python解析微博博文页面

问题描述

微博html部分采用新浪自研的封装方式,都储存在<script>标签中,用beautifulsoup直接解析无法成功,原始代码如下:

1
<script>FM.view({"ns":"pl.content.homeFeed.index","domid":"Pl_Official_MyProfileFeed__22","css":["style/css/module/list/comb_WB_feed_profile.css?version=5e290f400318556c"],"js":"page/js/pl/content/homeFeed/index.js?version=0c4baf6873f57710","html":"                <div class=\"WB_feed WB_feed_v3 WB_feed_v4\" pageNum=\"\" node-type='feed_list' module-type=\"feed\">\r\n        <div style=\"position:relative;\" node-type=\"feedconfig\" data-queryfix=is_hot=1>\r\n            <div style=\"position:absolute;top:-110px;left:0;width:0;height:0;\" id=\"feedtop\" name=\"feedtop\"><\/div>\r\n        <\/div>\r\n                    \t        \t\t    \t\t    \t\t    \t\t    \t        \t<div  tbinfo=\"ouid=3800468188\" action-type=\"feed_list_item\" diss-data=\"\"  mid=\"4072273369837098\"  class=\"WB_cardwrap WB_feed_type S_bg2 \">\n        <div class=\"WB_feed_detail clearfix\" node-type=\"feed_content\"\n        >\n                        <div class=\"WB_screen W_fr\">\n    <div class=\"screen_box\"><a href=\"javascript:void(0);\" action-type=\"fl_menu\"><i class=\"W_ficon ficon_arrow_down S_ficon\">c<\/i><\/a>\n        <div class=\"layer_menu_list\" style=\"display: none; position: absolute; z-index: 999;\" node-type=\"fl_menu_right\">\n            <ul>\n                                                        ..."})</script>

可以看到,我们需要抓取的内容被包含在<script>FM.view({"ns":"pl.content.homeFeed.index","domid":"Pl_Official_MyProfileFeed__22"……</script>之间,这便为我们提供了思路。

解决方案

  1. 用正则将待解析内容从全文中提取出来;

  2. 可以看到,所有的结果被反斜杠转义了,做法是将字符串decode成string-escape,再将所有的双反斜杠替换成空:

1
String.decode('string-escape').replace('\\', '')
  1. 到此,便可以用beautifulsoup解析了。