【松勤软件自动化测试】Python的解析HTML的几种操作方式

10月

1765 3 0

解析HTML是爬虫后的重要的一个处理数据的环节。以下记录解析HTML的几种方式。
先介绍基础的辅助函数，主要用于获取HTML并输入解析后的结束

＃把传递解析函数，便于下面的修改

def get_html（url，paraser = bs4_paraser）：

headers = {

'接受'：'* / *'，

'Accept-Encoding'：'gzip，deflate，sdch'，

'接受 - 语言'：'zh-CN，zh; q = 0.8'，

'主持人'：'www.360kan.com'，

'代理连接'：'保持活力'，

'User-Agent'：'Mozilla / 5.0（Windows NT 6.1; WOW64）AppleWebKit / 537.36（KHTML，像Gecko）Chrome / 52.0.2743.116 Safari / 537.36'

}

request = urllib2.Request（url，headers = headers）

response = urllib2.urlopen（request）

response.encoding ='utf-8'

如果response.code == 200：

data = StringIO.StringIO（response.read（））

gzipper = gzip.GzipFile（fileobj = data）

data = gzipper.read（）

value = paraser（data）#open（'E：/h5/haPkY0osd0r5UB.html'）。read（）

返回值

其他：

通过

value = get_html（'http://www.360kan.com/m/haPkY0osd0r5UB.html'，paraser = lxml_parser）

对于行值：

打印行

1，lxml.html的方式进行解析，

lxml XML工具包是C库libxml2和libxslt的Pythonic绑定。它的独特之处在于它将这些库的速度和XML特性完整性与原生Python API的简单性相结合，大多数兼容但优于众所周知的ElementTree API。最新版本适用于从2.6到3.5的所有CPython版本。有关lxml项目的背景和目标的更多信息，请参阅简介。常见问题解答中回答了一些常见问题。

[官网（http://lxml.de/）

def lxml_parser（页面）：

data = []

doc = etree.HTML（页面）

all_div = doc.xpath（'// div [@ class =“yingping-list-wrap”]'）

对于all_div中的行：

＃获取每一个影评，即影评的项目

all_div_item = row.xpath（'。// div [@ class =“item”]'）＃find_all（'div'，attrs = {'class'：'item'}）

对于all_div_item中的r：

值= {}

＃获取影评的标题部分

title = r.xpath（'。// div [@ class =“g-clear title-wrap”] [1]'）

value ['title'] = title [0] .xpath（'./ a / text（）'）[0]

value ['title_href'] = title [0] .xpath（'./ a / @ href'）[0]

score_text = title [0] .xpath（'./ div / span / span / @ style'）[0]

score_text = re.search（r'\ d +'，score_text）.group（）

value ['score'] = int（score_text）/ 20

＃时间

value ['time'] = title [0] .xpath（'./ div / span [@ class =“time”] / text（）'）[0]

＃多少人喜欢

value ['people'] = int（

re.search（r'\ d +'，title [0] .xpath（'./ div [@ class =“num”] / span / text（）'）[0]）。group（））

data.append（值）

返回数据

2，使用BeautifulSoup，不多说了，推荐一篇讲解非常好的文章 [应用讲解]（http://www.bkjia.com/Pythonjc/992499.html%20%E5%BA%94%E7%94%A8%E8%AE%B2%E8%A7%A3）

def bs4_paraser（html）：

all_value = []

值= {}

汤= BeautifulSoup（html，'html.parser'）

＃获取影评的部分

all_div = soup.find_all（'div'，attrs = {'class'：'yingping-list-wrap'}，limit = 1）

对于all_div中的行：

＃获取每一个影评，即影评的项目

all_div_item = row.find_all（'div'，attrs = {'class'：'item'}）

对于all_div_item中的r：

＃获取影评的标题部分

title = r.find_all（'div'，attrs = {'class'：'g-clear title-wrap'}，limit = 1）

如果title不是None而len（title）> 0：

value ['title'] = title [0] .a.string

value ['title_href'] = title [0] .a ['href']

score_text = title [0] .div.span.span ['style']

score_text = re.search（r'\ d +'，score_text）.group（）

value ['score'] = int（score_text）/ 20

＃时间

value ['time'] = title [0] .div.find_all（'span'，attrs = {'class'：'time'}）[0] .string

＃多少人喜欢

value ['people'] = int（

re.search（r'\ d +'，title [0] .find_all（'div'，attrs = {'class'：'num'}）[0] .span.string）.group（））

#print r

all_value.append（值）

值= {}

返回all_value

3，使用SGMLParser，主要是通过start，end tag的方式进行了，解析工程比较明朗，但是有点麻烦，而且该案例的场景不太适合该方法，（哈哈）

class CommentParaser（SGMLParser）：

def __init __（self）：

化SGMLParser .__的init __（个体经营）

文章来源: 松勤软件学院

原文链接: https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI3NDc4NTQ0Nw==&scene=126#wechat_redirect

你还没有登录，请先登录或注册！

xiaowu • 10天前

证婚词:https://www.nanss.com/shenghuo/15318.html 索赔函:https://www.nanss.com/shenghuo/14481.html 烧纸钱有什么讲究:https://www.nanss.com/shenghuo/15515.html 西游记读后感600字:https://www.nanss.com/xuexi/16692.html 雪景描写:https://www.nanss.com/xuexi/16575.html 保护环境的英语作文:https://www.nanss.com/xuexi/14751.html 十二生肖的由来:https://www.nanss.com/shenghuo/15171.html 物流实习报告:https://www.nanss.com/xuexi/15348.html 描写春雨的句子:https://www.nanss.com/yulu/15937.html 童话故事读后感:https://www.nanss.com/xuexi/15704.html 天然气的热值是多少:https://www.nanss.com/shenghuo/16070.html 三年级300字优秀日记:https://www.nanss.com/xuexi/14388.html ogg是什么格式:https://www.nanss.com/shenghuo/14941.html 岳阳楼记读后感:https://www.nanss.com/xuexi/14738.html 经典伤感文章:https://www.nanss.com/yuedu/14093.html hkc是什么牌子:https://www.nanss.com/wenti/14866.html 重阳节祝福语八个字:https://www.nanss.com/yulu/14303.html 卖报歌教案:https://www.nanss.com/gongzuo/14698.html 事业前景越来越好的词:https://www.nanss.com/gongzuo/14464.html 小马过河的故事:https://www.nanss.com/yuedu/14480.html 业务报告:https://www.nanss.com/gongzuo/15145.html 劳模发言稿:https://www.nanss.com/gongzuo/14983.html 父亲的病读后感:https://www.nanss.com/xuexi/14707.html 心经全文:https://www.nanss.com/shenghuo/15093.html 婚庆对联:https://www.nanss.com/jiaju/15668.html 周记大全:https://www.nanss.com/xuexi/15391.html 科技发展的好处:https://www.nanss.com/xuexi/15923.html 公司工作总结:https://www.nanss.com/gongzuo/14318.html 修辞手法有几种:https://www.nanss.com/xuexi/16403.html 70寸电视长宽多少厘米:https://www.nanss.com/wenti/14826.html

(0) 回复 (0)
还有-5条回复，点击查看

你还没有登录，请先登录或注册！
nihaosb • 2022-10-21

讨债公司/蓝月传奇辅助/蓝月辅助

(0) 回复 (0)
还有-5条回复，点击查看

你还没有登录，请先登录或注册！
nihaosb • 2021-10-12

讨债公司搬家公司蓝月传奇辅助

(0) 回复 (0)
还有-5条回复，点击查看

你还没有登录，请先登录或注册！