博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python爬网获取全国各地律师电话号
阅读量:6493 次
发布时间:2019-06-24

本文共 2501 字,大约阅读时间需要 8 分钟。

[本文出自天外归云的博客园]

从64365网站获取全国各地律师电话号,用到了python的lxml库进行对html页面内容的解析,对于xpath的获取和正确性校验,需要在火狐浏览器安装firebug和firepath插件。页面内容如下(目标是爬“姓名+电话”):

 

代码如下:

# coding:utf-8from lxml import etreeimport requests,lxml.html,osclass MyError(Exception):    def __init__(self, value):        self.value = value    def __str__(self):        return repr(self.value)        def get_lawyers_info(url):    r = requests.get(url)    html = lxml.html.fromstring(r.content)    phones = html.xpath('//span[@class="law-tel"]')    names = html.xpath('//div[@class="fl"]/p/a')    if(len(phones) == len(names)):        list(zip(names,phones))        phone_infos = [(names[i].text, phones[i].text_content()) for i in range(len(names))]    else:        error = "Lawyers amount are not equal to the amount of phone_nums: "+url        raise MyError(error)    phone_infos_list = []    for phone_info in phone_infos:        if(phone_info[1] == ""):            #print phone_info[0],u"没留电话"            info = phone_info[0]+": "+u"没留电话\r\n"        #print phone_info[0],phone_info[1]        else:            info = phone_info[0]+": "+phone_info[1]+"\r\n"        print info        phone_infos_list.append(info)    return phone_infos_listdef get_pages_num(url):    r = requests.get(url)    html = lxml.html.fromstring(r.content)    result = html.xpath('//div[@class="u-page"]/a[last()-1]')    pages_num = result[0].text    if pages_num.isdigit():        return pages_numdef get_all_lawyers(cities):    dir_path = os.path.abspath(os.path.dirname(__file__))    print dir_path    file_path = os.path.join(dir_path,"lawyers_info.txt")    print file_path    if os.path.exists(file_path):        os.remove(file_path)    #input()    with open("lawyers_info.txt","ab") as file:        for city in cities:            #file.write("City:"+city+"\n")            #print city            pages_num = get_pages_num("http://www.64365.com/"+city+"/lawyer/page_1.aspx")            if pages_num:                for i in range(int(pages_num)):                    url = "http://www.64365.com/"+city+"/lawyer/page_"+str(i+1)+".aspx"                    info = get_lawyers_info(url)                    for each in info:                        file.write(each.encode("gbk"))if __name__ == '__main__':    cities = ['beijing','shanghai','guangdong','guangzhou','shenzhen','wuhan','hangzhou','ningbo','tianjin','nanjing','jiangsu','zhengzhou','jinan','changsha','shenyang','chengdu','chongqing','xian']    get_all_lawyers(cities)

这里对热门城市进行了爬网,输入结果如下(保存到了当前目录下的“lawyers_info.txt”文件中):

转载地址:http://pqkyo.baihongyu.com/

你可能感兴趣的文章
linux内核netfilter模块分析之:HOOKs点的注册及调用
查看>>
P(Y|X) 和 P(X,Y)
查看>>
gf框架之grpool - 高性能的goroutine池
查看>>
dynamic关键字的使用
查看>>
iOS 音乐播放器之锁屏效果+歌词解析
查看>>
【转】Google 的眼光
查看>>
android O 蓝牙设备默认名称更改
查看>>
阳台的青椒苗
查看>>
swapper进程【转】
查看>>
python笔记21-列表生成式
查看>>
关于解决sql2012编辑器对象名无效问题
查看>>
跨链技术与通证经济
查看>>
[SignalR2] 认证和授权
查看>>
爬虫学习之-xpath
查看>>
js jQuery 右键菜单 清屏
查看>>
深入理解let和var的区别(暂时性死区)!!!
查看>>
dotConnect for Oracle
查看>>
Android开发需要的知识
查看>>
从零开始iOS8编程【iOS开发常用控件】
查看>>
我的友情链接
查看>>