python爬虫案例之csdn数据采集

编辑：光环大数据来源: 互联网时间: 2017-11-13 14:35 阅读: 次大中小

　　python爬虫案例之csdn数据采集。

通过python实现csdn页面的内容采集是相对来说比较容易的，因为csdn不需要登陆，不需要cookie，也不需要设置header

python2.7下

#coding:utf-8

#本实例用于获取指定用户csdn的文章名称、连接、阅读数目

importurllib2

importre

frombs4importBeautifulSoup

#csdn不需要登陆，也不需要cookie,也不需要设置header

print('=======================csdn数据挖掘==========================')

urlstr="http://blog.csdn.net/luanpeng825485697?viewmode=contents"

host="http://blog.csdn.net/luanpeng825485697"#根目录

alllink=[urlstr]#所有需要遍历的网址

data={}

defgetdata(html,reg):#从字符串中安装正则表达式获取值

pattern=re.compile(reg)

items=re.findall(pattern,html)

foriteminitems:

urlpath=urllib2.urlparse.urljoin(urlstr,item[0])#将相对地址，转化为绝对地址

ifnothasattr(object,urlpath):

data[urlpath]=item

printurlpath,'',#print最后有个逗号，表示输出不换行

printitem[2],'',

printitem[1]

#根据一个网址获取相关连接并添加到集合中

defgetlink(url,html):

soup=BeautifulSoup(html,'html.parser')#使用html5lib解析，所以需要提前安装好html5lib包

fortaginsoup.find_all('a'):#从文档中找到所有标签的内容

link=tag.get('href')

newurl=urllib2.urlparse.urljoin(url,link)#在指定网址中的连接的绝对连接

ifhostnotinnewurl:#如果是站外连接，则放弃

continue

ifnewurlinalllink:#不添加已经存在的网址

continue

ifnot"http://blog.csdn.net/luanpeng825485697/article/list"innewurl:#自定义添加一些链接限制

continue

alllink.append(newurl)#将地址添加到链接集合中

#根据一个网址，获取该网址中符合指定正则表达式的内容

defcraw(url):

try:

request=urllib2.Request(url)#创建一个请求

response=urllib2.urlopen(request)#获取响应

html=response.read()#读取返回html源码

#reg=r'"link_title">\r\nhttp://blog.csdn.net/luanpeng825485697/article/details/(.*)\n.*'#只匹配文章地址和名称

reg=r'"link_title">\r\nhttp://blog.csdn.net/luanpeng825485697/article/details/(.*)\r\n.*[\s\S]*?阅读\(http://blog.csdn.net/luanpeng825485697/article/details/(.*)\)'#匹配地址、名称、阅读数目

getdata(html,reg)

getlink(url,html)

excepturllib2.URLError,e:

ifhasattr(e,"code"):

printe.code

ifhasattr(e,"reason"):

printe.reason

forurlinalllink:

craw(url)

Python培训、Python培训班、Python培训机构，就选光环大数据！

大数据培训、人工智能培训、Python培训、大数据培训机构、大数据培训班、数据分析培训、大数据可视化培训，就选光环大数据！光环大数据，聘请专业的大数据领域知名讲师，确保教学的整体质量与教学水准。讲师团及时掌握时代潮流技术，将前沿技能融入教学中，确保学生所学知识顺应时代所需。通过深入浅出、通俗易懂的教学方式，指导学生更快的掌握技能知识，成就上万个高薪就业学子。更多问题咨询，欢迎点击------>>>>在线客服！

当前位置

python爬虫案例之csdn数据采集

你可能也喜欢这些

在线客服咨询

热点阅读

网友最爱

在线客服咨询