Python核心模块之urllib用法详解光环大数据

编辑：光环大数据来源: 互联网时间: 2018-02-12 13:27 阅读: 次大中小

urllib是Python内建的核心模块之一，主要用于各种网页请求的构造。这个模块操作非常简单，而且功能比较强大，是爬虫入门的不二之选。今天我们为大家整理了urllib库的一些核心用法，帮助大家更快的掌握其用法。Python爬虫项目中常用的requests库即时基于urllib构建的。

Get

urllib的

1



request

模块可以非常方便地抓取URL内容，也就是发送一个GET请求到指定的页面，然后返回HTTP的响应：

例如，对豆瓣的一个URL

进行抓取，并返回响应：


<spanclass="keyword">fromurllib<spanclass="keyword">importrequest

<spanclass="keyword">withrequest.urlopen(<spanclass="string">'https://api.douban.com/v2/book/2129650')<spanclass="keyword">asf:

data=f.read()

print(<spanclass="string">'Status:',f.status,f.reason)

<spanclass="keyword">fork,v<spanclass="keyword">inf.getheaders():

print(<spanclass="string">'%s:%s'%(k,v))

print(<spanclass="string">'Data:',data.decode(<spanclass="string">'utf-8'))

可以看到HTTP响应的头和JSON数据：


Status:200OK

Server:nginx

Date:Tue,26May201510:02:27GMT

Content-Type:application/json;char<spanclass="operator"><spanclass="keyword">set=utf-<spanclass="number">8

Content-Length:<spanclass="number">2049

<spanclass="keyword">Connection:<spanclass="keyword">close

Expires:Sun,<spanclass="number">1Jan<spanclass="number">2006<spanclass="number">01:<spanclass="number">00:<spanclass="number">00GMT

<spanclass="keyword">Pragma:<spanclass="keyword">no-cache

Cache-Control:must-revalidate,<spanclass="keyword">no-cache,private

X-DAE-Node:pidl1

Data:{<spanclass="string">"rating":{<spanclass="string">"max":<spanclass="number">10,<spanclass="string">"numRaters":<spanclass="number">16,<spanclass="string">"average":<spanclass="string">"7.4",<spanclass="string">"min":<spanclass="number">0},<spanclass="string">"subtitle":<spanclass="string">"",<spanclass="string">"author":[<spanclass="string">"廖雪峰编著"],<spanclass="string">"pubdate":<spanclass="string">"2007-6",...}



如果我们要想模拟浏览器发送GET请求，就需要使用



Request

对象，通过往



Request

对象添加HTTP头，我们就可以把请求伪装成浏览器。例如，模拟iPhone6去请求豆瓣首页：


<spanclass="keyword">fromurllib<spanclass="keyword">importrequest

req=request.Request(<spanclass="string">'http://www.douban.com/')

req.add_header(<spanclass="string">'User-Agent',<spanclass="string">'Mozilla/6.0(iPhone;CPUiPhoneOS8_0likeMacOSX)AppleWebKit/536.26(KHTML,likeGecko)Version/8.0Mobile/10A5376eSafari/8536.25')

<spanclass="keyword">withrequest.urlopen(req)<spanclass="keyword">asf:

print(<spanclass="string">'Status:',f.status,f.reason)

<spanclass="keyword">fork,v<spanclass="keyword">inf.getheaders():

print(<spanclass="string">'%s:%s'%(k,v))

print(<spanclass="string">'Data:',f.read().decode(<spanclass="string">'utf-8'))

这样豆瓣会返回适合iPhone的移动版网页：
...

<spanclass="xml"><spanclass="tag"><<spanclass="title">meta<spanclass="attribute">name=<spanclass="value">"viewport"<spanclass="attribute">content=<spanclass="value">"width=device-width,user-scalable=no,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0">

<spanclass="tag"><<spanclass="title">meta<spanclass="attribute">name=<spanclass="value">"format-detection"<spanclass="attribute">content=<spanclass="value">"telephone=no">

<spanclass="tag"><<spanclass="title">link<spanclass="attribute">rel=<spanclass="value">"apple-touch-icon"<spanclass="attribute">sizes=<spanclass="value">"57x57"<spanclass="attribute">href=<spanclass="value">"http://img4.douban.com/pics/cardkit/launcher/57.png"/>

...



Post

如果要以POST发送一个请求，只需要把参数


data

以bytes形式传入。

我们模拟一个微博登录，先读取登录的邮箱和口令，然后按照weibo.cn的登录页的格式以


username=xxx&password=xxx

的编码传入：

<spanclass="keyword">fromurllib<spanclass="keyword">importrequest,parse

print(<spanclass="string">'Logintoweibo.cn...')

email=input(<spanclass="string">'Email:')

passwd=input(<spanclass="string">'Password:')

login_data=parse.urlencode([

(<spanclass="string">'username',email),

(<spanclass="string">'password',passwd),

(<spanclass="string">'entry',<spanclass="string">'mweibo'),

(<spanclass="string">'client_id',<spanclass="string">''),

(<spanclass="string">'savestate',<spanclass="string">'1'),

(<spanclass="string">'ec',<spanclass="string">''),

(<spanclass="string">'pagerefer',<spanclass="string">'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')

])

req=request.Request(<spanclass="string">'https://passport.weibo.cn/sso/login')

req.add_header(<spanclass="string">'Origin',<spanclass="string">'https://passport.weibo.cn')

req.add_header(<spanclass="string">'User-Agent',<spanclass="string">'Mozilla/6.0(iPhone;CPUiPhoneOS8_0likeMacOSX)AppleWebKit/536.26(KHTML,likeGecko)Version/8.0Mobile/10A5376eSafari/8536.25')

req.add_header(<spanclass="string">'Referer',<spanclass="string">'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

<spanclass="keyword">withrequest.urlopen(req,data=login_data.encode(<spanclass="string">'utf-8'))<spanclass="keyword">asf:

print(<spanclass="string">'Status:',f.status,f.reason)

<spanclass="keyword">fork,v<spanclass="keyword">inf.getheaders():

print(<spanclass="string">'%s:%s'%(k,v))

print(<spanclass="string">'Data:',f.read().decode(<spanclass="string">'utf-8'))

如果登录成功，我们获得的响应如下：

Status:200OK

Server:nginx/1.2.0

...

<spanclass="operator"><spanclass="keyword">Set-Cookie:SSOLoginState=<spanclass="number">1432620126;path=/;domain=weibo.cn

...

Data:{"retcode":20000000,"msg":"","data":{...,"uid":"1658384301"}}

如果登录失败，我们获得的响应如下：

...

Data:{"retcode":50011015,"msg":"/u7528/u6237/u540d/u6216/u5bc6/u7801/u9519/u8bef","data":{"username":"example@python.org","errline":536}}

Handler

如果还需要更复杂的控制，比如通过一个Proxy去访问网站，我们需要利用

1



ProxyHandler

来处理，示例代码如下：

1

2

3

4

5

6

7



proxy_<spanclass="operator"><spanclass="keyword">handler=urllib.request.ProxyHandler({<spanclass="string">'http':<spanclass="string">'http://www.example.com:3128/'})

proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()

proxy_auth_handler.add_password(<spanclass="string">'realm',<spanclass="string">'host',<spanclass="string">'username',<spanclass="string">'password')

opener=urllib.request.build_opener(proxy_handler,proxy_auth_handler)

<spanclass="keyword">withopener.<spanclass="keyword">open(<spanclass="string">'http://www.example.com/login.html')<spanclass="keyword">asf:

pass



小结

urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能，需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求，再根据浏览器的请求头来伪装，

1



User-Agent

头就是用来标识浏览器的。

大数据培训、人工智能培训、Python培训、大数据培训机构、大数据培训班、数据分析培训、大数据可视化培训，就选光环大数据！光环大数据，聘请专业的大数据领域知名讲师，确保教学的整体质量与教学水准。讲师团及时掌握时代潮流技术，将前沿技能融入教学中，确保学生所学知识顺应时代所需。通过深入浅出、通俗易懂的教学方式，指导学生更快的掌握技能知识，成就上万个高薪就业学子。更多问题咨询，欢迎点击------>>>>在线客服！

当前位置

Python核心模块之urllib用法详解光环大数据

你可能也喜欢这些

在线客服咨询

热点阅读

网友最爱

在线客服咨询

当前位置

Python核心模块之urllib用法详解 光环大数据

你可能也喜欢这些

在线客服咨询

热点阅读

网友最爱

在线客服咨询

Python核心模块之urllib用法详解光环大数据