Python核心模块之urllib用法详解 光环大数据
urllib是Python内建的核心模块之一,主要用于各种网页请求的构造。这个模块操作非常简单,而且功能比较强大,是爬虫入门的不二之选。今天我们为大家整理了urllib库的一些核心用法,帮助大家更快的掌握其用法。Python爬虫项目中常用的requests库即时基于urllib构建的。
Get
urllib的
1
request
模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应:
例如,对豆瓣的一个URL
进行抓取,并返回响应:
<spanclass="keyword">from</span>urllib<spanclass="keyword">import</span>request
<spanclass="keyword">with</span>request.urlopen(<spanclass="string">'https://api.douban.com/v2/book/2129650'</span>)<spanclass="keyword">as</span>f:
data=f.read()
print(<spanclass="string">'Status:'</span>,f.status,f.reason)
<spanclass="keyword">for</span>k,v<spanclass="keyword">in</span>f.getheaders():
print(<spanclass="string">'%s:%s'</span>%(k,v))
print(<spanclass="string">'Data:'</span>,data.decode(<spanclass="string">'utf-8'</span>))
可以看到HTTP响应的头和JSON数据:
Status:200OK
Server:nginx
Date:Tue,26May201510:02:27GMT
Content-Type:application/json;char<spanclass="operator"><spanclass="keyword">set</span>=utf-<spanclass="number">8</span>
Content-Length:<spanclass="number">2049</span>
<spanclass="keyword">Connection</span>:<spanclass="keyword">close</span>
Expires:Sun,<spanclass="number">1</span>Jan<spanclass="number">2006</span><spanclass="number">01</span>:<spanclass="number">00</span>:<spanclass="number">00</span>GMT
<spanclass="keyword">Pragma</span>:<spanclass="keyword">no</span>-cache
Cache-Control:must-revalidate,<spanclass="keyword">no</span>-cache,private
X-DAE-Node:pidl1
Data:{<spanclass="string">"rating"</span>:{<spanclass="string">"max"</span>:<spanclass="number">10</span>,<spanclass="string">"numRaters"</span>:<spanclass="number">16</span>,<spanclass="string">"average"</span>:<spanclass="string">"7.4"</span>,<spanclass="string">"min"</span>:<spanclass="number">0</span>},<spanclass="string">"subtitle"</span>:<spanclass="string">""</span>,<spanclass="string">"author"</span>:[<spanclass="string">"廖雪峰编著"</span>],<spanclass="string">"pubdate"</span>:<spanclass="string">"2007-6"</span>,...}
</span>
如果我们要想模拟浏览器发送GET请求,就需要使用
Request
对象,通过往
Request
对象添加HTTP头,我们就可以把请求伪装成浏览器。例如,模拟iPhone6去请求豆瓣首页:
<spanclass="keyword">from</span>urllib<spanclass="keyword">import</span>request
req=request.Request(<spanclass="string">'http://www.douban.com/'</span>)
req.add_header(<spanclass="string">'User-Agent'</span>,<spanclass="string">'Mozilla/6.0(iPhone;CPUiPhoneOS8_0likeMacOSX)AppleWebKit/536.26(KHTML,likeGecko)Version/8.0Mobile/10A5376eSafari/8536.25'</span>)
<spanclass="keyword">with</span>request.urlopen(req)<spanclass="keyword">as</span>f:
print(<spanclass="string">'Status:'</span>,f.status,f.reason)
<spanclass="keyword">for</span>k,v<spanclass="keyword">in</span>f.getheaders():
print(<spanclass="string">'%s:%s'</span>%(k,v))
print(<spanclass="string">'Data:'</span>,f.read().decode(<spanclass="string">'utf-8'</span>))
这样豆瓣会返回适合iPhone的移动版网页:
...
<spanclass="xml"><spanclass="tag"><<spanclass="title">meta</span><spanclass="attribute">name</span>=<spanclass="value">"viewport"</span><spanclass="attribute">content</span>=<spanclass="value">"width=device-width,user-scalable=no,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0"</span>></span>
<spanclass="tag"><<spanclass="title">meta</span><spanclass="attribute">name</span>=<spanclass="value">"format-detection"</span><spanclass="attribute">content</span>=<spanclass="value">"telephone=no"</span>></span>
<spanclass="tag"><<spanclass="title">link</span><spanclass="attribute">rel</span>=<spanclass="value">"apple-touch-icon"</span><spanclass="attribute">sizes</span>=<spanclass="value">"57x57"</span><spanclass="attribute">href</span>=<spanclass="value">"http://img4.douban.com/pics/cardkit/launcher/57.png"</span>/></span>
...
</span>
Post
如果要以POST发送一个请求,只需要把参数
data
以bytes形式传入。
我们模拟一个微博登录,先读取登录的邮箱和口令,然后按照weibo.cn的登录页的格式以
username=xxx&password=xxx
的编码传入:
<spanclass="keyword">from</span>urllib<spanclass="keyword">import</span>request,parse
print(<spanclass="string">'Logintoweibo.cn...'</span>)
email=input(<spanclass="string">'Email:'</span>)
passwd=input(<spanclass="string">'Password:'</span>)
login_data=parse.urlencode([
(<spanclass="string">'username'</span>,email),
(<spanclass="string">'password'</span>,passwd),
(<spanclass="string">'entry'</span>,<spanclass="string">'mweibo'</span>),
(<spanclass="string">'client_id'</span>,<spanclass="string">''</span>),
(<spanclass="string">'savestate'</span>,<spanclass="string">'1'</span>),
(<spanclass="string">'ec'</span>,<spanclass="string">''</span>),
(<spanclass="string">'pagerefer'</span>,<spanclass="string">'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F'</span>)
])
req=request.Request(<spanclass="string">'https://passport.weibo.cn/sso/login'</span>)
req.add_header(<spanclass="string">'Origin'</span>,<spanclass="string">'https://passport.weibo.cn'</span>)
req.add_header(<spanclass="string">'User-Agent'</span>,<spanclass="string">'Mozilla/6.0(iPhone;CPUiPhoneOS8_0likeMacOSX)AppleWebKit/536.26(KHTML,likeGecko)Version/8.0Mobile/10A5376eSafari/8536.25'</span>)
req.add_header(<spanclass="string">'Referer'</span>,<spanclass="string">'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F'</span>)
<spanclass="keyword">with</span>request.urlopen(req,data=login_data.encode(<spanclass="string">'utf-8'</span>))<spanclass="keyword">as</span>f:
print(<spanclass="string">'Status:'</span>,f.status,f.reason)
<spanclass="keyword">for</span>k,v<spanclass="keyword">in</span>f.getheaders():
print(<spanclass="string">'%s:%s'</span>%(k,v))
print(<spanclass="string">'Data:'</span>,f.read().decode(<spanclass="string">'utf-8'</span>))
如果登录成功,我们获得的响应如下:
Status:200OK
Server:nginx/1.2.0
...
<spanclass="operator"><spanclass="keyword">Set</span>-Cookie:SSOLoginState=<spanclass="number">1432620126</span>;</span>path=/;domain=weibo.cn
...
Data:{"retcode":20000000,"msg":"","data":{...,"uid":"1658384301"}}
如果登录失败,我们获得的响应如下:
...
Data:{"retcode":50011015,"msg":"/u7528/u6237/u540d/u6216/u5bc6/u7801/u9519/u8bef","data":{"username":"example@python.org","errline":536}}
Handler
如果还需要更复杂的控制,比如通过一个Proxy去访问网站,我们需要利用
1
ProxyHandler
来处理,示例代码如下:
1
2
3
4
5
6
7
proxy_<spanclass="operator"><spanclass="keyword">handler</span>=urllib.request.ProxyHandler({<spanclass="string">'http'</span>:<spanclass="string">'http://www.example.com:3128/'</span>})
proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password(<spanclass="string">'realm'</span>,<spanclass="string">'host'</span>,<spanclass="string">'username'</span>,<spanclass="string">'password'</span>)
opener=urllib.request.build_opener(proxy_handler,proxy_auth_handler)
<spanclass="keyword">with</span>opener.<spanclass="keyword">open</span>(<spanclass="string">'http://www.example.com/login.html'</span>)<spanclass="keyword">as</span>f:
pass
</span>
小结
urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能,需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求,再根据浏览器的请求头来伪装,
1
User-Agent
头就是用来标识浏览器的。
大数据培训、人工智能培训、Python培训、大数据培训机构、大数据培训班、数据分析培训、大数据可视化培训,就选光环大数据!光环大数据,聘请专业的大数据领域知名讲师,确保教学的整体质量与教学水准。讲师团及时掌握时代潮流技术,将前沿技能融入教学中,确保学生所学知识顺应时代所需。通过深入浅出、通俗易懂的教学方式,指导学生更快的掌握技能知识,成就上万个高薪就业学子。 更多问题咨询,欢迎点击------>>>>在线客服!