Python核心模块之urllib用法详解 光环大数据

编辑:光环大数据 来源: 互联网 时间: 2018-02-12 13:27 阅读:

 

  urllib是Python内建的核心模块之一,主要用于各种网页请求的构造。这个模块操作非常简单,而且功能比较强大,是爬虫入门的不二之选。今天我们为大家整理了urllib库的一些核心用法,帮助大家更快的掌握其用法。Python爬虫项目中常用的requests库即时基于urllib构建的。

Get

urllib的

1

   

request

模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应:

例如,对豆瓣的一个URL


进行抓取,并返回响应:
   

<spanclass="keyword">from</span>urllib<spanclass="keyword">import</span>request

<spanclass="keyword">with</span>request.urlopen(<spanclass="string">'https://api.douban.com/v2/book/2129650'</span>)<spanclass="keyword">as</span>f:

data=f.read()

print(<spanclass="string">'Status:'</span>,f.status,f.reason)

<spanclass="keyword">for</span>k,v<spanclass="keyword">in</span>f.getheaders():

print(<spanclass="string">'%s:%s'</span>%(k,v))

print(<spanclass="string">'Data:'</span>,data.decode(<spanclass="string">'utf-8'</span>))

可以看到HTTP响应的头和JSON数据:
   

Status:200OK

Server:nginx

Date:Tue,26May201510:02:27GMT

Content-Type:application/json;char<spanclass="operator"><spanclass="keyword">set</span>=utf-<spanclass="number">8</span>

Content-Length:<spanclass="number">2049</span>

<spanclass="keyword">Connection</span>:<spanclass="keyword">close</span>

Expires:Sun,<spanclass="number">1</span>Jan<spanclass="number">2006</span><spanclass="number">01</span>:<spanclass="number">00</span>:<spanclass="number">00</span>GMT

<spanclass="keyword">Pragma</span>:<spanclass="keyword">no</span>-cache

Cache-Control:must-revalidate,<spanclass="keyword">no</span>-cache,private

X-DAE-Node:pidl1

Data:{<spanclass="string">"rating"</span>:{<spanclass="string">"max"</span>:<spanclass="number">10</span>,<spanclass="string">"numRaters"</span>:<spanclass="number">16</span>,<spanclass="string">"average"</span>:<spanclass="string">"7.4"</span>,<spanclass="string">"min"</span>:<spanclass="number">0</span>},<spanclass="string">"subtitle"</span>:<spanclass="string">""</span>,<spanclass="string">"author"</span>:[<spanclass="string">"廖雪峰编著"</span>],<spanclass="string">"pubdate"</span>:<spanclass="string">"2007-6"</span>,...}

</span>

如果我们要想模拟浏览器发送GET请求,就需要使用

   

Request

对象,通过往

   

Request

对象添加HTTP头,我们就可以把请求伪装成浏览器。例如,模拟iPhone6去请求豆瓣首页:
   

<spanclass="keyword">from</span>urllib<spanclass="keyword">import</span>request

req=request.Request(<spanclass="string">'http://www.douban.com/'</span>)

req.add_header(<spanclass="string">'User-Agent'</span>,<spanclass="string">'Mozilla/6.0(iPhone;CPUiPhoneOS8_0likeMacOSX)AppleWebKit/536.26(KHTML,likeGecko)Version/8.0Mobile/10A5376eSafari/8536.25'</span>)

<spanclass="keyword">with</span>request.urlopen(req)<spanclass="keyword">as</span>f:

print(<spanclass="string">'Status:'</span>,f.status,f.reason)

<spanclass="keyword">for</span>k,v<spanclass="keyword">in</span>f.getheaders():

print(<spanclass="string">'%s:%s'</span>%(k,v))

print(<spanclass="string">'Data:'</span>,f.read().decode(<spanclass="string">'utf-8'</span>))

这样豆瓣会返回适合iPhone的移动版网页:
...

<spanclass="xml"><spanclass="tag">&lt;<spanclass="title">meta</span><spanclass="attribute">name</span>=<spanclass="value">"viewport"</span><spanclass="attribute">content</span>=<spanclass="value">"width=device-width,user-scalable=no,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0"</span>&gt;</span>

<spanclass="tag">&lt;<spanclass="title">meta</span><spanclass="attribute">name</span>=<spanclass="value">"format-detection"</span><spanclass="attribute">content</span>=<spanclass="value">"telephone=no"</span>&gt;</span>

<spanclass="tag">&lt;<spanclass="title">link</span><spanclass="attribute">rel</span>=<spanclass="value">"apple-touch-icon"</span><spanclass="attribute">sizes</span>=<spanclass="value">"57x57"</span><spanclass="attribute">href</span>=<spanclass="value">"http://img4.douban.com/pics/cardkit/launcher/57.png"</span>/&gt;</span>

...

</span>

Post

如果要以POST发送一个请求,只需要把参数
   

data

以bytes形式传入。

我们模拟一个微博登录,先读取登录的邮箱和口令,然后按照weibo.cn的登录页的格式以
   

username=xxx&amp;password=xxx

的编码传入:

<spanclass="keyword">from</span>urllib<spanclass="keyword">import</span>request,parse

print(<spanclass="string">'Logintoweibo.cn...'</span>)

email=input(<spanclass="string">'Email:'</span>)

passwd=input(<spanclass="string">'Password:'</span>)

login_data=parse.urlencode([

(<spanclass="string">'username'</span>,email),

(<spanclass="string">'password'</span>,passwd),

(<spanclass="string">'entry'</span>,<spanclass="string">'mweibo'</span>),

(<spanclass="string">'client_id'</span>,<spanclass="string">''</span>),

(<spanclass="string">'savestate'</span>,<spanclass="string">'1'</span>),

(<spanclass="string">'ec'</span>,<spanclass="string">''</span>),

(<spanclass="string">'pagerefer'</span>,<spanclass="string">'https://passport.weibo.cn/signin/welcome?entry=mweibo&amp;r=http%3A%2F%2Fm.weibo.cn%2F'</span>)

])

req=request.Request(<spanclass="string">'https://passport.weibo.cn/sso/login'</span>)

req.add_header(<spanclass="string">'Origin'</span>,<spanclass="string">'https://passport.weibo.cn'</span>)

req.add_header(<spanclass="string">'User-Agent'</span>,<spanclass="string">'Mozilla/6.0(iPhone;CPUiPhoneOS8_0likeMacOSX)AppleWebKit/536.26(KHTML,likeGecko)Version/8.0Mobile/10A5376eSafari/8536.25'</span>)

req.add_header(<spanclass="string">'Referer'</span>,<spanclass="string">'https://passport.weibo.cn/signin/login?entry=mweibo&amp;res=wel&amp;wm=3349&amp;r=http%3A%2F%2Fm.weibo.cn%2F'</span>)

<spanclass="keyword">with</span>request.urlopen(req,data=login_data.encode(<spanclass="string">'utf-8'</span>))<spanclass="keyword">as</span>f:

print(<spanclass="string">'Status:'</span>,f.status,f.reason)

<spanclass="keyword">for</span>k,v<spanclass="keyword">in</span>f.getheaders():

print(<spanclass="string">'%s:%s'</span>%(k,v))

print(<spanclass="string">'Data:'</span>,f.read().decode(<spanclass="string">'utf-8'</span>))

如果登录成功,我们获得的响应如下:
 

Status:200OK

Server:nginx/1.2.0

...

<spanclass="operator"><spanclass="keyword">Set</span>-Cookie:SSOLoginState=<spanclass="number">1432620126</span>;</span>path=/;domain=weibo.cn

...

Data:{"retcode":20000000,"msg":"","data":{...,"uid":"1658384301"}}

如果登录失败,我们获得的响应如下:

...

Data:{"retcode":50011015,"msg":"/u7528/u6237/u540d/u6216/u5bc6/u7801/u9519/u8bef","data":{"username":"example@python.org","errline":536}}

Handler

如果还需要更复杂的控制,比如通过一个Proxy去访问网站,我们需要利用

1

   

ProxyHandler

来处理,示例代码如下:

1

2

3

4

5

6

7

   

proxy_<spanclass="operator"><spanclass="keyword">handler</span>=urllib.request.ProxyHandler({<spanclass="string">'http'</span>:<spanclass="string">'http://www.example.com:3128/'</span>})

proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()

proxy_auth_handler.add_password(<spanclass="string">'realm'</span>,<spanclass="string">'host'</span>,<spanclass="string">'username'</span>,<spanclass="string">'password'</span>)

opener=urllib.request.build_opener(proxy_handler,proxy_auth_handler)

<spanclass="keyword">with</span>opener.<spanclass="keyword">open</span>(<spanclass="string">'http://www.example.com/login.html'</span>)<spanclass="keyword">as</span>f:

pass

</span>

小结

urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能,需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求,再根据浏览器的请求头来伪装,

1

   

User-Agent

头就是用来标识浏览器的。


大数据培训、人工智能培训、Python培训、大数据培训机构、大数据培训班、数据分析培训、大数据可视化培训,就选光环大数据!光环大数据,聘请专业的大数据领域知名讲师,确保教学的整体质量与教学水准。讲师团及时掌握时代潮流技术,将前沿技能融入教学中,确保学生所学知识顺应时代所需。通过深入浅出、通俗易懂的教学方式,指导学生更快的掌握技能知识,成就上万个高薪就业学子。 更多问题咨询,欢迎点击------>>>>在线客服

你可能也喜欢这些

在线客服咨询

领取资料

X
立即免费领取

请准确填写您的信息

点击领取
#第三方统计代码(模版变量) '); })();
'); })();