【一起学爬虫】PyQuery详解

1,430 阅读7分钟

回顾

之前介绍了Beautifulsoup库,这个库可以让我们不写繁杂的正则表达式就可以爬取数据。但是你可能会觉得Beautifulsoup库不太好用,语法太繁杂,难记。今天介绍一个灵活又强大的网页解析库PyQuery。

什么是PyQuery

如果你熟悉jQuery的语法,那么PyQuery就是爬虫的绝佳选择,api可以无缝迁移。

PyQuery的安装

pip install pyquery

PyQuery的使用

下面案例讲解使用到的都是下面这个字符串

html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''

(1)字符串初始化

from pyquery import PyQuery as pq
doc = pq(html)#PyQuery对象,直接传入字符串
print(doc('li'))#其实就是css选择器,选择class时前面加‘.’;选择属性时前面加‘#’,选择标签直接写
print(doc('.item-0')[0].text)#输出第一个class值为item-0对应的内容

输出:

<li class="item-0">first item</li>
         <li class="item-1">< a href=" ">second item</ a></li>
         <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
         <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
         <li class="item-0">< a href="link5.html">fifth item</ a></li>
     
first item

(2) URL初始化

from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')#直接传入URL,会自动返回请求后的HTML并传入到PyQuery
print(doc('head'))

输出:

<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> 

(3)文件初始化

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')#本地文件名
print(doc('li'))

基本的CSS选择器

from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))#id前面加‘#’,class选择就在前面加‘.’ 标签的话什么都不加,写在前面就是选择外层元素、后面就是选择里面的元素

输出:

<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>

(1)查找子元素

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)

输出:

<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
</ul>
 
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1">< a href="link2.html">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>

(2)直接子元素:

lis = items.children(‘.active’)#()中是二次筛选,也可以没有
print(type(lis))
print(lis)

(3)父元素

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')#list的父节点
container = items.parent()

输出:

<class 'pyquery.pyquery.PyQuery'>
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1">< a href=" ">second item</ a></li>
         <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
         <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
         <li class="item-0">< a href="link5.html">fifth item</ a></li>
     </ul>
 </div>

返回祖先节点:

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)

输出:

<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1">< a href=" ">second item</ a></li>
             <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
             <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
             <li class="item-0">< a href="link5.html">fifth item</ a></li>
         </ul>
     </div>
 </div><div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1">< a href="link2.html">second item</ a></li>
             <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
             <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
             <li class="item-0">< a href="link5.html">fifth item</ a></li>
         </ul>
     </div>

也可以传入css选择器再次进行筛选:

parent = items.parents('.wrap')
print(parent)

只会输出上面的第一个结果

兄弟元素

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')#后面是没有空格,表示查找同时包含这两个class的元素,只有一个符合条件
print(li.siblings())

输出的是其他4个兄弟li标签

遍历

from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
    print(li)#每一个li标签都是pyquery类型,可以进行进一步操作

获取信息

(1)获取属性值

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.attr('href'))
print(a.attr.href)

输出:

< a href=" "><span class="bold">third item</span></ a>
link3.html
link3.html
< a href="link3.html"><span class="bold">third item</span></ a>
link3.html
link3.html

(2)获取文本值

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())

输出:

< a href=" "><span class="bold">third item</span></ a>
third item

(3)获取HTML

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())

输出:

<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>   
< a href="link3.html"><span class="bold">third item</span></ a>

DOM 操作

(1)addClass和removeClass

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

输出:

<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>
             
<li class="item-0">< a href="link3.html"><span class="bold">third item</span></ a></li>
             
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>

DOM操作其实就是对:属性、css、class等进行操作 (2)添加属性attr、添加css

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name', 'link')#添加新的属性对
print(li)
li.css('font-size', '14px')#添加新的css
print(li)

输出:

<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>
             
<li class="item-0 active" name="link">< a href="link3.html"><span class="bold">third item</span></ a></li>
             
<li class="item-0 active" name="link" style="font-size: 14px">< a href="link3.html"><span class="bold">third item</span></ a></li>

(3)移除

html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()#单独获取Hello world
print(wrap.text())

输出: Hello, World This is a paragraph. Hello, World (5)伪类选择器

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')#获取第一个li标签
print(li)
li = doc('li:last-child')#获取最后一个li标签
print(li)
li = doc('li:nth-child(2)')#获取第二个li标签
print(li)
li = doc('li:gt(2)')#获取大于2的li标签
print(li)
li = doc('li:nth-child(2n)')##获取第偶数个li标签
print(li)
li = doc('li:contains(second)')#获取包含某个文本值的li标签
print(li)

(5)其他

资料分享

java学习笔记、10T资料、100多个java项目分享


欢迎关注个人公众号【菜鸟名企梦】,公众号专注:互联网求职面经javapython爬虫大数据等技术分享**: 公众号**菜鸟名企梦后台发送“csdn”即可免费领取【csdn】和【百度文库】下载服务; 公众号菜鸟名企梦后台发送“资料”:即可领取5T精品学习资料**、java面试考点java面经总结,以及几十个java、大数据项目资料很全,你想找的几乎都有

扫码关注,及时获取更多精彩内容。(博主今日头条大数据工程师)