scrapy ——链接提取器之爬取读书网数据（十三）

2023-12-14 11:03:04

提取多页图书的名字及图片，但总页数未知。

1.CrawlSpider介绍?

CrawlSpider：
1.继承自 scrapy.spider

2.CrawlSpider可以定义规则。在解析 html 的时候，可以根据链接规则提取出指定的链接，然后再向这些链接发送请求。所以，如果有需要跟进链接的请求，意思就是爬取了网页之后，需要提取链接在此爬取，使用CrawlSpider是非常合适的。

3.提取链接

链接提取器，在这里就可以写规则提取指定链接

scrapy.linkextractors.LinkExtractor(

? ? ? ? allow=(),? ? ? ? # （常用）正则表达式? 提取符合正则的链接

? ? ? ? deny=(),? ? ? ? # 正则表达式，不提取符合条件的链接

? ? ? ? allow_domains=(),? ? ? ? # 允许的域名

? ? ? ? deny_domains(),? ? ? ? # 不允许的域名

? ? ? ? restrict_xpaths=(),? ? ? ? # （常用）Xpath，提取符合xpath规则的链接

? ? ? ? restrict_css=()? ? ? ? # 提取符合选择器规则的链接

)

4.模拟使用

? ? ? ? 正则用法：link1 = LinkExtractor(allow=r'list_23_\d+\.html')

? ? ? ? xpath用法：link2 =?LinkExtractor(restrict_xpath=r'//div[id="d"]'

? ? ? ? css用法：link3 =?LinkExtractor(restrict_css='#x')

5.提取链接

? ? ? ? link.extract_links(response)

2.创建爬虫项目

创建项目：scrapy startproject scrapy_101

# 注意这里添加了 -t crawl

创建爬虫文件：scrapy genspider -t crawl read?www.dushu.com

3.爬取读书网并解析数据

items.py 中定义

class Scrapy101Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    src = scrapy.Field()

read.py 中代码为

其中要注意的是，rules 为定义正则表达式，提取符合规则的链接（可以用Xpath或css）

我们观察到，每一页的地址都是 “... 1090_页数?...”规则，所以正则写为

r'https://www.dushu.com/book/1090_\d+\.html'

start_urls的地址要为符合正则规则的第一页地址（这非常重要，不然提取不到第一页的数据）。

class ReadSpider(CrawlSpider):
    name = 'read'
    allowed_domains = ['www.dushu.com']
    start_urls = ['https://www.dushu.com/book/1090_1.html']

    rules = (
        Rule(LinkExtractor(allow=r'https://www.dushu.com/book/1090_\d+\.html'),
                           callback='parse_item',
                           follow=True),
    )

    def parse_item(self, response):
        img_lsit = response.xpath('//div[@class="bookslist"]//li//img')
        for img in img_lsit:
            name = img.xpath('//div[@class="bookslist"]//li//img/@alt').get()
            src = img.xpath('//div[@class="bookslist"]//li//img/@data-original').get()

            book = Scrapy101Item(name=name,src=src)
            yield book

在settings.py中打开管道。在settings.py中写入 LOG_FILE=‘XXX.log’，就可以不在终端显示运行信息，其会被写入log文件中。

在 pipelines.py 中写数据

class Scrapy101Pipeline:
    # 在爬虫文件开始之前就执行的一个文件
    def open_spider(self, spider):
        print('++++++++++++++++++++++++++')
        self.fp = open('book.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

        # 在爬虫文件执行完之后再执行的方法

    def close_spider(self, spider):
        print('----------------------')
        self.fp.close()

参考

尚硅谷Python爬虫教程小白零基础速通（含python基础+爬虫案例）

文章来源:https://blog.csdn.net/m0_45447650/article/details/134460670
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：veading@qq.com进行投诉反馈，一经查实，立即删除！