【进阶】【Python网络爬虫】【16.爬虫框架】scrapy深度爬虫(附大量案例代码)(建议收藏)

2024-01-02 12:09:47

一、scrapy深度爬取

1. 如何爬取多页的数据(全站数据爬取)

  • 手动请求发送:
# callback用来指定解析方法
yield scrapy.Request(url=new_url,callback=self.parse)

2. 如何爬取深度存储的数据

  • 什么是深度,说白了就是爬取的数据没有存在于同一张页面中。
  • 必须使用请求传参的机制才可以完整的实现。
# 请求传参
yield scrapy.Request(meta={},url=detail_url,callback=self.parse_detail)

# 可以将meta字典传递给callback这个回调函数
案例 - scrapy多页爬取数据
etting.py
BOT_NAME = "deepPro"

SPIDER_MODULES = ["deepPro.spiders"]
NEWSPIDER_MODULE = "deepPro.spiders"

ROBOTSTXT_OBEY = False

LOG_LEVEL = 'ERROR'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
items.py
import scrapy

class DeepproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    state = scrapy.Field()
    detail_url = scrapy.Field()
    content = scrapy.Field()
spiders
deep.py
import scrapy

from ..items import DeepproItem


class DeepSpider(scrapy.Spider):
    name = "deep"
    # allowed_domains = ["www.xxx.com"]
    start_urls = ["https://wz.sun0769.com/political/index/politicsNewest"]
    # 通用的 url 模板
    url_model = 'https://wz.sun0769.com/political/index/politicsNewest?id=1&page=%d'
    # 页码
    page_num = 2

    def parse(self, response):
        li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
        for li in li_list:
            title = li.xpath('./span[3]/a/text()').extract_first()
            state = li.xpath('./span[2]/text()').extract_first().strip()
            detail_url = 'https://wz.sun0769.com' + li.xpath('./span[3]/a/@href').extract_first().strip()

            # 创建一个item类型的对象
            item = DeepproItem(title=title, state=state, detail_url=detail_url)

            # 请求传参的技术,将此处的item对象传递给指定的函数:meta作用,可以将一个字典传递给callback指定的回调函数
            # 对详情页的url进行请求发送(手动请求发送GET)
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item})

        if self.page_num <= 5:
            print(f'######################爬取第{self.page_num}页数据######################')
            new_url = format(self.url_model % self.page_num)
            self.page_num += 1
            yield scrapy.Request(url=new_url, callback=self.parse)

    def parse_detail(self, response):
        # 负责对详情页的页面源码数据进行数据解析
        content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]//text()').extract()
        content = ''.join(content).strip()

        # 接收通过请求传参传递过来的字典数据
        dic_meta = response.meta
        item = dic_meta['item']
        item['content'] = content
        print(item)
        yield item

二、如何提高scrapy的爬取效率

# 增加并发:
    默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100# 降低日志级别:
    在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为WORNING或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘ERROR’

三、scrapy发送post请求

【注意】该方法默认的实现,是对起始的url发起get请求,如果想发起post请求,则需要子类重写该方法。

  • yield scrapy.Request() :发起get请求
  • yield scrapy.FormRequest() :发起post请求
import scrapy

class PostdemoSpider(scrapy.Spider):
    name = "postDemo"
    # allowed_domains = ["www.xxx.com"]

    # 让 scrapy 对 start_urls 中的列表元素进行 post 请求的发送
    start_urls = ["https://fanyi.baidu.com/sug"]

    # 该函数是 scrapy 已经写好了,运行项目的时候,scrapy 是在调用该方法进行相关请求的发送
    def start_requests(self):
        for url in self.start_urls:
            # FormRequest 进行 post 请求发送
            yield scrapy.FormRequest(url=url, callback=self.parse, formdata={'kw': 'dog'})  # formdata='post请求参数'

    def parse(self, response):
        # 数据解析
        ret = response.json()
        print(ret)

四、scrapy的核心组件

从中可以大致了解scrapy框架的一个运行机制

# - 引擎(Scrapy)
	用来处理整个系统的数据流处理, 触发事务(框架核心)
# - 调度器(Scheduler)
	用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
# - 下载器(Downloader)
	用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
# - 爬虫(Spiders)
	爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
# - 项目管道(Pipeline)
	负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。

五、中间件

scrapy的中间件有两个:

  • 爬虫中间件
  • 下载中间件

中间件的作用是什么?

  • 观测中间件在五大核心组件的什么位置,根据位置了解中间件的作用
    • 下载中间件位于引擎和下载器之间
    • 引擎会给下载器传递请求对象,下载器会给引擎返回响应对象。
    • 作用:可以拦截到scrapy框架中所有的请求和响应。
      • 拦截请求干什么?
        • 修改请求的ip,修改请求的头信息,设置请求的cookie
      • 拦截响应干什么?
        • 可以修改响应数据

1. 开发代理中间件

# 下载中间件: 处于引擎和下载器之间
class MiddleproDownloaderMiddleware:
    # 拦截请求
    # 参数 requests :拦截到的请求对象
    # 参数 spider :爬虫文件中爬虫类的实例化对象(可以实现爬虫文件和中间件的数据交互)
    def process_request(self, request, spider):
        print('i am process_request()')
        print('拦截到请求对象的 url:', request.url)
        # request.headers 返回的请求头是字典
        print('拦截到请求对象的请求头:', request.headers)
        # 设置请求 - UA
        request.headers['User-Agent'] = 'xxxx'
        # 设置 cookie
        request.headers['Cookie'] = 'xxxx'
        # 获取在爬虫文件中定义好的代理池
        proxy_list = spider.proxy_list
        print(proxy_list)
        # 设置拦截到请求对象的代理
        import random
        request.meta['proxy'] = random.choice(proxy_list)

    def process_response(self, request, response, spider):
        print('i am process_response()')
        return response

    # 拦截失败的请求对象
    # 参数 request : 失败的请求对象
    # 参数 exception :异常信息
    def process_exception(self, request, exception, spider):
        print('该请求失败:', request.url)
        # 将错误的请求进行代理的设置(修正)
        request.meta['proxy'] = 'https://ip:port'
        # 返回值的作用:将 request 表示的请求对象重新进行请求发送
        return request

2. 开发UA中间件

# request.headers['User-Agent'] = ua
def process_request(self, request, spider):
       request.headers['User-Agent'] = '从列表中随机选择的一个UA值'
       print(request.url+':请求对象拦截成功!')
       return None

3. 开发Cookie中间件

def process_request(self, request, spider):
    request.headers['cookie'] = 'xxx'
    # request.cookies = 'xxx'
    print(request.url+':请求对象拦截成功!')
    return None

4. 中间件中spider参数的作用

import scrapy

class MiddleSpider(scrapy.Spider):
    name = "middle"
    allowed_domains = ["www.xxx.com"]
    start_urls = ["https://www.baidu.com", "https://www.sougou.com", "https://www.jd.com"]

    # 代理池
    proxy_list = [
        'https://ip:port', 'https://ip:port', 'https://ip:port', 'https://ip:port'
    ]  # 该代理池是需要再中间中被应用/使用

    def parse(self, response):
        pass
案例 - qd_01_quotes
items.py
import scrapy
"""
数据结构类:整个项目中传递的数据结构在此定义
在scrapy items.py 文件中定义的数据结构是一个类字典对象
"""

class Qd01QuotesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()
middlewares.py
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter

"""
headers proxies cookies
中间件文件: 处理反扒
"""
class Qd01QuotesSpiderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)


class Qd01QuotesDownloaderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)
pipelines.py
from itemadapter import ItemAdapter
"""
数据管道<保存数据, 数据去重>, 所有的数据都会流经数据管道
"""
class QuotesPipeline:

    def process_item(self, item, spider):
        # 爬虫文件返回的一条一条item数据结构会被次函数接收
        print('返回过来的item数据:', item)

        d = dict(item)  # 强调item是一个类字典数据结构
        with open('quotes.csv', mode='a', encoding='utf-8') as f:
            f.write(d['text'] + ',' + d['author'] + ',' + '|'.join(d['tags']))
            f.write('\n')
        return item
settings.py
配置文件修改
  • 不遵从robots协议 :ROBOTSTXT_OBEY = False
  • 指定输出日志的类型 :LOG_LEVEL = ‘ERROR’
  • 指定UA :USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36’
""" 对整个scrapy爬虫项目做配置的文件 """
BOT_NAME = "qd_01_quotes"
SPIDER_MODULES = ["qd_01_quotes.spiders"]
NEWSPIDER_MODULE = "qd_01_quotes.spiders"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  # robots协议

# 指定输出的日志类型
LOG_LEVEL = 'ERROR'

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    # 开启管道类的功能配置, 作用是保存数据
    # 300 代表权重, 值越小代表权重越大, 优先级越高
    "qd_01_quotes.pipelines.QuotesPipeline": 299,
}

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
spiders
quotes.py
import scrapy
"""
爬虫文件:
    1. 收集采集数据的地址
    2. 解析数据返回
"""
# 继承自爬虫基类
class QuotesSpider(scrapy.Spider):
    # 爬虫文件的名字, 通过scrapy genspider 时创建, 后期启动爬虫项目的时候需要指定爬虫的名字
    name = "quotes"

    # 运行爬虫时, 采集哪个域名下的数据, 允许的范围是一个列表
    allowed_domains = []

    # 起始采集网址, 通过genspider自动生成的, 后续要修改
    # 在此列表中的地址会默认的被框架请求
    # 如果说采集的是有规律的地址, 那么可以使用列表推导式收集
    start_urls = ["http://quotes.toscrape.com/"]

    # 在start_urls中所有地址默认都会被下面的函数处理
    def parse(self, response):
        # 在scrapy爬虫解析函数中, 都必须携带response参数
        # response.request<响应体>  + 请求体 + parsel.Selector(css+xpath+re)
        # print(response.text)

        divs = response.css('.quote')
        for div in divs:
            text = div.css('.text::text').get()
            author = div.css('.author::text').get()
            tags = div.css('.tags a::text').getall()

            # 如果获取的数据数据, 通过yield返回的是字典数据, 那么框架会自动处理
            # 在scrapy爬虫文件中, 所有数据的返回全部用 yield
            # 在循环中一条一条返回数据
            yield {
                'text': text,
                'author': author,
                'tags': tags,
            }

"""
启动项目指令: 
    1.终端进入项目目录
    2.scrapy crawl +(爬虫文件名字)
"""
"""反扒、保存、底层调度顺序"""
quotes_items.py
import scrapy
# 相对于当前py文件, 取上级目录
from ..items import Qd01QuotesItem

class QuotesItemsSpider(scrapy.Spider):
    name = "quotes_items"
    allowed_domains = ["toscrape.com"]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        divs = response.css('.quote')
        for div in divs:
            text = div.css('.text::text').get()
            author = div.css('.author::text').get()
            tags = div.css('.tags a::text').getall()
            # yield {
            #     'text': text,
            #     'author': author,
            #     'tags': tags,
            # }
            
            # 使用定义好的数据结构返回一条一条数据
            yield Qd01QuotesItem(text=text, author=author, tags=tags)
quotes_next.py
import scrapy
from ..items import Qd01QuotesItem

class QuotesNextSpider(scrapy.Spider):
    name = "quotes_next"
    allowed_domains = ["toscrape.com"]
    # 翻页第一种处理方案
    # start_urls = [f"https://quotes.toscrape.com/page/{page}/" for page in range(1, 10)]
    start_urls = ["http://quotes.toscrape.com/"]

    def parse(self, response):
        divs = response.css('.quote')
        for div in divs:
            text = div.css('.text::text').get()
            author = div.css('.author::text').get()
            tags = div.css('.tags a::text').getall()

            # 使用定义好的数据结构返回一条一条数据
            yield Qd01QuotesItem(text=text, author=author, tags=tags)

        # 翻页第二种处理方案
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            all_url = 'https://quotes.toscrape.com' + next_page  # 拼接完整地址

            # 手动构建请求
            # callback   传入一个回调函数, 告诉框架让谁处理这次请求
            yield scrapy.Request(url=all_url, callback=self.parse)
案例 - qd_02_fuliba
items.py
import scrapy

class Qd02FulibaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()  # 文章标题
    put_time = scrapy.Field()  # 发布时间
    reads = scrapy.Field()  # 阅读数
    stars = scrapy.Field()  # 点赞数
    info = scrapy.Field()  # 点赞数
middlewares.py
from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class Qd02FulibaSpiderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)


class Qd02FulibaDownloaderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)
pipelines.py
from itemadapter import ItemAdapter

class Qd02FulibaPipeline:
    def process_item(self, item, spider):
        d = dict(item)  # 强调item是一个类字典数据结构
        with open('quotes.csv', mode='a', encoding='utf-8') as f:
            f.write(d['title'] + ',' + d['put_time'] + ',' + d['reads'] + ',' + d['stars'] + ',' + d['info'])
            f.write('\n')
        return item
settings.py
BOT_NAME = "qd_02_fuliba"

SPIDER_MODULES = ["qd_02_fuliba.spiders"]
NEWSPIDER_MODULE = "qd_02_fuliba.spiders"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
    "qd_02_fuliba.pipelines.Qd02FulibaPipeline": 300,
}

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
spiders
fuliba.py
import scrapy
from ..items import Qd02FulibaItem

class FulibaSpider(scrapy.Spider):
    name = "fuliba"
    allowed_domains = ["fuliba2023.net"]
    start_urls = [f"https://fuliba2023.net/page/{page}" for page in range(1, 162)]

    def parse(self, response):
        # print(response.text)
        articles = response.css('.content article')

        for art in articles:
            title = art.css('h2>a::text').get()  # 文章标题
            put_time = art.css('.meta>time::text').get()  # 发布时间
            reads = art.css('.pv::text').get()  # 阅读数
            stars = art.css('.post-like>span::text').get()  # 点赞数
            info = art.css('.note::text').get()  # 点赞数
            yield Qd02FulibaItem(title=title, put_time=put_time, reads=reads,
                                 stars=stars, info=info)
案例 - qd_03_english
items.py
import scrapy

class Qd03EnglishItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    info = scrapy.Field()
    img_url = scrapy.Field()
middlewares.py
from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class Qd03EnglishSpiderMiddleware:

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)


class Qd03EnglishDownloaderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)
pelines.py
from itemadapter import ItemAdapter

class Qd03EnglishPipeline:
    def process_item(self, item, spider):
        with open('english.csv', mode='a', encoding='utf-8') as f:
            f.write(item['title'] + ',' + item['info'] + ',' + item['img_url'])
            f.write('\n')
        return item
settings.py
BOT_NAME = "qd_03_english"

SPIDER_MODULES = ["qd_03_english.spiders"]
NEWSPIDER_MODULE = "qd_03_english.spiders"

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
    "qd_03_english.pipelines.Qd03EnglishPipeline": 300,
}

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
spiders
nglish.py
import scrapy
from ..items import Qd03EnglishItem

class EnglishSpider(scrapy.Spider):
    name = "english"
    allowed_domains = ["chinadaily.com.cn"]

    # 默认情况下 start_urls 中的地址, 会被框架自动请求
    # start_urls = [f"https://language.chinadaily.com.cn/thelatest/page_{page}.html" for page in range(1, 11)]
    # 手动构建请求
    # 调用了框架的内置方法, 方法的重写
    def start_requests(self):
        for page in range(1, 11):
            # 只能构建get请求
            yield scrapy.Request(
                url=f"https://language.chinadaily.com.cn/thelatest/page_{page}.html",
                callback=self.parse
            )
            
	# 回调函数
    def parse(self, response):
        # print(response.text)
        divs = response.css('.gy_box')

        for div in divs:
            title = div.css('.gy_box_txt2>a::text').get()
            info = div.css('.gy_box_txt3>a::text').get().strip()  # 简介
            if info:
                info = info.strip()

            img_url = 'https:' + div.css('.gy_box_img>img::attr(src)').get()
            yield Qd03EnglishItem(title=title, info=info, img_url=img_url)

# post
# Form data
# json data

文章来源:https://blog.csdn.net/weixin_43612602/article/details/135334520
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。