【进阶】【Python网络爬虫】【16.爬虫框架】scrapy深度爬虫(附大量案例代码)(建议收藏)
2024-01-02 12:09:47
Python网络爬虫
一、scrapy深度爬取
1. 如何爬取多页的数据(全站数据爬取)
- 手动请求发送:
# callback用来指定解析方法
yield scrapy.Request(url=new_url,callback=self.parse)
2. 如何爬取深度存储的数据
- 什么是深度,说白了就是爬取的数据没有存在于同一张页面中。
- 必须使用请求传参的机制才可以完整的实现。
# 请求传参
yield scrapy.Request(meta={},url=detail_url,callback=self.parse_detail)
# 可以将meta字典传递给callback这个回调函数
案例 - scrapy多页爬取数据
etting.py
BOT_NAME = "deepPro"
SPIDER_MODULES = ["deepPro.spiders"]
NEWSPIDER_MODULE = "deepPro.spiders"
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
items.py
import scrapy
class DeepproItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
state = scrapy.Field()
detail_url = scrapy.Field()
content = scrapy.Field()
spiders
deep.py
import scrapy
from ..items import DeepproItem
class DeepSpider(scrapy.Spider):
name = "deep"
# allowed_domains = ["www.xxx.com"]
start_urls = ["https://wz.sun0769.com/political/index/politicsNewest"]
# 通用的 url 模板
url_model = 'https://wz.sun0769.com/political/index/politicsNewest?id=1&page=%d'
# 页码
page_num = 2
def parse(self, response):
li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
for li in li_list:
title = li.xpath('./span[3]/a/text()').extract_first()
state = li.xpath('./span[2]/text()').extract_first().strip()
detail_url = 'https://wz.sun0769.com' + li.xpath('./span[3]/a/@href').extract_first().strip()
# 创建一个item类型的对象
item = DeepproItem(title=title, state=state, detail_url=detail_url)
# 请求传参的技术,将此处的item对象传递给指定的函数:meta作用,可以将一个字典传递给callback指定的回调函数
# 对详情页的url进行请求发送(手动请求发送GET)
yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item})
if self.page_num <= 5:
print(f'######################爬取第{self.page_num}页数据######################')
new_url = format(self.url_model % self.page_num)
self.page_num += 1
yield scrapy.Request(url=new_url, callback=self.parse)
def parse_detail(self, response):
# 负责对详情页的页面源码数据进行数据解析
content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]//text()').extract()
content = ''.join(content).strip()
# 接收通过请求传参传递过来的字典数据
dic_meta = response.meta
item = dic_meta['item']
item['content'] = content
print(item)
yield item
二、如何提高scrapy的爬取效率
# 增加并发:
默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
# 降低日志级别:
在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为WORNING或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘ERROR’
三、scrapy发送post请求
【注意】该方法默认的实现,是对起始的url发起get请求,如果想发起post请求,则需要子类重写该方法。
yield scrapy.Request() :
发起get请求yield scrapy.FormRequest() :
发起post请求
import scrapy
class PostdemoSpider(scrapy.Spider):
name = "postDemo"
# allowed_domains = ["www.xxx.com"]
# 让 scrapy 对 start_urls 中的列表元素进行 post 请求的发送
start_urls = ["https://fanyi.baidu.com/sug"]
# 该函数是 scrapy 已经写好了,运行项目的时候,scrapy 是在调用该方法进行相关请求的发送
def start_requests(self):
for url in self.start_urls:
# FormRequest 进行 post 请求发送
yield scrapy.FormRequest(url=url, callback=self.parse, formdata={'kw': 'dog'}) # formdata='post请求参数'
def parse(self, response):
# 数据解析
ret = response.json()
print(ret)
四、scrapy的核心组件
从中可以大致了解scrapy框架的一个运行机制
# - 引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
# - 调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
# - 下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
# - 爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
# - 项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
五、中间件
scrapy的中间件有两个:
爬虫中间件
下载中间件
中间件的作用是什么?
- 观测中间件在五大核心组件的什么位置,根据位置了解中间件的作用
- 下载中间件位于引擎和下载器之间
- 引擎会给下载器传递请求对象,下载器会给引擎返回响应对象。
- 作用:可以拦截到scrapy框架中所有的请求和响应。
- 拦截请求干什么?
- 修改请求的ip,修改请求的头信息,设置请求的cookie
- 拦截响应干什么?
- 可以修改响应数据
- 拦截请求干什么?
1. 开发代理中间件
# 下载中间件: 处于引擎和下载器之间
class MiddleproDownloaderMiddleware:
# 拦截请求
# 参数 requests :拦截到的请求对象
# 参数 spider :爬虫文件中爬虫类的实例化对象(可以实现爬虫文件和中间件的数据交互)
def process_request(self, request, spider):
print('i am process_request()')
print('拦截到请求对象的 url:', request.url)
# request.headers 返回的请求头是字典
print('拦截到请求对象的请求头:', request.headers)
# 设置请求 - UA
request.headers['User-Agent'] = 'xxxx'
# 设置 cookie
request.headers['Cookie'] = 'xxxx'
# 获取在爬虫文件中定义好的代理池
proxy_list = spider.proxy_list
print(proxy_list)
# 设置拦截到请求对象的代理
import random
request.meta['proxy'] = random.choice(proxy_list)
def process_response(self, request, response, spider):
print('i am process_response()')
return response
# 拦截失败的请求对象
# 参数 request : 失败的请求对象
# 参数 exception :异常信息
def process_exception(self, request, exception, spider):
print('该请求失败:', request.url)
# 将错误的请求进行代理的设置(修正)
request.meta['proxy'] = 'https://ip:port'
# 返回值的作用:将 request 表示的请求对象重新进行请求发送
return request
2. 开发UA中间件
# request.headers['User-Agent'] = ua
def process_request(self, request, spider):
request.headers['User-Agent'] = '从列表中随机选择的一个UA值'
print(request.url+':请求对象拦截成功!')
return None
3. 开发Cookie中间件
def process_request(self, request, spider):
request.headers['cookie'] = 'xxx'
# request.cookies = 'xxx'
print(request.url+':请求对象拦截成功!')
return None
4. 中间件中spider参数的作用
import scrapy
class MiddleSpider(scrapy.Spider):
name = "middle"
allowed_domains = ["www.xxx.com"]
start_urls = ["https://www.baidu.com", "https://www.sougou.com", "https://www.jd.com"]
# 代理池
proxy_list = [
'https://ip:port', 'https://ip:port', 'https://ip:port', 'https://ip:port'
] # 该代理池是需要再中间中被应用/使用
def parse(self, response):
pass
案例 - qd_01_quotes
items.py
import scrapy
"""
数据结构类:整个项目中传递的数据结构在此定义
在scrapy items.py 文件中定义的数据结构是一个类字典对象
"""
class Qd01QuotesItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
middlewares.py
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
"""
headers proxies cookies
中间件文件: 处理反扒
"""
class Qd01QuotesSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def process_spider_output(self, response, result, spider):
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
pass
def process_start_requests(self, start_requests, spider):
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
class Qd01QuotesDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
pipelines.py
from itemadapter import ItemAdapter
"""
数据管道<保存数据, 数据去重>, 所有的数据都会流经数据管道
"""
class QuotesPipeline:
def process_item(self, item, spider):
# 爬虫文件返回的一条一条item数据结构会被次函数接收
print('返回过来的item数据:', item)
d = dict(item) # 强调item是一个类字典数据结构
with open('quotes.csv', mode='a', encoding='utf-8') as f:
f.write(d['text'] + ',' + d['author'] + ',' + '|'.join(d['tags']))
f.write('\n')
return item
settings.py
配置文件修改
不遵从robots协议 :
ROBOTSTXT_OBEY = False指定输出日志的类型 :
LOG_LEVEL = ‘ERROR’指定UA :
USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36’
""" 对整个scrapy爬虫项目做配置的文件 """
BOT_NAME = "qd_01_quotes"
SPIDER_MODULES = ["qd_01_quotes.spiders"]
NEWSPIDER_MODULE = "qd_01_quotes.spiders"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # robots协议
# 指定输出的日志类型
LOG_LEVEL = 'ERROR'
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 开启管道类的功能配置, 作用是保存数据
# 300 代表权重, 值越小代表权重越大, 优先级越高
"qd_01_quotes.pipelines.QuotesPipeline": 299,
}
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
spiders
quotes.py
import scrapy
"""
爬虫文件:
1. 收集采集数据的地址
2. 解析数据返回
"""
# 继承自爬虫基类
class QuotesSpider(scrapy.Spider):
# 爬虫文件的名字, 通过scrapy genspider 时创建, 后期启动爬虫项目的时候需要指定爬虫的名字
name = "quotes"
# 运行爬虫时, 采集哪个域名下的数据, 允许的范围是一个列表
allowed_domains = []
# 起始采集网址, 通过genspider自动生成的, 后续要修改
# 在此列表中的地址会默认的被框架请求
# 如果说采集的是有规律的地址, 那么可以使用列表推导式收集
start_urls = ["http://quotes.toscrape.com/"]
# 在start_urls中所有地址默认都会被下面的函数处理
def parse(self, response):
# 在scrapy爬虫解析函数中, 都必须携带response参数
# response.request<响应体> + 请求体 + parsel.Selector(css+xpath+re)
# print(response.text)
divs = response.css('.quote')
for div in divs:
text = div.css('.text::text').get()
author = div.css('.author::text').get()
tags = div.css('.tags a::text').getall()
# 如果获取的数据数据, 通过yield返回的是字典数据, 那么框架会自动处理
# 在scrapy爬虫文件中, 所有数据的返回全部用 yield
# 在循环中一条一条返回数据
yield {
'text': text,
'author': author,
'tags': tags,
}
"""
启动项目指令:
1.终端进入项目目录
2.scrapy crawl +(爬虫文件名字)
"""
"""反扒、保存、底层调度顺序"""
quotes_items.py
import scrapy
# 相对于当前py文件, 取上级目录
from ..items import Qd01QuotesItem
class QuotesItemsSpider(scrapy.Spider):
name = "quotes_items"
allowed_domains = ["toscrape.com"]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
divs = response.css('.quote')
for div in divs:
text = div.css('.text::text').get()
author = div.css('.author::text').get()
tags = div.css('.tags a::text').getall()
# yield {
# 'text': text,
# 'author': author,
# 'tags': tags,
# }
# 使用定义好的数据结构返回一条一条数据
yield Qd01QuotesItem(text=text, author=author, tags=tags)
quotes_next.py
import scrapy
from ..items import Qd01QuotesItem
class QuotesNextSpider(scrapy.Spider):
name = "quotes_next"
allowed_domains = ["toscrape.com"]
# 翻页第一种处理方案
# start_urls = [f"https://quotes.toscrape.com/page/{page}/" for page in range(1, 10)]
start_urls = ["http://quotes.toscrape.com/"]
def parse(self, response):
divs = response.css('.quote')
for div in divs:
text = div.css('.text::text').get()
author = div.css('.author::text').get()
tags = div.css('.tags a::text').getall()
# 使用定义好的数据结构返回一条一条数据
yield Qd01QuotesItem(text=text, author=author, tags=tags)
# 翻页第二种处理方案
next_page = response.css('.next a::attr(href)').get()
if next_page:
all_url = 'https://quotes.toscrape.com' + next_page # 拼接完整地址
# 手动构建请求
# callback 传入一个回调函数, 告诉框架让谁处理这次请求
yield scrapy.Request(url=all_url, callback=self.parse)
案例 - qd_02_fuliba
items.py
import scrapy
class Qd02FulibaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() # 文章标题
put_time = scrapy.Field() # 发布时间
reads = scrapy.Field() # 阅读数
stars = scrapy.Field() # 点赞数
info = scrapy.Field() # 点赞数
middlewares.py
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class Qd02FulibaSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def process_spider_output(self, response, result, spider):
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
pass
def process_start_requests(self, start_requests, spider):
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
class Qd02FulibaDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
pipelines.py
from itemadapter import ItemAdapter
class Qd02FulibaPipeline:
def process_item(self, item, spider):
d = dict(item) # 强调item是一个类字典数据结构
with open('quotes.csv', mode='a', encoding='utf-8') as f:
f.write(d['title'] + ',' + d['put_time'] + ',' + d['reads'] + ',' + d['stars'] + ',' + d['info'])
f.write('\n')
return item
settings.py
BOT_NAME = "qd_02_fuliba"
SPIDER_MODULES = ["qd_02_fuliba.spiders"]
NEWSPIDER_MODULE = "qd_02_fuliba.spiders"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
"qd_02_fuliba.pipelines.Qd02FulibaPipeline": 300,
}
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
spiders
fuliba.py
import scrapy
from ..items import Qd02FulibaItem
class FulibaSpider(scrapy.Spider):
name = "fuliba"
allowed_domains = ["fuliba2023.net"]
start_urls = [f"https://fuliba2023.net/page/{page}" for page in range(1, 162)]
def parse(self, response):
# print(response.text)
articles = response.css('.content article')
for art in articles:
title = art.css('h2>a::text').get() # 文章标题
put_time = art.css('.meta>time::text').get() # 发布时间
reads = art.css('.pv::text').get() # 阅读数
stars = art.css('.post-like>span::text').get() # 点赞数
info = art.css('.note::text').get() # 点赞数
yield Qd02FulibaItem(title=title, put_time=put_time, reads=reads,
stars=stars, info=info)
案例 - qd_03_english
items.py
import scrapy
class Qd03EnglishItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
info = scrapy.Field()
img_url = scrapy.Field()
middlewares.py
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class Qd03EnglishSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
return None
def process_spider_output(self, response, result, spider):
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
pass
def process_start_requests(self, start_requests, spider):
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
class Qd03EnglishDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
return None
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)
pelines.py
from itemadapter import ItemAdapter
class Qd03EnglishPipeline:
def process_item(self, item, spider):
with open('english.csv', mode='a', encoding='utf-8') as f:
f.write(item['title'] + ',' + item['info'] + ',' + item['img_url'])
f.write('\n')
return item
settings.py
BOT_NAME = "qd_03_english"
SPIDER_MODULES = ["qd_03_english.spiders"]
NEWSPIDER_MODULE = "qd_03_english.spiders"
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
"qd_03_english.pipelines.Qd03EnglishPipeline": 300,
}
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
spiders
nglish.py
import scrapy
from ..items import Qd03EnglishItem
class EnglishSpider(scrapy.Spider):
name = "english"
allowed_domains = ["chinadaily.com.cn"]
# 默认情况下 start_urls 中的地址, 会被框架自动请求
# start_urls = [f"https://language.chinadaily.com.cn/thelatest/page_{page}.html" for page in range(1, 11)]
# 手动构建请求
# 调用了框架的内置方法, 方法的重写
def start_requests(self):
for page in range(1, 11):
# 只能构建get请求
yield scrapy.Request(
url=f"https://language.chinadaily.com.cn/thelatest/page_{page}.html",
callback=self.parse
)
# 回调函数
def parse(self, response):
# print(response.text)
divs = response.css('.gy_box')
for div in divs:
title = div.css('.gy_box_txt2>a::text').get()
info = div.css('.gy_box_txt3>a::text').get().strip() # 简介
if info:
info = info.strip()
img_url = 'https:' + div.css('.gy_box_img>img::attr(src)').get()
yield Qd03EnglishItem(title=title, info=info, img_url=img_url)
# post
# Form data
# json data
文章来源:https://blog.csdn.net/weixin_43612602/article/details/135334520
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。 如若内容造成侵权/违法违规/事实不符,请联系我的编程经验分享网邮箱:veading@qq.com进行投诉反馈,一经查实,立即删除!
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。 如若内容造成侵权/违法违规/事实不符,请联系我的编程经验分享网邮箱:veading@qq.com进行投诉反馈,一经查实,立即删除!