https://blog.csdn.net/qq_36034503/article/details/109151295
首先明确目标,要爬取什么?
我们爬取“孔夫子旧书网”所有书籍的图片及信息
上面标注的就是我们要爬取的信息,确定了目标,就可以编写items.py
import scrapy
class MyscrapyItem(scrapy.Item):
# 普通字段
title = scrapy.Field()
author = scrapy.Field()
time = scrapy.Field()
new_price = scrapy.Field()
old_price = scrapy.Field()
# 爬取图片并保存需要的字段
image_urls = scrapy.Field()
images = scrapy.Field()
image_paths = scrapy.Field()
分析页面,查看我们要获取的信息的存放位置
这是一个页面的信息,当前页面存在下一页的话,我们要获取下一页的信息
编写spider.py
import scrapy
import re
from scrapy import Request
from ..items import MyscrapyItem
class KongfzSpider(scrapy.Spider):
name = 'kongfz'
allowed_domains = ['kongfz.com']
start_urls = ['http://item.kongfz.com/Cjisuanji/']
def parse(self, response):
divs = response.xpath("//div[@id='listBox']/div")
for div in divs:
item = MyscrapyItem()
item['title'] = div.xpath("./div[@class='item-info']//a/text()").get()
item['author'] = div.xpath("./div[@class='item-info']//span[1]/text()").get()
item['time'] = div.xpath("./div[@class='item-info']//span[3]/text()").get()
item['new_price'] = div.xpath("./div[@class='item-other-info']/div[1]//span[@class='price']/text()").get()
item['old_price'] = div.xpath("./div[@class='item-other-info']/div[2]//span[@class='price']/text()").get()
item['image_urls'] = [div.xpath(".//div[@class='big-img-box']/img/@src").get()]
print(item)
yield item
# 翻页
next_url = response.xpath("//a[@class='next-btn']/@href").get()
if next_url is not None:
yield response.follow(next_url, callback=self.parse)
编写middlewares.py,在请求每个页面时,需要做反反爬操作
def process_request(self, request, spider):
# 在请求页面时伪装成站内请求,用以反 反爬虫
referer = request.url
if referer:
request.headers['referer'] = referer
return None
编写pipelines.py,自定义图片下载设置
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
# from scrapy import log
import hashlib
from scrapy.utils.python import to_bytes
class MyscrapyPipeline:
def process_item(self, item, spider):
# print(item)
return item
class KongfzImgDownloadPipeline(ImagesPipeline):
# 设置下载文件请求的请求头
default_headers = {
'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'referer': 'http://item.kongfz.com/Cjisuanji/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0',
}
# 伪装成站内请求,反反爬
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
self.default_headers['referer'] = image_url
yield Request(image_url, headers=self.default_headers)
# 自定义 文件路径 和 文件名
def file_path(self, request, item, response=None, info=None):
image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
return f'full/{item["title"]}/{image_guid}.jpg'
# 获取文件的存放路径
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
最后配置settings.py
import os
BOT_NAME = 'myscrapy'
SPIDER_MODULES = ['myscrapy.spiders']
NEWSPIDER_MODULE = 'myscrapy.spiders'
FEED_EXPORT_ENCODING = 'utf-8'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # false 表示不遵循robot.txt 协议
# 注释:表示没有开启cookie, false:表示使用setting里设置的cookie, true: 表示使用自定义的cookie
COOKIES_ENABLED = True
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'myscrapy.middlewares.MyscrapyDownloaderMiddleware': 543, # 启用下载中间件,在请求页面时伪装成站内请求,用以反 反爬虫
}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'myscrapy.pipelines.KongfzImgDownloadPipeline': 300,
}
IMAGES_STORE = 'D:\\images' # 设置保存图片的根目录
各个步骤均已完成,那就开始让你的爬虫表演吧
scrapy crawl kongfz -o kongfz.json
......,,......,去喝了杯水后
看下Kongfz.json
再看下
灰常完美,一大波美图正等着你去细品。
做下总结,一个scrapy项目通常需要编写了五个文件,按顺序
items.py --> spider.py --> middlewares.py --> pipelines.py --> settings.py
孔夫子旧书网的图片都爬下来了,还怕爬取不到其他图片吗
有想法的person可以开始你的表演了
————————————————
版权声明:本文为CSDN博主「码农不是马」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_36034503/article/details/109151295
- 还没有人评论,欢迎说说您的想法!