https://blog.csdn.net/qq_36034503/article/details/109151295

 

首先明确目标,要爬取什么?


 
我们爬取“孔夫子旧书网”所有书籍的图片及信息


 
上面标注的就是我们要爬取的信息,确定了目标,就可以编写items.py
import scrapy
 
 
class MyscrapyItem(scrapy.Item):
 
    # 普通字段
    title = scrapy.Field()
    author = scrapy.Field()
    time = scrapy.Field()
    new_price = scrapy.Field()
    old_price = scrapy.Field()
 
    # 爬取图片并保存需要的字段
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()
 
分析页面,查看我们要获取的信息的存放位置


 
这是一个页面的信息,当前页面存在下一页的话,我们要获取下一页的信息


 
编写spider.py
import scrapy
 
import re
from scrapy import Request
from ..items import MyscrapyItem
 
 
class KongfzSpider(scrapy.Spider):
    name = 'kongfz'
    allowed_domains = ['kongfz.com']
    start_urls = ['http://item.kongfz.com/Cjisuanji/']
 
    def parse(self, response):
        divs = response.xpath("//div[@id='listBox']/div")
        for div in divs:
 
            item = MyscrapyItem()
            item['title'] = div.xpath("./div[@class='item-info']//a/text()").get()
            item['author'] = div.xpath("./div[@class='item-info']//span[1]/text()").get()
            item['time'] = div.xpath("./div[@class='item-info']//span[3]/text()").get()
            item['new_price'] = div.xpath("./div[@class='item-other-info']/div[1]//span[@class='price']/text()").get()
            item['old_price'] = div.xpath("./div[@class='item-other-info']/div[2]//span[@class='price']/text()").get()
            item['image_urls'] = [div.xpath(".//div[@class='big-img-box']/img/@src").get()]
 
            print(item)
            yield item
 
        # 翻页
        next_url = response.xpath("//a[@class='next-btn']/@href").get()
        if next_url is not None:
            yield response.follow(next_url, callback=self.parse)
 

编写middlewares.py,在请求每个页面时,需要做反反爬操作
    def process_request(self, request, spider):
        # 在请求页面时伪装成站内请求,用以反 反爬虫
        referer = request.url
        if referer:
            request.headers['referer'] = referer
        return None
 

编写pipelines.py,自定义图片下载设置
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
# from scrapy import log
import hashlib
from scrapy.utils.python import to_bytes
 
 
class MyscrapyPipeline:
    def process_item(self, item, spider):
        # print(item)
        return item
 
 
class KongfzImgDownloadPipeline(ImagesPipeline):
 
    # 设置下载文件请求的请求头
    default_headers = {
        'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'zh-CN,zh;q=0.9',
        'referer': 'http://item.kongfz.com/Cjisuanji/',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0',
    }
 
    # 伪装成站内请求,反反爬
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            self.default_headers['referer'] = image_url
            yield Request(image_url, headers=self.default_headers)
 
    # 自定义 文件路径 和 文件名
    def file_path(self, request, item, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return f'full/{item["title"]}/{image_guid}.jpg'
 
    # 获取文件的存放路径
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item
 

最后配置settings.py
import os
 
BOT_NAME = 'myscrapy'
 
SPIDER_MODULES = ['myscrapy.spiders']
NEWSPIDER_MODULE = 'myscrapy.spiders'
 
 
FEED_EXPORT_ENCODING = 'utf-8'
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = False     # false 表示不遵循robot.txt 协议
 
# 注释:表示没有开启cookie, false:表示使用setting里设置的cookie, true: 表示使用自定义的cookie
COOKIES_ENABLED = True
 
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'myscrapy.middlewares.MyscrapyDownloaderMiddleware': 543,   # 启用下载中间件,在请求页面时伪装成站内请求,用以反 反爬虫
}
 
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'myscrapy.pipelines.KongfzImgDownloadPipeline': 300,
}
 
IMAGES_STORE = 'D:\\images'  # 设置保存图片的根目录
 

各个步骤均已完成,那就开始让你的爬虫表演吧
scrapy crawl kongfz -o kongfz.json
 

 

......,,......,去喝了杯水后
看下Kongfz.json


 
再看下


 
灰常完美,一大波美图正等着你去细品。
做下总结,一个scrapy项目通常需要编写了五个文件,按顺序
items.py --> spider.py --> middlewares.py --> pipelines.py --> settings.py
 

孔夫子旧书网的图片都爬下来了,还怕爬取不到其他图片吗
有想法的person可以开始你的表演了
————————————————
版权声明:本文为CSDN博主「码农不是马」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/qq_36034503/article/details/109151295