Scrapy 官方示例

Scrapy 官方提供了一个演示项目,里面中编写了两个爬虫 Spider,分别使用了 CSS 和 XPath 解析网页的内容。

Scrapy 官方示例地址:https://github.com/scrapy/quotesbot

 

1. 查看爬虫

从 github 上下载项目后,可以使用 list 命令列出项目中包含的爬虫:

$ scrapy list
toscrape-css
toscrape-xpath

两个爬虫都从同一网站提取相同的数据,但 toscrape-css 使用 CSS 选择器,来解析网页的内容,而 toscrape-xpath 则使用了 XPath 表达式。

两个爬虫的代码文件位于 spiders 目录下:

$ cd  spiders
$ ls
toscrape-css.py
toscrape-xpath.py

 

2. 爬虫代码

使用 CSS 选择器的爬虫文件 toscrape-css.py:

# -*- coding: utf-8 -*-
import scrapy


class ToScrapeCSSSpider(scrapy.Spider):
    name = "toscrape-css"
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").extract_first(),
                'author': quote.css("small.author::text").extract_first(),
                'tags': quote.css("div.tags > a.tag::text").extract()
            }

        next_page_url = response.css("li.next > a::attr(href)").extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

使用 XPath 选择器的爬虫文件 toscrape-xpath.py:

# -*- coding: utf-8 -*-
import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()
            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

 

3. 运行爬虫

您可以使用 scrapy crawl 命令运行爬虫,例如:

$ scrapy crawl toscrape-css

如果要将已抓取的数据保存到文件,可以传递 -o 选项:

$ scrapy crawl toscrape-css -o quotes.json

运行结果:

2021-08-07 17:20:13 [scrapy.core.engine] INFO: Spider opened
2021-08-07 17:20:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-07 17:20:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-08-07 17:20:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2021-08-07 17:20:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2021-08-07 17:20:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2021-08-07 17:20:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
......

两个爬虫提取的数据类似于:

{
    'author': 'Douglas Adams',
    'text': '“I may not have gone where I intended to go, but I think I ...”',
    'tags': ['life', 'navigation']
}