3.6 KiB
Scrapy is an open-source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way. It's widely used for web scraping and has built-in support for extracting data from HTML/XML sources using XPath and CSS selectors. Scrapy is designed with a focus on crawling websites and extracting structured data from these pages. Here’s a concise reference guide for common use cases with Scrapy:
Scrapy Reference Guide
Installation
pip install scrapy
Creating a Scrapy Project
To start a new Scrapy project, navigate to the directory where you want your project to be and run:
scrapy startproject myproject
This command creates a myproject directory with the following structure:
myproject/
scrapy.cfg # deploy configuration file
myproject/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Defining Items
Items are models that define the structure of the scraped data. They are defined in items.py:
import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
description = scrapy.Field()
Writing Spiders
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = [
'http://example.com',
]
def parse(self, response):
yield {
'url': response.url,
'title': response.css('title::text').get(),
}
Running Spiders
To run a spider, use the scrapy crawl command followed by the spider’s name:
scrapy crawl my_spider
Storing the Scraped Data
The simplest way to store the scraped data is by using the -o option in the scrapy crawl command, which will generate a file containing the scraped data in your chosen format:
scrapy crawl my_spider -o output.json
Middleware and Pipelines
- Middlewares allow you to add custom processing to requests and responses.
- Item Pipelines allow you to process and filter the items returned by your spiders. They are defined in
pipelines.pyand need to be activated in your project’ssettings.py.
Scrapy Shell
Scrapy provides an interactive shell for trying out the selectors without running the spider. It’s very useful for developing and debugging:
scrapy shell 'http://example.com'
Scrapy is built for efficiency, capable of handling multiple requests concurrently and making it significantly faster than manually written scripts for web scraping. This guide covers basic functionalities, but Scrapy's capabilities extend to a broad range of advanced features, including logging, extensions for deep customization, and built-in support for exporting scraped data in various formats.
Scrapy’s architecture is built around "spiders," which are self-contained crawlers that are given a set of instructions. In combination with its powerful selector and item capabilities, Scrapy is suited for complex web scraping tasks, from simple website crawls to large-scale web scraping operations.