82 lines
3.6 KiB
Markdown
82 lines
3.6 KiB
Markdown
`Scrapy` is an open-source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way. It's widely used for web scraping and has built-in support for extracting data from HTML/XML sources using XPath and CSS selectors. Scrapy is designed with a focus on crawling websites and extracting structured data from these pages. Here’s a concise reference guide for common use cases with `Scrapy`:
|
||
|
||
# `Scrapy` Reference Guide
|
||
|
||
## Installation
|
||
```
|
||
pip install scrapy
|
||
```
|
||
|
||
## Creating a Scrapy Project
|
||
To start a new Scrapy project, navigate to the directory where you want your project to be and run:
|
||
```shell
|
||
scrapy startproject myproject
|
||
```
|
||
This command creates a `myproject` directory with the following structure:
|
||
```
|
||
myproject/
|
||
scrapy.cfg # deploy configuration file
|
||
myproject/ # project's Python module, you'll import your code from here
|
||
__init__.py
|
||
items.py # project items definition file
|
||
middlewares.py # project middlewares file
|
||
pipelines.py # project pipelines file
|
||
settings.py # project settings file
|
||
spiders/ # a directory where you'll later put your spiders
|
||
__init__.py
|
||
```
|
||
|
||
## Defining Items
|
||
Items are models that define the structure of the scraped data. They are defined in `items.py`:
|
||
```python
|
||
import scrapy
|
||
|
||
class MyItem(scrapy.Item):
|
||
name = scrapy.Field()
|
||
description = scrapy.Field()
|
||
```
|
||
|
||
## Writing Spiders
|
||
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `scrapy.Spider` and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
|
||
```python
|
||
import scrapy
|
||
|
||
class MySpider(scrapy.Spider):
|
||
name = "my_spider"
|
||
start_urls = [
|
||
'http://example.com',
|
||
]
|
||
|
||
def parse(self, response):
|
||
yield {
|
||
'url': response.url,
|
||
'title': response.css('title::text').get(),
|
||
}
|
||
```
|
||
|
||
## Running Spiders
|
||
To run a spider, use the `scrapy crawl` command followed by the spider’s name:
|
||
```shell
|
||
scrapy crawl my_spider
|
||
```
|
||
|
||
## Storing the Scraped Data
|
||
The simplest way to store the scraped data is by using the `-o` option in the `scrapy crawl` command, which will generate a file containing the scraped data in your chosen format:
|
||
```shell
|
||
scrapy crawl my_spider -o output.json
|
||
```
|
||
|
||
## Middleware and Pipelines
|
||
- **Middlewares** allow you to add custom processing to requests and responses.
|
||
- **Item Pipelines** allow you to process and filter the items returned by your spiders. They are defined in `pipelines.py` and need to be activated in your project’s `settings.py`.
|
||
|
||
## Scrapy Shell
|
||
Scrapy provides an interactive shell for trying out the selectors without running the spider. It’s very useful for developing and debugging:
|
||
```shell
|
||
scrapy shell 'http://example.com'
|
||
```
|
||
|
||
`Scrapy` is built for efficiency, capable of handling multiple requests concurrently and making it significantly faster than manually written scripts for web scraping. This guide covers basic functionalities, but `Scrapy`'s capabilities extend to a broad range of advanced features, including logging, extensions for deep customization, and built-in support for exporting scraped data in various formats.
|
||
|
||
|
||
Scrapy’s architecture is built around "spiders," which are self-contained crawlers that are given a set of instructions. In combination with its powerful selector and item capabilities, Scrapy is suited for complex web scraping tasks, from simple website crawls to large-scale web scraping operations. |