Files
2024-05-01 12:28:44 -06:00

82 lines
3.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

`Scrapy` is an open-source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way. It's widely used for web scraping and has built-in support for extracting data from HTML/XML sources using XPath and CSS selectors. Scrapy is designed with a focus on crawling websites and extracting structured data from these pages. Heres a concise reference guide for common use cases with `Scrapy`:
# `Scrapy` Reference Guide
## Installation
```
pip install scrapy
```
## Creating a Scrapy Project
To start a new Scrapy project, navigate to the directory where you want your project to be and run:
```shell
scrapy startproject myproject
```
This command creates a `myproject` directory with the following structure:
```
myproject/
scrapy.cfg # deploy configuration file
myproject/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
```
## Defining Items
Items are models that define the structure of the scraped data. They are defined in `items.py`:
```python
import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
description = scrapy.Field()
```
## Writing Spiders
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `scrapy.Spider` and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
```python
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = [
'http://example.com',
]
def parse(self, response):
yield {
'url': response.url,
'title': response.css('title::text').get(),
}
```
## Running Spiders
To run a spider, use the `scrapy crawl` command followed by the spiders name:
```shell
scrapy crawl my_spider
```
## Storing the Scraped Data
The simplest way to store the scraped data is by using the `-o` option in the `scrapy crawl` command, which will generate a file containing the scraped data in your chosen format:
```shell
scrapy crawl my_spider -o output.json
```
## Middleware and Pipelines
- **Middlewares** allow you to add custom processing to requests and responses.
- **Item Pipelines** allow you to process and filter the items returned by your spiders. They are defined in `pipelines.py` and need to be activated in your projects `settings.py`.
## Scrapy Shell
Scrapy provides an interactive shell for trying out the selectors without running the spider. Its very useful for developing and debugging:
```shell
scrapy shell 'http://example.com'
```
`Scrapy` is built for efficiency, capable of handling multiple requests concurrently and making it significantly faster than manually written scripts for web scraping. This guide covers basic functionalities, but `Scrapy`'s capabilities extend to a broad range of advanced features, including logging, extensions for deep customization, and built-in support for exporting scraped data in various formats.
Scrapys architecture is built around "spiders," which are self-contained crawlers that are given a set of instructions. In combination with its powerful selector and item capabilities, Scrapy is suited for complex web scraping tasks, from simple website crawls to large-scale web scraping operations.