diff --git a/docs/tech_docs/python/Scrapy.md b/docs/tech_docs/python/Scrapy.md new file mode 100644 index 0000000..d7269cf --- /dev/null +++ b/docs/tech_docs/python/Scrapy.md @@ -0,0 +1,82 @@ +`Scrapy` is an open-source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way. It's widely used for web scraping and has built-in support for extracting data from HTML/XML sources using XPath and CSS selectors. Scrapy is designed with a focus on crawling websites and extracting structured data from these pages. Here’s a concise reference guide for common use cases with `Scrapy`: + +# `Scrapy` Reference Guide + +## Installation +``` +pip install scrapy +``` + +## Creating a Scrapy Project +To start a new Scrapy project, navigate to the directory where you want your project to be and run: +```shell +scrapy startproject myproject +``` +This command creates a `myproject` directory with the following structure: +``` +myproject/ + scrapy.cfg # deploy configuration file + myproject/ # project's Python module, you'll import your code from here + __init__.py + items.py # project items definition file + middlewares.py # project middlewares file + pipelines.py # project pipelines file + settings.py # project settings file + spiders/ # a directory where you'll later put your spiders + __init__.py +``` + +## Defining Items +Items are models that define the structure of the scraped data. They are defined in `items.py`: +```python +import scrapy + +class MyItem(scrapy.Item): + name = scrapy.Field() + description = scrapy.Field() +``` + +## Writing Spiders +Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `scrapy.Spider` and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data. +```python +import scrapy + +class MySpider(scrapy.Spider): + name = "my_spider" + start_urls = [ + 'http://example.com', + ] + + def parse(self, response): + yield { + 'url': response.url, + 'title': response.css('title::text').get(), + } +``` + +## Running Spiders +To run a spider, use the `scrapy crawl` command followed by the spider’s name: +```shell +scrapy crawl my_spider +``` + +## Storing the Scraped Data +The simplest way to store the scraped data is by using the `-o` option in the `scrapy crawl` command, which will generate a file containing the scraped data in your chosen format: +```shell +scrapy crawl my_spider -o output.json +``` + +## Middleware and Pipelines +- **Middlewares** allow you to add custom processing to requests and responses. +- **Item Pipelines** allow you to process and filter the items returned by your spiders. They are defined in `pipelines.py` and need to be activated in your project’s `settings.py`. + +## Scrapy Shell +Scrapy provides an interactive shell for trying out the selectors without running the spider. It’s very useful for developing and debugging: +```shell +scrapy shell 'http://example.com' +``` + +`Scrapy` is built for efficiency, capable of handling multiple requests concurrently and making it significantly faster than manually written scripts for web scraping. This guide covers basic functionalities, but `Scrapy`'s capabilities extend to a broad range of advanced features, including logging, extensions for deep customization, and built-in support for exporting scraped data in various formats. + + +Scrapy’s architecture is built around "spiders," which are self-contained crawlers that are given a set of instructions. In combination with its powerful selector and item capabilities, Scrapy is suited for complex web scraping tasks, from simple website crawls to large-scale web scraping operations. \ No newline at end of file