structure updates

2024-05-01 12:28:44 -06:00
parent a689e58eea
commit aeba9bdb34
461 changed files with 0 additions and 0 deletions
--- a/tech_docs/python/Scrapy.md
+++ b/tech_docs/python/Scrapy.md
@@ -0,0 +1,82 @@
+`Scrapy` is an open-source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way. It's widely used for web scraping and has built-in support for extracting data from HTML/XML sources using XPath and CSS selectors. Scrapy is designed with a focus on crawling websites and extracting structured data from these pages. Here’s a concise reference guide for common use cases with `Scrapy`:
+
+# `Scrapy` Reference Guide
+
+## Installation
+```
+pip install scrapy
+```
+
+## Creating a Scrapy Project
+To start a new Scrapy project, navigate to the directory where you want your project to be and run:
+```shell
+scrapy startproject myproject
+```
+This command creates a `myproject` directory with the following structure:
+```
+myproject/
+    scrapy.cfg            # deploy configuration file
+    myproject/            # project's Python module, you'll import your code from here
+        __init__.py
+        items.py          # project items definition file
+        middlewares.py    # project middlewares file
+        pipelines.py      # project pipelines file
+        settings.py       # project settings file
+        spiders/          # a directory where you'll later put your spiders
+            __init__.py
+```
+
+## Defining Items
+Items are models that define the structure of the scraped data. They are defined in `items.py`:
+```python
+import scrapy
+
+class MyItem(scrapy.Item):
+    name = scrapy.Field()
+    description = scrapy.Field()
+```
+
+## Writing Spiders
+Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass `scrapy.Spider` and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
+```python
+import scrapy
+
+class MySpider(scrapy.Spider):
+    name = "my_spider"
+    start_urls = [
+        'http://example.com',
+    ]
+
+    def parse(self, response):
+        yield {
+            'url': response.url,
+            'title': response.css('title::text').get(),
+        }
+```
+
+## Running Spiders
+To run a spider, use the `scrapy crawl` command followed by the spider’s name:
+```shell
+scrapy crawl my_spider
+```
+
+## Storing the Scraped Data
+The simplest way to store the scraped data is by using the `-o` option in the `scrapy crawl` command, which will generate a file containing the scraped data in your chosen format:
+```shell
+scrapy crawl my_spider -o output.json
+```
+
+## Middleware and Pipelines
+- **Middlewares** allow you to add custom processing to requests and responses.
+- **Item Pipelines** allow you to process and filter the items returned by your spiders. They are defined in `pipelines.py` and need to be activated in your project’s `settings.py`.
+
+## Scrapy Shell
+Scrapy provides an interactive shell for trying out the selectors without running the spider. It’s very useful for developing and debugging:
+```shell
+scrapy shell 'http://example.com'
+```
+
+`Scrapy` is built for efficiency, capable of handling multiple requests concurrently and making it significantly faster than manually written scripts for web scraping. This guide covers basic functionalities, but `Scrapy`'s capabilities extend to a broad range of advanced features, including logging, extensions for deep customization, and built-in support for exporting scraped data in various formats.
+
+
+Scrapy’s architecture is built around "spiders," which are self-contained crawlers that are given a set of instructions. In combination with its powerful selector and item capabilities, Scrapy is suited for complex web scraping tasks, from simple website crawls to large-scale web scraping operations.