Files
2024-05-01 12:28:44 -06:00

3.6 KiB
Raw Permalink Blame History

Scrapy is an open-source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way. It's widely used for web scraping and has built-in support for extracting data from HTML/XML sources using XPath and CSS selectors. Scrapy is designed with a focus on crawling websites and extracting structured data from these pages. Heres a concise reference guide for common use cases with Scrapy:

Scrapy Reference Guide

Installation

pip install scrapy

Creating a Scrapy Project

To start a new Scrapy project, navigate to the directory where you want your project to be and run:

scrapy startproject myproject

This command creates a myproject directory with the following structure:

myproject/
    scrapy.cfg            # deploy configuration file
    myproject/            # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

Defining Items

Items are models that define the structure of the scraped data. They are defined in items.py:

import scrapy

class MyItem(scrapy.Item):
    name = scrapy.Field()
    description = scrapy.Field()

Writing Spiders

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass scrapy.Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = [
        'http://example.com',
    ]

    def parse(self, response):
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
        }

Running Spiders

To run a spider, use the scrapy crawl command followed by the spiders name:

scrapy crawl my_spider

Storing the Scraped Data

The simplest way to store the scraped data is by using the -o option in the scrapy crawl command, which will generate a file containing the scraped data in your chosen format:

scrapy crawl my_spider -o output.json

Middleware and Pipelines

  • Middlewares allow you to add custom processing to requests and responses.
  • Item Pipelines allow you to process and filter the items returned by your spiders. They are defined in pipelines.py and need to be activated in your projects settings.py.

Scrapy Shell

Scrapy provides an interactive shell for trying out the selectors without running the spider. Its very useful for developing and debugging:

scrapy shell 'http://example.com'

Scrapy is built for efficiency, capable of handling multiple requests concurrently and making it significantly faster than manually written scripts for web scraping. This guide covers basic functionalities, but Scrapy's capabilities extend to a broad range of advanced features, including logging, extensions for deep customization, and built-in support for exporting scraped data in various formats.

Scrapys architecture is built around "spiders," which are self-contained crawlers that are given a set of instructions. In combination with its powerful selector and item capabilities, Scrapy is suited for complex web scraping tasks, from simple website crawls to large-scale web scraping operations.