Scrapy exporter for Big Data formats
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Jörn Franke 6d0d1675df feat: code formatting using black 3 months ago
.github/workflows feat: code formatting using black 3 months ago
docs fix doc 2 years ago
examples feat: code formatting using black 3 months ago
tests feat: code formatting using black 3 months ago
zuinnote feat: code formatting using black 3 months ago
.bandit.yml feat: add basic github action workflow #5 3 months ago
.gitignore feat: added further files to gitignore 3 months ago
.prospector.yaml feat: add basic github action workflow #5 3 months ago
CHANGELOG.md docs: correct changelog date 3 months ago
LICENSE Initial 2 years ago
README.rst docs: update package 3 months ago
pyproject.toml fix: metadata pandas 3 months ago

README.rst

scrapy-contrib-bigexporters

Overview

scrapy-contrib-bigexporters provides additional exporters for the web crawling and scraping framework Scrapy (https://scrapy.org).

The following big data formats are supported:

Requirements

  • Python 3.6+
  • Scrapy 2.4+
  • Works on Linux, Windows, macOS, BSD
  • Parquet export requires fastparquet 0.4.1+
  • Avro export requires fastavro 1.1.0
  • ORC export requires pyorc 0.4.0+

Install

The quick way (pip):

pip install scrapy-contrib-bigexporters

Alternatively, you can install it from conda-forge:

conda install -c conda-forge scrapy-contrib-bigexporters

Depending on which format you want to use you need to install one or more of the following libraries.

Avro:

pip install fastavro

ORC:

pip install pyorc

Parquet:

pip install fastparquet

Additional libraries may be needed for specific compression algorithms. See "Use".

Use

Use of the library is simple. Install it with your Scrapy project as described above.You only need to configure the exporter in the Scrapy settings, run your scraper and the data will be exported into your desired format. There is no development needed.

See here for configuring the exporter in settings:

Source

The source is available at: