You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Your Name 9c24c24a50 updated docker-compose command with sudo 3 months ago
epub_generator epub generator initial state 3 months ago
tpb initial process_load() in pipeline setup 3 months ago
.gitignore added other __pychach__folder 4 months ago
LICENSE added GPL 3 license 4 months ago
README.md updated docker-compose command with sudo 3 months ago
create_proxies.py auto script and docker-compose rotating proxies 4 months ago
docker-compose.yml auto script and docker-compose rotating proxies 4 months ago
requirements.txt added scrapy-rotating-proxies 4 months ago
scrapy.cfg initial git project setup 4 months ago
source.log initial git project setup 4 months ago
tpb_db.sql updated sql structure following dev of process_load 3 months ago

README.md

TPB_scrapy

Description

Scraping The Pirate Bay's top 100 ebooks, and slowly seing patterns emerge 🍆

Using the scrapy framework, tor proxies and mariadb.

Dependencies

Install through your package manager:

  • docker
  • docker-compose
  • tor
  • mariadb
  • python-pip

Installation

Archlinux / Manjaro :

sudo pacman -S docker docker-compose tor python-pip mysql

Make sure you have virtualenv installed:

python -m pip install virtualenv

Setup:

For LAMP installation, please refer to this LAMP guide from the Manjaro forum. It goes through all the necessary steps to install Apache, MariaDB, PHP and PHPMyAdmin.

Enable docker,tor service:

sudo systemctl start docker.service
sudo systemctl enable docker.service
sudo systemctl start tor.service
sudo systemctl enable tor.service

Install with pip the packages from requirements.txt.

git clone https://gitlab.com/Zipperflunky/tpb_scrapy.git

then:

To install the virtalenv, install the required packages in it:

cd tpb_scrapy
virtualenv venv
source venv/bin/activate
python -m pip install -r requirements.txt

Pull docker-compose image and start the tor-proxy containers:

sudo docker-compose pull && sudo docker-compose up -d

Always start the virtualenv with source venv/bin/activate before running any python scripts !

Usage

Usage run this to start crawling:

cd tpb && scrapy crawl tpbspider

Use this to run the tpbspider.py without the mariadb pipeline:

cd tpb/spiders && scrapy runspider tpbspider -o output.json

Support

Just open an issue.

Roadmap

  • import and install sql db
  • add sql sub routine
  • add epub generation for the almanac

Contributing

Just create an issue that states what you want to modify. I'll review that and we can discuss it there. Alternatively, you can open a merge request.

License

TPB_scraper is licensed under GPL 3 and later.