Downloader for Wikileak's Leaked DNC Emails
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Zane van Iperen 4f5bcec544
Add new method for downloading emails
8 months ago
old Add new method for downloading emails 8 months ago
.gitignore Add new method for downloading emails 8 months ago
LICENSE add WikileaksEmailDownloader + licenses 1 year ago
README.md Add new method for downloading emails 8 months ago
WikileaksEmailDownloader.py add WikileaksEmailDownloader + licenses 1 year ago
clinton-emails.metalink Add new method for downloading emails 8 months ago
dnc-emails.metalink Add new method for downloading emails 8 months ago
dncdownload.sh add WikileaksEmailDownloader + licenses 1 year ago
hash-files.py Add new method for downloading emails 8 months ago
metagen.py Add new method for downloading emails 8 months ago
podesta-emails.metalink Add new method for downloading emails 8 months ago
schema.sql Add new method for downloading emails 8 months ago
urlscrape.mt2.py Add new method for downloading emails 8 months ago
wikileaks.db.zst Add new method for downloading emails 8 months ago
wikileaks.py Add new method for downloading emails 8 months ago

README.md

Wikileaks DNC/Podesta/Clinton Email Downloader

Scripts that download DNC, Podesta, and Clinton emails from Wikileaks into their original format so they can be loaded into an email client for further perusal.

Legacy Version(s)

  • dncdownload.sh - The original version, written for bash. Only supports DNC.
  • WikileaksEmailDownloader.py - The second version, written for Python3. Supports DNC + Podesta.

Downloading

This repository contains pregenerated metalink files for each set of emails. Use aria2 to download them.

Emails will be written in their respective {dnc,podesta,clinton}-emails subdirectory. DNC and Podesta emails have their 0-padded ID prefixed to the file name as some have duplicate names.

To download

$ aria2c \
    --save-session=dnc.session.aria2 \
    --save-session-interval=10 \
    --continue=true \
    --max-concurrent-downloads=50 \
    --max-tries=0 \
    --retry-wait=5 \
    --allow-overwrite=true \
    --always-resume=true \
    --auto-file-renaming=false \
    dnc-emails.metalink # or podesta-emails.metalink or clinton-emails.metalink

To resume

$ aria2c \
    --save-session=dnc.session.aria2 \
    --save-session-interval=10 \
    --continue=true \
    --max-concurrent-downloads=50 \
    --max-tries=0 \
    --retry-wait=5 \
    --allow-overwrite=true \
    --always-resume=true \
    --auto-file-renaming=false \
    -i dnc.session.aria2

Use metagen.py <dnc|podesta|clinton>. This requires wikileaks.db to be completed. A compressed version is provided in the repository. See wikileaks.db.zst. If you'd like to generate from scratch, continue reading.

Creating wikileaks.db

This is a bit of a painful process:

  1. Create the database.

    $ sqlite3 wikileaks.db < schema.db
    
  2. Scrape the email metadata (filenames, etc.)

    $ ./urlscrape.mt2.py
    

    This will take awhile. Wikileaks likes to 503/504 a lot, so be patient. If interrupted, this will pick up where it left off.

  3. Generate "stage 1" metalinks.

    ./metagen.py dnc > dnc.stage1.metalink
    ./metagen.py podesta > podestea.stage1.metalink
    ./metagen.py clinton > clinton.stage1.metalink
    

    These are metalink files without file sizes or hashes, only URLs and names.

  4. Download the files. This is the most fragile part as there's nothing to verify against.

    $ aria2c \
       --save-session=dnc.session.aria2 \
       --save-session-interval=10 \
       --continue=true \
       --max-concurrent-downloads=50 \
       --max-tries=0 \
       --retry-wait=5 \
       --allow-overwrite=true \
       --always-resume=true \
       --auto-file-renaming=false \
       dnc.stage1.metalink # also do podesta and clinton
    
  5. Hash the downloaded files

    $ ./hash-files.py
    

wikileaks.db should now contain all the information required to generate the completed metalink files.

License

CC0-1.0.