|voussoir 83fc4ade2e||2 weeks ago|
|timesearch_modules||5 months ago|
|utilities||1 year ago|
|.gitignore||1 year ago|
|CONTACT.md||2 weeks ago|
|DONATE.md||3 months ago|
|LICENSE.txt||3 months ago|
|README.md||4 months ago|
|requirements.txt||3 years ago|
|timesearch.py||6 months ago|
|timesearch_logo.svg||1 year ago|
Reddit has removed the timestamp search feature which timesearch was built off of (original). Please message the admins by sending a PM to /r/reddit.com. Let them know that this feature is important to you, and you would like them to restore it on the new search stack.
Thankfully, Jason Baumgartner aka /u/Stuck_in_the_Matrix, owner of Pushshift.io, has made it easy to interact with his dataset. Timesearch now queries his API to get post data, and then uses reddit's /api/info to get up-to-date information about those posts (scores, edited text bodies, ...). While we're at it, this also gives us the ability to speed up
get_comments. In addition, we can get all of a user's comments which was not possible through reddit alone.
NOTE: Because Pushshift is an independent dataset run by a regular person, it does not contain posts from private subreddits. Without the timestamp search parameter, scanning private subreddits is now impossible. I urge once again that you contact
your senator the admins to have this feature restored.
I don't have a test suite. You're my test suite! Messages go to /u/GoldenSights.
Timesearch is a collection of utilities for archiving subreddits.
pip install -r requirements.txtto get them all.
scripttype, and set the redirect URI to
http://localhost:8080. The title and description can be anything you want, and the about URL is not required.
allfor the scopes.
bot.py. Fill out the variables using your OAuth information, and read the instructions to see where to put it. The Useragent is a description of your API usage. Typically "/u/username's praw client" is sufficient.
get_submissions: If you try to page through
/new on a subreddit, you'll hit a limit at or before 1,000 posts. Timesearch uses the pushshift.io dataset to get information about very old posts, and then queries the reddit api to update their information. Previously, we used the
timestamp cloudsearch query parameter on reddit's own API, but reddit has removed that feature and pushshift is now the only viable source for initial data.
python timesearch.py get_submissions -r subredditname <flags>
python timesearch.py get_submissions -u username <flags>
get_comments: Similar to
get_submissions, this tool queries pushshift for comment data and updates it from reddit.
python timesearch.py get_comments -r subredditname <flags>
python timesearch.py get_comments -u username <flags>
livestream: get_submissions+get_comments is great for starting your database and getting the historical posts, but it's not the best for staying up-to-date. Instead, livestream monitors
/comments to continuously ingest data.
python timesearch.py livestream -r subredditname <flags>
python timesearch.py livestream -u username <flags>
get_styles: Downloads the stylesheet and CSS images.
python timesearch.py get_styles -r subredditname
get_wiki: Downloads the wiki pages, sidebar, etc. from /wiki/pages.
python timesearch.py get_wiki -r subredditname
offline_reading: Renders comment threads into HTML via markdown.
Note: I'm currently using the markdown library from pypi, and it doesn't do reddit's custom markdown like
/u/, obviously. So far I don't think anybody really uses o_r so I haven't invested much time into improving it.
python timesearch.py offline_reading -r subredditname <flags>
python timesearch.py offline_reading -u username <flags>
index: Generates plaintext or HTML lists of submissions, sorted by a property of your choosing. You can order by date, author, flair, etc. With the
--offline parameter, you can make all the links point to the files you generated with
python timesearch.py index -r subredditname <flags>
python timesearch.py index -u username <flags>
breakdown: Produces a JSON file indicating which users make the most posts in a subreddit, or which subreddits a user posts in.
python timesearch.py breakdown -r subredditname
python timesearch.py breakdown -u username
merge_db: Copy all new data from one timesearch database into another. Useful for syncing or merging two scans of the same subreddit.
python timesearch.py merge_db --from filepath/database1.db --to filepath/database2.db
When you download this project, the main file that you will execute is
timesearch.py here in the root directory. It will load the appropriate module to run your command from the modules folder.
You can view a summarized version of all the help text by running
timesearch.py, and you can view a specific help text by running a command with no arguments, like
timesearch.py livestream, etc.
I recommend sqlitebrowser if you want to inspect the database yourself.
2020 01 27
redmash. Well, since the timesearch toolkit is meant to be a singular cohesive package now I decided to finally rename everything. I believe I have aliased everything properly so the old names still work for backwards compat, except for the fact the modules folder is now called
timesearch_moduleswhich may break your import statements if you ever imported that on your own.
2018 04 09
2017 11 13
2017 11 05
2017 11 04
2017 10 12
mergedbutility for combining databases.
2017 06 02
commentaugment -s abcdefto get a particular thread even if you haven't scraped anything else from that subreddit. Previously
-sonly worked if the database already existed and you specified it via
-r. Now it is inferred from the submission itself.
2017 04 28
2016 08 10
2016 07 03
2016 07 02
2016 06 07
2016 06 05
migrate_20160605.pyscript to convert old databases into new ones.
2015 11 11
offline_reading.pywhich converts a timesearch database into a comment tree that can be rendered into HTML
2015 09 07
livestreamto crash because
bot.refresh()was outside of the try-catch.
2015 08 19
I want to live in a future where everyone uses UTC and agrees on daylight savings.