Quick code for building and optimizing topic models in Python. Uses spaCy and gensim.
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Charles Pence 306db80953
Add a version of the Craig Zeta algorithm that runs on our corpora.
2 weeks ago
dynamic Update README and polish some script errors. 8 months ago
quick_topic Clean up and add max_distance in docs-like-words. 3 weeks ago
retired Raise default num_docs. 3 weeks ago
tests Add a version of the Craig Zeta algorithm that runs on our corpora. 2 weeks ago
vector Add real SVG output, too. 2 weeks ago
.gitignore Add a DTM version of topics-for-word. 10 months ago
COPYING First draft of some corpus-making code, not done yet. 1 year ago
README.md Update README. 6 months ago
authors-topics Clean up corpus-find-word. 6 months ago
build-corpus Fix typo. 2 weeks ago
build-document-topics More work bringing things to the new code style. 6 months ago
build-models Missing parameter in build-models. 3 weeks ago
corpus-find-word Clean up corpus-find-word. 6 months ago
docs-for-topics DRY up document detail printing. 6 months ago
docs-for-word-in-topic Add missing file; clean up docs-for-word-in-topic, document. 6 months ago
requirements.txt Add clustering. 2 weeks ago
topics-for-words Add a "sum" option to topics-for-words. 3 weeks ago
topics-over-time No, *really* fix bucket print. 6 months ago
vis-pyldavis More work bringing things to the new code style. 6 months ago
vis-umap First beta of TSNE. 6 months ago
words-for-topics Fix up words-for-topics. 6 months ago



A few scripts that I cobbled together to build and evaluate quick topic models and document corpora using Gensim, starting with documents in the Sciveyor JSON schema format.


All of these scripts accept required and optional configuration parameters as command-line arguments. Run script --help to get more information about configuration.

Build Corpora and Models

build-corpus: Takes all JSON files in the current folder and processes them into a Gensim corpus, using spaCy.

Articles will be skipped if:

  • the langdetect package believes that they are non-English
  • they have no full-text at all

Tokens will be filtered out if:

  • they are less than 3 characters in length
  • they are stop-words
  • they contain any non-alpha characters
  • they are not nouns, verbs, adjectives, adverbs, proper nouns, or foreign words, as tagged by spaCy's part-of-speech tagger
  • they appear only one time in the entire corpus

They are then lemmatized and lowercased before being preserved.

If run on an already-generated corpus, this will check all the relevant files and print out statistics.

build-models: Builds topic models on the generated corpus in the current directory. Twenty-nine models are created at sizes 2-25 (each size), and from 50-150 (every 25). They are then evaluated using the C_v coherence model, and coherence scores are dumped to the console.

build-document-topics: Builds a matrix representation of the probabilities for each topic in each document. This matrix is an important ingredient in many analyses below, so we precalculate it separately.

Visualize Models

vis-pyldavis: Builds a pyLDAvis HTML file for the requested topic model.

vis-umap: Builds an interactive graph of the UMAP embedding of the documents in the topic model.

Explore Models

corpus-find-word <word-part> [<word-part>...]: Prints out all types in the dictionary containing the given search strings.

docs-for-topics: Prints out the top documents for each topic in the corpus. By default, it will print a pretty-formatted citation for each document, though you may request that another field is printed for each document instead.

docs-for-word-in-topic --word word --topic topic [topic...]: Rank and print the top documents for the value of the probability of the given topics times the number of occurrences of the given word.

docs-like-words <word> [<word>...]: Find the documents that are most similar to the list of words passed, using a cosine-similarity measure (notably, not using any generated topic model). This has fairly large memory requirements.

topics-for-words <word> [word...]: Prints out the probabilities of each word's occurrence in each topic.

topics-over-time: Loads the dates for all documents and prints out the prevalence of each topic in the corpus over time. The years are clustered into buckets of a given size, usually five years.

words-for-topics: Prints out the top N words for each topic in the model.

Analyze Models

authors-topics: Compute the summed topic probability for each author present in the corpus.

Dynamic Topic Models

The dynamic folder contains version of (at least some of) these scripts that work on dynamic topic models. It requires that the Blei et al. binary for dynamic topic modeling is installed; see dynamic/dtmmodel.py for more information.

In general, those scripts have the same names and functions as above; the following are novel:

most-different-words <model.gensim> <topic> <year_1> <year_2>: Print out the words that changed the most (both increase and decrease) within the given topic, between the two years.


There's a requirements.txt file here that should serve for installing all of the required Python packages.


Copyright (c) 2022 Charles H. Pence.

Licensed under the GNU GPL v3. See the COPYING file for more details.