|
2 weeks ago | |
---|---|---|
dynamic | 8 months ago | |
quick_topic | 3 weeks ago | |
retired | 3 weeks ago | |
tests | 2 weeks ago | |
vector | 2 weeks ago | |
.gitignore | 10 months ago | |
COPYING | 1 year ago | |
README.md | 6 months ago | |
authors-topics | 6 months ago | |
build-corpus | 2 weeks ago | |
build-document-topics | 6 months ago | |
build-models | 3 weeks ago | |
corpus-find-word | 6 months ago | |
docs-for-topics | 6 months ago | |
docs-for-word-in-topic | 6 months ago | |
requirements.txt | 2 weeks ago | |
topics-for-words | 3 weeks ago | |
topics-over-time | 6 months ago | |
vis-pyldavis | 6 months ago | |
vis-umap | 6 months ago | |
words-for-topics | 6 months ago |
README.md
Quick-Topic
A few scripts that I cobbled together to build and evaluate quick topic models and document corpora using Gensim, starting with documents in the Sciveyor JSON schema format.
Scripts
All of these scripts accept required and optional configuration parameters as
command-line arguments. Run script --help
to get more information about
configuration.
Build Corpora and Models
build-corpus
: Takes all JSON files in the current folder and processes them
into a Gensim corpus, using spaCy.
Articles will be skipped if:
- the
langdetect
package believes that they are non-English - they have no full-text at all
Tokens will be filtered out if:
- they are less than 3 characters in length
- they are stop-words
- they contain any non-alpha characters
- they are not nouns, verbs, adjectives, adverbs, proper nouns, or foreign words, as tagged by spaCy's part-of-speech tagger
- they appear only one time in the entire corpus
They are then lemmatized and lowercased before being preserved.
If run on an already-generated corpus, this will check all the relevant files and print out statistics.
build-models
: Builds topic models on the generated corpus in the current
directory. Twenty-nine models are created at sizes 2-25 (each size), and from
50-150 (every 25). They are then evaluated using the C_v coherence model, and
coherence scores are dumped to the console.
build-document-topics
: Builds a matrix representation of the probabilities for
each topic in each document. This matrix is an important ingredient in many
analyses below, so we precalculate it separately.
Visualize Models
vis-pyldavis
: Builds a pyLDAvis HTML file for the requested topic model.
vis-umap
: Builds an interactive graph of the UMAP embedding of the documents
in the topic model.
Explore Models
corpus-find-word <word-part> [<word-part>...]
: Prints out all types in the
dictionary containing the given search strings.
docs-for-topics
: Prints out the top documents for each topic in the corpus. By
default, it will print a pretty-formatted citation for each document, though you
may request that another field is printed for each document instead.
docs-for-word-in-topic --word word --topic topic [topic...]
: Rank and print
the top documents for the value of the probability of the given topics times the
number of occurrences of the given word.
docs-like-words <word> [<word>...]
: Find the documents that are most similar
to the list of words passed, using a cosine-similarity measure (notably, not
using any generated topic model). This has fairly large memory requirements.
topics-for-words <word> [word...]
: Prints out the probabilities of each word's
occurrence in each topic.
topics-over-time
: Loads the dates for all documents and prints out the
prevalence of each topic in the corpus over time. The years are clustered into
buckets of a given size, usually five years.
words-for-topics
: Prints out the top N words for each topic in the model.
Analyze Models
authors-topics
: Compute the summed topic probability for each author
present in the corpus.
Dynamic Topic Models
The dynamic
folder contains version of (at least some of) these scripts that
work on dynamic topic models. It requires that the Blei et al. binary for
dynamic topic modeling is installed; see dynamic/dtmmodel.py
for more
information.
In general, those scripts have the same names and functions as above; the following are novel:
most-different-words <model.gensim> <topic> <year_1> <year_2>
: Print out the
words that changed the most (both increase and decrease) within the given
topic, between the two years.
Development
There's a requirements.txt
file here that should serve for installing all of
the required Python packages.
License
Copyright (c) 2022 Charles H. Pence.
Licensed under the GNU GPL v3. See the COPYING file for more details.