|Charles Pence 605d14586e|
doc2vec: Create and analyze document vector models
This code will read a document corpus in our document corpus format and then train, visualize, and analyze document-vector models on that corpus.
Table of Contents
The easiest way to run this script is to check out this repository and then run
hatch run doc2vec (arguments). You can also install it using pip, if you would
like (directly from this repository; we do not push modules to PyPi).
Available commands are:
doc2vec build --corpus-base BASE --output OUTPUT.gensim [--vector-size 100] [--window 5] [--min-occurrence 2] [--[no-]debug]
Build a vector model from the given corpus. You must specify the path to the already constructed corpus (built with doc-corpus) and the output file in which we should save the trained model. Options:
--vector-size: The dimensionality of the trained vectors. 100 is the default from the original published
doc2vecarticle, and there is very little advice online about the training possibilities for this option.
--window: The size of the sliding window used to infer the document vectors. This will control the extent to which the algorithm is sensitive to the order of terms in the corpus and the immediate neighborhood in which terms appear. Common practice is to optimize this value over a range like 2, 5, 10, and 20.
--min-occurrence: It is occasionally useful to set a more aggressive minimum occurrence threshold for vector models than the one that was used for the corpus (recall that
doc-corpus createalso carries a
--min-occurrenceoption). Lower values of this option often add noise; consider tuning this option across values of 2, 5, 10, and 20.
The first real vector analyses the lab performed using these scripts (for our project on disagreement in biodiversity), we used a window of 20 and a minimum occurrence of 10, with the default vector size.
If set, the
--debugoption will add a substantial amount of debug printing about the training of the model, perhaps useful if you’re seeing results that you don’t expect.
doc2vec similarity --corpus-base BASE --input INPUT.gensim [--num-docs N] [--max-distance D] [--distribution] [--inverse] [--field FIELD] WORDS...
Build a vector from the given list of words, and compare all documents in the corpus to see how similar they are to this list of words. You must specify the path to the already constructed corpus and trained model file.
You can specify one of three operating modes:
--num-docs N: Return the N documents that are closest to the vector formed by the given words.
--max-distance D: Return all documents that are less than the provided distance D from the vector formed by the given words.
--distribution: Print the entire distribution of distances. This mode will print a CSV file containing document IDs and their distance from the vector formed by the given words. (Tip: Inspect this distribution to see what values you might consider using for
These three options are mutually incompatible; setting more than one of them will cause an error.
--inverse: If using the
--max-distancemodes, invert the sort order. This will make
--num-docsreturn the N least similar documents, and make
--max-distancefunction as a minimum distance, returning documents farther away than D.
--field FIELD: If using the
--max-distancemodes, results will be printed in the form of distances along with some information about each document. The information defaults to a citation, but can be customized here.
doc2vec visualize --corpus-base BASE --input INPUT.gensim [--output umap] [--random-seed 1337] [--document-list to_highlight.txt] [--highlight-color #cc0000] [--clusters/--no-clusters] [--field citation]
Visualize the trained vector model using a UMAP dimensionality reduction algorithm. Options:
--output umap: The script will create two files, one HTML and one SVG, with this base filename.
--random-seed 1337: If you would like multiple runs of this script to distribute and cluster in the same way, specify the same random seed here.
--document-list to_highlight.txt: If you specify a list of document filenames here, then the nodes for those documents will be highlighted in the final visualization.
--highlight-color #cc0000: The color in which to highlight the documents in the list above.
--clusters/--no-clusters: By default, all documents are colored using a basic clustering algorithm, simply to make the resulting visualization slightly easier to read. If you set
--no-clusters, however, all documents will be a uniform gray (allowing, for example, the documents tagged with
--highlight-colorto be more visible).
--field citation: The document information to display about each document in the hover details for each document node.
All code in this repository is released under the GNU GPL v3.