|Charles Pence 1dc1ff274e|
Document Corpus Construction
This code will read a set of document files in the lab’s JSON schema format and produce a tokenized Gensim corpus object.
Table of Contents
The easiest way to run this script is to check out this repository and then run
hatch run doc-corpus (arguments). You can also install it using pip, if you
would like (directly from this repository; we do not push modules to PyPi).
A document corpus is a collection of files whose name starts with a base, which
will be suffixed with an underscore and some various file extensions. (So, for
example, if you pass
~/corpus as the base, you will see files like
Available commands are:
doc-corpus create [--min-occurrence N] [--output BASE] FILES...
Build a corpus (a collection of files whose path will start with
BASE) from a collection of JSON files specified by FILES. The paths in FILES can also be (optionally recursive) glob patterns, which will be expanded by the script.
By default, we filter out terms that only appear one time in the corpus; you may filter even more aggressively by setting the value for the
min_occurrenceoption to a value higher than 2.
doc_corpus info --corpus-base BASE
Print out basic summary statistics for the corpus at
doc_corpus search --corpus-base BASE SEARCH...
For each search pattern that you specify (in the
SEARCHarguments), look through the corpus for all words that contain that pattern, and print them out, along with the number of documents in the corpus in which they appear.
For example, running a search for “progress” would return not only the number of documents in which “progress” appears, but also “progressive”, “progression”, and so forth.
For other scripts that need to consume these document corpora, there’s a minimal
API that you can use to figure out what to load. There are two decorators,
doc_corpus.options_required, to add the
--corpus-base option to a click-based CLI application, and the
methods in the module will return the path to the various parts of the corpus
that you might want to load.
Each of those
*_path methods also includes a
check argument which, if set to
True, will verify that the file actually exists and raise a
FileNotFoundError if it does not.
from doc_corpus import options_required, dict_path from gensim.corpora import Dictionary @click.command() @options_required def do_stuff(corpus_base): path = dict_path(corpus_base, check=True) dictionary = Dictionary.load(str(path)) # do things with the dictionary
All code in this repository is released under the GNU GPL v3.