|
||
---|---|---|
src/doc_corpus | ||
.gitignore | ||
LICENSE | ||
README.md | ||
pyproject.toml |
README.md
Document Corpus Construction
This code will read a set of document files in the lab’s JSON schema format and produce a tokenized Gensim corpus object.
Table of Contents
Documentation
The easiest way to run this script is to check out this repository and then run
hatch run doc-corpus (arguments)
. You can also install it using pip, if you
would like (directly from this repository; we do not push modules to PyPi).
A document corpus is a collection of files whose name starts with a base, which
will be suffixed with an underscore and some various file extensions. (So, for
example, if you pass ~/corpus
as the base, you will see files like
~/corpus_filenames.pickle
and ~/corpus_dates.pickle
.)
Available commands are:
-
doc-corpus create [--min-occurrence N] [--output BASE] FILES...
Build a corpus (a collection of files whose path will start with
BASE
) from a collection of JSON files specified by FILES. The paths in FILES can also be (optionally recursive) glob patterns, which will be expanded by the script.By default, we filter out terms that only appear one time in the corpus; you may filter even more aggressively by setting the value for the
min_occurrence
option to a value higher than 2. -
doc_corpus info --corpus-base BASE
Print out basic summary statistics for the corpus at
BASE
. -
doc_corpus search --corpus-base BASE SEARCH...
For each search pattern that you specify (in the
SEARCH
arguments), look through the corpus for all words that contain that pattern, and print them out, along with the number of documents in the corpus in which they appear.For example, running a search for “progress” would return not only the number of documents in which “progress” appears, but also “progressive”, “progression”, and so forth.
API
For other scripts that need to consume these document corpora, there’s a minimal
API that you can use to figure out what to load. There are two decorators,
doc_corpus.options
and doc_corpus.options_required
, to add the
--corpus-base
option to a click-based CLI application, and the *_path
methods in the module will return the path to the various parts of the corpus
that you might want to load.
Each of those *_path
methods also includes a check
argument which, if set to
True
, will verify that the file actually exists and raise a
FileNotFoundError
if it does not.
For example:
from doc_corpus import options_required, dict_path
from gensim.corpora import Dictionary
@click.command()
@options_required
def do_stuff(corpus_base):
path = dict_path(corpus_base, check=True)
dictionary = Dictionary.load(str(path))
# do things with the dictionary
License
All code in this repository is released under the GNU GPL v3.