Code to produce a tokenized corpus from JSON documents
Go to file
Charles Pence 1dc1ff274e
Dynamically download spaCy models. A bit of a hack, but should be more extensible (#1).
2023-09-13 17:48:47 +02:00
src/doc_corpus Dynamically download spaCy models. A bit of a hack, but should be more extensible (#1). 2023-09-13 17:48:47 +02:00
.gitignore Add initial corpus building and word search code, taken from the old quick_topic repo. 2023-09-12 13:00:37 +02:00
LICENSE Add initial corpus building and word search code, taken from the old quick_topic repo. 2023-09-12 13:00:37 +02:00
README.md Add a table of contents. 2023-09-12 13:09:57 +02:00
pyproject.toml Dynamically download spaCy models. A bit of a hack, but should be more extensible (#1). 2023-09-13 17:48:47 +02:00

README.md

Document Corpus Construction

This code will read a set of document files in the labs JSON schema format and produce a tokenized Gensim corpus object.

Table of Contents

Documentation

The easiest way to run this script is to check out this repository and then run hatch run doc-corpus (arguments). You can also install it using pip, if you would like (directly from this repository; we do not push modules to PyPi).

A document corpus is a collection of files whose name starts with a base, which will be suffixed with an underscore and some various file extensions. (So, for example, if you pass ~/corpus as the base, you will see files like ~/corpus_filenames.pickle and ~/corpus_dates.pickle.)

Available commands are:

  • doc-corpus create [--min-occurrence N] [--output BASE] FILES...

    Build a corpus (a collection of files whose path will start with BASE) from a collection of JSON files specified by FILES. The paths in FILES can also be (optionally recursive) glob patterns, which will be expanded by the script.

    By default, we filter out terms that only appear one time in the corpus; you may filter even more aggressively by setting the value for the min_occurrence option to a value higher than 2.

  • doc_corpus info --corpus-base BASE

    Print out basic summary statistics for the corpus at BASE.

  • doc_corpus search --corpus-base BASE SEARCH...

    For each search pattern that you specify (in the SEARCH arguments), look through the corpus for all words that contain that pattern, and print them out, along with the number of documents in the corpus in which they appear.

    For example, running a search for “progress” would return not only the number of documents in which “progress” appears, but also “progressive”, “progression”, and so forth.

API

For other scripts that need to consume these document corpora, theres a minimal API that you can use to figure out what to load. There are two decorators, doc_corpus.options and doc_corpus.options_required, to add the --corpus-base option to a click-based CLI application, and the *_path methods in the module will return the path to the various parts of the corpus that you might want to load.

Each of those *_path methods also includes a check argument which, if set to True, will verify that the file actually exists and raise a FileNotFoundError if it does not.

For example:

from doc_corpus import options_required, dict_path
from gensim.corpora import Dictionary

@click.command()
@options_required
def do_stuff(corpus_base):
    path = dict_path(corpus_base, check=True)
    dictionary = Dictionary.load(str(path))
    # do things with the dictionary

License

All code in this repository is released under the GNU GPL v3.