A multi-purpose utility for managing our internal corpora
 
 
Go to file
Charles Pence a3312f976b
Move from sciveyor-tool to corpus-tool, strip out superfluous stuff.
2023-06-12 10:07:40 +02:00
bin Add test-coverage support. 2022-01-08 18:58:37 +01:00
cmd Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
json Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
mongo Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
transform Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
util Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
.gitignore Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
LICENSE Add README, and fix license (whoops). 2022-01-09 18:12:08 +01:00
README.md Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
go.mod Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
go.sum Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00
main.go Move from sciveyor-tool to corpus-tool, strip out superfluous stuff. 2023-06-12 10:07:40 +02:00

README.md

Corpus Tool

This is a swiss-army-knife utility that I use to administer our local corpus, including collections of JSON files in our internal format and MongoDB servers containing that same data.

FIXME: We need to document, somewhere or other, the format of our internal MongoDB servers.

Contents

Building

  1. Check out the code, including the submodules: git clone --recurse-submodules ...
  2. Build: go build
  3. Run: ./corpus-tool ...

Requirements

A MongoDB server, with a collection of documents that follow the schema spelled out in util/schema.go. FIXME: At some point in the future, this server will become more complex, with support for other collections carrying information about disambiguated authors, journals, and institutions. That support is not currently available in this tool.

Usage

You can get a list of all the available major commands by running corpus-tool --help, and you can get more help on any command by running corpus-tool <command> --help.

You can also set persistent configuration flags for corpus-tool by creating a configuration file. The default path for the configuration file is ~/.corpus-tool.yaml, and you can set a custom path for the configuration file by passing --config <path>.

json import: Import JSON Files to MongoDB

This tool should be used whenever you want to import JSON documents into the MongoDB server.

To use it, call corpus-tool as follows:

./corpus-tool json import \
  --batch-size NUM \
  --mongo-address mongodb://localhost \
  --mongo-database YourDatabase \
  --mongo-collection documents \
  --mongo-timeout SECS \
  <files> ...

The mongo-address is a URL, which can specify username, password, and port (mongodb://user:pass@address:port). The mongo-database flag should be familiar from any connection to MongoDB. In almost all cases, the mongo-collection should be set to documents. The mongo-timeout flag controls how long we will wait for MongoDB timeouts, in seconds. It defaults to 30, for small import jobs. For a very large import (i.e., tens of thousands of documents), you will want to set this to a very high number.

The files argument may either refer to specific files or to glob patterns.

The --batch-size flag can be set to any number of documents (it defaults to 100). The optimal size will depend on your connection to your MongoDB server, your document sizes, and your network configuration, but 100 works for most purposes.

Note that no schema validation at all will be done on these documents, though they will be passed through several kinds of essential transformations (for example, converting the dates from JSON string format to MongoDB date format).

json validate: Validate JSON Files

The tool can be used to check whether or not a collection of JSON files on disk conforms to the JSON schema. To use it, call corpus-tool as follows:

./corpus-tool json validate [--loose] [--unique] /path/to/*.json

The files arguments may either refer to specific files or to glob patterns.

For information about the --loose flag, see the mongo validate command. If --unique is passed, then the validation will parse each file, load its ID value, and check to see if there are any duplicate ID values among the JSON files that are passed. This will slow down validation, so it is disabled by default.

mongo validate: Validate MongoDB Documents

The tool can be used to check whether or not the contents of a given MongoDB server conform to the JSON schema. To use it, call corpus-tool as follows:

./corpus-tool validate [--strict] \
  --mongo-address mongodb://localhost \
  --mongo-database YourDatabase \
  --mongo-collection documents \
  --mongo-timeout SECS

For information about the MongoDB connection flags, see the json import command.

By default, the tool operates in "strict mode," and will thus check to make sure not only that the attributes of each document are valid, but also it will print errors if there are any fields in a document which do not appear in the JSON schema (that is, it will print errors for any "extra" fields in the documents). If you want to ignore these errors, you can deactivate strict mode by passing the --loose flag, in which case corpus-tool will silently ignore the presence of any extra fields, only printing errors if there are known fields containing invalid data.

Configuration File

The following settings may be persistently configured by editing the configuration file, located by default at ~/.corpus-tool.yaml:

mongo:
  address: string
  collection: string
  database: string
  timeout: 30
verbose: true

General Options

  • --config <path>: Specify an alternative path to a YAML-format configuration file.
  • --verbose, -v: By default, basic information (and, if on an interactive terminal, progress bars) will be printed to the console. To see more information, pass the --verbose flag.

Glob Patterns

All file arguments can also be passed a glob matching pattern. We use an extended syntax with support for:

  • *: any sequence of non-separator characters
  • **: any sequence of characters, including separators (recursive glob)
  • ?: any single non-separator character
  • [class]: character classes, of the form [abcd] (character list), [a-z] (character range), or [^a-z] (negated class)
  • {alt1,alt2,...}: a finite list of alternatives

Changelog

  • v0.9: Rename to corpus-tool, remove Solr support (after the end of the Sciveyor project); were now using this tool only internally
  • v0.8: (abandoned)
  • v0.7: Rewrite mongo-tool as sciveyor-tool, using Cobra and Viper instead of Kong.
  • v0.6: Fix our entirely broken Mongo date handling, and export in a different format to allow for storing them in Solr date objects. Fix a small bug with batched import.
  • v0.5: Add a batch-size flag to import.
  • v0.4: Move glob handling into the app, allowing for a --unique test in validate-files.
  • v0.3: Port command-line handling to Kong, and introduce a robust sub-command interface. Rename from mongo-solr to mongo-tool. Integrate the functionality of schema-tool into mongo-tool.
  • v0.2: Store all the date values in documents as ISODate in MongoDB.
  • v0.1: Initial support for only the fields mentioned in the JSON document schema.

License

The code here is copyright © 20212023 Charles H. Pence, and released under the GNU GPL v3.