|
||
---|---|---|
bin | ||
cmd | ||
json | ||
mongo | ||
transform | ||
util | ||
.gitignore | ||
LICENSE | ||
README.md | ||
go.mod | ||
go.sum | ||
main.go |
README.md
Corpus Tool
This is a swiss-army-knife utility that I use to administer our local corpus, including collections of JSON files in our internal format and MongoDB servers containing that same data.
FIXME: We need to document, somewhere or other, the format of our internal MongoDB servers.
Contents
Building
- Check out the code, including the submodules:
git clone --recurse-submodules ...
- Build:
go build
- Run:
./corpus-tool ...
Requirements
A MongoDB server, with a collection of documents that follow the schema spelled
out in util/schema.go
. FIXME: At some point in the future, this server
will become more complex, with support for other collections carrying
information about disambiguated authors, journals, and institutions. That
support is not currently available in this tool.
Usage
You can get a list of all the available major commands by running corpus-tool --help
, and you can get more help on any command by running corpus-tool <command> --help
.
You can also set persistent configuration flags for corpus-tool
by creating a
configuration file. The default path for the configuration file is
~/.corpus-tool.yaml
, and you can set a custom path for the configuration file
by passing --config <path>
.
json import
: Import JSON Files to MongoDB
This tool should be used whenever you want to import JSON documents into the MongoDB server.
To use it, call corpus-tool
as follows:
./corpus-tool json import \
--batch-size NUM \
--mongo-address mongodb://localhost \
--mongo-database YourDatabase \
--mongo-collection documents \
--mongo-timeout SECS \
<files> ...
The mongo-address
is a URL, which can specify username, password, and port
(mongodb://user:pass@address:port
). The mongo-database
flag should be
familiar from any connection to MongoDB. In almost all cases, the
mongo-collection
should be set to documents
. The mongo-timeout
flag
controls how long we will wait for MongoDB timeouts, in seconds. It defaults to
30, for small import jobs. For a very large import (i.e., tens of thousands of
documents), you will want to set this to a very high number.
The files argument may either refer to specific files or to glob patterns.
The --batch-size
flag can be set to any number of documents (it defaults to
100). The optimal size will depend on your connection to your MongoDB server,
your document sizes, and your network configuration, but 100 works for most
purposes.
Note that no schema validation at all will be done on these documents, though they will be passed through several kinds of essential transformations (for example, converting the dates from JSON string format to MongoDB date format).
json validate
: Validate JSON Files
The tool can be used to check whether or not a collection of JSON files on disk
conforms to the JSON schema. To use it, call corpus-tool
as follows:
./corpus-tool json validate [--loose] [--unique] /path/to/*.json
The files arguments may either refer to specific files or to glob patterns.
For information about the --loose
flag, see the mongo validate
command. If
--unique
is passed, then the validation will parse each file, load its ID
value, and check to see if there are any duplicate ID values among the JSON
files that are passed. This will slow down validation, so it is disabled by
default.
mongo validate
: Validate MongoDB Documents
The tool can be used to check whether or not the contents of a given MongoDB
server conform to the JSON schema. To use it, call corpus-tool
as follows:
./corpus-tool validate [--strict] \
--mongo-address mongodb://localhost \
--mongo-database YourDatabase \
--mongo-collection documents \
--mongo-timeout SECS
For information about the MongoDB connection flags, see the json import
command.
By default, the tool operates in "strict mode," and will thus check to make sure
not only that the attributes of each document are valid, but also it will print
errors if there are any fields in a document which do not appear in the JSON
schema (that is, it will print errors for any "extra" fields in the documents).
If you want to ignore these errors, you can deactivate strict mode by passing
the --loose
flag, in which case corpus-tool
will silently ignore the
presence of any extra fields, only printing errors if there are known fields
containing invalid data.
Configuration File
The following settings may be persistently configured by editing the
configuration file, located by default at ~/.corpus-tool.yaml
:
mongo:
address: string
collection: string
database: string
timeout: 30
verbose: true
General Options
--config <path>
: Specify an alternative path to a YAML-format configuration file.--verbose
,-v
: By default, basic information (and, if on an interactive terminal, progress bars) will be printed to the console. To see more information, pass the--verbose
flag.
Glob Patterns
All file arguments can also be passed a glob matching pattern. We use an extended syntax with support for:
*
: any sequence of non-separator characters**
: any sequence of characters, including separators (recursive glob)?
: any single non-separator character[class]
: character classes, of the form[abcd]
(character list),[a-z]
(character range), or[^a-z]
(negated class){alt1,alt2,...}
: a finite list of alternatives
Changelog
- v0.9: Rename to
corpus-tool
, remove Solr support (after the end of the Sciveyor project); we’re now using this tool only internally - v0.8: (abandoned)
- v0.7: Rewrite
mongo-tool
assciveyor-tool
, using Cobra and Viper instead of Kong. - v0.6: Fix our entirely broken Mongo date handling, and export in a different format to allow for storing them in Solr date objects. Fix a small bug with batched import.
- v0.5: Add a batch-size flag to
import
. - v0.4: Move glob handling into the app, allowing for a
--unique
test invalidate-files
. - v0.3: Port command-line handling to Kong, and introduce a robust
sub-command interface. Rename from
mongo-solr
tomongo-tool
. Integrate the functionality ofschema-tool
intomongo-tool
. - v0.2: Store all the
date
values in documents asISODate
in MongoDB. - v0.1: Initial support for only the fields mentioned in the JSON document schema.
License
The code here is copyright © 2021–2023 Charles H. Pence, and released under the GNU GPL v3.