A MongoDB corpus tool for the Sciveyor project
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
Charles Pence 68989353a6
Do some very serious cleanup of, well, all our Solr handling.
1 month ago
schema@317dde2737 Update schema. 2 months ago
statik Update schema. 2 months ago
transform Do some very serious cleanup of, well, all our Solr handling. 1 month ago
util Ugliest regex I've ever written, but the workaround works. 2 months ago
.gitignore Rename to mongo-tool. 2 months ago
.gitmodules Add schema submodule. 3 months ago
LICENSE Initial commit borrowing some Mongo code from schema-tool. 3 months ago
README.md Fix a number of bugs with date handling and batch processing. 1 month ago
go.mod Move globbing support into the tool itself, taking the shell out of the loop. 1 month ago
go.sum Move globbing support into the tool itself, taking the shell out of the loop. 1 month ago
import.go Fix a number of bugs with date handling and batch processing. 1 month ago
main.go Add an initial import command. It found bugs in the data. Augh. 2 months ago
mongo.go Add a batch-size parameter to the import command. 1 month ago
solr.go Do some very serious cleanup of, well, all our Solr handling. 1 month ago
sync.go Add support for force-sync, for my own local test-server purposes. 2 months ago
sync_mongo.go DRY up the Mongo document retrieval code. 2 months ago
sync_solr.go Do some very serious cleanup of, well, all our Solr handling. 1 month ago
validate_files.go Missing newline in error message. 1 month ago
validate_mongo.go Move globbing support into the tool itself, taking the shell out of the loop. 1 month ago

README.md

Sciveyor MongoDB Tool

This is a utility designed to manipulate the MongoDB and Solr servers used to store and search the documents in Sciveyor.

Contents

Building

  1. Check out the code, including the submodules: git clone --recurse-submodules ...
  2. Install statik: go get github.com/rakyll/statik
  3. Generate static schema blob: go generate
  4. Build: go build
  5. Run: ./mongo-tool ...

Requirements

  1. A MongoDB server, with a collection of documents that follow the schema spelled out here. FIXME: At some point in the future, this server will become more complex, with support for other collections carrying information about disambiguated authors, journals, and institutions. That support is not currently available in this tool.
  2. A Solr server, pre-loaded with the schema described here. (FIXME: Not currently available for public consumption. Watch this space; it needs more debugging.)

Usage

A number of sub-commands can then be used to perform various maintenance tasks on the MongoDB and Solr servers. You can get a list of all those tasks by running mongo-tool --help, and you can get more help on any command by running mongo-tool <command> --help.

sync: Synchronize MongoDB to Solr

The tool can be used to perform a three-step synchronization of the content from the MongoDB server to the Solr server. This is an extremely simple sync:

  1. For each document in the MongoDB database:
    1. If it is present in the Solr database, but either its version or its dataSourceVersion parameters have changed, delete and re-create it in the Solr database.
    2. If it is not present in the Solr database, create it.
  2. For each document in the Solr database:
    1. If it is not present in the Mongo database, delete it.

Notably, this is not a proper atomic synchronization. Documents are deleted and re-created, not partially updated (in Solr's terminology, we do not use "atomic updates"). We also do not detect any changes other than in the two version parameters. Version numbers must be bumped to trigger a sync. (This is an intentional policy choice.)

To use it, then, call mongo-tool as follows:

./mongo-tool sync \
  --mongo-address=mongodb://localhost \
  --mongo-database=YourDatabase \
  --mongo-collection=documents \
  --mongo-timeout=SECS \
  --solr-address=http://localhost:8983/solr \
  --solr-collection=sciveyor

The parameters are simply the various connection options for the two servers. The mongo-address is a URL, which can specify username, password, and port (mongodb://user:pass@address:port). The mongo-database parameter should be familiar from any connection to MongoDB. In almost all Sciveyor cases, the mongo-collection should be set to documents. The mongo-timeout parameter controls how long we will wait for MongoDB timeouts. It defaults to 30, but might need to be much higher in some applications.

The two Solr parameters are the URL to the root of the server (which will almost always end with /solr), and the collection or core name currently in use. (The final Solr URLs, then, will append the collection to the address.)

For debugging purposes, it is occasionally helpful to force a sync -- that is, to delete and re-create every document in Solr with the corresponding copy from MongoDB. If this behavior is desired, you can pass --force. We strongly recommend that you do not use this feature.

import: Import JSON Files to MongoDB

This tool should be used whenever you want to import JSON documents (once again, in the JSON schema specified by Sciveyor) into the MongoDB server.

To use it, call mongo-tool as follows:

./mongo-tool import \
  --batch-size=NUM \
  --mongo-address=mongodb://localhost \
  --mongo-database=YourDatabase \
  --mongo-collection=documents \
  --mongo-timeout=SECS \
  <files> ...

For information about the MongoDB connection parameters, see the sync command above.

The files parameter may either refer to specific files or to glob patterns.

The --batch-size parameter can be set to any number of documents (it defaults to 100). The optimal size will depend on your connection to your MongoDB server, your document sizes, and your network configuration, but 100 works for most purposes.

Note that no schema validation at all will be done on these documents, though they will be passed through several kinds of essential transformations (for example, converting the dates from JSON string format to MongoDB date format).

validate: Validate MongoDB Documents

The tool can be used to check whether or not the contents of a given MongoDB server conform to the Sciveyor JSON schema. To use it, call mongo-tool as follows:

./mongo-tool validate [--strict] \
  --mongo-address=mongodb://localhost \
  --mongo-database=YourDatabase \
  --mongo-collection=documents \
  --mongo-timeout=SECS

For information about the MongoDB connection parameters, see the sync command above.

If --strict is set (it defaults to true, you may disable it by passing --strict=false), then the validation will check to make sure not only that the attributes of each document are valid, it will also print errors if there are any fields in a document which do not appear in the JSON schema (that is, it will print errors on any "extra" fields). Strict mode is activated by default. Passing --strict=false will silently ignore the presence of any extra fields, only printing errors if there are fields containing invalid data.

validate-files: Validate JSON Files

The tool can be used to check whether or not a collection of JSON files on disk conforms to the Sciveyor JSON schema. To use it, call mongo-tool as follows:

./mongo-tool validate-files [--strict] [--unique] /path/to/*.json

The files parameter may either refer to specific files or to glob patterns.

For information about the --strict parameter, see the validate command above. If --unique is passed, then the validation will parse each file, load its ID value, and check to see if there are any duplicate ID values among the JSON files that are passed. This will slow down validation, so it is disabled by default.

General Options

  • --verbose, -v: By default, basic information about the sync will be printed to the console. To see much more information (including printed dumps of the IDs present in both the Mongo and Solr databases), pass the --verbose flag.

Glob Patterns

All file parameters can also be passed a glob matching pattern. We use an extended syntax with support for:

  • *: any sequence of non-separator characters
  • **: any sequence of characters, including separators (recursive glob)
  • ?: any single non-separator character
  • [class]: character classes, of the form [abcd] (character list), [a-z] (character range), or [^a-z] (negated class)
  • {alt1,alt2,...}: a finite list of alternatives

Changelog

  • v0.6: Fix our entirely broken Mongo date handling, and export in a different format to allow for storing them in Solr date objects. Fix a small bug with batched import.
  • v0.5: Add a batch-size parameter to import.
  • v0.4: Move glob handling into the app, allowing for a --unique test in validate-files.
  • v0.3: Port command-line handling to Kong, and introduce a robust sub-command interface. Rename from mongo-solr to mongo-tool. Integrate the functionality of schema-tool into mongo-tool.
  • v0.2: Store all the date values in documents as ISODate in MongoDB.
  • v0.1: Initial support for only the parameters mentioned in the JSON document schema.

License

The code here is copyright © 2021 Charles H. Pence, and released under the GNU GPL v3.