A collection of maintenance scripts for the Sciveyor database.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Charles Pence 01bbe14ccc
Update jats_to_json for schema v5.
1 month ago
check Document folder_extensions. 1 year ago
fetch Split out the crossref parsing code into a separate file. 1 year ago
lib Tweak around some weird HTTP timeout behavior. 8 months ago
move Shift off the args so that this actually works. 2 months ago
old I believe both of these work, but I'm leaving them here until I test them. 1 year ago
parse Update jats_to_json for schema v5. 1 month ago
util Add a tiny utility to print the value of an XML query. 1 year ago
xslt Add a whole heck of a lot more DTDs for the PMC OA subset. 8 years ago
.env.example Scrub API keys from current files, add dotenv. 1 year ago
.gitignore Scrub API keys from current files, add dotenv. 1 year ago
COPYING Add explicit license (was already stated in README). 8 years ago
Gemfile Add the Chronic date-parsing gem. 9 months ago
Gemfile.lock Update bundle to ensure Ruby 3 is OK. 2 months ago
README.md Add new script to README. 2 months ago


Sciveyor Scripts

This repository contains a set of scripts used for maintaining the documents in the Sciveyor article database. They may or may not work for you, be useful, or explode.

Some of the scripts are documented in this README file, and others are not. We're aiming to improve the status of this documentation, but make no guarantees.



In our work, we often produce multiple versions of the same file -- for instance, a.pdf might be transformed into a.txt and a.json, and then supplemented by a.pubmed.xml and a.crossref.json. This script will walk through the current directory looking at every basename, ensuring that for each one, each of the provided file extensions can be found. If not, all of the versions available will be moved into a folder called orphans.


check/folder_extensions [list,of,extensions]

The only parameter is a comma-separated list of file extensions to check. (These may be provided with or without a dot.) Files will be left alone if all of the given extensions are found, otherwise they will be moved.


In a folder containing:

  • a.pdf
  • a.xml
  • a.txt
  • b.pdf
  • c.xml


check/folder_extensions xml,pdf,txt

will produce:

  • a.pdf
  • a.xml
  • a.txt
  • orphans
    • b.pdf
    • c.xml


This script changes one file extension to a different one for all files of that extension found in the paths provided.


move/change_extension <.oldex> <.newex> <paths>

This will change all files with the extension .oldex under paths to .newex. An error will be printed and the move will be skipped if the destination file already exists. The script will attempt to continue and move any other movable files, however.


In a folder containing:

  • a.json
  • a.pdf
  • b.json
  • c
    • d.json


move/change_extension .json .backup .

will produce:

  • a.backup
  • a.pdf
  • b.backup
  • c
    • d.backup


Running this script will remove all characters from filenames in the current directory other than a-z, A-Z, 0-9, dash, and underscore, leaving exactly one dotted file extension at the end of the filename.

Note that the assumption of one file extension means that both files that are supposed to have no extension at all and files with double-dotted extensions (like file.pubmed.xml, where .pubmed.xml is supposed to be "the file extension") will produce unexpected behavior.


Directories that are too large tend to upset operating systems. Over about 10,000 files (at least in our testing), network shares and even basic local commands like ls stop being very responsive. This script is designed to fix this in a way that still allows for one to quickly determine if a file is present on disk or not. It will take a number of files with names like journal-article-10_2307_1689205.xml, journal-article-10_2307_382953.xml, etc., and file them in folders corresponding to parts of the filename. For instance, the example files above could be placed in:

  • journal-article-10_2307_1689205.xml1/6/8/journal-article-10_2307_1689205.xml
  • journal-article-10_2307_382953.xml3/8/journal-article-10_2307_382953.xml

where here, the folder names have been extracted from the first "variable" parts of the filenames (1682905 and 382953, respectively).

The script, then, works through the directories given and moves files into the output directory. If the output directory passes a given threshold size, it is split along the first non-ignored character. The process repeats, further subdividing folders as needed until all files have been moved.

With files stored in this way, one can write a quick algorithm for determining whether or not a file is present on disk. Start looking at the variable characters in the filename, walk down the folders present on disk, until you run out, and then look for the presence or absence of the file for which you're searching.


in_hashed_directories [--max-files NUM] [--ignore-chars NUM] [--output DIR] --main-extension [EXT] [directories to search]

  • -mNUM, --max-files NUM: Control the maximum number of files that the script will allow within a given folder before splitting it. It defaults to 10,000.
  • -iNUM, --ignore-chars NUM: Ignore a given number of characters as "constant" at the beginning of every filename, before looking for "splitting" characters. It defaults to zero, skipping no characters.
  • -xEXT, --main-extension EXT: The script will look for all files which share the same basename and move them all at once (that is, a.xml, a.pdf, and a.txt will always wind up in the same folder). This parameter tells the script which extension should be the "primary" one to search. It defaults to .xml.
  • -oDIR, --output DIR: The output directory which will be the root of the hashed directory tree. Defaults to ..
  • Then pass a list of directories to search. Files will be moved into subdirectories of the output directory.


in_hashed_directories --max-files 5000 --ignore-chars 24 --main-extension .xml --output ~/FilesHashed ~/Files

Move all files in ~/Files to hashed subdirectories of ~/FilesHashed, letting no directory grow larger than 5,000 files, and ignoring the first 24 characters of every filename (in the example above, journal-article-10_2307_).


All scripts here, unless otherwise specified, are released under the Creative Commons CC0 license, making them as far as possible public domain content in every local jurisdiction. Some scripts will have other licensing information, which will be indicated at the top of the file.