||3 months ago|
|check||3 months ago|
|fetch||4 months ago|
|lib||3 months ago|
|move||3 months ago|
|ocr||4 months ago|
|old||2 years ago|
|parse||3 months ago|
|util||2 years ago|
|xslt||8 years ago|
|.env.example||2 years ago|
|.gitignore||2 years ago|
|COPYING||9 years ago|
|Gemfile||4 months ago|
|Gemfile.lock||4 months ago|
|README.md||3 months ago|
This repository contains a set of scripts used for maintaining the documents in the Sciveyor article database. They may or may not work for you, be useful, or explode.
Some of the scripts are documented in this README file, and others are not. We're aiming to improve the status of this documentation, but make no guarantees.
In our work, we often produce multiple versions of the same file -- for
a.pdf might be transformed into
a.json, and then
a.crossref.json. This script will walk
through the current directory and all sub-directories recursively, looking at
every basename, ensuring that for each one, each of the provided file extensions
can be found. If not, all of the versions available will be moved into a
sub-folder of their current location called
The only parameter is a comma-separated list of file extensions to check. (These may be provided with or without a dot.) Files will be left alone if all of the given extensions are found, otherwise they will be moved.
In a folder containing:
This script changes one file extension to a different one for all files of that extension found in the paths provided.
move/change_extension <.oldex> <.newex> <paths>
This will change all files with the extension
.newex. An error will be printed and the move will be skipped if the
destination file already exists. The script will attempt to continue and move
any other movable files, however.
In a folder containing:
move/change_extension .json .backup .
Running this script will remove all characters from filenames in the current
directory other than
0-9, dash, and underscore, leaving exactly
one dotted file extension at the end of the filename.
Note that the assumption of one file extension means that both files that are
supposed to have no extension at all and files with double-dotted extensions
.pubmed.xml is supposed to be "the file
extension") will produce unexpected behavior.
Directories that are too large tend to upset operating systems. Over about
10,000 files (at least in our testing), network shares and even basic local
ls stop being very responsive. This script is designed to fix
this in a way that still allows for one to quickly determine if a file is
present on disk or not. It will take a number of files with names like
etc., and file them in folders corresponding to parts of the filename. For
instance, the example files above could be placed in:
where here, the folder names have been extracted from the first "variable" parts
of the filenames (
The script, then, works through the directories given and moves files into the output directory. If the output directory passes a given threshold size, it is split along the first non-ignored character. The process repeats, further subdividing folders as needed until all files have been moved.
With files stored in this way, one can write a quick algorithm for determining whether or not a file is present on disk. Start looking at the variable characters in the filename, walk down the folders present on disk, until you run out, and then look for the presence or absence of the file for which you're searching.
in_hashed_directories [--max-files NUM] [--ignore-chars NUM] [--output DIR] --main-extension [EXT] [directories to search]
--max-files NUM: Control the maximum number of files that the script will allow within a given folder before splitting it. It defaults to 10,000.
--ignore-chars NUM: Ignore a given number of characters as "constant" at the beginning of every filename, before looking for "splitting" characters. It defaults to zero, skipping no characters.
--main-extension EXT: The script will look for all files which share the same basename and move them all at once (that is,
a.txtwill always wind up in the same folder). This parameter tells the script which extension should be the "primary" one to search. It defaults to
--output DIR: The output directory which will be the root of the hashed directory tree. Defaults to
- Then pass a list of directories to search. Files will be moved into subdirectories of the output directory.
in_hashed_directories --max-files 5000 --ignore-chars 24 --main-extension .xml --output ~/FilesHashed ~/Files
Move all files in
~/Files to hashed subdirectories of
no directory grow larger than 5,000 files, and ignoring the first 24 characters
of every filename (in the example above,
This is our master script for converting PDF files to plain text. There's a lot of decisions that have been made in the construction of this script, so it's worth it to spend some time detailing why we've done what we have.
First: this script does not, under any circumstances, extract native digital text from PDFs. This may seem like a surprising choice. Why not use that included text if it's available to us? Unfortunately, it suffers from two general problems. First, it strongly tends to arrive in the wrong order. As you know if you've tried to copy and paste from a PDF, often text blocks connect text together in nonsensical ways. Converting and passing through OCR does a better job detecting page layout. Second, font problems are rampant. Technically, no glyph displayed in a PDF has to have any connection to any Unicode character whatsoever; we rely on the accuracy of the conversion tables in each PDF to do the job. For many publishers, especially older PDF files, those tables just don't work. It's more reliable, in general, to rasterize to images and then OCR.
Second: we rasterize all PDFs at 600 DPI (higher than usual to offer some cushion for PDF files that have broken physical size information), and in greyscale. This seems reasonably optimal for Tesseract 5.0, our OCR system.
Third: we've chosen Tesseract 5 (currently
alpha-20210401) after some testing
on various ages of published materials. It provides output at least as good as
that of ABBYY 10, the OCR system that we were using initially for the evoText
project, and it is an open source solution, which makes long-term maintenance
easier. The new LSTM neural-network OCR engine in Tesseract 4+, plus the
tessdata_best models trained by Google, has accuracy on par with any extant
OCR system that we are aware of.
Finally: we run cleanup scripts on all OCR text files that Tesseract generates after it finishes. Currently there is only one such script:
ocr/fix_hyphenation: Tesseract does not often merge hyphenated words occurring at the ends of lines. To solve this problem, we scan each line of the text file, looking for lines that end either with an ASCII dash or a Unicode hyphen character. If we find one, we check each of the two partial words at the end of that line and the start of the next, and the merged word created by concatenating them, against a spelling dictionary. If the merged word is a correct spelling, we accept it. If it is not, but both of the words on either side of the hyphen are, we create a hyphenated word (e.g., a line ending with 'drug-' and a line beginning with 'free' will produce 'drug-free'). If neither word is found in the dictionary, the merged word is used (assuming the presence of a technical term). Note that this script requires that the user have installed
aspelland its English-language dictionary data.
ocr pdf.pdf out.txt
This script simply expects to be passed two filenames, first the PDF to convert and second the text file to be created.
This script simply OCRs all PDF files present in the command line (including
recursively in any subdirectories) to text files in the same location (with the
.txt). Files will be skipped if the output is
ocr_multiple folder_1 folder_2 3.pdf
All scripts here, unless otherwise specified, are released under the Creative Commons CC0 license, making them as far as possible public domain content in every local jurisdiction. Some scripts will have other licensing information, which will be indicated at the top of the file.