|Robert Labudda 1dbc1dbee5|
filecabinet is a minimal document management system for your computer. It has metadata per document and supports fulltext search in various document types.
The easiest way to install is to use
pip install filecabinet
Alternatively you can get the source code at codeberg:
git clone https://codeberg.org/vonshednob/filecabinet pip install filecabinet
filecabinet requires the xapian python bindings
which can not be installed through
Other automatically installed required dependencies are:
Even though optional, I strongly recommend installing Tesseract OCR to enable fulltext search in scanned documents.
To initialize your file cabinet, run
filecabinet init and provide a new
path where you would like to store your documents:
filecabinet init ~/Documents/cabinet
Now you can start either copying files into
to process them, or add files manually via
filecabinet add ~/some_scanned_document.jpg
To get a basic overview of documents, you can use the Shell.
There’s a basic shell that allows you to inspect indexed documents, edit their metadata (by means of an external text editor), or view the documents.
To open the shell, run
help inside the shell to see what your options are.
If you want to use a specific text editor to modify metadata, consider
updating your configuration file’s
Shell section and add a
document_editor, like this:
[Shell] editor = subl -w
In this example we set up SublimeText as the external editor. Note that the
-w option is necessary to make filecabinet wait until you’re done editing
the file before returning into the shell.
Visual Studio Code uses the
--wait flag to accomplish the same
Searching for tags is done case-insensitive and is done using
For example if you're looking for a document that's tagged with banana, you
can search for it by
Searching new documents is accomplished by searching for
If you only want to find documents that are not new, you can also
-tag:new. Unless specified, a search will ignore whether or not a
document is new.
You can search for any metadata value, like title, author, or language,
by searching with the metadata name and a colon like
Everything else that does not match the special search terms will be used in the fulltext search.
If you want to search for terms with whitespaces, you can use quotes:
The title contains "brain", is from author "Gumby" and it was set to some time
before August 2005:
title:brain author:gumby date:2015-08-01
Looking for a newly added document with the title "The Larch":
filecabinet can use Tesseract OCR to do character recognition on pictures and scanned PDFs, so you can search the text of images.
In order for that to work, you have to install Tesseract and some language packages, depending on the languages of the documents you wish to scan.
If you don't have Tesseract OCR installed, filecabinet will still work, but be much less useful.
Rule based tagging
By using metaindex, filecabinet inherits the powerful rule based tagging. This allows you to automatically add metadata tags to documents based on their text (which might have come from OCR).
Rules are defined in text files and you have to point filecabinet to the
rule files that you want it to use. To do that, add a section
your configuration file (usually at
~/.config/filecabinet/filecabinet.conf) and list your rule files like
[Rules] base = ~/.config/filecabinet/basic_rules.txt companies = ~/Document/company_rules.txt
The names (before the
=) are somewhat free-form descriptors.
To understand how to write these rule files, please have a look at the metaindex documentation.
To test your rules on documents, you can use the
command. It will run all indexers on a file and show you what tags have
been found by your rules.
test-rules the tested document will not be added to your
Cabinet Directory Structure
Assuming a cabinet is set up at
~/cabinet, the directory structure is:
~/cabinet │ ├── inbox │ ├── metaindex.conf │ ├── metaindex.log │ └── documents │ └── <partial document id> │ └── <full document id> │ ├── <document id>.yaml │ ├── <document id>.<suffix> │ └── <document id>.txt
inboxwill be processed (and emptied) when
filecabinet pickupis being run
documentscontains the documents
<document id>.yamlcontains the metadata
<document id>.<suffix>is the original document (usually a PDF)
<document id>.txtis the extracted full text, if it could be extracted
metaindex.conf, the configuration file for filecabinet's metaindexserver
metaindex.log, the log file of file cabinet's metaindexserver
Usage from Python
filecabinet from Python, you can use this boilerplate:
from filecabinet import Manager manager = Manager() manager.launch_server() session = manager.new_session()
session will be an instance of
Session which, together with
allows manipulation of metadata and querying of documents.