Bash script for adding a text layer to PDF files and converting images in PDFs (with OCR). https://decatec.de
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
DecaTec 41c2e135a5 PDF/A as default output format 2 weeks ago
LICENSE Initial commit 1 year ago
OCRmyFiles.sh PDF/A as default output format 2 weeks ago
README.md Fixed md 11 months ago

README.md

OCRmyFiles

Bash script for adding a text layer to PDF files and converting images in PDFs (with OCR).

Adds an OCR text layer to all PDF files in the given input directory and saves the new PDF files to the output directory.

When the input directory also contains image files (e.g. jpg, png), these are converted to (OCR’ed) PDFs.

All other file types are just copied from the input directory to the output directory.

Requirements

Usage

  • Download script or clone repository
  • Make script executable sudo chmod +x OCRmyFiles.sh
  • Modify the script to fit your needs:
    • Set default input/output directories
    • Modify the OCRmyPDF command line arguments (you can find an overview of available command line arguments here)
    • Modify the Tesseract command line arguments (you can find an overview of available command line arguments here)
  • Call the script:
    • OCRmyFiles.sh (no parameter): using default directories for input/output (as defined in the script itself)
    • OCRmyFiles.sh <inputDir> <outputDir>: using specified directories for input/output
  • The script might print some warnings/errors from Tesseract. These can be ignored in most cases as the OCR text layer will be created anyway
  • You can also call this script with a cronjob for automated processing of PDFs/images:
    • With the user the cronjob should be executed, call contab -e
    • Add the following to run the script e.g. every 30 minutes: */30 * * * * /path/to/the/script/OCRmyFiles.sh > /dev/null 2>&1