You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
OJ Random 310e52c50d convert throws cache resources exhausted 6 months ago
README.md blank 8 months ago
ocr.sh convert throws cache resources exhausted 6 months ago

README.md

What it will do?

You scanned documents but can not search them for text?

This script is for you.

What it does...

  • search a directory for PDFs
  • read every PDF
    • try to extract text from the PDF
    • create a new PDF containing the original content AND the text that was recognized

The script does the same for TIFFs.
If your scan stretches over several TIFFs e.g.

  • a-scan-1.tiff
  • a-scan-2.tiff
  • ...

no problem

  • The script will concatenate the TIFFs for you and
  • Create one single PDF

The script was tested for Debian 10 (buster).

Preparation

apt-get install imagemagick tesseract-ocr tesseract-ocr-deu

See the orc.sh for troubleshooting.

Run

./ocr.sh

This will grab all TIFFs and then all PDFs.

Be aware that at the moment the OCR uses DEU (German) as dictionary. Change the script to use another language.