You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
OJ Random 310e52c50d convert throws cache resources exhausted 6 months ago blank 8 months ago convert throws cache resources exhausted 6 months ago

What it will do?

You scanned documents but can not search them for text?

This script is for you.

What it does...

  • search a directory for PDFs
  • read every PDF
    • try to extract text from the PDF
    • create a new PDF containing the original content AND the text that was recognized

The script does the same for TIFFs.
If your scan stretches over several TIFFs e.g.

  • a-scan-1.tiff
  • a-scan-2.tiff
  • ...

no problem

  • The script will concatenate the TIFFs for you and
  • Create one single PDF

The script was tested for Debian 10 (buster).


apt-get install imagemagick tesseract-ocr tesseract-ocr-deu

See the for troubleshooting.



This will grab all TIFFs and then all PDFs.

Be aware that at the moment the OCR uses DEU (German) as dictionary. Change the script to use another language.