||6 months ago|
|README.md||8 months ago|
|ocr.sh||6 months ago|
What it will do?
You scanned documents but can not search them for text?
This script is for you.
What it does...
- search a directory for PDFs
- read every PDF
- try to extract text from the PDF
- create a new PDF containing the original content AND the text that was recognized
The script does the same for TIFFs.
If your scan stretches over several TIFFs e.g.
- The script will concatenate the TIFFs for you and
- Create one single PDF
The script was tested for Debian 10 (buster).
apt-get install imagemagick tesseract-ocr tesseract-ocr-deu
See the orc.sh for troubleshooting.
This will grab all TIFFs and then all PDFs.
Be aware that at the moment the OCR uses DEU (German) as dictionary. Change the script to use another language.