By John Glynn
I wrote an article on Optical Character Recognition (OCR) in the July 2011 Bits'N'Bytes, which detailed how to convert an image file into a single bit TIFF file by means of the Gimp program.
The conversion program tesseract requires a tiff file to produce a text file.
This manual process works well, but can be a little tedious if you want to process a lot of image files to text.
The backend program tesseract has improved and now handles columns so I have written a bash shell script
which automates the whole process.
You can download the scripts here.
Installation:
The script requires two programs tesseract and convert. The convert command is part of the Imagemagik suite.
1. Install tesseract and Imagemagik on your OS.
2. Make a directory called Convert directly under your home directory. eg, /home/john/Convert
3. Download a copy of the shell script
4. MultiConvert at a Club meeting from the ??? and place it in /home/$USER/bin director
5. Make sure this program is executable!