Friday, April 18, 2008 2:50 PM
Improving OCR Accuracy
OCR Engines are very good at reading text from clean machine print documents. If you have older scans or if the documents are not meant to be easily read by a machine, there are still some things you can do to improve your accuracy.
This report from the GPO on Optimizing OCR Accuracy made some interesting findings, but also some errors which I will try to explain.
The report cites that thresholding the documents didn't improve accuracy and shows this result:
The issue is not that thresholding wouldn't help. In fact, since most OCR engines can only work on thresholded documents, they will do it for you if you do not. They are right to point out that the scans should be done at full color -- but that's because you then get a chance to apply the thresholding yourself (instead of letting the scanner do it). If you use a good thresholding algorithm, you can do quite a bit better.
Using DynamicThresholdCommand from DotImage Document Imaging and SpeckRemovalCommand from Advanced Document Cleanup with default parameters, I got this result:
I don't have the original, so I cannot check OCR accuracy, but I bet I will get a better result than they found using the default threshold in Photoshop. In any case, a threshold must be done before OCR, so either you do it under your control or the OCR engine will do it for you.
Another problem they found is with downsampling. They had scans at high DPI, but the OCR vendor recommendation was for 300 DPI so they downsampled. I am sure that the OCR vendor meant at least 300 DPI, and they did not have to do this. It is sure that you will reduce OCR accuracy with downsampling as you have to lose information in downsampling. Even if you do apply it, you must make sure to choose a good algorithm -- there are benefits to downsampling (increased speed), but if accuracy is the main concern, then you should not do it.
The use of image processing before OCR can increase accuracy, but you must use the proper algorithm. A limiting factor of their tests is that applied their pre-preprocessing steps manually with Photoshop and therefore could not try a lot of different options. By using an image processing toolkit, you can easily run a lot of tests in batch. You are essentially solving an optimization problem, so applying a hill-climbing or genetic algorithm would help decide the best processing choice for your collection of documents.