Did you know that .NET provides an easy way to interact and control console programs? In this article I will walk you through this process by creating a wrapper class for Google’s Tesseract OCR application. At the end of this post, I will provide a complete WinForms-based frontend for Google’s Tesseract OCR Engine.
Initial Tesseract Setup
First, download the precompiled version from the Tesseract download site. As 2.03 does not yet have a precompiled version available we will be using 2.01. You will also need to download the corresponding language files for Tesseract. I will be using the English language dataset. Extract your dataset archive into the same directory you placed your precompiled binaries. Please insure that all directory structure remains intact, as otherwise Tesseract will not work.
Unfortunately, the precompiled version does not include any of the help, so if you wish to read up on how to use Tesseract you need to view the online documentation. Only one line from the documentation is necessary for our purposes:
tesseract <image.tif> <output> [-l langid]
Here we see that we Tesseract takes three parameters, one optional. The first being a tiff file to perform the OCR on, the second being an output text file and the final, optional, parameter being a language identifier. Please note that whatever filename you specify as output will actually have .txt appended internally by Tesseract. As for this simple example, I’m going to ignore the langid parameter as Tesseract defaults to English. I also know from my own personal experience implementing Tesseract support in our DotImage OCR Toolkit, by default Tesseract only supports 1 bit per pixel uncompressed Tiffs.
The Basics of Running Tesseract
You would think that this is a particularly simple case as Tesseract only needs to be passed in parameters and requires no flow control. Ideally, we will simply leverage the Process class to control how Tesseract is launched and read from it’s output. Initially, this is only a small jump from what we learned in Processes in .NET Part 2. The only real difference here is that instead of using Verbs we are specifying behavior through the ProcessStartInfo’s Arguments property.
1: Process tessProc = new Process();
2: tessProc.StartInfo.WorkingDirectory = @"C:\Tess";
3: tessProc.StartInfo.FileName = @"C:\Tess\tesseract.exe";
4: tessProc.StartInfo.Arguments = "input.tif output";
5: tessProc.Start();
6: tessProc.WaitForExit();
7: string output = File.ReadAllText(@"C:\Tess\output.txt");
Unfortunately, while this very simple example will work in many cases, this is not one. This is because Tesseract.exe secretly launches a separate process and immediately exits. This makes the WaitForExit() call look like it was successful but, as OCR takes a while, when you try to read from the output file it will either not yet exist or it will be locked for writing by the Tesseract process.
There are many different ways to approach this problem. In this case an easy method would be to try repeatedly to access Tesseract’s log file using a timeout to ensure our program doesn't lock up. Successful or not, the log file is always written right before the process terminates. Additionally, in the case of failure, it has information about what happened.
1: Process tessProc = new Process();
2: tessProc.StartInfo.WorkingDirectory = @"C:\Tess";
3: tessProc.StartInfo.FileName = @"C:\Tess\tesseract.exe";
4: tessProc.StartInfo.Arguments = "input.tif output";
5: tessProc.Start();
6:
7: int timeout = 10000, increment = 1000;
8: string logtext = null;
9: while ((timeout -= increment) > 0)
10: {
11: try
12: {
13: logtext = File.ReadAllText(@"C:\Tess\tesseract.log");
14: break;
15: }
16: catch (IOException)
17: {
18: System.Threading.Thread.Sleep(increment);
19: }
20: }
21:
22: string output;
23: if (logtext != null && !logtext.Contains("Error"))
24: output = File.ReadAllText(@"C:\Tess\output.txt");
25: else
26: output = "";
27:
28: return output;
This example only provides the very basics of what is necessary to run Tesseract and get a result. In order to use Tesseract in a robust and extensible way, we will need to build a complete wrapper class.
Designing a Wrapper Class
Tesseract has a number of quirks which makes it somewhat annoying to deal with, at least when compared with most other command line applications. It’s important to be on the lookout for these kinds of small quirks when building an interface to an application. For completeness, I’ll list what I’ve found for Tesseract here along with solutions.
- Tesseract always appends .txt to the output filename.
- As above, append “.txt” when you read the output file.
- Tesseract launches a second background process which does the work.
- As above, I chose wait on output with a timeout. However, this technique does not allow for multiple instances of the same Tesseract process as the log file is shared. As of right now I perform OCR in a lock in order to prevent multiple executions at the same time.
- Tesseract does not like paths with spaces.
- Copy locally before processing. This way Tesseract never has to leave it’s working directory.
- Tesseract provides no useful feedback outside of the log file.
- As above, wait on and parse the file when it becomes available. Be sure to delete the file when you are finished. Otherwise you may run into a situation where you are parsing old log data.
The interface essentially boils down to the two following methods. The complexity in the interface stems from the very properties we just enumerated.
1: public class TesseractWrapper
2: {
3: public TesseractWrapper(string programLoc)
4: {
5: DirectoryInfo dinfo = new DirectoryInfo(programLoc);
6: ValidateTesseractDirectory(dinfo);
7: _tesseractLocation = dinfo.FullName;
8: }
9:
10: public string ExtractText(string imageFile)
11: {
12: if (!File.Exists(imageFile))
13: throw new ArgumentException("Specified file must exist.");
14:
15: //Init Instance Timeout Variable
16: _timeLeft = _tessTimeout;
17:
18: lock (_lockObj)
19: {
20: Process tessProc = BuildTesseractProcess();
21: return RunOCR(tessProc, imageFile);
22: }
23: }
24:
25: private string RunOCR(Process tessProc, string imagePath)
26: {
27: string imgParameter, outputParameter, outputText;
28:
29: //Temporary Files, Be Sure To Clean Up
30: string textOutputPath, logPath, tempImagePath;
31:
32: GenerateFilePathsAndParameters(out imgParameter, out tempImagePath,
33: out outputParameter, out textOutputPath, out logPath);
34:
35: //Ensure no previous log file is hanging around.
36: CleanUpFiles(logPath);
37:
38: try
39: {
40: //Copy image locally to avoid issues with spaces in paths.
41: File.Copy(imagePath, tempImagePath);
42:
43: LaunchTesseract(tessProc, imgParameter, outputParameter);
44:
45: string logText = ReadFileWithInstanceTimeout(logPath);
46: CheckTesseractLogTextAndThrowOnError(logText);
47: outputText = ReadFileWithInstanceTimeout(textOutputPath);
48: }
49: finally
50: {
51: CleanUpFiles(tempImagePath, textOutputPath, logPath);
52: }
53:
54: return outputText;
55: }
56:
57: /* Download the full source to see private members and methods */
58: }
Overall, it’s not exactly what I would call production quality but it’s acceptable for when I need to quickly test a file in Tesseract.
An Asynchronous Wrapper for Easy WinForms Integration
Once you have all of the little quirks of your application covered, the only issue left is that calling your ExtractText method leaves your application locked up for it has returned. The best way to deal with this is to use an DynamicInvoke on a delegate and managing the update to your console application via a callback. To make this easy I wrote an asynchronous child class.
1: public class AsyncTesseractWrapper : TesseractWrapper
2: {
3: public void AsyncRunTesseract(string imageFile)
4: {
5: AsyncRunTesseract(imageFile, null);
6: }
7:
8: public void AsyncRunTesseract(string imageFile, TesseractOcrComplete callback)
9: {
10: if (_runner != null && !IsCompleted)
11: throw new ApplicationException("Please wait until the previous Async call is complete before running another.");
12:
13: ResetAsyncFields();
14: _completeCallback = callback;
15: _runner = new TesseractRunner(PerformOCRExceptionCatcher);
16: _asyncResult = _runner.BeginInvoke(imageFile, AsyncCallback, null);
17: }
18:
19: public bool IsCompleted
20: {
21: get
22: {
23: return _asyncResult == null ? false : _asyncResult.IsCompleted;
24: }
25: }
26:
27: public string GetAsyncResult()
28: {
29: if (_lastException != null) throw _lastException;
30: if (_asyncResult == null) return null;
31: if (_output == null) _output = _runner.EndInvoke(_asyncResult);
32: return _output;
33: }
34:
35: /* Download the full source to see private members and methods */
36: }
With this in place, it’s a simple matter to call Tesseract from a WinForms application:
1: _wrapper = new AsyncTesseractWrapper(tesseractPath);
2:
3: try
4: {
5: _wrapper.AsyncRunTesseract(dialog.FileName);
6: ocrResultCheckTimer.Enabled = true;
7: }
8: catch (WrapperException ex)
9: {
10: MessageBox.Show(this, ex.Message);
11: }
Because WinForms controls don’t like it when you try to change their values from an external thread, it is necessary to poll our asynchronous class instead of using a callback. I implemented this using a simple form timer.
1: private void ocrResultCheckTimer_Tick(object sender, EventArgs e)
2: {
3: if (_wrapper.IsCompleted)
4: {
5: try
6: {
7: richTextBox1.Text = _wrapper.GetAsyncResult();
8: }
9: finally
10: {
11: ocrResultCheckTimer.Enabled = false;
12: }
13: }
14: }
Conclusion
Well, that finishes up my walkthrough on writing a wrapper class for Tesseract’s console build. This of course is much different than how I implemented Tesseract support in our toolkit. For that I wrote hooks into the actual DLL file. If you have any questions, or have a better way to implement part of this project, please leave a comment. In this way, everyone can benefit from your knowledge.