Peter + Transym OCR versus The World: For The Win!

Oh yes. Ooh yes. Oooooh yessss. This is what I wanted to be capable of doing: getting a scan result in a viewer, selecting an area in the scan result, and then OCR that area, to get the result in the selected text box control. <*fist pump*>


Concept proven; mission accomplished.

Please note that Transym OCR is still confused about the lower case l (L for Lima) embedded in a bunch of characters that are digits being a lower case l, not a 1. I can’t blame it – I have yet to come across an OCR engine south of 5000 euros that can sort this out reliably.

Jeez, I’m good!*
On to the next hurdle. There are quite a few that need to be taken – the most important one is that the PictureBox control doesn’t let you zoom. Google is usually my friend, but all solutions it comes up with imply resizing the original bitmap to a size to fit the picture box – which will NOT help OCR one bit!

But: I have now proven that Transym OCR does everything I want it to do for my application! I have now got it to the point where it covers all must-haves. There are quite a few nice-to-haves that were not available to me when I was using Microsoft MODI, such as confidence (a property of the OCR result that indicates how sure the engine is that it’s got it right), which I could use to colour-code automatically-indexed fields. But first, I am going to concentrate on replicating the functionality that I had.

*) every programmer has to say this to himself or herself every now and then. The rest of the world has no idea what you’re doing, and how hard it is. Winking smile

Peter versus Transym OCR: 2-0

As of today, the DnlCore.Shared library contains an OCR namespace.
There is not much that is exposed to the outside world yet. It contains one method: ScanAndOcrDocument. It will fire up the TWAIN interface to a TWAIN scanner  and ask for a stack of paper to be scanned (note how this supports an automatic sheet feeder to scan multiple pages),

and when it’s done doing that, it will return a string containing the OCR result of the entire document.


It (using the DnlCore.Shared.OCR.Engine library) doesn’t get any harder than this. Of course, now we will need to be able to create scans of selected regions from the document. The only challenge there is to get a selected area in a viewer to be available as a System.Drawing.Bitmap… which won’t be rocket science.
I am definitely on to something very, very good with this Transym OCR engine! Open-mouthed smile

Peter versus Transym TOCR: 1-0

So… after having downloaded and reviewed the Transym TOCR documentation and samples, I discovered something “interesting” – the Transym OCR API is not a .NET API, or even a COM API, but a standard, Win32-style API. There is not much isolation, which makes it incredibly powerful, but also quite intimidating.

Tonight, I have abstracted the functionality into a separate class library that I can now call from my test program without too much ado. I may even go ahead and do a .NET wrapper around that. I’m not overly worried about the performance penalty, as for me this is essentially going to work within a desktop application, and furthermore the performance penalty is negligible in the context of the OCR process. I can the either tweak the workings of the old-style API class to suit my needs, or make it configurable (either from an app.config or by creating overrides).

The class library seems to work: I can scan a document from a TWAIN source and load the results into a viewer. Up to this point, the results are so good that I am wondering why this OCR engine doesn’t get more exposure… for OCR’ing machine type, its value for money is spectacular!

That’s enough for today – midnight is approaching. Time to hit the sack.

Peter versus Transym TOCR: 0-0

Some of you probably know that, a few years ago, on a couple rainy Saturday afternoon at the kitchen table, I cobbled up a Document Management System.

A Document Management System (hereafter called DMS) manages documents. We distinguish between live documents (documents that are still being worked on, such as what you would expect to find in SharePoint or other collaborative systems) and ‘dead’ documents, which are in a final state, and have been sent or received. My little DMS was for dead documents, mainly to be capable of handling incoming mail other than birthday cards. We all know the feeling when we think “yes, I must have that insurance policy/warranty receipt/invoice somewhere. If only I knew where…”.

Another reason to do this was the well-known “because I can”. Before, I had worked with a company that also sold DMS software, and I felt that, in a couple ways, I would be able to do better.

One of the key requirements of a good DMS is the ability to quickly index scanned-in documents. This process is called heads-up indexing; heads-up referring to the data entry worker keeping his eye on the screen, rather than look down on a sheet of paper on the desk. An indexing program would show the scanned document, and next to it a UI that held the fields to be populated with the document’s metadata (information that helps to classify and archive the document so that it can be found when needed).
This heads-up indexing can be made a lot easier if the user using the indexing program, has the facility to select the field to be indexed, then, in the viewer that shows the document, drag a selection rectangle around a specific area and then hit a button so that the text content is automatically filled in in the selected field. To do this, we use a technology called Optical Character Recognition (OCR), which attempts to “read” the image and will return whatever it has read as text.

For this, I used Microsoft MODI, which was supplied with Microsoft Office. Worked great – OCR was adequate, considering the fact that it was basically a free add-on. In fact, MODI was good enough to justify buying Microsoft Office by itself.

And of course, Microsoft then decided tot discontinue MODI. They said they had brought that functionality to OneNote, but using the OCR facility from OneNote through an API was completely undocumented, and proved all but impossible.

Consequently, I was forced to use my existing DMS on a suitable system (a 32 bits Windows system running Office 2007).

Looking at alternatives didn’t make me very happy. There is, of course, Tesseract, which is open source. Also, in terms of recognition quality, it sucks eggs, and it is not particularly stable. And the commercial engines that do come with an API start at around $2000, which I am unable to justify for something that is primarily meant for our own use.

When I did another search, a thread on StackOverflow mentioned Transym TOCR, and said it was fine for machine-type (as opposed to handwriting). I looked at their web site, and discovered that the engine is mainly built for integration, and that TOCR had the facility for ORC’ing bitmaps in memory, which would enable me to process selections (drawing a rectangle in a viewer, exporting that rectangle as an in-memory bitmap, and using the result to populate fields in the indexer). Then I looked at the price.

Usually,when something seems too good to be true, it is because it is too good to be true. One hundred and thirteen euros is silly money for a decent OCR engine with an API. But I downloaded the evaluation version, and gave it a run. Or ten, actually. The first document, a letter in Dutch with some amounts, scanned to JPEG (which is not a nice thing to do to an OCR engine) came out 100% correct.
Hmm, must be a fluke. I gave it another shot with another JPEG. Same result.

OK, I’ll make it suffer.
Most OCR engines first process the document with a raw OCR, meaning it just tries to recognise the characters. Then, they usually try to apply some intelligence to it, often using dictionaries and stuff. Suppose the raw OCR comes up with “intell1gence”, it is then run through a dictionary, which will suggest “intelligence”, and lo and behold, we have a good result.

But sometimes, you are trying to process stuff that defies any attempt at commonsense. Suppose you have an insurance policy that has a policy number like
You won’t find that in ANY dictionary. Your poor OCR engine is trying to make sense of the Oh’s and the zeros, and to add insult to injury there is also a lower case L (an l) in it, which, in a font like Arial, even people have a hard time recognising correctly. So I made up a fictional document with that number in it.
The only thing TOCR got wrong is the lower case L, which it recognised as a 1. The rest of the document was 100% correct.

Impressive. Too good to be true? Possibly, but I’m going to try it anyway.

So, yes, I am going to see if I can integrate that engine into my DMS. While I’m at it, I’m going to redo it from the ground up, using the repository design pattern and Entity Framework code-first under the back end, and using MVVC at the front end. Not because the current version is buggy as hell, but more because I can. This should make it easier to maintain, and also because it allows me to work with technologies that I am also using for my everyday work.
The indexing front end will remain a Windows Forms program, since I believe that, when doing indexing, performance (and therefore tight coupling) is of paramount importance — indexers do not want to wait. But the search client will probably get both a Windows Forms and a web-based front end, sharing the same business logic.

What I am mainly going to blog about is my proceedings with the Transym TOCR engine – I will try to not bore you with the rest of the stuff. After all, if you have come this far, you probably already know how to do that…