Peter versus Transym OCR: 2-0

As of today, the DnlCore.Shared library contains an OCR namespace.
There is not much that is exposed to the outside world yet. It contains one method: ScanAndOcrDocument. It will fire up the TWAIN interface to a TWAIN scanner  and ask for a stack of paper to be scanned (note how this supports an automatic sheet feeder to scan multiple pages),
image

and when it’s done doing that, it will return a string containing the OCR result of the entire document.

image

It (using the DnlCore.Shared.OCR.Engine library) doesn’t get any harder than this. Of course, now we will need to be able to create scans of selected regions from the document. The only challenge there is to get a selected area in a viewer to be available as a System.Drawing.Bitmap… which won’t be rocket science.
I am definitely on to something very, very good with this Transym OCR engine! Open-mouthed smile

Peter versus Transym TOCR: 1-0

So… after having downloaded and reviewed the Transym TOCR documentation and samples, I discovered something “interesting” – the Transym OCR API is not a .NET API, or even a COM API, but a standard, Win32-style API. There is not much isolation, which makes it incredibly powerful, but also quite intimidating.

Tonight, I have abstracted the functionality into a separate class library that I can now call from my test program without too much ado. I may even go ahead and do a .NET wrapper around that. I’m not overly worried about the performance penalty, as for me this is essentially going to work within a desktop application, and furthermore the performance penalty is negligible in the context of the OCR process. I can the either tweak the workings of the old-style API class to suit my needs, or make it configurable (either from an app.config or by creating overrides).

The class library seems to work: I can scan a document from a TWAIN source and load the results into a viewer. Up to this point, the results are so good that I am wondering why this OCR engine doesn’t get more exposure… for OCR’ing machine type, its value for money is spectacular!

That’s enough for today – midnight is approaching. Time to hit the sack.

Peter versus Transym TOCR: 0-0

Some of you probably know that, a few years ago, on a couple rainy Saturday afternoon at the kitchen table, I cobbled up a Document Management System.

A Document Management System (hereafter called DMS) manages documents. We distinguish between live documents (documents that are still being worked on, such as what you would expect to find in SharePoint or other collaborative systems) and ‘dead’ documents, which are in a final state, and have been sent or received. My little DMS was for dead documents, mainly to be capable of handling incoming mail other than birthday cards. We all know the feeling when we think “yes, I must have that insurance policy/warranty receipt/invoice somewhere. If only I knew where…”.

Another reason to do this was the well-known “because I can”. Before, I had worked with a company that also sold DMS software, and I felt that, in a couple ways, I would be able to do better.

One of the key requirements of a good DMS is the ability to quickly index scanned-in documents. This process is called heads-up indexing; heads-up referring to the data entry worker keeping his eye on the screen, rather than look down on a sheet of paper on the desk. An indexing program would show the scanned document, and next to it a UI that held the fields to be populated with the document’s metadata (information that helps to classify and archive the document so that it can be found when needed).
This heads-up indexing can be made a lot easier if the user using the indexing program, has the facility to select the field to be indexed, then, in the viewer that shows the document, drag a selection rectangle around a specific area and then hit a button so that the text content is automatically filled in in the selected field. To do this, we use a technology called Optical Character Recognition (OCR), which attempts to “read” the image and will return whatever it has read as text.

For this, I used Microsoft MODI, which was supplied with Microsoft Office. Worked great – OCR was adequate, considering the fact that it was basically a free add-on. In fact, MODI was good enough to justify buying Microsoft Office by itself.

And of course, Microsoft then decided tot discontinue MODI. They said they had brought that functionality to OneNote, but using the OCR facility from OneNote through an API was completely undocumented, and proved all but impossible.

Consequently, I was forced to use my existing DMS on a suitable system (a 32 bits Windows system running Office 2007).

Looking at alternatives didn’t make me very happy. There is, of course, Tesseract, which is open source. Also, in terms of recognition quality, it sucks eggs, and it is not particularly stable. And the commercial engines that do come with an API start at around $2000, which I am unable to justify for something that is primarily meant for our own use.

When I did another search, a thread on StackOverflow mentioned Transym TOCR, and said it was fine for machine-type (as opposed to handwriting). I looked at their web site, and discovered that the engine is mainly built for integration, and that TOCR had the facility for ORC’ing bitmaps in memory, which would enable me to process selections (drawing a rectangle in a viewer, exporting that rectangle as an in-memory bitmap, and using the result to populate fields in the indexer). Then I looked at the price.

Usually,when something seems too good to be true, it is because it is too good to be true. One hundred and thirteen euros is silly money for a decent OCR engine with an API. But I downloaded the evaluation version, and gave it a run. Or ten, actually. The first document, a letter in Dutch with some amounts, scanned to JPEG (which is not a nice thing to do to an OCR engine) came out 100% correct.
Hmm, must be a fluke. I gave it another shot with another JPEG. Same result.

OK, I’ll make it suffer.
Most OCR engines first process the document with a raw OCR, meaning it just tries to recognise the characters. Then, they usually try to apply some intelligence to it, often using dictionaries and stuff. Suppose the raw OCR comes up with “intell1gence”, it is then run through a dictionary, which will suggest “intelligence”, and lo and behold, we have a good result.

But sometimes, you are trying to process stuff that defies any attempt at commonsense. Suppose you have an insurance policy that has a policy number like
VO3014l072
You won’t find that in ANY dictionary. Your poor OCR engine is trying to make sense of the Oh’s and the zeros, and to add insult to injury there is also a lower case L (an l) in it, which, in a font like Arial, even people have a hard time recognising correctly. So I made up a fictional document with that number in it.
The only thing TOCR got wrong is the lower case L, which it recognised as a 1. The rest of the document was 100% correct.

Impressive. Too good to be true? Possibly, but I’m going to try it anyway.

So, yes, I am going to see if I can integrate that engine into my DMS. While I’m at it, I’m going to redo it from the ground up, using the repository design pattern and Entity Framework code-first under the back end, and using MVVC at the front end. Not because the current version is buggy as hell, but more because I can. This should make it easier to maintain, and also because it allows me to work with technologies that I am also using for my everyday work.
The indexing front end will remain a Windows Forms program, since I believe that, when doing indexing, performance (and therefore tight coupling) is of paramount importance — indexers do not want to wait. But the search client will probably get both a Windows Forms and a web-based front end, sharing the same business logic.

What I am mainly going to blog about is my proceedings with the Transym TOCR engine – I will try to not bore you with the rest of the stuff. After all, if you have come this far, you probably already know how to do that…

System.Timers.Timer doesn’t seem to fire

If you are using System.Timers.Timer to fire off a work loop at a set interval, you may sometimes come across the situation that it seems as if your timer’s Elapsed event doesn’t fire – you know that the timer’s interval has expired, but nothing happens.

I’ve been tearing my hair out on a few occasions about this. Usually, if I would do a rebuild, the problem would automagically disappear, but today it didn’t.

Apparently, I was not the only one suffering from this problem, because Stack Overflow had various threads inquiring about this. Most of the answers pointed to “use system.threading.timer instead”. One of them actually hinted at the problem: if the event handler would throw an exception, system.timers.timer would quietly swallow this exception and then choke.

You might say “yes, but I do exception handling in my handler… so the exception should be caught in the handler itself rather than be thrown up to the calling timer… right?”

Most of the time, right. But if calling your handler immediately results in an exception, such as when your handler declares an object from, say, a DLL and it cannot find the correct version of said DLL, you are toast. Even if the first statement in your method is a Try, this is never hit – the simple act of calling the method results in an exception. Which is thrown back up to the caller, which just so happens to be your misbehaving timer.

Unfortunately, using System.Threading.Timer is not really an option. First, it is a lot more cumbersome to use than System.Timers.Timer… and secondly (and more importantly), it does NOT work across threads.

So I went down for a smoke, which for me is a proven method to find my way around a problem.

When I came back up, this is what I did:

I wrote a wrapper method around my original handler. The wrapper method simply calls the original handler, from a try… catch block. And then I changed the handler for the timer’s Elapsed event to point to the wrapper rather than to the original worker method. A simple solution, really… it is fairly transparent, and it does not require a lot of refactoring. And because we stick to System.Timers.Timer, is is thread-safe. Most of the work went into typing in the comment lines – I believe in the concept of “don’t comment on HOW you do something, that should be clear when looking at the code, but do comment on WHY you chose to do it like this”, and with that in mind, this did require some commenting.

If the original handler throws an exception, this is handled by the wrapper rather than by your timer, and you can take proper care of it in the wrapper method, and your timer never notices it.

So, here’s what it looks like now. First, in my initialisation code, I instantiate the timer:

try {
    myTimer = new System.Timers.Timer();
    myTimer.Interval = PollInterval * 1000;
    myTimer.Elapsed += DoStuffWrapper;
    myTimer.Enabled = true;
    myTimer.AutoReset = true;
    AddLogEntry("Starting work loop in " + PollInterval + " seconds.", EventLevel.Debug, EventID.GeneralInfo);
} catch (Exception ex) {
    AddLogEntry("Timer not started: " + ex.Message, EventLevel.Critical, EventID.GeneralError);
}

The handler for the myTimer.Elapsed event is my wrapper, which looks like this:

public static void DoStuffWrapper()
{
    //The only raison d’être for this method is that it is a wrapper around DoStuff. If we would add the DoStuff method as the handler to our 
    //Timer.Elapsed event, and DoStuff would throw an error immediately on being invoked (referencing the wrong version of a DLL for instance),
    //you would NEVER see this, because the timer would eat up the exception and then choke silently. 
    try {
        DoStuff();
    } catch (Exception ex) {
        AddLogEntry("Work loop cannot be initiated - processing will stop (" + ex.Message + ")", EventLevel.Critical);
        myTimer.Enabled = false;
        return;
    }
}

This wrapper then calls the method DoStuff, where the actual work is done. If DoStuff throws an exception up the food chain, DoStuffWrapper will handle this.

 

Never assume that a timestamp is unique

These days, I am writing a logger class. Since I want to also create a unique log transaction ID, to identify log entries belonging together from the moment the log class is initialised to the moment it is disposed, I thought I could use the time ordinal… because time always moves forward… right?

Wrong.

If you create the identifier for instance A, and then the NNTP client queries the NNTP Server and finds out it’s fifty seconds ahead, sets back the clock fifty seconds, your identifier is then turned back 50.000 ticks as well.

Chances that you’ll get duplicates are, of course, extremely slim, but you cannot exclude it, and if you use this identifier to determine the order in which logging sessions were initiated, that’s broken as well.

OK, back to the drawing board.