Some of you probably know that, a few years ago, on a couple rainy Saturday afternoon at the kitchen table, I cobbled up a Document Management System.
A Document Management System (hereafter called DMS) manages documents. We distinguish between live documents (documents that are still being worked on, such as what you would expect to find in SharePoint or other collaborative systems) and ‘dead’ documents, which are in a final state, and have been sent or received. My little DMS was for dead documents, mainly to be capable of handling incoming mail other than birthday cards. We all know the feeling when we think “yes, I must have that insurance policy/warranty receipt/invoice somewhere. If only I knew where…”.
Another reason to do this was the well-known “because I can”. Before, I had worked with a company that also sold DMS software, and I felt that, in a couple ways, I would be able to do better.
One of the key requirements of a good DMS is the ability to quickly index scanned-in documents. This process is called heads-up indexing; heads-up referring to the data entry worker keeping his eye on the screen, rather than look down on a sheet of paper on the desk. An indexing program would show the scanned document, and next to it a UI that held the fields to be populated with the document’s metadata (information that helps to classify and archive the document so that it can be found when needed).
This heads-up indexing can be made a lot easier if the user using the indexing program, has the facility to select the field to be indexed, then, in the viewer that shows the document, drag a selection rectangle around a specific area and then hit a button so that the text content is automatically filled in in the selected field. To do this, we use a technology called Optical Character Recognition (OCR), which attempts to “read” the image and will return whatever it has read as text.
For this, I used Microsoft MODI, which was supplied with Microsoft Office. Worked great – OCR was adequate, considering the fact that it was basically a free add-on. In fact, MODI was good enough to justify buying Microsoft Office by itself.
And of course, Microsoft then decided tot discontinue MODI. They said they had brought that functionality to OneNote, but using the OCR facility from OneNote through an API was completely undocumented, and proved all but impossible.
Consequently, I was forced to use my existing DMS on a suitable system (a 32 bits Windows system running Office 2007).
Looking at alternatives didn’t make me very happy. There is, of course, Tesseract, which is open source. Also, in terms of recognition quality, it sucks eggs, and it is not particularly stable. And the commercial engines that do come with an API start at around $2000, which I am unable to justify for something that is primarily meant for our own use.
When I did another search, a thread on StackOverflow mentioned Transym TOCR, and said it was fine for machine-type (as opposed to handwriting). I looked at their web site, and discovered that the engine is mainly built for integration, and that TOCR had the facility for ORC’ing bitmaps in memory, which would enable me to process selections (drawing a rectangle in a viewer, exporting that rectangle as an in-memory bitmap, and using the result to populate fields in the indexer). Then I looked at the price.
Usually,when something seems too good to be true, it is because it is too good to be true. One hundred and thirteen euros is silly money for a decent OCR engine with an API. But I downloaded the evaluation version, and gave it a run. Or ten, actually. The first document, a letter in Dutch with some amounts, scanned to JPEG (which is not a nice thing to do to an OCR engine) came out 100% correct.
Hmm, must be a fluke. I gave it another shot with another JPEG. Same result.
OK, I’ll make it suffer.
Most OCR engines first process the document with a raw OCR, meaning it just tries to recognise the characters. Then, they usually try to apply some intelligence to it, often using dictionaries and stuff. Suppose the raw OCR comes up with “intell1gence”, it is then run through a dictionary, which will suggest “intelligence”, and lo and behold, we have a good result.
But sometimes, you are trying to process stuff that defies any attempt at commonsense. Suppose you have an insurance policy that has a policy number like
You won’t find that in ANY dictionary. Your poor OCR engine is trying to make sense of the Oh’s and the zeros, and to add insult to injury there is also a lower case L (an l) in it, which, in a font like Arial, even people have a hard time recognising correctly. So I made up a fictional document with that number in it.
The only thing TOCR got wrong is the lower case L, which it recognised as a 1. The rest of the document was 100% correct.
Impressive. Too good to be true? Possibly, but I’m going to try it anyway.
So, yes, I am going to see if I can integrate that engine into my DMS. While I’m at it, I’m going to redo it from the ground up, using the repository design pattern and Entity Framework code-first under the back end, and using MVVC at the front end. Not because the current version is buggy as hell, but more because I can. This should make it easier to maintain, and also because it allows me to work with technologies that I am also using for my everyday work.
The indexing front end will remain a Windows Forms program, since I believe that, when doing indexing, performance (and therefore tight coupling) is of paramount importance — indexers do not want to wait. But the search client will probably get both a Windows Forms and a web-based front end, sharing the same business logic.
What I am mainly going to blog about is my proceedings with the Transym TOCR engine – I will try to not bore you with the rest of the stuff. After all, if you have come this far, you probably already know how to do that…