Transym OCR & Peter–Proof Of Concept completed and approved

This week’s work comprised of readying the DnlCore-Shared.OCR classes, the DnlCore-Shared.ImageViewer control… and then build the Indexing Client Proof of Concept on top of that.

I wanted to prove that I could create a Win32 form with the following:

  • a custom PictureBox-like control that would allow
    • scrolling, and (more importantly)
    • zooming in the scrolled-out image representation, while preserving the image itself at its original resolution
    • selecting an area and returning the contents of the selection rectangle (from the preserved image to retain full scan resolution) as a bitmap
  • the facility to OCR the result of selected areas and use these to populate a selected field
  • the facility to load image files from a TWAIN scanner or from disk.

<echo>This has now been accomplished. </echo>

It goes like so:
The client (with the scanned image loaded at the right):
image

We start out at a zoom level of 33%, which is perfectly readable on a screen that was cutting-edge 10 years ago.

We doubleclick the Sender field (which makes it go powder blue) and select the name of the sender:
image

and then we choose Selection/OCR (or we hit Alt-O because we are lazy), and lo and behold:
image

We double-click the Reference field, draw an area around the 5928 bit, Alt-O, and hey look here:image

The main work went into creating the image viewer control (read: stealing the image control from the sample viewer application that comes with Transym OCR, and poking it in the eye until it does what I want). It inherits from the PictureBox control and now (after some eye-poking) adds the functionality mentioned in the requirements above.

A fair bit of work went into isolating the API code of TOCR and the creating the .NET wrapper around it.

The actual logic in this POC amounts to only 208 lines of generously formatted C#. I reckon I could compress that to something like 150 lines, but that would be at the expense of code readability and not add a bit to performance or ease of maintenance.
Of course, as PoC’s come, it contains very little exception handling, so it’ll probably grow – but still… impressive!

(“Jeez, I’m good!”) Winking smile

The indexing client logic all 202 lines of it:

using System;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using DnlCore.Shared;
using DnlCore.Shared.OCR;

namespace pocIndexing
{
    public partial class Form1 : Form
    {
        private Bitmap[] bitmaps;
        private bool haveSelection = false;
        private Rectangle selection;
        private string currentFile;
        private System.Windows.Forms.TextBox selectedControl = null;

        public Form1()
        {
            InitializeComponent();
        }

        private void LoadFromFileButton_Click(object sender, EventArgs e)
        {
            if (LoadFromFileDialog.ShowDialog() == DialogResult.OK)
            {
                currentFile = LoadFromFileDialog.FileName;
                loadImage(currentFile);
                setZoomRate(33);
            }
        }

        private void loadImage(string filePath)
        {
            editableImage1.Image = new Bitmap(filePath);
            setZoomRate(33);
        }

        private void ScanButton_Click(object sender, EventArgs e)
        {
            loadFromScanner();
        }

        private void loadFromScanner()
        {
            IEngine engine = new Engine();
            bitmaps = engine.GetDocumentFromScanner();
            editableImage1.Image = bitmaps[0];
        }

        private void setZoomRate(float rate)
        {
            editableImage1.Zoom = rate;
        }

        private void Form1_FormClosing(object sender, FormClosingEventArgs e)
        {
            editableImage1.Dispose();
        }

        private void page_SelectionChanged(Rectangle rect)
        {
            if (rect.Width != 0)
            {
                imageToolStripMenuItem.Text = "&Selection";
                selection = rect;
                haveSelection = true;
            }
            else
            {
                imageToolStripMenuItem.Text = "&Image";
                haveSelection = false;
            }
        }

        private void openToolStripMenuItem_Click(object sender, EventArgs e)
        {
            if (LoadFromFileDialog.ShowDialog() == DialogResult.OK)
            {
                loadImage(LoadFromFileDialog.FileName);
            }
        }

        private void scanToolStripMenuItem_Click(object sender, EventArgs e)
        {
            loadFromScanner();
        }

        private void zoom20MenuItem_Click(object sender, EventArgs e)
        {
            setZoomRate(20);
        }

        private void zoom33MenuItem_Click(object sender, EventArgs e)
        {
            setZoomRate(33);
        }

        private void zoom50MenuItem_Click(object sender, EventArgs e)
        {
            setZoomRate(50);
        }

        private void zoom75MenuItem_Click(object sender, EventArgs e)
        {
            setZoomRate(75);
        }

        private void zoom100MenuItem_Click(object sender, EventArgs e)
        {
            setZoomRate(100);
        }

        private string getOcrResult()
        {
            string result = string.Empty;
            IEngine ocrEngine = new Engine();
            Bitmap bitmapToOcr = null;
            if (!haveSelection)
            {
                bitmapToOcr = editableImage1.Image;
            }
            else
            {
                bitmapToOcr = editableImage1.SelectionImage;
            }
            result = ocrEngine.GetOcrFromBitmap(bitmapToOcr);

            return result;
        }

        private string detectFileType(string filePath)
        {
            string result = string.Empty;
            // Magic numbers embedded in files - these are mutually exclusive
            const short BMP_ID = 0x4D42;                // bitmap fileheader file type
            const short TIF_BO_LE = 0x4949;             //TIFF byte order little endian
            const short TIF_BO_BE = 0x4d4d;             // TIFF byte order big endian
            const short TIF_ID_LE = 0x2a;               // TIFF version little endian
            const short TIF_ID_BE = 0x2a00;             // TIFF version big endian
            const short GIF_ID1 = 0x4947;               // GIF 1st short
            const short GIF_ID2 = 0x3846;               // GIF 2nd short

            short value;

            using (System.IO.BinaryReader reader = new System.IO.BinaryReader(System.IO.File.Open(filePath, System.IO.FileMode.Open)))
            {
                value = reader.ReadInt16();
                switch (value)
                {
                    case BMP_ID:
                        result = "BMP";
                        break;
                    case TIF_BO_LE:
                    case TIF_BO_BE:
                    case TIF_ID_BE:
                    case TIF_ID_LE:
                        result = "TIF";
                        break;
                    case GIF_ID1:
                    case GIF_ID2:
                        result = "GIF";
                        break;
                    default:
                        result = "";
                        break;
                }
                reader.Close();
            }

            return result;
        }

        private void oCRToolStripMenuItem_Click(object sender, EventArgs e)
        {
            selectedControl.Text = getOcrResult();
            selectedControl.BackColor = Color.White;
        }

        private void textBox_DoubleClick(object sender, EventArgs e)
        {
            selectedControl = (TextBox)sender;
            selectedControl.BackColor = Color.PowderBlue;
            foreach (var t in this.Controls)
            {
                if (t.GetType() == typeof(TextBox) && t != sender)
                {
                    TextBox tb = (TextBox)t;
                    tb.BackColor = Color.White;
                }
            }
        }

        private void textBox1_TextChanged(object sender, EventArgs e)
        {
            TextBox t = (TextBox)sender;
            t.BackColor = Color.White;
        }

    }
}

Peter + Transym OCR versus The World: For The Win!

Oh yes. Ooh yes. Oooooh yessss. This is what I wanted to be capable of doing: getting a scan result in a viewer, selecting an area in the scan result, and then OCR that area, to get the result in the selected text box control. <*fist pump*>

image

Concept proven; mission accomplished.

Please note that Transym OCR is still confused about the lower case l (L for Lima) embedded in a bunch of characters that are digits being a lower case l, not a 1. I can’t blame it – I have yet to come across an OCR engine south of 5000 euros that can sort this out reliably.

Jeez, I’m good!*
On to the next hurdle. There are quite a few that need to be taken – the most important one is that the PictureBox control doesn’t let you zoom. Google is usually my friend, but all solutions it comes up with imply resizing the original bitmap to a size to fit the picture box – which will NOT help OCR one bit!

But: I have now proven that Transym OCR does everything I want it to do for my application! I have now got it to the point where it covers all must-haves. There are quite a few nice-to-haves that were not available to me when I was using Microsoft MODI, such as confidence (a property of the OCR result that indicates how sure the engine is that it’s got it right), which I could use to colour-code automatically-indexed fields. But first, I am going to concentrate on replicating the functionality that I had.

*) every programmer has to say this to himself or herself every now and then. The rest of the world has no idea what you’re doing, and how hard it is. Winking smile

Peter versus Transym OCR: 2-0

As of today, the DnlCore.Shared library contains an OCR namespace.
There is not much that is exposed to the outside world yet. It contains one method: ScanAndOcrDocument. It will fire up the TWAIN interface to a TWAIN scanner  and ask for a stack of paper to be scanned (note how this supports an automatic sheet feeder to scan multiple pages),
image

and when it’s done doing that, it will return a string containing the OCR result of the entire document.

image

It (using the DnlCore.Shared.OCR.Engine library) doesn’t get any harder than this. Of course, now we will need to be able to create scans of selected regions from the document. The only challenge there is to get a selected area in a viewer to be available as a System.Drawing.Bitmap… which won’t be rocket science.
I am definitely on to something very, very good with this Transym OCR engine! Open-mouthed smile

Today’s partial solar eclipse–proof that it happened! ;-)

Right then. Today, we were supposed to see a partial solar eclipse. We were told yesterday that it would most likely be clear skies, so we would be lucky. Zonsverd.... dikkie!
Here’s what we did get to see: various shades of gray (yes… I know. Spare me the pun. Life has become hard enough for us black&white photographers as it is).
The red arrow points to where approximately the sun would have been visible. So, that was a letdown.

I did, however, get some evidence. We have a little weather station on our roof, that measures solar input as well. Here’s the graph for this morning:

Eclipse 20-03-2015

The pink line shows the solar input as it would have been on a perfectly clear morning. The yellow line shows the actual input.
See the dip starting at 09:40, and reaching the bottom at 10:35-ish? There you have it! Proof of today’s solar eclipse! Remember: you read it here first. Winking smile

Funny detail: the solar input in watts per square meter plummets dramatically, from more than 100 to approximately 20. But we did not perceive it as such – I thought it did get noticeably darker, but not 80% darker! The graph shows how much our senses fool us: in fact, at around 10:35, it was about as dark as it was three hours earlier!

It would be interesting to see if that would also show up in the outside temperature. Let’s have a look:
solar-eclipse-temp

And there it is. With about half an hour’s delay, as you’d expect.
In this graph, it looks a bit more dramatic than it is, due to the graph resolution on the Y-axis, but it’s still there.
So… if you missed it like we did here in Waarder, just take a look at these two graphs to relive the moment.
It’s debatable whether it’s worth it to save this for posterity, though. Smile

Peter versus Transym TOCR: 1-0

So… after having downloaded and reviewed the Transym TOCR documentation and samples, I discovered something “interesting” – the Transym OCR API is not a .NET API, or even a COM API, but a standard, Win32-style API. There is not much isolation, which makes it incredibly powerful, but also quite intimidating.

Tonight, I have abstracted the functionality into a separate class library that I can now call from my test program without too much ado. I may even go ahead and do a .NET wrapper around that. I’m not overly worried about the performance penalty, as for me this is essentially going to work within a desktop application, and furthermore the performance penalty is negligible in the context of the OCR process. I can the either tweak the workings of the old-style API class to suit my needs, or make it configurable (either from an app.config or by creating overrides).

The class library seems to work: I can scan a document from a TWAIN source and load the results into a viewer. Up to this point, the results are so good that I am wondering why this OCR engine doesn’t get more exposure… for OCR’ing machine type, its value for money is spectacular!

That’s enough for today – midnight is approaching. Time to hit the sack.

Peter versus Transym TOCR: 0-0

Some of you probably know that, a few years ago, on a couple rainy Saturday afternoon at the kitchen table, I cobbled up a Document Management System.

A Document Management System (hereafter called DMS) manages documents. We distinguish between live documents (documents that are still being worked on, such as what you would expect to find in SharePoint or other collaborative systems) and ‘dead’ documents, which are in a final state, and have been sent or received. My little DMS was for dead documents, mainly to be capable of handling incoming mail other than birthday cards. We all know the feeling when we think “yes, I must have that insurance policy/warranty receipt/invoice somewhere. If only I knew where…”.

Another reason to do this was the well-known “because I can”. Before, I had worked with a company that also sold DMS software, and I felt that, in a couple ways, I would be able to do better.

One of the key requirements of a good DMS is the ability to quickly index scanned-in documents. This process is called heads-up indexing; heads-up referring to the data entry worker keeping his eye on the screen, rather than look down on a sheet of paper on the desk. An indexing program would show the scanned document, and next to it a UI that held the fields to be populated with the document’s metadata (information that helps to classify and archive the document so that it can be found when needed).
This heads-up indexing can be made a lot easier if the user using the indexing program, has the facility to select the field to be indexed, then, in the viewer that shows the document, drag a selection rectangle around a specific area and then hit a button so that the text content is automatically filled in in the selected field. To do this, we use a technology called Optical Character Recognition (OCR), which attempts to “read” the image and will return whatever it has read as text.

For this, I used Microsoft MODI, which was supplied with Microsoft Office. Worked great – OCR was adequate, considering the fact that it was basically a free add-on. In fact, MODI was good enough to justify buying Microsoft Office by itself.

And of course, Microsoft then decided tot discontinue MODI. They said they had brought that functionality to OneNote, but using the OCR facility from OneNote through an API was completely undocumented, and proved all but impossible.

Consequently, I was forced to use my existing DMS on a suitable system (a 32 bits Windows system running Office 2007).

Looking at alternatives didn’t make me very happy. There is, of course, Tesseract, which is open source. Also, in terms of recognition quality, it sucks eggs, and it is not particularly stable. And the commercial engines that do come with an API start at around $2000, which I am unable to justify for something that is primarily meant for our own use.

When I did another search, a thread on StackOverflow mentioned Transym TOCR, and said it was fine for machine-type (as opposed to handwriting). I looked at their web site, and discovered that the engine is mainly built for integration, and that TOCR had the facility for ORC’ing bitmaps in memory, which would enable me to process selections (drawing a rectangle in a viewer, exporting that rectangle as an in-memory bitmap, and using the result to populate fields in the indexer). Then I looked at the price.

Usually,when something seems too good to be true, it is because it is too good to be true. One hundred and thirteen euros is silly money for a decent OCR engine with an API. But I downloaded the evaluation version, and gave it a run. Or ten, actually. The first document, a letter in Dutch with some amounts, scanned to JPEG (which is not a nice thing to do to an OCR engine) came out 100% correct.
Hmm, must be a fluke. I gave it another shot with another JPEG. Same result.

OK, I’ll make it suffer.
Most OCR engines first process the document with a raw OCR, meaning it just tries to recognise the characters. Then, they usually try to apply some intelligence to it, often using dictionaries and stuff. Suppose the raw OCR comes up with “intell1gence”, it is then run through a dictionary, which will suggest “intelligence”, and lo and behold, we have a good result.

But sometimes, you are trying to process stuff that defies any attempt at commonsense. Suppose you have an insurance policy that has a policy number like
VO3014l072
You won’t find that in ANY dictionary. Your poor OCR engine is trying to make sense of the Oh’s and the zeros, and to add insult to injury there is also a lower case L (an l) in it, which, in a font like Arial, even people have a hard time recognising correctly. So I made up a fictional document with that number in it.
The only thing TOCR got wrong is the lower case L, which it recognised as a 1. The rest of the document was 100% correct.

Impressive. Too good to be true? Possibly, but I’m going to try it anyway.

So, yes, I am going to see if I can integrate that engine into my DMS. While I’m at it, I’m going to redo it from the ground up, using the repository design pattern and Entity Framework code-first under the back end, and using MVVC at the front end. Not because the current version is buggy as hell, but more because I can. This should make it easier to maintain, and also because it allows me to work with technologies that I am also using for my everyday work.
The indexing front end will remain a Windows Forms program, since I believe that, when doing indexing, performance (and therefore tight coupling) is of paramount importance — indexers do not want to wait. But the search client will probably get both a Windows Forms and a web-based front end, sharing the same business logic.

What I am mainly going to blog about is my proceedings with the Transym TOCR engine – I will try to not bore you with the rest of the stuff. After all, if you have come this far, you probably already know how to do that…