Naive Web Data Engineering Tales: Building Robust Optical Character Recognition #1

I <3 Data Engineering, BUT .. #

Problem is I am not a data engineer. I am a hacker. I only make things so let me introduce you to my new series Naive Data Engineering!

Over the span of a few years, every once in a while a client or project will come across my hands with OCR (Optical Character Recognition) plastered all over it. Especially when the data that will need to be engineered and massaged are images or video. I remember some infamous drunken episodes of trying to get reliable Computer Vision web applications to work during my Masters’ work and other similar projects. Often times I have come to realize that CV (especially OCR) was not the answer.

I have decided to write about OCR after several attempts of explaining the costs required to make robust OCR happen. Often times for clients I will even recommend against OCR using Jedi mind tricks (a staple in software engineering).

OCR? That is not the solution you are looking for

But I digress, let’s check out the exciting world of OCR!

First, What is heck is OCR? #

Optical Character Recognition involves an intersection of pattern recognition, computer vision and artificial intelligence. Usually, OCR is used to convert text from images/video in to computer readable text.

Some examples include:

  1. Reading license plate numbers and searching for the car in a database
  2. Storing collected business cards into contacts
  3. Converting class handwritten notes into a word document, etc.

Demo #

There are several libraries to try OCR with, for example OCRAD.js.

OCRAD.js Demo
Click on the image to go to the demo page

OCRAD is GNU C library that was compiled to Javascript using Emenscripten. This library is amazing for frontend/client side OCR but if you try several differing fonts (the blue round arrow) you will notice that the text extraction is not as desirable.

Examples:First #

Font Problems Digit Errors
For example Also

There are several other open source libraries such as Tesseract and more. But the point to notice is that OCR is not a silver bullet. Or rather OCR alone is usually not enough for a real world web application.

Next time on Naive Data Engineering #

In the next version of this series I will discuss what can and shouldn’t be done with OCR. We will then look at how to digest OCR results and post process them to make proper use of the results.

 
4
Kudos
 
4
Kudos

Now read this

Building MVPs cheap

Friday come’s along … let’s make an MVP # Jason Moreau from GrabScience reached out to me to work on a MVP for an idea he has. I won’t go into details but I will discuss some of the problems we ran into and how we solved it. Large File... Continue →