This is how we rock the OCR!


OCR – Optical Character Recognition 

We were building a receipt processing ocr service. The web service, that is able to extract texts from the receipt images provided by the users.

We had to build this service in ruby. Tesseract service, an opensource service owned by google was our best option to use for text extraction.

Tesseract is an engine that extracts texts from the image. It does a lot for us. But, accuracy of texts extracted was not so pleasing.

So after the demo version our target was totally focused on accuracy. We planned to start with  image pre processing. As found from the experimentation result, cleaner receipts images gave higher accuracy on extracted texts. Things that could be done during image pre-processing was to remove background of the receipt image, remove noise and skew the image so that texts would align horizontally.

We used Imagemagick as the pre processing…

View original post 257 more words