A HTML GUI for training Tesseract on character sets

The Tesseract OCR Chopper, by data journalist Dino Beslagic.

I’m making this short stub post because ever since I’ve used tesseract to convert scanned documents into text, I’ve wondered why the hell is it so hard to train tesseract (to make it better at recognizing a font)? As it turns out, Beslagic created a web-app that makes the task comparatively easy and platform-independent.

He recently updated it but posted it about 2 years ago. I can’t believe I didn’t find it until now. How did I find it? By stumbling upon the “AddOns” wiki for the Tesseract project. I love Tesseract but am surprised at how such a useful and popular utility can have such scattered resources.

I'm a programmer journalist, currently teaching computational journalism at Stanford University. I'm trying to do my new blogging at blog.danwin.com.