A HTML GUI for training Tesseract on character sets

The Tesseract OCR Chopper, by data journalist Dino Beslagic.

I’m making this short stub post because ever since I’ve used tesseract to convert scanned documents into text, I’ve wondered why the hell is it so hard to train tesseract (to make it better at recognizing a font)? As it turns out, Beslagic created a web-app that makes the task comparatively easy and platform-independent.

He recently updated it but posted it about 2 years ago. I can’t believe I didn’t find it until now. How did I find it? By stumbling upon the “AddOns” wiki for the Tesseract project. I love Tesseract but am surprised at how such a useful and popular utility can have such scattered resources.

Post navigation

Previous: Kurt Vonnegut’s brilliant, brief career at Sports Illustrated: “He was not good at being an employee”

Next: Louis C.K. releases new $5 DRM-free comedy recordings: Carnegie Hall (2010) and Shameless (previously on HBO)

Leave a Reply Cancel reply

Top Posts & Pages

Dan Nguyen

Questionable Psychic #392: Missing Jacquelyn Kotarac

Why did infinite scroll fail at Etsy?

A good way for Google Buzz to recruit the dumbest users on Facebook

A Million Pageviews, Thousands of Dollars Poorer, and Still Countlessly Richer.

Unix for Journalism - Computational Methods in the Civic Sphere

Recent Posts

Soft-launching Stanford’s Computational Journalism Lab

Goodbye WordPress; hello Jekyll and Markdown

The big, ignored business of extorting undocumented migrants

Unix for Journalism – Computational Methods in the Civic Sphere

How to compile OpenCV 2.4.10 on Ubuntu 14.04 and 14.10

The Computer Science of Tomorrow, Today, and the Past

MySQL (and SQLite) for Data Journalists

Preparing eggs and programming

Archives
Archives

My name is Dan Nguyen and I am the Head of Data at Skift.com. I do a lot of programming and photography.

New: Small Data Journalism, a resource site I developed for a NYU class I taught in Fall 2013

Follow me on Twitter: @dancow

My guide to learning photography: The Bastards Book of Photography. My guide to programming: The Bastards Book of Ruby. Sick of my yammering and just want NYC photos? Check out my tumblr: Eye Heart New York

Home

Old blog

Newer blog

Photos

@dancow

eyeheartnewyork

Bastards Books

Dan Nguyen is a developer, data analyst, journalist, and photographer. You can contact him at dan@danwin.com I'm trying to do more of my blogging at blog.danwin.com.