Adding New Fonts to Tesseract 3 OCR Engine

Tagged:  

Update: I've turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you have corrections to the article, please send them directly to me using the Contact form.

Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I've explained the process so others may more easily add fonts to their system.

The process has a few major steps:

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named 'standard-training-text.txt'. This file contains the training text that is used by Tesseract for the included fonts.

Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I've attached a sample doc too, if that helps. Set the text to the font you want to use, and save it as font-name.doc.

Save the document as a PDF (call it [lang].font-name.exp0.pdf, with lang being an ISO-639 three letter abbreviation for your language), and then use the following command to convert it to a 300dpi tiff (requires imagemagick):

convert -density 300 -depth 4 lang.font-name.exp0.pdf lang.font-name.exp0.tif

You'll now have a good training image called lang.font-name.exp0.tif. If you're adding multiple fonts, or bold, italic or underline, repeat this process multiple times, creating one doc → pdf → tiff per font variation.

Train Tesseract

The next step is to run tesseract over the image(s) we just created, and to see how well it can do with the new font. After it's taken its best shot, we then give it corrections. It'll provide us with a box file, which is just a file containing x,y coordinates of each letter it found along with what letter it thinks it is. So let's see what it can do:

tesseract lang.font-name.exp0.tiff lang.font-name.exp0 batch.nochop makebox

You'll now have a file called font-name.exp0.box, and you'll need to open it in a box-file editor. There are a bunch of these on the Tesseract wiki. The one that works for me (on Ubuntu) is moshpytt, though it doesn't support multi-page tiffs. If you need to use a multi-page tiff, see the issue on the topic for tips. Once you've opened it, go through every letter, and make sure it was detected correctly. If a letter was skipped, add it as a row to the box file. Similarly, if two letters were detected as one, break them up into two lines.

When that's done, you feed the box file back into tesseract:

tesseract eng.font-name.exp0.tif eng.font-name.box nobatch box.train.stderr

Next, you need to detect the Character set used in all your box files:

unicharset_extractor *.box

When that's complete, you need to create a font_properties file. It should list every font you're training, one per line, and identify whether it has the following characteristics: <fontname> <italic> <bold> <fixed> <serif> <fraktur>

So, for example, if you use the standard training data, you might end up with a file like this:

eng.arial.box 0 0 0 0 0
eng.arialbd.box 0 1 0 0 0
eng.arialbi.box 1 1 0 0 0
eng.ariali.box 1 0 0 0 0
eng.b018012l.box 0 0 0 1 0
eng.b018015l.box 0 1 0 1 0
eng.b018032l.box 1 0 0 1 0
eng.b018035l.box 1 1 0 1 0
eng.c059013l.box 0 0 0 1 0
eng.c059016l.box 0 1 0 1 0
eng.c059033l.box 1 0 0 1 0
eng.c059036l.box 1 1 0 1 0
eng.cour.box 0 0 1 1 0
eng.courbd.box 0 1 1 1 0
eng.courbi.box 1 1 1 1 0
eng.couri.box 1 0 1 1 0
eng.georgia.box 0 0 0 1 0
eng.georgiab.box 0 1 0 1 0
eng.georgiai.box 1 0 0 1 0
eng.georgiaz.box 1 1 0 1 0
eng.lincoln.box 0 0 0 0 1
eng.old-english.box 0 0 0 0 1
eng.times.box 0 0 0 1 0
eng.timesbd.box 0 1 0 1 0
eng.timesbi.box 1 1 0 1 0
eng.timesi.box 1 0 0 1 0
eng.trebuc.box 0 0 0 1 0
eng.trebucbd.box 0 1 0 1 0
eng.trebucbi.box 1 1 0 1 0
eng.trebucit.box 1 0 0 1 0
eng.verdana.box 0 0 0 0 0
eng.verdanab.box 0 1 0 0 0
eng.verdanai.box 1 0 0 0 0
eng.verdanaz.box 1 1 0 0 0

Note that this is the standard font_properties file that should be supplied with Tesseract and I've added the two bold rows for the blackletter fonts I'm training. You can also see which fonts are included out of the box.

We're getting near the end. Next, create the clustering data:

mftraining -F font_properties -U unicharset -O lang.unicharset *.tr
cntraining *.tr

If you want, you can create a wordlist or a unicharambigs file. If you don't plan on doing that, the last step is to combine the various files we've created.

To do that, rename each of the language files (normproto, Microfeat, inttemp, pffmtable) to have your lang prefix, and run (mind the dot at the end):

combine_tessdata lang.

This will create all the data files you need, and you just need to move them to the correct place on your OS. On Ubuntu, I was able to move them to;

sudo mv eng.traineddata /usr/local/share/tessdata/

And that, good friend, is it. Worst process for a human, ever.

AttachmentSize
old-english.doc19.5 KB
standard-training-text.txt5.01 KB

unicharset_extractor *.box command does not work for me but good tutorial anyway thanks:
Extracting unicharset from *.box
Cannot open box file *.box

Should one use another training text if tesseract is supposed to detect only digits?
Thanks

I believe so, yes.

Hi. Great tutorial, I start to get to somewhere with it. But I get a couple of questions:

I'm trying to train tesseract to read a digital display, I've make a sample sheet of only digits decimal point (.) and (:) I've make all the possible combinations of one digit, to digits, three digits, all digits, one digit . digit, etc. It is right to produce such a document as I only need to read digits?

If I create a new eng.traineddata with only the new font and digits, how I add these file to the standard eng.traiineddata in tessdata? Wouldn't it overwrite the existing one when we copy it to the tessdata directory?

Thanks in advance

That sounds like the right approach to me in terms of your input document, but I'm no expert.

To your second question, yeah, I think the point is to replace traineddata that comes with tesseract so that it instead uses your special data. If you're only doing digits, you shouldn't need all the other stuff, and indeed, I'd expect worse quality with it than without it.

Well it have worked, I can recognize now digits. I've used a free digital display font named ds-digital (you can donwload it for free anywhere), create an image, preprocess dither, erode (to close the gaps), and it is working.

Thank you very much for you help.

Nice tut, I managed to generate a .traineddata file. It only does a poor job recognizing, I tried this with the 04b font (http://www.urbanfonts.com/fonts/04b_30.htm) with only digits. Could someone try this out for me, I cant get this to work :(

So, I am using Tesseract 3.0 and I am just using the traineddata out of the box (I know that I need to follow this tut, and I am). What I am seeing is a lot of fatal errors, and my app dies. So... if Tesseract is going to continue to kill my app, them I am in trouble. Have you seen this? Does it get better after you train it?

I didn't explain myself very well. I have a Java Web Service that invokes the calls to tesseract. At a random number of calls, Tesseract throws a FATAL ERROR in the JRE, and I have to restart my app. Has anyone else seen this?

This will have more to do with your app than with Tesseract. You should ask this same question on a site that will have better information about that. You should also invoke your commands to Tesseract in such a way that they will be wrapped in try/catch blocks.

Agreed. And again, great tutorial! Serious life saver bro.

For all readers... make sure that you have the latest version of leptonica install (1.67 or higher) and version 3.01 of tesseract. I am using Linux and the repos had 3.00, so I had to manually download and install both of those. Other than that, you tut took me 100% of the way.

Very good tutorial, thank you!!

I have a problem when following it. When trying to use tesseract command:

tesseract lang.font-name.exp0.tiff lang.font-name.exp0 batch.nochop makebox

I get the following error:

TIFFstream: Sorry, can not handle image.
Unsupported image type.

The image I am trying to use is a multipage TIFF image. When using the eurotext.tif image that comes with tesseract, the box file is created, so I think it must be a problem with the multipage feature. Do you have any idea about how to solve the issue?

Thank you in advance

Nice to see some more notes on Tesseract Training to create Language Data Files.

I am trying to enhance Tesseracts ability to OCR German Fraktur script in a Heinrich Brugsch 1000 page text.
While your tutorial is relevant for Linux users... some variations need to be made for windows users.

Have you ever tried a windows project to create a new deu-frak.traineddata file ?

Nope.

Hi and thanks for the tutorial.

I have a couple of questions:

1) If I want to include all of the built-in fonts in the font_properties file (like you did in the tutorial), shouldn't I have also the training data for those fonts (i.e. tif/box pairs)? If yes, where can I find them?

2) in the font_properties file, it seems to me that I have to write
old-english 0 0 0 0 1
instead of
eng.old-english.box 0 0 0 0 1

otherwise tesseract tells me it can't find font properties for old-english font... am I doing something wrong?

Thanks,
Philip

Hello everyone,
Thanks Michael for such a great tutorial. I had tried this out but the problem with my version was that, it overrites original eng.traindata file shipped with tesseract. Is there any way to add new fonts training with existing training data file? Thank you for your help.

Well, if you follow the tutorial, you'll end up training using the same thing that was used to generate the eng.traindata file, so you do have to overwrite it, but you do so with your own data that's better than what's there.

Hi.
First of all thank you for the Tutorial, you made things much easier!
By the way I think I had problems with the font_properties file.
I am trying to train tesseract for a new Font for the German language. To give it a first try I made a little training file (only few letters, numbers and symbols, 30 in total) and I follow your Tutorial step by step.
When it was time to make the font_properties file I did not find the one that should be supplied with Tesseract (neither German nor English) so I ended up making a new file with one line only describing the new font I'd like to train.
Everythings worked well till the end but it is clear that I obtain a "new language set" based only on the new font I trained.
Is it caused by an incomplete font_properties file?
Is there a way to supplement an already existing language with a new font?
Thank you for your time.

The problem you're having probably isn't due to the incomplete font_properties file. If you need a complete one, the one that I provided in the post is the default one.

There's no way to supplement an already existing language with a new font. You have to create all new training files. It's dumb.

Good luck.

Thank you for this tutorial. :)
I just want to share with you this problem I'm having:

I've already completed the language files (normproto, Microfeat, inttemp, pffmtable) but when I entered
combine_tessdata eng.

an error says:
combine_tessdata: command not found

I've already done this successfully: sudo apt-get install tesseract-ocr

the mftraining , unicharset_extractor, etc are working excpet this combine_tessdata.

Please help me & Thank you!

Hello
at step
tesseract lang.font-name.exp0.tiff lang.font-name.exp0 batch.nochop makebox
why cant i make a box.
i run this comand, the result is
E:\soth\Tesseract-OCR>tesseract "E:\soth\Tesseract-OCR\demo\eng.arial.exp0.tiff" eng.arial.exp0 batch.nochop makebox
Tesseract Open Source OCR Engine v3.01 with Leptonica
Cannot open input file: E:\HocTap\Tesseract-OCR\demo\eng.arial.exp0.tiff

Hi everybody. I use Tesseract 3.0.2, I try to training. But I have problem. when I run: mftraining -F font_properties -U unicharset -O eng.unicharset eng.timesitalic.exp0.tr eng.timesitalic.exp1.tr, it crash: Error:Assert failed:in file '...\....\trainingsampleset.cpp, line 662. Thanks!

Hi expert tesseract-ocr group

I try to train tesseract-ocr is fail
Are you have video train tesseract -ocr add new fonts or new language torrent
where can i get video torrent
please help me
regards
souk

Thanks for this tutorial :)
The best part of this tutorial is the font_properties fiels's required list ie font name list.
This help me a lot.
Thanks for this tutorial again :)

Hi when i use this command it showing Invalid Parameter -300

convert -density 300 -depth 4 eng.font-name.exp0.pdf eng.font-name.exp0.tif

send one sample

Thank you

make sure you downloaded the binary file of imagemagick and not the source. Also, if you're running this on windows, try cygwin instead of cmd

Same error above

Hi everybody. I use Tesseract 3.0.2, I try to training. But I have problem. when I run: mftraining -F font_properties -U unicharset -O eng.unicharset eng.timesitalic.exp0.tr eng.timesitalic.exp1.tr, it crash: Error:Assert failed:in file '...\....\trainingsampleset.cpp, line 662. Thanks!

Need help urgently..

Thanks

In Tesseract 3.02 training, there is a new shapeclustering command that generates a shapetable file to be used in subsequent steps.

http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Thanks so much. A very nice tutorial!
I'm beginning to train a new font. But I don't know how to add the trained data (with new font) to exist trainneddata.eng package.
When replacing with new traineddata.eng package, I can't recognize any old fonts. Help me!!!
(I use MacOS & Tesseract 3.02)