<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Michael Jay Lissner</title><link href="https://michaeljaylissner.com/" rel="alternate"></link><link href="https://michaeljaylissner.com/feeds/tag/howto" rel="self"></link><id>https://michaeljaylissner.com/</id><updated>2014-10-06T00:00:00-07:00</updated><entry><title>Editing a File on Github</title><link href="https://michaeljaylissner.com/posts/2014/10/06/editing-on-github-a-non-technical-explainer/" rel="alternate"></link><updated>2014-10-06T00:00:00-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2014-10-06:posts/2014/10/06/editing-on-github-a-non-technical-explainer/</id><summary type="html">
&lt;p&gt;When writing programs, developers have a choice of whether they want their work to be public or private. Programs that are made public are called “open source” and ones that are not are called “closed source”. In both cases the developer can share a program with the world as a website or iPhone app, or whatever, but in the case where the code is shared publicly it’s &lt;em&gt;also&lt;/em&gt; possible for anybody anywhere in the world to change the program to make it better. (For more detail on this and other jargon, see the &lt;a href="#some-definitions"&gt;definitions at the end&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;This is very cool! &lt;/p&gt;
&lt;p&gt;But I hear you asking, “How do I, a non-developer, make use of this system to make the world a better place?” I’m glad you asked — this article is for you.&lt;/p&gt;
&lt;h2 id="and-then-there-was-git"&gt;And then there was Git&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://git-scm.com/"&gt;Git&lt;/a&gt; is an extremely popular system that developers use to keep track of the code they write. The main thing it does is make it so that two developers can work on the same file, track their individual changes and then combine their work, as you might do in Microsoft Word. Since all programs are just collections of lots of files that are together known as a “repository”, this lets a number of developers work together without tramping on each others changes.&lt;/p&gt;
&lt;p&gt;There are a million ways to use Git but lately a lot of people use Git through a website called &lt;a href="https://github.com/"&gt;Github&lt;/a&gt;. Github makes it super-easy to use Git, but you still need to understand a few steps that are necessary to make changes. The basic steps we’ll take are:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;You: Find the file&lt;/li&gt;
&lt;li&gt;You: Change the file and save your changes&lt;/li&gt;
&lt;li&gt;You: Create a pull request&lt;/li&gt;
&lt;li&gt;The manager (me or somebody else): Merges the pull request, making your changes live&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For the purpose of this article, I’ve created a new repository as a playground where you can try this out. &lt;/p&gt;
&lt;p&gt;The playground is here: &lt;a href="https://github.com/mlissner/git-tutorial/tree/master"&gt;https://github.com/mlissner/git-tutorial/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Go check out the playground and create a Github account, then come back here and continue to the next step, changing a file. &lt;/p&gt;
&lt;h2 id="make-your-change"&gt;Make your change&lt;/h2&gt;
&lt;p&gt;Like the rest of this, the process of making a change is actually pretty easy. All you have to do is find the file, make your change, and then save it. So:&lt;/p&gt;
&lt;h3 id="find-the-file"&gt;Find the file&lt;/h3&gt;
&lt;p&gt;When you look at &lt;a href="https://github.com/mlissner/git-tutorial/tree/master"&gt;the playground&lt;/a&gt;, you’ll see a bunch of files like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="File List" src="https://michaeljaylissner.com/images/github/file-list.png"/&gt;&lt;/p&gt;
&lt;p&gt;Click the file you want to edit. In this case, it’s we’ll actually be changing file called “your-name.txt”. Click it.&lt;/p&gt;
&lt;p&gt;Once you do that, you’ll see the contents of the file — a list of names, mine at the top — and you’ll see a pencil that lets you edit the file. &lt;/p&gt;
&lt;p&gt;Click the pencil! &lt;/p&gt;
&lt;h3 id="change-the-file"&gt;Change the file&lt;/h3&gt;
&lt;p&gt;At this point you’ll see a message saying something like: &lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You are editing a file in a project you do not have write access to. We are forking this project for you (if one does not yet exist) to write your proposed changes to. Submitting a change to this file will write it to a new branch in your fork so you can send a pull request. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Groovy. If you ignore both the jargon and the bad grammar, you can go ahead and add your name to the bottom of the file, and then you’ll see two fields at the bottom that you can use to explain your change:&lt;/p&gt;
&lt;p&gt;&lt;img alt="Explain Thyself" src="https://michaeljaylissner.com/images/github/explain-thyself.png"/&gt;&lt;/p&gt;
&lt;p&gt;This is like an email. The first field is the subject of your change, something brief and to the point. The second field lets you flesh out in more detail what you did, why it’s useful, etc. In many cases — like simply adding your name to this file — your changes are obvious and you can just hit the big green “Propose file change” button.&lt;/p&gt;
&lt;p&gt;Let’s press the big green button, shall we? &lt;/p&gt;
&lt;h3 id="send-a-pull-request"&gt;Send a “pull request”&lt;/h3&gt;
&lt;p&gt;At this point you’ll see another form with another somewhat cryptic message:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The change you just made was written to a new branch in your fork of this project named patch-1. If you’d like the author of the original project to merge these changes, submit a pull request.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I think the important part of that message is the second sentence:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’d like the author of the original project to merge these changes, submit a pull request.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Ok, so how do you do that? Well, it turns out that the page we’re looking at is very similar to the one we were just on. It has two fields, one for a subject and one for a comment. You can fill these out, but if it’s a simple change you don’t need to, and anyway, if you put stuff on the last page it’ll just be copied here already.&lt;/p&gt;
&lt;p&gt;So: Press the big green button that says “Create pull request”. &lt;/p&gt;
&lt;p&gt;You’re now done, but what did you do, exactly? &lt;/p&gt;
&lt;h3 id="lets-parse-whats-happened-so-far"&gt;Let’s parse what’s happened so far&lt;/h3&gt;
&lt;p&gt;At this point, you’ve found a file, changed it, and submitted a pull request. Along the way, the system told you that it was “forking this project for you” and that your changes were, “written to a new branch in your fork of this project”. &lt;/p&gt;
&lt;p&gt;Um, what? &lt;/p&gt;
&lt;p&gt;The most amazing thing that Git does is allow many developers to work on the same file at the same time. It does this by creating what it calls forks and branches. For our purposes these are basically the same thing. The idea behind both is that every so often people working on a file save a copy of the entire repository into what’s called a commit. A commit is a copy of the code that is saved forever so anybody can travel back in time and see the code from two weeks ago or a month ago or whatever. 95% of any Git repository is just a bunch of these copies, and you actually created one when you saved your changes to the file. &lt;/p&gt;
&lt;p&gt;This is super useful on its own, but when somebody forks or branches the repository, what they do is say, “I want a perfect copy of all the old stuff, but from here on, I’m going my own way whenever I save things.” Over time, everybody working in the repository does this, creating their own work in their own branches, and amazingly, one person’s work doesn’t interfere with another’s. &lt;/p&gt;
&lt;p&gt;Later, once somebody thinks that their work is good enough to share with everybody, they create what’s called a “Pull Request”, just like you did a moment ago, and the owner of the repository — in this case, me — gets an email asking him or her to “pull” the code into the main repository and “merge” the changes into the files that are there. Once this is done, everybody gets those changes from then on. &lt;/p&gt;
&lt;p&gt;It’s a brilliant system. &lt;/p&gt;
&lt;h3 id="my-turn-merging-the-pull-request"&gt;My turn: Merging the pull request&lt;/h3&gt;
&lt;p&gt;When you created that pull request a moment ago, you actually sent me an email and now you have to wait for me to do something. Eventually, I’ll get your email, and when I do I’ll go to Github and see a screen like this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="PR Screen" src="https://michaeljaylissner.com/images/github/pr-screen.png"/&gt;&lt;/p&gt;
&lt;p&gt;I’ll probably make a comment saying thank you, and then &lt;em&gt;I’ll&lt;/em&gt; press the Big Green Button that says, “Merge pull request”.&lt;/p&gt;
&lt;p&gt;This will merge your changes into mine and we’ll both go about our merry way. Mission accomplished! &lt;/p&gt;
&lt;h2 id="why-this-works-so-well"&gt;Why this works so well&lt;/h2&gt;
&lt;p&gt;This system is pretty amazing and it works very well for tiny little projects and massive ones alike (for example, &lt;a href="https://github.com/torvalds/linux/network"&gt;some projects have thousands of active forks&lt;/a&gt;). What’s great about this system is that it allows anybody to do whatever they want in their fork without requiring any permission from the owner of the code. Anybody can do whatever they want in their fork and I’m happy to see them experimenting. That work will never affect me until they issue a pull request and I merge it in, accepting their proposed changes.&lt;/p&gt;
&lt;p&gt;This process mirrors a lot of real world processes between writers and editors, but solidifies and equalizes it so that there’s a &lt;em&gt;right&lt;/em&gt; way to do things and so that nobody can cause any trouble. The process itself can be a little overwhelming at first, with lots of jargon and steps, but once you get it down, it’s smooth and quick and works very well. &lt;/p&gt;
&lt;p&gt;As you might expect, there are tons of resources about this on the Web. Some really good ones &lt;a href="https://guides.github.com/introduction/flow/"&gt;are at Github&lt;/a&gt; and there are even &lt;a href="http://git-scm.com/book"&gt;entire online books&lt;/a&gt; going into these topics. Like all things, you can go as deep as you want, but the above should give you some good basics to get you started. &lt;/p&gt;
&lt;h2 id="some-definitions"&gt;Some Definitions&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open vs Closed Source&lt;/strong&gt;: This is a topic entire theses and books have been written about, but in general open source is way of creating a program where a developer shares all of their code so anybody can see it. In general when a program is open source, people are welcome to edit the code, help file and fix bugs, etc. On the other hand, closed source development is a way of creating a program so that only the developers can see the code, and the public at large is generally not welcome to contribute, except to sometimes email the developer with comments. &lt;/p&gt;
&lt;p&gt;In a way, the product of open source development is a combination of the code itself plus the program it creates, while in closed source projects the product is the program alone. There are thousands of examples of each of these ways of developing software. For example, &lt;a href="https://source.android.com/"&gt;Android&lt;/a&gt; and the &lt;a href="https://github.com/torvalds/linux/"&gt;Linux Kernel&lt;/a&gt; are open source, while Microsoft Word and iPhones are not. (See how I couldn’t link to the latter two?)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: A collection of files, images, and other stuff that are kept together for a common purpose. Generally it’s a bunch of files that create a website or program, but some people use repositories for all kinds of things, like dealing with &lt;a href="https://github.com/mlissner/identity-theft"&gt;identity theft&lt;/a&gt; (shameless plug), &lt;a href="https://github.com/mlissner/michaeljaylissner.com/edit/master/content/editing-on-github-a-non-technical-explainer.md"&gt;holding the contents of this very webpage&lt;/a&gt; (shameless plug), or even &lt;a href="https://github.com/vzvenyach/codingforlawyers/"&gt;writing online books teaching lawyers to code&lt;/a&gt; (&lt;em&gt;not&lt;/em&gt; a shameless plug!).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pull request&lt;/strong&gt;: A polite way to say, “This code is ready to get included in the main repository. Please pull it in.”&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Merging&lt;/strong&gt;: The process of taking a branch or fork and merging the changes in it into another branch or fork. This combines two people’s work into a single place. &lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;</summary><category term="GitHub"></category><category term="howto"></category><category term="CourtListener"></category></entry><entry><title>How to help end Boy Scouts of America’s ban on gays</title><link href="https://michaeljaylissner.com/posts/2013/04/19/help-end-the-bsa-ban-on-gays/" rel="alternate"></link><updated>2013-04-19T10:18:05-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2013-04-19:posts/2013/04/19/help-end-the-bsa-ban-on-gays/</id><summary type="html">&lt;p&gt;On May 3rd, the Boy Scouts are considering lifting their ban on gays, and are putting a vote to the local and national councils. This means that it&amp;#8217;s easy to influence the vote by calling in and expressing your opinion. It&amp;#8217;s simple to do so, and more voices could change the direction of the Boy Scouts of America, allowing all boys to be included and&amp;nbsp;accepted. &lt;/p&gt;
&lt;p&gt;Here&amp;#8217;s what to&amp;nbsp;do:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="http://www.scouting.org/LocalCouncilLocator.aspx"&gt;Find your local&amp;nbsp;council&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Give them a call or send them an email telling them your&amp;nbsp;opinion.&lt;/li&gt;
&lt;li&gt;Contact the national council via email: &lt;a href="mailto:feedback@scouting.org"&gt;feedback@scouting.org&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Most important is to contact your local council. We need their phones to be ringing off the hook with people expressing their opinions. It truly takes no more than two minutes. Here&amp;#8217;s the San Diego council: (619) 298-6121, and the East Bay council: (925)&amp;nbsp;674-6100.&lt;/p&gt;
&lt;p&gt;If you prefer a form letter, you can &lt;a href="https://secure3.convio.net/hrc/site/Advocacy?cmd=display&amp;amp;page=UserAction&amp;amp;id=1623&amp;amp;autologin=true&amp;amp;utm_term=link2&amp;amp;JServSessionIdr004=on4q7x9ly4.app304a"&gt;just do this one through the Human Rights Campaign&lt;/a&gt; (34,000 people already&amp;nbsp;have).&lt;/p&gt;
&lt;p&gt;Here&amp;#8217;s a simple transcript for you to&amp;nbsp;follow:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;I&amp;#8217;m [calling/wrriting] to express my desire that the Boy Scouts immediately lift their ban on gays. This ban is discriminatory, outdated and pointless. Scouting teaches many great lessons to thousands of adolescents across the &lt;span class="caps"&gt;U.S.&lt;/span&gt; One lesson that Scouting should not teach is that homosexuality is somehow wrong or means for&amp;nbsp;discrimination. &lt;/p&gt;
&lt;p&gt;On May 3rd, I hope that you will help the &lt;span class="caps"&gt;BSA&lt;/span&gt; finally makes the right decision so it can continue to lead boys into being mature and accepting&amp;nbsp;men.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Send this to the email above, and call your local council. This could make a&amp;nbsp;difference.&lt;/p&gt;</summary><category term="howto"></category><category term="homosexuality"></category><category term="boy scouts"></category></entry><entry><title>Adding New Fonts to Tesseract 3 OCR Engine</title><link href="https://michaeljaylissner.com/posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/" rel="alternate"></link><updated>2014-10-06T00:00:00-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2012-02-11:posts/2012/02/11/adding-new-fonts-to-tesseract-3-ocr-engine/</id><summary type="html">
&lt;h2 id="status"&gt;Status&lt;/h2&gt;
&lt;p&gt;I’m attempting to keep this up to date as Tesseract changes. If you have 
corrections, please send them directly using the &lt;a href="https://michaeljaylissner.com/contact"&gt;contact page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I’ve turned off commenting on this article because it was just a 
bunch of people asking for help and never getting any. If you need help 
with these instructions, go to &lt;a href="https://stackoverflow.com/questions/tagged/tesseract"&gt;Stack Overflow&lt;/a&gt; and ask there. If you send
me a link to a question on such a site, I’m much more likely to respond positively.&lt;/p&gt;
&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;&lt;a href="http://code.google.com/p/tesseract-ocr/"&gt;Tesseract&lt;/a&gt; is a great and powerful &lt;span class="caps"&gt;OCR&lt;/span&gt; engine, but their &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3"&gt;instructions 
for adding a new font&lt;/a&gt; are incredibly long and complicated. At 
CourtListener we have to handle several unusual &lt;a href="http://en.wikipedia.org/wiki/Blackletter"&gt;blackletter fonts&lt;/a&gt;, 
so we had to go through this process a few times. Below I’ve explained the 
process so others may more easily add fonts to their system.&lt;/p&gt;
&lt;p&gt;The process has a few major steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#create-training-documents"&gt;Create training documents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="#train-tesseract"&gt;Teach Tesseract about the documents&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="create-training-documents"&gt;Create training documents&lt;/h2&gt;
&lt;p&gt;To create training documents, open up &lt;span class="caps"&gt;MS&lt;/span&gt; Word or LibreOffice, paste in the contents of the attached file named ‘standard-training-text.txt’. This file contains the training text that is used by Tesseract for the included fonts.&lt;/p&gt;
&lt;p&gt;Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I’ve attached a sample doc too, if that helps. Set the text to the font you want to use, and save it as font-name.doc.&lt;/p&gt;
&lt;p&gt;Save the document as a &lt;span class="caps"&gt;PDF&lt;/span&gt; (call it [lang].font-name.exp0.pdf, with lang being an &lt;a href="http://www.sil.org/iso639-3/iso-639-3_Name_Index_20120203.tab"&gt;&lt;span class="caps"&gt;ISO&lt;/span&gt;-639 three letter abbreviation&lt;/a&gt; for your language), and then use the following command to convert it to a 300dpi tiff (requires imagemagick):&lt;/p&gt;
&lt;p&gt;&lt;code&gt;convert -density 300 -depth 4 lang.font-name.exp0.pdf lang.font-name.exp0.tif&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;You’ll now have a good training image called lang.font-name.exp0.tif. If you’re adding multiple fonts, or bold, italic or underline, repeat this process multiple times, creating one doc → pdf →  tiff per font variation.&lt;/p&gt;
&lt;h2 id="train-tesseract"&gt;Train Tesseract&lt;/h2&gt;
&lt;p&gt;The next step is to run tesseract over the image(s) we just created, and to see how well it can do with the new font. After it’s taken its best shot, we then give it corrections. It’ll provide us with a box file, which is just a file containing x,y coordinates of each letter it found along with what letter it thinks it is. So let’s see what it can do:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;tesseract&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tiff&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp0&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nochop&lt;/span&gt; &lt;span class="n"&gt;makebox&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;You’ll now have a file called font-name.exp0.box, and you’ll need to open it in a box-file editor. There are a bunch of these &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Box_File_Editors"&gt;on the Tesseract wiki&lt;/a&gt;. The one that works for me (on Ubuntu) is &lt;a href="http://code.google.com/p/moshpytt/"&gt;moshpytt&lt;/a&gt;, though it doesn’t support multi-page tiffs. If you need to use a multi-page tiff, see &lt;a href="http://code.google.com/p/moshpytt/issues/detail?id=2"&gt;the issue on the topic&lt;/a&gt; for tips. Once you’ve opened it, go through &lt;strong&gt;every&lt;/strong&gt; letter, and make sure it was detected correctly. If a letter was skipped, add it as a row to the box file. Similarly, if two letters were detected as one, break them up into two lines.&lt;/p&gt;
&lt;p&gt;When that’s done, you feed the box file back into tesseract:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;tesseract&lt;/span&gt; &lt;span class="n"&gt;eng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exp0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tif&lt;/span&gt; &lt;span class="n"&gt;eng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;font&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;box&lt;/span&gt; &lt;span class="n"&gt;nobatch&lt;/span&gt; &lt;span class="n"&gt;box&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Next, you need to detect the Character set used in all your box files:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;unicharset_extractor&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;box&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;When that’s complete, you need to create a &lt;code&gt;font_properties&lt;/code&gt; file. It should list every font you’re training, one per line, and identify whether it has the following characteristics: &amp;lt;fontname&amp;gt; &amp;lt;italic&amp;gt; &amp;lt;bold&amp;gt; &amp;lt;fixed&amp;gt; &amp;lt;serif&amp;gt; &amp;lt;fraktur&amp;gt;&lt;/p&gt;
&lt;p&gt;So, for example, if you use the standard training data, you might end up with a file like this:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;eng.arial.box 0 0 0 0 0
eng.arialbd.box 0 1 0 0 0
eng.arialbi.box 1 1 0 0 0
eng.ariali.box 1 0 0 0 0
eng.b018012l.box 0 0 0 1 0
eng.b018015l.box 0 1 0 1 0
eng.b018032l.box 1 0 0 1 0
eng.b018035l.box 1 1 0 1 0
eng.c059013l.box 0 0 0 1 0
eng.c059016l.box 0 1 0 1 0
eng.c059033l.box 1 0 0 1 0
eng.c059036l.box 1 1 0 1 0
eng.cour.box 0 0 1 1 0
eng.courbd.box 0 1 1 1 0
eng.courbi.box 1 1 1 1 0
eng.couri.box 1 0 1 1 0
eng.georgia.box 0 0 0 1 0
eng.georgiab.box 0 1 0 1 0
eng.georgiai.box 1 0 0 1 0
eng.georgiaz.box 1 1 0 1 0
eng.lincoln.box 0 0 0 0 1
eng.old-english.box 0 0 0 0 1
eng.times.box 0 0 0 1 0
eng.timesbd.box 0 1 0 1 0
eng.timesbi.box 1 1 0 1 0
eng.timesi.box 1 0 0 1 0
eng.trebuc.box 0 0 0 1 0
eng.trebucbd.box 0 1 0 1 0
eng.trebucbi.box 1 1 0 1 0
eng.trebucit.box 1 0 0 1 0
eng.verdana.box 0 0 0 0 0
eng.verdanab.box 0 1 0 0 0
eng.verdanai.box 1 0 0 0 0
eng.verdanaz.box 1 1 0 0 0
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Note that this is the standard font_properties file that should be supplied with Tesseract and I’ve added the two bold rows for the blackletter fonts I’m training. You can also see which fonts are included out of the box.&lt;/p&gt;
&lt;p&gt;We’re getting near the end. Next, create the clustering data:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;mftraining -F font_properties -U unicharset -O lang.unicharset *.tr 
cntraining *.tr
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If you want, you can &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional)"&gt;create a wordlist&lt;/a&gt; or a &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#The_last_file_(unicharambigs)"&gt;unicharambigs file&lt;/a&gt;. If you don’t plan on doing that, the last step is to combine the various files we’ve created. &lt;/p&gt;
&lt;p&gt;To do that, rename each of the language files (normproto, Microfeat, inttemp, pffmtable) to have your lang prefix, and run (mind the dot at the end):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;combine_tessdata&lt;/span&gt; &lt;span class="n"&gt;lang&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;This will create all the data files you need, and you just need to move them to the correct place on your &lt;span class="caps"&gt;OS&lt;/span&gt;. On Ubuntu, I was able to move them to;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="n"&gt;sudo&lt;/span&gt; &lt;span class="n"&gt;mv&lt;/span&gt; &lt;span class="n"&gt;eng&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;traineddata&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;usr&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;local&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;share&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;tessdata&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;And that, good friend, is it. Worst process for a human, ever.&lt;/p&gt;
&lt;h2 id="enclosures"&gt;Enclosures&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://michaeljaylissner.com/archive/ocr/standard-training-text.txt"&gt;Training data file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://michaeljaylissner.com/archive/ocr/old-english.doc"&gt;Old English example file&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</summary><category term="Tesseract"></category><category term="OCR"></category><category term="HowTo"></category><category term="Font"></category></entry><entry><title>Installing Tracker from Source</title><link href="https://michaeljaylissner.com/posts/2009/02/23/installing-tracker-from-source/" rel="alternate"></link><updated>2009-02-23T00:13:40-08:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2009-02-23:posts/2009/02/23/installing-tracker-from-source/</id><summary type="html">&lt;p&gt;I&amp;#8217;ve been working over the past several weeks on getting Tracker to work better on my system. There are a couple reasons that I&amp;#8217;m doing this. The first is that by default on Ubuntu, Tracker doesn&amp;#8217;t support a number of meta formats (such as the tags in JPEGs, &lt;span class="caps"&gt;ID3&lt;/span&gt; info in MP3s, and the like). The second was that the &lt;span class="caps"&gt;RDF&lt;/span&gt; parsing code in the default Ubuntu version is a bit buggy, and the new version is better. It&amp;#8217;s been a bit of a pain figuring out the install process, so I figured I&amp;#8217;d post here so others might have an easier&amp;nbsp;time.&lt;/p&gt;
&lt;p&gt;The &lt;a href="http://projects.gnome.org/tracker/start.html"&gt;online instructions&lt;/a&gt; say to simply download the code, and to install it. No big deal, right? Well&amp;#8230;in reality, it&amp;#8217;s a bit harder than that. The process I went through was to download the source from &lt;a href="http://projects.gnome.org/tracker/download.html"&gt;here&lt;/a&gt; per the instructions, unpack the source files, and to run the configure&amp;nbsp;command. &lt;/p&gt;
&lt;p&gt;After the configure command is run each time, it will give you a summary of which components will be installed, and which will not. If you have all the dependencies necessary, and include a couple of arguments to the configure command, everything will get installed. If not, certain pieces will be&amp;nbsp;missing. &lt;/p&gt;
&lt;p&gt;The list of dependencies is a bit long, so before you run ./configure, you might as well install them. To do so,&amp;nbsp;run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;sudo aptitude install libgmime-2.0-2a libgmime-2.0-2-dev dbus-glib-1-dev libdbus-glib-1-dev libhal-dev libhal-storage-dev sqlite3-dev libsqlite3-dev libexif-dev libdeskbar-tracker libgsf-1-dev libjpeg62-dev libtiff4-dev libxine-dev libpoppler-dev libgstreamer0.10-dev libpoppler-glib-dev libtotem-plparser-dev libunac1-dev libexempi-dev libraptor1-dev libtracker-gtk-dev libgnome-desktop-dev libgnome-desktop-dev libnotify-dev
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Once those are installed,&amp;nbsp;run:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;./configure --enable-deskbar-applet --enable-tracker-applet --prefix&lt;span class="o"&gt;=&lt;/span&gt;/usr --sysconfdir&lt;span class="o"&gt;=&lt;/span&gt;/etc
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;After this, it should say that pretty much everything will be installed. If so, you can proceed to the commands below, and once those are complete, the latest version should be installed with full&amp;nbsp;functionality.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;make
sudo make install
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;If after running your configure script (&lt;code&gt;./configure&lt;/code&gt;) it doesn&amp;#8217;t 
indicate that everything will be installed, put in a comment below, and we&amp;#8217;ll 
see what we can&amp;nbsp;do.&lt;/p&gt;</summary><category term="Tracker"></category><category term="install"></category><category term="howto"></category><category term="desktop search"></category></entry></feed>