Our New Zealand Blog

If you seek our blog about our journey on the Te Araroa trail, it lies down this path.

Snow Mountain Distance Map

I put on my cartographer's hat today and made a distance map for Snow Mountain. The regular map for this area doesn't have any distance markers on it, so these are based on talking with the ranger in the park. He seemed knowledgeable, so these numbers are probably the best we'll get for a while.

If somebody discovers that the rangers in Snow Mountain finally have firm distances, let me know what they are, and I'll happily update this map. Until then, attached is the map as both an SVG and a png. Enjoy.

Best Energy Pack for Ultra-light, long-distance Backpacking

Obviously, the best ultalight battery pack is no battery pack, but on our trip we're going to be bringing a phone, a camera, an MP3 player and two headlamps. The phone doubles as our GPS and the headlamps might be USB rechargeable. Having spare energy to get us through longer sections is important.

I've completed some research, investigating which are the best options. As I was reviewing the many battery pack and solar charger solutions that are out there, I came to a few conclusions right off the bat:

  • The models with turbines, like the Eton BoostTurbine2000 are crap. I called Eton to ask them how many cranks it would take to charge a phone. The technical support guy's response: "You'd be at it all day. These are meant for emergency use only." So turbines are out.
  • The idea of using a solar panel is great, but doing so will suck. I looked far and wide for an ultralight solar panel, but they're all heavier than just getting a battery pack. The problem with solar panels is that they need to have their own battery packs, so you end up carrying the extra battery pack and the panel itself. If you're out for really long trips, these are probably worth it, but for trips where you'll be in and out of towns, there are lighter options.
  • There are a million wannabe battery pack manufacturers, like Timetec, Powergen, Rokit and Unu. They're strong on branding, but when you go to their website — if they have one — you'll find more marketing but not much real innovation or information. For example, in my attached chart, there are many blank spots and questions marks for these brands.

Looking for battery packs for backpackers means finding one that's light, durable and efficient. There are a few features you probably want in particular:

  • You want a battery pack that will charge your devices as quickly as they support. Many battery packs, particularly the small ones, only provide 1 amp of output from a single USB port. If you have a device that can handle more (like a tablet), it'll charge quite slowly. The best ones have as much as 2.5A output, and will adapt to send the right amount of power to your device.
  • You want a battery pack that charges quickly when you plug it into the wall. You don't want to get stuck in town overnight because of how slowly your battery pack charges.
  • You want a battery pack that offers "pass through charging". This allows you to plug the battery into the wall, and plug your device into the battery -- simultaneously -- allowing you to charge both at the same time. Most battery packs don't offer this.
  • Finally, you want the most energy packed into the smallest, lightest device.

When looking at all these features, there are three standout energy packs, the Innergie PocketCell Duo, the Anker Astro 5600 and the Just Mobile Gum++.

The Anker Astro ($30) is the lightest of the bunch, but it also is complex, charges slowly when plugged in, and has a complex body. It has a flashlight, for what that's worth, and, at 4.2 ounces, has an energy density of 1333 mAh/oz. It does not appear to have passthrough functionality.

The Just Mobile Gum++ ($90) is a very simple, lightweight and durable option. At 4.6 ounces, it's still very light. It has a slightly lower energy density from the Astro (1304 mAh/oz), but boasts an ABS shell, the fastest charge rate when plugged in (2500 mA), and just generally looks like a good, simple option.

Finally is the Innergie PocketCell Duo ($90). This is the most powerful of the bunch, at 6800 mAh, has the highest energy density at 1456 mAh/oz) and has two output ports (both at 2100 mA). Unfortunately, unlike the Gum++, it only chargest at 1500 mA, so it will take longer to charge.

The PocketCell and Gum++ are both very new products, and don't have many reviews. The Anker has been around awhile and has decent reviews, though some upset people that aren't happy with their products.

Here's my recommendation: If you need a battery pack that can charge two devices at the same time, get the Innergie PocketCell Duo. If not, get the Just Mobile Gum++. If you want to do your own analysis, check out the attached spreadsheet and go wild.

Update: Since Innergie does not mention whether their device supports pass through, I've contacted them via their contact form and via Twitter. After more than two months, they've been entirely non-responsive. There is another person on their Twitter stream that is complaining (loudly) that they never responded to a broken device complaint that he had. I can't recommend Innergie despite their slick device, great marketing, etc. Something is going wrong with that company.

New Tool to Remove Dead Feeds from OPML Files

Since Google Reader's closure is immenent, a lot of folks are looking for solutions. One problem I've run into many times throughout the years is that the feeds I have in Google Reader are largely dead, and there's no way to get rid of the ones that are no longer updated or simply gone.

So I built a tool.

Check it out on BitBucket and give it a whirl. It'll go through an OPML file, check all the feeds in it, and make a new file for you that has them cleaned out. In my case, it purged about 20% of the feeds I had -- a big improvement.

It's a simple tool, but look, if your friend hasn't updated their blog in two years...they're probably not going to. If the feed is gone, you no longer need it cluttering your life. If a feed has been redirected, you should fix that in your reader! Little things, right?

Desolation Wilderness

Z and I went to Desolation Wilderness this weekend in preparation for our thru hike of the Te Araroa trail in November. It was a great trip, and photos are here. If you need permission to see them, let me know.

The route

Our route brought us to Wrights Lake at about 10pm on Friday night. From there, we hiked up to Twin Lakes and found a spot by the south western edge. This was pretty lucky. Getting across the lake's outlet at midnight when we got there was a pain, especially for me, since I lacked a flashlight.

On Saturday, we hiked halfway to Island Lakes, then south to a east-west ridge that forms one of the edge of Mt. Price. This ridge was stupidly dangerous, hoping from bounder to boulder with no safety net and a giant fall below. In wet or snowy conditions, it'd be impossible, but with some patience we were able to do it without too much trouble. Once traversed the ridge to the east, walking up the southern side of Mt. Price was a cake walk.

After spending some time at the top, where we observed some friendly (but skinny) marmots, we headed east towards Mosquito Pass. Getting down from the top was a hairy mess, with lots of rocky cliffs we had to very carefully pick our way down. Eventually, we made it off the cliffs and into the snow fields, which happily were soft enough to glissade down. We didn't have ice axes, so if we couldn't glissade, we would have had to wait for the sun to get higher and melt more snow. The snow here was full of medium-sized sun cups, which Z naturally (and quite properly, given her running shoes) slid around on and hated. I hated them too, but my shoes at least afforded me some traction on them. The snow went down almost to the top of the ridge between Clyd and Aloha lakes, but not quite, and once we got there, we clamored over the rocks for a while until arriving back on trail at Mosquito Pass. From here, we walked along Lake Aloha, to camp at the southern edge. Mozzies were pretty bad here, but bearable.

On Sunday, we woke up, hiked back over Mosquito Pass, and beyond it through a long valley, up and over Rockbound Pass. From there we hiked down, back into the Mosquito zone, and beyond to our the car.

Regrets

  • I forgot to put new batteries in my flashlight, and the old ones were all but dead.
  • The deet that we brought was old and didn't seem to work too well.
  • Z needs real shoes.
  • Ice axes would have been safer.

Awesome things

  • Lake Aloha is amazing but crowded.
  • The snow was just right, despite it being WAY too early for a trip to the High Sierras.
  • My legs again were abused: Scratches from the night hike with no flashlight, and sunburns from poor sunscreen application.
  • The itinerary pushed our muscles without injury (unlike last time).
  • Z is a natural backpacker and a great hiking partner.

How to help end Boy Scouts of America's ban on gays

On May 3rd, the Boy Scouts are considering lifting their ban on gays, and are putting a vote to the local and national councils. This means that it's easy to influence the vote by calling in and expressing your opinion. It's simple to do so, and more voices could change the direction of the Boy Scouts of America, allowing all boys to be included and accepted.

Here's what to do:

  1. Find your local council
  2. Give them a call or send them an email telling them your opinion.
  3. Contact the national council via email: feedback@scouting.org.

Most important is to contact your local council. We need their phones to be ringing off the hook with people expressing their opinions. It truly takes no more than two minutes. Here's the San Diego council: (619) 298-6121, and the East Bay council: (925) 674-6100.

If you prefer a form letter, you can just do this one through the Human Rights Campaign (34,000 people already have).

Here's a simple transcript for you to follow:

I'm [calling/wrriting] to express my desire that the Boy Scouts immediately lift their ban on gays. This ban is discriminatory, outdated and pointless. Scouting teaches many great lessons to thousands of adolescents across the U.S. One lesson that Scouting should not teach is that homosexuality is somehow wrong or means for discrimination.

On May 3rd, I hope that you will help the BSA finally makes the right decision so it can continue to lead boys into being mature and accepting men.

Send this to the email above, and call your local council. This could make a difference.

So you wanna buy a bike

Tagged:  

Another of my friends is asking me questions about buying a bike. I love that they do this, but since it's become a trend, I figure I should throw my thoughts together here as a reference.

Buying a bike is actually pretty simple:

  1. You want a used road bike.
  2. You want gears; more is better.
  3. Lighter is better.
  4. Size is important.
  5. Exotic isn't for you (yet).

That's pretty much all there is to it, at least on the macro level. Let's dive into each of these a tad, shall we?

First, you want a used road bike. The reason for this is pretty logical: As a commuter, you'll be riding mostly on...roads! You could get a cruiser (but that defies rule #2), or a mountain bike (defies rule #3) or a recumbant (see rule #5). But ultimately, road bikes are the best for what you want to do: travel quickly and safely to and from work and around town with minimal fuss. New ones cost a fortune, so get one used.

I wish I didn't have to mention rule number two, but for all the hipsters out there, you're wrong about fixies. If you life in San Francisco, a city with specific maps for the hills, you should use the things that were invented to get you up hills: Gears! Even if you live in a flat-ish city, you should get gears because they give you flexibility. Sure, your bike is a commuter today, but tomorrow your friend might want to go on a bike ride with you. Or maybe you move to a new city. Who knows? Get gears. Don't be trendy.

Enough said. Rule number three is that lighter is better. If you look at new road bikes, you'll quickly learn that you can spend an incredible amount on bikes. And naturally, the ligher they are, the more expensive they are. So how light is right? Well, this one is tough and somewhat subjective, so I say, find one that's as light as you can buy for your dollar. Off the cuff, I'd wager that around $400 is the point of deminishing returns for most people.

Size is important. If it doesn't fit, it's useless. Go to a bike shop and get sized before you do much shopping. It'll help you winnow the stuff you're looking at anyway. This may as well be the first step of your search.

Rule five explains itself. When you know more about bikes, branch out if you care. You probably won't, so save yourself the effort of looking at exotic stuff. It's exotic for a reason.

Bonus questions for the avid reader

What brand should I buy? Doesn't much matter, surprisingly. There are better and worse brands, but if you're buying a used road bike, and follow rule number 3, your goals will be accomplished.

What material should my bike be? Probably steel. Aluminum is good too, but probably out of your price range. Steel's a very reputable material though. If you can find Reynold's steel, all the better.

How do I know the bike I'm buying isn't stolen? Good question! This one's hard. You can look for the serial number or try to only deal with people that seem legit. There is a national bike registry (which you should use!), but otherwise there's not a whole lot you can do...yet.

Comments? Thoughts? Email me. You're probably my friend already if you're reading this...

Enabling Two-Factor Authentication

This post is as much Public Service Announcement as anything else. I didn't realize that two-factor authentication had finally taken off. It's practically vital for your email account (you're asking for trouble without it), but in the past year or so, a bunch of other services have begun offering it.

Today I went on a little security binge, and found that I could turn on two-factor authentication at:

  • Google/Gmail
  • Yahoo
  • Dropbox
  • Charles Schwab (they send you a fob for free)
  • Facebook
  • Paypal
  • Amazon Web Services

One note about Charles Schwab is that getting their fob is great, but it's hardly all you should do to secure your account. You should also set up what they call a "verbal password" that you have to provide whenever you call in. Without it, it's pretty easy to get into an account via their surprisingly weak phone security.

Anyway, this is a pretty good list so far. The companies are using a handful of different techniques for doing this, but they all seem pretty solid in the end. Google's, naturally, seems to be one of the most robust, but I'm impressed there's so much offered.

Go set these up!

2013 Donations

Long-time friends will probably realize that with the coming of the new year comes a revisit to my annual donations.

This year's donations are larger than any previous year, but largely fall along similar trends as in the past. The larger donations this year (about $1,000-worth) go towards non-profit organizations. The choices this year were hard. After consulting with a few friends, I decided to donate to two new categories: Environmental and Anti-Gun.

Finding a good environmental organization to give your money is HARD. After a few hours of research, I had looked at many organizations that were doing good work. But a lot of those organizations were still trying to prove the point that climate change is an issue, or were focused on small-scale issues. These are both noble goals, but I think what we need now are big solutions on an international level. I'm no expert in this topic, by far, but I'm fairly convinced that individual decision making isn't going to solve the problem fast enough. It's great if we all learn to recycle and to consider environmental impact in our daily lives. That, I don't disagree with. But I don't think it's enough. I think we need to start forcing governments and organizations to be cleaner. I'm convinced that so long as the economic incentives are in place that have led to the current behaviors, the market will follow those incentives. I'm hopeful that my donation to the Center for Climate and Energy Solutions will help bring changes to these incentives.

Finding an anti-gun organization is easier, especially given the current state of affairs after Sandyhook Elementary School. While I'm not so sure that anti-gun legislation is going to solve any truly big problems, I hope that donating my money here will help strike while the iron is hot. I simply can't believe that the 2nd Amendment pro-gun lobby is as successful as it is, and I am hopeful that we'll be able to change the dialog around guns over the next few years. Gun ownership is trending down in the U.S., and I hope that we can accellerate that trend, bringing a cease to the needless gun deaths violence we currently live with.

The other big donations in this year's list go mostly towards organizations that I've donated to in the past. Fair Vote and Rootstrikers are organizations that work to fix the current political system. Most Americans (about 70%, I believe) agree that the current Federal legislation system is corrupt, and these organizations are working to fix that. I'm pessimistic that until these organizations find success, we won't be able to deal with the small or large issues facing the country, so these organizations continue to get the plurality of my donation ($400 between them). I think the ridiculous fiscal cliff "negotiations" are testament to how bad things have gotten. Our political system is paralyzed.

Other organizations that did well this year include a handful of open-source foundations that I rely on, but which otherwise give away their work for free. My livelihood and these very donations rely on these bits of infrastructure we take for granted, so I figure I should give them some money to keep 'em going.

Here's the nitty gritty breakdown of my donations this year (as well as last):

As always, I welcome input on these decisions, and suggestions for the years ahead. Those that made suggestions for this year, I truly appreciate your help.

Year in Review: Travel Edition

Tagged:  

I did a lot of traveling this year; more than anybody should ever really do. Since I'm already forgetting all the places I went to, I figured I'd write it all down.

Here's the tally:

TripFlightsDistance (miles)
London, Germany, Turkey915500
Germany for work211362
Montreal Bike Trip and Visit to Montreal25221
L.A., San Diego, Colorado backpacking52586
Olympic Peninsula Backpacking21356
Paris, Brussels211092
Law via the Internet Conference at Ithaca45454
TOTAL2652571

I'm pretty sure some things are left out from the beginning of the year, and I'm still trying to figure out how I ended up doing so much goddamn traveling. For comparison's sake, I must note that the Earth is 24,901 around its belly, and this is a total of more than twice that.

Next year will be another banner year, as I already have seven weddings on the books. I don't think there will be so many trips to Europe though. That'll make the biggest difference.

I just hope I have enough suits. It's gonna be crazy.

Setting up etherpad with postgres on Windows

There don't seem to be any successful installation instructions for postgres on Windows. It's not that hard, but there are a couple things you need to do.

I haven't gone through these instructions to make sure they work, but this is roughly what I've done to get my Postgres/Windows/Node/Etherpad working together:

  • Install git
  • Install node.js
  • Install python
    • add PYTHON as an env
  • Install postgres
    • add C:\Program Files\PostgreSQL\9.2\bin to your path
  • Download etherpad-lite with git
  • Run the etherpad-lite windows installer per the instructions
  • start etherpad-lite
    • make sure it works with the dirty DB before getting exotic
  • Set up postgres
    • npm install pg (will throw an error about msbuild version, but ignore that, the native JS drivers are installed)
    • add a user using pgadmin
    • add a DB using pgadmin and the user created a second ago
    • reconfigure to use postgres in the settings.json file
  • Run start.bat to make sure it works
  • Turn down the log messages to only ERROR in the settings.json file.
  • Use NSSM to daemonize it, per the instructions here.

Note that NSSM doesn't yet have stdout and stderr redirection built in. Thus, to start the daemon with these working, you have to create a little script like this:

@ECHO OFF
@REM This runs etherpad with stdout and stderr getting redirected to special logs
call D:\etherpad\etherpad-lite\start.bat >> D:\etherpad\etherpad-lite\logs\stdout.log 2>> D:\etherpad\etherpad-lite\logs\stderr.log

Presentation on Juriscraper and CourtListener for LVI2012

Yesterday and today I've been in Ithaca, New York, participating in the Law via the Internet Conference (LVI), where I've been learning tons!

I had the good fortune to have my proposal topic selected for Track 4: Application Development for Open Access and Engagement.

In the interest of sharing, I've attached the latest version of my slides to this Blog post, and the audio for the talk may eventually get posted on the LVI site.

Calculating the average elevation of a trip using a TCX file

If you use a site like ridewithgps, something you may want to know is how to calculate the average elevation for a trip. Unfortunately, most sites don't seem to provide this, so we have to do a little hacking.

Here's what worked for me:

  • download the GPS TCX file
  • grep out the altitude lines (grep -i 'altitude' your_file.tcx)
  • find & replace out the remaining XML tags and whitespace using a basic text editor
  • average the remaining values in a spreadsheet

Takes about five minutes. For my trip the number came out to be 10,753 feet!

URL Hacking at REI.com

I'm about two hours away from heading on vacation to Montreal, but I wanted to post a quick update about a vulnerability I found on REI.com last night.

The vulnerability was a simple one. A few days ago, to get a 15% off coupon, I signed up for their Gear Mail newsletter. It eventually came, and at the bottom it had a link to unsubscribe, which I clicked (I was only after the 15% sign-up coupon).

The link led to:

http://email.rei.com/cgi-bin12/DM/t/nCT4n0N3xbv0ESo05DPf0Et&EmailAddr=ml...

Which redirects to:

https://preferences.rei.com/rei/rei_PrefCtr.asp?EmailAddr=mlissner@micha...

I immediately noticed the badness in these URLs, and at a whim, I tried modifying the URL to use a friend's email address. Sure enough it worked, and I could look up the full name and zip code of anybody who had an email address that was in REI's system.

Around midnight last night, I sent REI an email informing them of the problem, giving them a month to fix it, and I posted on Twitter that I had found a vulnerability on REI.com. Naively, I thought that if I didn't post the link on Twitter, nobody would be able to figure it out, but of course, by morning a friend of mine (a security/privacy researcher, sigh) had found the link and posted it. Not only that, but for fun, he had tried his address book against the link, and turned up 30 of his friend's names and zip codes out of a sample of about 200.

I sent another note to REI to make sure that they knew about the link now being in the open, and that the month I promised them had been curtailed by my own mistake.

It's now 7:15pm, about 19 hours after I first informed them of the problem, and it's fixed. It still seems to be possible for me to update your email subscriptions, but at least I can't look up information about you.

New tool for testing lxml XPath queries

I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we're using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built: There's one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I consistently had the problem that when the HTML got nasty, they'd start falling over.

No more! Today I built a quick Django app that can be run locally or on a server. It's quite simple. You input some HTML and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it's pretty basic on the whole.

I'd love to get any feedback I can about this. It's probably still got some bugs, but it's small enough that they should be quite easy to stamp out.

Update: I got in touch with the developer of Xacobeo. There's an --html flag that you can pass to it at startup, if that's your intention. If you use that, it indeed uses the same DOM parser that my tool does. Sigh. Affordances are important, especially in a GUI-based tool.

Further privacy protections at CourtListener

I've written previously about the lengths we go to at CourtListener to protect people's privacy, and today we completed one more privacy enhancement.

After my last post on this topic, we discovered that although we had already blocked cases from appearing in the search results of all major search engines, we had a privacy leak in the form of our computer-readable sitemaps. These sitemaps contain links to every page within a website, and since those links contain the names of the parties in a case, it's possible that a Google search for the party name could turn up results that should be hidden.

This was problematic, and as of now we have changed the way we serve sitemaps so that they use the noindex X-Robots-Tag HTTP header. This tells search crawlers that they are welcome to read our sitemaps, but that they should avoid serving them or indexing them.

My Presentation Proposal for LVI 2012

The Law Via the Internet conference is celebrating its 20th anniversary at Cornell University on October 7-9th. I will be attending, and with any luck, I'll be presenting on the topic proposed below.

Wrangling Court Data on a National Level

Access to case law has recently become easier than ever: By simply visiting a court's website it is now possible to find and read thousands of cases without ever leaving your home. At the same time, there are nearly a hundred court websites, many of these websites suffer from poor funding or prioritization, and gaining a higher-level view of the law can be challenging. “Juriscraper” is a new project designed to ease these problems for all those that wish to collect these court opinions daily. The project is under active development, and we are looking for others to get involved.

Juriscraper is a liberally-licensed open source library that can be picked up and used by any organization to scrape the case data from court websites. In addition to a simply scraping the websites and extracting metadata from them, Juriscraper has a number of other design goals:

  • Extensibility to support video, oral argument audio, and other media types
  • Support for all metadata provided by court websites
  • Extensibility to support varied geographies and jurisdictions
  • Generalized object-oriented architecture with little or no code repetition
  • Standardized coding techniques using the latest libraries and standards (Python, xpath, lxml, requests, chardet)
  • Simple installation, configuration, and API
  • Friendly and transparent to court websites

As well as a number of features:

  • Harmonization of metadata (US, USA, United States of America, etc → United States; et al, et. al., etc. get eliminated; vs., v, vs → v.; all dates are Python objects; etc.)
  • Smart title-casing of case names (several courts provide case names in uppercase only)
  • Sanity checking and sorting of metadata values returned by court websites

Once implemented, Juriscraper is part of a two-part system. The second part is the caller, which uses the API, and which itself solves some interesting questions:

  • How are duplicates detected and avoided?
  • How can the impact on court websites be minimized?
  • How can mime type detection be completed successfully so that textual contents can be extracted?
  • What should we do if it is an image-based PDF?
    • How should HTML be tidied?
    • How often should we check a court website for new content?
  • What should we do in case of failure?

Juriscraper is currently deployed by CourtListener.com to scrape all of the Federal Appeals courts, and we are slowly adding additional state courts over the coming weeks.

We have been scraping these sites in various ways for several years, and Juriscraper is the culmination of what we've learned. We hope that by presenting our work at LVI 2012, we will be able to share what we have learned and gain additional collaborators in our work.

Adding New Fonts to Tesseract 3 OCR Engine

Tagged:  

Update: I've turned off commenting on this article because it was just a bunch of people asking for help and never getting any. If you need help with these instructions, go to Stack Overflow and ask there. If you have corrections to the article, please send them directly to me using the Contact form.

Tesseract is a great and powerful OCR engine, but their instructions for adding a new font are incredibly long and complicated. At CourtListener we have to handle several unusual blackletter fonts, so we had to go through this process a few times. Below I've explained the process so others may more easily add fonts to their system.

The process has a few major steps:

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in the contents of the attached file named 'standard-training-text.txt'. This file contains the training text that is used by Tesseract for the included fonts.

Set your line spacing to at least 1.5, and space out the letters by about 1pt. using character spacing. I've attached a sample doc too, if that helps. Set the text to the font you want to use, and save it as font-name.doc.

Save the document as a PDF (call it [lang].font-name.exp0.pdf, with lang being an ISO-639 three letter abbreviation for your language), and then use the following command to convert it to a 300dpi tiff (requires imagemagick):

convert -density 300 -depth 4 lang.font-name.exp0.pdf lang.font-name.exp0.tif

You'll now have a good training image called lang.font-name.exp0.tif. If you're adding multiple fonts, or bold, italic or underline, repeat this process multiple times, creating one doc → pdf → tiff per font variation.

Train Tesseract

The next step is to run tesseract over the image(s) we just created, and to see how well it can do with the new font. After it's taken its best shot, we then give it corrections. It'll provide us with a box file, which is just a file containing x,y coordinates of each letter it found along with what letter it thinks it is. So let's see what it can do:

tesseract lang.font-name.exp0.tiff lang.font-name.exp0 batch.nochop makebox

You'll now have a file called font-name.exp0.box, and you'll need to open it in a box-file editor. There are a bunch of these on the Tesseract wiki. The one that works for me (on Ubuntu) is moshpytt, though it doesn't support multi-page tiffs. If you need to use a multi-page tiff, see the issue on the topic for tips. Once you've opened it, go through every letter, and make sure it was detected correctly. If a letter was skipped, add it as a row to the box file. Similarly, if two letters were detected as one, break them up into two lines.

When that's done, you feed the box file back into tesseract:

tesseract eng.font-name.exp0.tif eng.font-name.box nobatch box.train.stderr

Next, you need to detect the Character set used in all your box files:

unicharset_extractor *.box

When that's complete, you need to create a font_properties file. It should list every font you're training, one per line, and identify whether it has the following characteristics: <fontname> <italic> <bold> <fixed> <serif> <fraktur>

So, for example, if you use the standard training data, you might end up with a file like this:

eng.arial.box 0 0 0 0 0
eng.arialbd.box 0 1 0 0 0
eng.arialbi.box 1 1 0 0 0
eng.ariali.box 1 0 0 0 0
eng.b018012l.box 0 0 0 1 0
eng.b018015l.box 0 1 0 1 0
eng.b018032l.box 1 0 0 1 0
eng.b018035l.box 1 1 0 1 0
eng.c059013l.box 0 0 0 1 0
eng.c059016l.box 0 1 0 1 0
eng.c059033l.box 1 0 0 1 0
eng.c059036l.box 1 1 0 1 0
eng.cour.box 0 0 1 1 0
eng.courbd.box 0 1 1 1 0
eng.courbi.box 1 1 1 1 0
eng.couri.box 1 0 1 1 0
eng.georgia.box 0 0 0 1 0
eng.georgiab.box 0 1 0 1 0
eng.georgiai.box 1 0 0 1 0
eng.georgiaz.box 1 1 0 1 0
eng.lincoln.box 0 0 0 0 1
eng.old-english.box 0 0 0 0 1
eng.times.box 0 0 0 1 0
eng.timesbd.box 0 1 0 1 0
eng.timesbi.box 1 1 0 1 0
eng.timesi.box 1 0 0 1 0
eng.trebuc.box 0 0 0 1 0
eng.trebucbd.box 0 1 0 1 0
eng.trebucbi.box 1 1 0 1 0
eng.trebucit.box 1 0 0 1 0
eng.verdana.box 0 0 0 0 0
eng.verdanab.box 0 1 0 0 0
eng.verdanai.box 1 0 0 0 0
eng.verdanaz.box 1 1 0 0 0

Note that this is the standard font_properties file that should be supplied with Tesseract and I've added the two bold rows for the blackletter fonts I'm training. You can also see which fonts are included out of the box.

We're getting near the end. Next, create the clustering data:

mftraining -F font_properties -U unicharset -O lang.unicharset *.tr
cntraining *.tr

If you want, you can create a wordlist or a unicharambigs file. If you don't plan on doing that, the last step is to combine the various files we've created.

To do that, rename each of the language files (normproto, Microfeat, inttemp, pffmtable) to have your lang prefix, and run (mind the dot at the end):

combine_tessdata lang.

This will create all the data files you need, and you just need to move them to the correct place on your OS. On Ubuntu, I was able to move them to;

sudo mv eng.traineddata /usr/local/share/tessdata/

And that, good friend, is it. Worst process for a human, ever.

The Winning Font in Court Opinions

At CourtListener, we're developing a new system to convert scanned court documents to text. As part of our development we've analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we've attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold Italic Bold Italic Total
Times 1454 953 867 47 3321
Courier 369 333 209 131 1042
Arial 364 39 11 41 455
Symbol 212 0 0 0 212
Helvetica 24 161 2 2 189
Century Schoolbook 58 54 52 9 173
Garamond 44 42 41 0 127
Palatino Linotype 36 24 24 1 85
Old English 42 0 0 0 42
Lincoln 27 0 0 0 27

Support for x-robots-tag and robots HTML meta tag

As part of our research for our post on how we block search engines, we looked into which search engines support which privacy standards. This information doesn't seem to exist anywhere else on the Internet, so below are our findings, starting with the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Google (known as Googlebot) and Bing (known as Bingbot) support the x-robots-tag header and the robots HTML tag. Here's Google's page on the topic. And here's Bing's. The msnbot is retired.

Yahoo, AOL

Yahoo!'s search engine is provided by Bing. AOL's is provided by Google. These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia's search engine, known as yandex), support the robots meta tag, but do not appear to support the x-robots-tag. Ask's page on the topic is here, and Yandex's is here. The popular open source crawler, Nutch, also supports the robots HTML tag, but not the x-robots-tag header. Update: Newer versions of Nutch now support x-robots-tag!

The Internet Archive, Alexa

The Internet Archive uses Alexa's crawler, which is known as ia_archiver. This crawler does not seem to support either the HTML robots meta tag nor the x-robots-tag HTTP header. Their page on the subject is here. I have requested more information from them, and will update this page if I hear back.

Duckduckgo, Blekko, Baidu

Duckduckgo and Blekko do not support either the robots meta tag nor the x-robots-tag header, per emails I've had with each of them. I also requested information from Baidu, but their response totally ignored my question and was in Chinese. They do have some information here, but it does not seem to provide any information on the noindex value for the robots tag. In any case, the only way to block these crawlers seems to be via a robots.txt file.

Syndicate content