Michael Jay Lissner
  • Home
  • About Site
  • Contact
  • Projects & Papers
  • Tags
  • Archives

Converting PDF Files to HTML

For my final project, we are considering posting court cases on our site, and so I did some work today analyzing how best to convert the PDF files the courts give us to HTML that people can actually use. I looked briefly at google docs, since it has an amazing tool that converts PDF files to something resembling text, but short of spending a few days hacking the site, I couldn’t figure out any easy way to leverage their technology in any sort of automated way.

The other two tools I have looked at today are pdftotext and pdftohtml, which, not surprisingly, do what their names claim they do. Since we’re going to be pulling cases from the 13 federal circuit courts, I wanted to figure out which method works best for which court, and which method will provide us with the most generalizable solution across whatever PDF a court may crank out.

The short version is that the best option seems to be:

pdftotext -htmlmeta -layout -enc 'UTF-8' yourfile.pdf

This creates an html file with the text of the case laid out best as possible, some basic html meta data applied, and the UTF-8 encoding applied.

Before coming to this conclusion though, I looked at two settings that pdftohtml has. With the -c argument, it can generate a ‘complex’ HTML document that closely resembles that of the original. Without the -c argument, it will create a more simple document. Although the complex documents are rather impressive in appearance, they’re abysmal when it comes to the quality of the HTML code that is generated. For an example, look at the source code for this this file. If, on the other hand, the -c argument is not run, and the simple documents are generated, the appearance of the final product is worse than the simple text documents that are created by pdftotext. Check out this one for example.

For thoroughness, here is a table containing the results from this test.

Court pdftotext pdftohtml complex pdftotext simple Original PDF
1st The first circuit publishes in HTML Format by default
2nd link link link link
3rd link link link link
4th link link link link
5th link link link link
6th link link link link
7th link link link link
8th link link link link
9th link link link link
10th link link link link
11th link link link link
DC Circuit link link link link
Federal Circuit link link link link

A caveat regarding pdftotext: This library is developed by a company called Glyph & Cog. Although the code is open source, I couldn’t for the life of me figure out how to file a bug against it. This doesn’t particularly bode well for using something as a dependency. On the flip side, Glyph & Cog is happy to provide support for the product.

I love getting feedback and comments. Make my day by making a comment.

Comments
comments powered by Disqus

  • « With Howard Zinn’s Death, We All Suffer a Little
  • Using Revision Control on a Django Project Without Revealing Your Passwords »

Published

Feb 6, 2010

Category

Tech

Tags

  • CourtListener 17
  • Final Project 5
  • pdf 2
  • pdftohtml 1
  • pdftotext 1

Contact

This is Reader-Editable

Edit this post on Github

Get Weekly Updates

  • Unless mentioned otherwise, all material on this site is licensed under a Creative Commons copyright or the GNU Affero GPL. Privacy Policy.
  • Powered by Pelican. Theme: Elegant by Talha Mansoor