<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Michael Jay Lissner</title><link href="https://michaeljaylissner.com/" rel="alternate"></link><link href="https://michaeljaylissner.com/feeds/tag/pdftohtml" rel="self"></link><id>https://michaeljaylissner.com/</id><updated>2010-02-06T15:03:18-08:00</updated><entry><title>Converting PDF Files to HTML</title><link href="https://michaeljaylissner.com/posts/2010/02/06/converting-pdf-files-to-html/" rel="alternate"></link><updated>2010-02-06T15:03:18-08:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2010-02-06:posts/2010/02/06/converting-pdf-files-to-html/</id><summary type="html">&lt;p&gt;For my final project, we are considering posting court cases on our site, and so I did some work today analyzing how best to convert the &lt;span class="caps"&gt;PDF&lt;/span&gt; files the courts give us to &lt;span class="caps"&gt;HTML&lt;/span&gt; that people can actually use. I looked briefly at google docs, since it has an amazing tool that converts &lt;span class="caps"&gt;PDF&lt;/span&gt; files to something resembling text, but short of spending a few days hacking the site, I couldn&amp;#8217;t figure out any easy way to leverage their technology in any sort of automated&amp;nbsp;way. &lt;/p&gt;
&lt;p&gt;The other two tools I have looked at today are &lt;a href="http://www.foolabs.com/xpdf/"&gt;pdftotext&lt;/a&gt; and &lt;a href="http://pdftohtml.sourceforge.net/"&gt;pdftohtml&lt;/a&gt;, which, not surprisingly, do what their names claim they do. Since we&amp;#8217;re going to be pulling cases from the 13 federal circuit courts, I wanted to figure out which method works best for which court, and which method will provide us with the most generalizable solution across whatever &lt;span class="caps"&gt;PDF&lt;/span&gt; a court may crank&amp;nbsp;out.&lt;/p&gt;
&lt;p&gt;The short version is that the best option seems to&amp;nbsp;be:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;pdftotext -htmlmeta -layout -enc &lt;span class="s1"&gt;&amp;#39;UTF-8&amp;#39;&lt;/span&gt; yourfile.pdf
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This creates an html file with the text of the case laid out best as possible, some basic html meta data applied, and the &lt;span class="caps"&gt;UTF&lt;/span&gt;-8 encoding&amp;nbsp;applied. &lt;/p&gt;
&lt;p&gt;Before coming to this conclusion though, I looked at two settings that pdftohtml has. With the -c argument, it can generate a &amp;#8216;complex&amp;#8217; &lt;span class="caps"&gt;HTML&lt;/span&gt; document that closely resembles that of the original. Without the -c argument, it will create a more simple document. Although the complex documents are rather impressive in appearance, they&amp;#8217;re abysmal when it comes to the quality of the &lt;span class="caps"&gt;HTML&lt;/span&gt; code that is generated. For an example, look at the source code for this &lt;a href="/archive/shared/pdf-to-html-test/pdftohtml-complex-noframes-noimages-2ndCircuit-08-6301-cv_opn.html"&gt;this file&lt;/a&gt;. If, on the other hand, the -c argument is not run, and the simple documents are generated, the appearance of the final product is worse than the simple text documents that are created by pdftotext. Check out &lt;a href="/archive/shared/pdf-to-html-test/pdftohtml-simple-noframes-noimages-2ndCircuit-08-6301-cv_opn.html"&gt;this one&lt;/a&gt; for&amp;nbsp;example.&lt;/p&gt;
&lt;p&gt;For thoroughness, here is a table containing the results from this test.
&lt;table&gt;
&lt;tr&gt;
  &lt;th&gt;Court&lt;/th&gt;
  &lt;th&gt;pdftotext&lt;/th&gt;
  &lt;th&gt;pdftohtml complex&lt;/th&gt;
  &lt;th&gt;pdftotext simple&lt;/th&gt;
  &lt;th&gt;Original &lt;span class="caps"&gt;PDF&lt;/span&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;1&lt;sup&gt;st&lt;/sup&gt;&lt;/td&gt;
  &lt;td colspan="4" align="center"&gt;The first circuit publishes in &lt;span class="caps"&gt;HTML&lt;/span&gt; Format by default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;2&lt;sup&gt;nd&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-2ndCircuit-08-6301-cv_opn.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-2ndCircuit-08-6301-cv_opn.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-2ndCircuit-08-6301-cv_opn.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/2ndCircuit-08-6301-cv_opn.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;3&lt;sup&gt;rd&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-3rdCircuit-091225p.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-3rdCircuit-091225p.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-3rdCircuit-091225p.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/3rdCircuit-091225p.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;4&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-4thCircuit-082373.P.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-4thCircuit-082373.P.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-4thCircuit-082373.P.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/4thCircuit-082373.P.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;5&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-5thCircuit-07-30815-CR0.wpd.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-5thCircuit-07-30815-CR0.wpd.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-5thCircuit-07-30815-CR0.wpd.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/5thCircuit-07-30815-CR0.wpd.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;6&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-6thCircuit-10a0023p-06.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-6thCircuit-10a0023p-06.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-6thCircuit-10a0023p-06.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/6thCircuit-10a0023p-06.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;7&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-7thCircuit-UZ1FFY4T.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-7thCircuit-UZ1FFY4T.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-7thCircuit-UZ1FFY4T.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/7thCircuit-UZ1FFY4T.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-8thCircuit-071306U.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-8thCircuit-071306U.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-8thCircuit-071306U.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/8thCircuit-071306U.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;9&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-9thCircuit-07-55393.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-9thCircuit-07-55393.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-9thCircuit-07-55393.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/9thCircuit-07-55393.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;10&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-10thCircuit-06-6247.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-10thCircuit-06-6247.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-10thCircuit-06-6247.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/10thCircuit-06-6247.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;11&lt;sup&gt;th&lt;/sup&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-11thCircuit-200814991.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-11thCircuit-200814991.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-11thCircuit-200814991.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/11thCircuit-200814991.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;span class="caps"&gt;DC&lt;/span&gt; Circuit&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-DC-Circuit-07-3125-1229519.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-DC-Circuit-07-3125-1229519.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-DC-Circuit-07-3125-1229519.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/DC-Circuit-07-3125-1229519.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Federal Circuit&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftotext-layout-htmlmeta-utf-8-FederalCircuit-09-1361.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-complex-noframes-noimages-FederalCircuit-09-1361.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/pdftohtml-simple-noframes-noimages-FederalCircuit-09-1361.html"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&lt;a href="https://michaeljaylissner.com/archive/pdf-to-html-test/FederalCircuit-09-1361.pdf"&gt;&lt;em&gt;link&lt;/em&gt;&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A caveat regarding pdftotext:&lt;/strong&gt; This library is developed by a company called &lt;a href="http://www.glyphandcog.com/index.html"&gt;Glyph &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; Cog&lt;/a&gt;. Although the code is open source, I couldn&amp;#8217;t for the life of me figure out how to file a bug against it. This doesn&amp;#8217;t particularly bode well for using something as a dependency. On the flip side, Glyph &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; Cog is happy to provide support for the&amp;nbsp;product.&lt;/p&gt;</summary><category term="pdftotext"></category><category term="pdftohtml"></category><category term="pdf"></category><category term="Final Project"></category><category term="CourtListener"></category></entry></feed>