Final Project

Using Revision Control on a Django Project Without Revealing Your Passwords

Just a quick post today, since this took me way too long to figure out. If you have a django project that you want to share without sharing the private bits of settings.py, there is an easy way to do this.

I tried for a while to to set up mercurial hooks that would strip out my passwords before each commit, and then place them back after each commit, thus avoiding uploading them publicly. This does not work however because all of the mercurial hooks happen after snapshots of the modified files have been made. So you can edit the files using a hook, but your edits will only go into effect upon the next check in. Clearly, this will not do.

Another solution that I tried was the mercurial keyword extension. This could work, but ultimately it does not because you have to remember to run it before and after each commit — something I know I'd forget sooner or later.

The solution that does work is to split up your settings.py file into multiple pieces such that there is a private file and a public file. I followed the instructions here, with the resulting code looking being checked in here and here. There is also a file called "20-private.conf" which is not uploaded publicly, and which contains all the private bits of code that would normally be found in settings.py. Thus, all of my settings can be found my django, but I do not have to share my private ones.

Converting PDF Files to HTML

For my final project, we are considering posting court cases on our site, and so I did some work today analyzing how best to convert the PDF files the courts give us to HTML that people can actually use. I looked briefly at google docs, since it has an amazing tool that converts PDF files to something resembling text, but short of spending a few days hacking the site, I couldn't figure out any easy way to leverage their technology in any sort of automated way.

The other two tools I have looked at today are pdftotext and pdftohtml, which, not surprisingly, do what their names claim they do. Since we're going to be pulling cases from the 13 federal circuit courts, I wanted to figure out which method works best for which court, and which method will provide us with the most generalizable solution across whatever PDF a court may crank out.

The short version is that the best option seems to be:

pdftotext -htmlmeta -layout -enc 'UTF-8' yourfile.pdf

This creates an html file with the text of the case laid out best as possible, some basic html meta data applied, and the UTF-8 encoding applied.

Before coming to this conclusion though, I looked at two settings that pdftohtml has. With the -c argument, it can generate a 'complex' HTML document that closely resembles that of the original. Without the -c argument, it will create a more simple document. Although the complex documents are rather impressive in appearance, they're abysmal when it comes to the quality of the HTML code that is generated. For an example, look at the source code for this this file. If, on the other hand, the -c argument is not run, and the simple documents are generated, the appearance of the final product is worse than the simple text documents that are created by pdftotext. Check out this one for example.

For thoroughness, here is a table containing the results from this test.

Court pdftotext pdftohtml complex pdftotext simple Original PDF
1st The first circuit publishes in HTML Format by default
2nd link link link link
3rd link link link link
4th link link link link
5th link link link link
6th link link link link
7th link link link link
8th link link link link
9th link link link link
10th link link link link
11th link link link link
DC Circuit link link link link
Federal Circuit link link link link

A caveat regarding pdftotext: This library is developed by a company called Glyph & Cog. Although the code is open source, I couldn't for the life of me figure out how to file a bug against it. This doesn't particularly bode well for using something as a dependency. On the flip side, Glyph & Cog is happy to provide support for the product.

How to Protect Your Open Source Code from Theft and a Mercurial Hook to Help

Updated, 2010-01-24: Some edits regarding the Affero license (thanks to Brian at http://cyberlawcases.com for the corrections).

I've finally begun doing some of the actual coding for my final project so the time has come to set up a mercurial repository to hold the code.

Once we complete our project, we will have built a free product that competes with some of the core functionality of both LexisNexis and Westlaw, so something we wanted to do was make sure they couldn't steal our code, enhance their product and thus moot ours.

To achieve this, we're using the GNU Affero General Public License v3, which allows people to take our code for free, but requires that they publicly share any modifications that they make to the code. The normal GNU General Public License allows the code to be used at no cost, but only requires that changes to the code be shared with the public if one distributes the changed version to the public. With a server-based project, like ours, one could operate modified versions of the code without ever having a need to distribute any of the software to the public. This loophole is closed by the Affero license.

In order to license our work, we must be its copyright holder. This is easy enough, since we get copyright instantly in the U.S., but, as has been demonstrated in Jacobsen v. Katzer, in order to seek remedies for copyright violations, we would have to register everything we made with the copyright office. This costs $35 per registration, and with open source software, it's not clear whether each and every version needs to be registered or just major releases, or what.

Since this is too onerous to be practical, an additional approach to protecting our works is useful, and in the DMCA (17 U.S.C. § 506(d)), remedies are provided for the "fraudulent removal of copyright notice." Although these do not (in any way) match the protections provided by normal copyright registration, they are a useful place to begin. Thus, if we place a copyright notice into each file of our code, those using our code must either risk violating the DMCA by removing these notices, or leave our copyright information intact. (Placing such notices in each file is also the recommendation of the Free Software Foundation.)

To place our information into each and every file of code that we upload publicly, I wrote a short mercurial hook that adds copyright and licensing information it to the top of every file that is modified or added to the repository. To use the script, simply make it executable, place it in the .hg directory of your project, and add the following lines to .hg/hgrc:

[hooks]
pretxncommit = .hg/checklicense.py

A couple of things I should note about this script is that it currently only checks for java and python files, and that it requires files called java_license.txt and python_license.txt to be in the root of your repository. It should be fairly easy to modify though to fit your own needs.

Syndicate content