Final Project

Announcing CourtListener.com

I'm elated to announce today that I am officially taking the ropes of my final project and letting it loose into the wild. It's been seven months since development on it officially started and finally, the beta version is done.

If you haven't been following along, the project itself is an open source legal research tool which allows anybody to keep up to date with federal precedents as they are set by the 13 Federal Circuit courts. Right now, it has more than 130,000 documents in its corpus, including almost all of the Supreme Court record dating back to 1754. Every day it downloads the latest documents within about a half hour of when each court publishes them.

One thing we've focused on while building the site has making it as useful as possible for as many people as possible. Since not everybody likes getting updates in their inbox, we've also tied the search engine in with an Atom feed generator so that you can search for whatever you want, and then follow updates in your feed reader.

Everything we've built uses a powerful boolean search engine on the backend. At present, there are a ton of boolean connectors that you can use on our site to search our corpus or create alerts and feeds. Unlike full text search that most people are familiar with, boolean search allows incredibly complex queries, such as every document mentioning Attorney General Holder that is published in the Third Circuit of Appeals (@court ca3 @doctext holder), or perhaps every document that mentions "Roe" and "Wade" within ten words of each other (@doctext "roe wade"~10).

But that's not all. Because we also want you to be able to use this efficiently during your day-to-day searching, we've built an add-on that will work in most browsers, which allows you to search CourtListener.com without first going to our homepage.

You can also browse all of documents in our corpus, or you can go to the details page for an opinion, where you can read the text of its body without having to download a PDF and crank up Adobe Acrobat.

As I mentioned earlier, this project has been designed as an open source project, so if you're looking for something to contribute to, look no further. We have a very active bug list where you can dip your toes in, or if you prefer something meatier, we can cook something up specifically for you.

I've greatly enjoyed working on this project so far, and I'd love to get more people using it, working on it, and recommending it to their friends. We're already planning version 1.0, so drop me a line if you're interested in helping out, otherwise, go check it out already, and see all that it has to offer!

Designing the Final Project

Over the past week, I've been working to create scrapers for each of the 13 federal appeals courts. Last night I finally finished the last of them, so today I'm moving on to the design of the site. Design is always much better when people work in a team, so I'm putting these designs here so others can look at them and give me feedback. Please, please do!

So far, I've sketched out four of the major pages that the site will have. A user's will begin using the site on its homepage. Here, they will be given few options. Basically, they can login, register for an account, make a search, or read one of the ancillary pages such as the "About" or "Privacy" page:

Also, note the advanced button under the search field. When this is clicked, it expands to show the advanced search queries that the site will support, as you can see on the next page.

If people are logged in, their homepage becomes the "Create new alert page," which you can see below. For now, this allows users to create very complicated queries by hand. In the future, it would be nice to build their queries for them. By default, the advanced section will be collapsed, but in the wire frame, I sketched it out. Also, if users click on "More details," (in the bottom-right of the "Advanced" box) they can get explanations and examples of all the connectors shown.

From that page, they would normally be redirected to their settings page, where their alerts are listed. Here, they can edit and see their alerts.

Clicking the "Edit" button takes a user back to the "create alert" page, except that it will be pre-filled with the alert they're trying to edit.

Of course, users can also edit their profile by clicking on the settings link on the top of every page . This page isn't too special, though it does have a couple unusual features, such as the bar memberships the user is a part of and whether they prefer HTML or plain text emails (not shown in the below version - sorry).

And that's it for now. I'd LOVE any feedback anybody has on these. Typing this up, I've already come across a couple problems:

  • Users currently get to their alerts by clicking settings - that ain't intuitive.
  • The about page is pretty hard to find. It may need more emphasis.

I'm sure there are more problems I'm not seeing. That's why I need your help. What am I missing? What should I change? What's stupid? What's outmoded?

Using Revision Control on a Django Project Without Revealing Your Passwords

Just a quick post today, since this took me way too long to figure out. If you have a django project that you want to share without sharing the private bits of settings.py, there is an easy way to do this.

I tried for a while to to set up mercurial hooks that would strip out my passwords before each commit, and then place them back after each commit, thus avoiding uploading them publicly. This does not work however because all of the mercurial hooks happen after snapshots of the modified files have been made. So you can edit the files using a hook, but your edits will only go into effect upon the next check in. Clearly, this will not do.

Another solution that I tried was the mercurial keyword extension. This could work, but ultimately it does not because you have to remember to run it before and after each commit — something I know I'd forget sooner or later.

The solution that does work is to split up your settings.py file into multiple pieces such that there is a private file and a public file. I followed the instructions here, with the resulting code looking being checked in here and here. There is also a file called "20-private.conf" which is not uploaded publicly, and which contains all the private bits of code that would normally be found in settings.py. Thus, all of my settings can be found my django, but I do not have to share my private ones.

Converting PDF Files to HTML

For my final project, we are considering posting court cases on our site, and so I did some work today analyzing how best to convert the PDF files the courts give us to HTML that people can actually use. I looked briefly at google docs, since it has an amazing tool that converts PDF files to something resembling text, but short of spending a few days hacking the site, I couldn't figure out any easy way to leverage their technology in any sort of automated way.

The other two tools I have looked at today are pdftotext and pdftohtml, which, not surprisingly, do what their names claim they do. Since we're going to be pulling cases from the 13 federal circuit courts, I wanted to figure out which method works best for which court, and which method will provide us with the most generalizable solution across whatever PDF a court may crank out.

The short version is that the best option seems to be:

pdftotext -htmlmeta -layout -enc 'UTF-8' yourfile.pdf

This creates an html file with the text of the case laid out best as possible, some basic html meta data applied, and the UTF-8 encoding applied.

Before coming to this conclusion though, I looked at two settings that pdftohtml has. With the -c argument, it can generate a 'complex' HTML document that closely resembles that of the original. Without the -c argument, it will create a more simple document. Although the complex documents are rather impressive in appearance, they're abysmal when it comes to the quality of the HTML code that is generated. For an example, look at the source code for this this file. If, on the other hand, the -c argument is not run, and the simple documents are generated, the appearance of the final product is worse than the simple text documents that are created by pdftotext. Check out this one for example.

For thoroughness, here is a table containing the results from this test.

Court pdftotext pdftohtml complex pdftotext simple Original PDF
1st The first circuit publishes in HTML Format by default
2nd link link link link
3rd link link link link
4th link link link link
5th link link link link
6th link link link link
7th link link link link
8th link link link link
9th link link link link
10th link link link link
11th link link link link
DC Circuit link link link link
Federal Circuit link link link link

A caveat regarding pdftotext: This library is developed by a company called Glyph & Cog. Although the code is open source, I couldn't for the life of me figure out how to file a bug against it. This doesn't particularly bode well for using something as a dependency. On the flip side, Glyph & Cog is happy to provide support for the product.

How to Protect Your Open Source Code from Theft and a Mercurial Hook to Help

Updated, 2010-01-24: Some edits regarding the Affero license (thanks to Brian at http://cyberlawcases.com for the corrections).

I've finally begun doing some of the actual coding for my final project so the time has come to set up a mercurial repository to hold the code.

Once we complete our project, we will have built a free product that competes with some of the core functionality of both LexisNexis and Westlaw, so something we wanted to do was make sure they couldn't steal our code, enhance their product and thus moot ours.

To achieve this, we're using the GNU Affero General Public License v3, which allows people to take our code for free, but requires that they publicly share any modifications that they make to the code. The normal GNU General Public License allows the code to be used at no cost, but only requires that changes to the code be shared with the public if one distributes the changed version to the public. With a server-based project, like ours, one could operate modified versions of the code without ever having a need to distribute any of the software to the public. This loophole is closed by the Affero license.

In order to license our work, we must be its copyright holder. This is easy enough, since we get copyright instantly in the U.S., but, as has been demonstrated in Jacobsen v. Katzer, in order to seek remedies for copyright violations, we would have to register everything we made with the copyright office. This costs $35 per registration, and with open source software, it's not clear whether each and every version needs to be registered or just major releases, or what.

Since this is too onerous to be practical, an additional approach to protecting our works is useful, and in the DMCA (17 U.S.C. § 506(d)), remedies are provided for the "fraudulent removal of copyright notice." Although these do not (in any way) match the protections provided by normal copyright registration, they are a useful place to begin. Thus, if we place a copyright notice into each file of our code, those using our code must either risk violating the DMCA by removing these notices, or leave our copyright information intact. (Placing such notices in each file is also the recommendation of the Free Software Foundation.)

To place our information into each and every file of code that we upload publicly, I wrote a short mercurial hook that adds copyright and licensing information it to the top of every file that is modified or added to the repository. To use the script, simply make it executable, place it in the .hg directory of your project, and add the following lines to .hg/hgrc:

[hooks]
pretxncommit = .hg/checklicense.py

A couple of things I should note about this script is that it currently only checks for java and python files, and that it requires files called java_license.txt and python_license.txt to be in the root of your repository. It should be fairly easy to modify though to fit your own needs.

Syndicate content