courtlistener.com

New tool for testing lxml XPath queries

I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we're using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built: There's one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I consistently had the problem that when the HTML got nasty, they'd start falling over.

No more! Today I built a quick Django app that can be run locally or on a server. It's quite simple. You input some HTML and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it's pretty basic on the whole.

I'd love to get any feedback I can about this. It's probably still got some bugs, but it's small enough that they should be quite easy to stamp out.

Update: I got in touch with the developer of Xacobeo. There's an --html flag that you can pass to it at startup, if that's your intention. If you use that, it indeed uses the same DOM parser that my tool does. Sigh. Affordances are important, especially in a GUI-based tool.

Further privacy protections at CourtListener

I've written previously about the lengths we go to at CourtListener to protect people's privacy, and today we completed one more privacy enhancement.

After my last post on this topic, we discovered that although we had already blocked cases from appearing in the search results of all major search engines, we had a privacy leak in the form of our computer-readable sitemaps. These sitemaps contain links to every page within a website, and since those links contain the names of the parties in a case, it's possible that a Google search for the party name could turn up results that should be hidden.

This was problematic, and as of now we have changed the way we serve sitemaps so that they use the noindex X-Robots-Tag HTTP header. This tells search crawlers that they are welcome to read our sitemaps, but that they should avoid serving them or indexing them.

My Presentation Proposal for LVI 2012

The Law Via the Internet conference is celebrating its 20th anniversary at Cornell University on October 7-9th. I will be attending, and with any luck, I'll be presenting on the topic proposed below.

Wrangling Court Data on a National Level

Access to case law has recently become easier than ever: By simply visiting a court's website it is now possible to find and read thousands of cases without ever leaving your home. At the same time, there are nearly a hundred court websites, many of these websites suffer from poor funding or prioritization, and gaining a higher-level view of the law can be challenging. “Juriscraper” is a new project designed to ease these problems for all those that wish to collect these court opinions daily. The project is under active development, and we are looking for others to get involved.

Juriscraper is a liberally-licensed open source library that can be picked up and used by any organization to scrape the case data from court websites. In addition to a simply scraping the websites and extracting metadata from them, Juriscraper has a number of other design goals:

  • Extensibility to support video, oral argument audio, and other media types
  • Support for all metadata provided by court websites
  • Extensibility to support varied geographies and jurisdictions
  • Generalized object-oriented architecture with little or no code repetition
  • Standardized coding techniques using the latest libraries and standards (Python, xpath, lxml, requests, chardet)
  • Simple installation, configuration, and API
  • Friendly and transparent to court websites

As well as a number of features:

  • Harmonization of metadata (US, USA, United States of America, etc → United States; et al, et. al., etc. get eliminated; vs., v, vs → v.; all dates are Python objects; etc.)
  • Smart title-casing of case names (several courts provide case names in uppercase only)
  • Sanity checking and sorting of metadata values returned by court websites

Once implemented, Juriscraper is part of a two-part system. The second part is the caller, which uses the API, and which itself solves some interesting questions:

  • How are duplicates detected and avoided?
  • How can the impact on court websites be minimized?
  • How can mime type detection be completed successfully so that textual contents can be extracted?
  • What should we do if it is an image-based PDF?
    • How should HTML be tidied?
    • How often should we check a court website for new content?
  • What should we do in case of failure?

Juriscraper is currently deployed by CourtListener.com to scrape all of the Federal Appeals courts, and we are slowly adding additional state courts over the coming weeks.

We have been scraping these sites in various ways for several years, and Juriscraper is the culmination of what we've learned. We hope that by presenting our work at LVI 2012, we will be able to share what we have learned and gain additional collaborators in our work.

The Winning Font in Court Opinions

At CourtListener, we're developing a new system to convert scanned court documents to text. As part of our development we've analyzed more than 1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training data for our OCR system so that it specializes in these fonts, but for now we've attached a spreadsheet with our findings, and a script that can be used by others to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font Regular Bold Italic Bold Italic Total
Times 1454 953 867 47 3321
Courier 369 333 209 131 1042
Arial 364 39 11 41 455
Symbol 212 0 0 0 212
Helvetica 24 161 2 2 189
Century Schoolbook 58 54 52 9 173
Garamond 44 42 41 0 127
Palatino Linotype 36 24 24 1 85
Old English 42 0 0 0 42
Lincoln 27 0 0 0 27

Respecting privacy while providing hundreds of thousands of public documents

At CourtListener, we have always taken privacy very seriously. We have over 600,000 cases currently, most of which are available on Google and other search engines. But in the interest of privacy, we make two broad exceptions to what's available on search engines:

  1. As is stated in our removal policy, if someone gets in touch with us in writing and requests that we block search engines from indexing a document, we generally attempt to do so within a few hours.
  2. If we discover a privacy problem within a case, we proactively block search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of requests to prevent indexing by search engines, we're often faced with an ethical dilemma, since in many instances, the party making the request is merely displeased that their involvement in the case is easy to discover and/or they are simply embarrassed by their past. In this case, the question we have to ask ourselves is: Where is the balance between the person's right to privacy and the public's need to access court records, and to what extent do changes in practical obscurity compel action on our behalf? For example, if someone convicted of murder or child molestation is trying to make information about their past harder to discover, how should we weigh the public's interest in easily locating this information via a search engine? In the case of convicted child molesters, we can look to Megan's law for a public policy stance on the issue, but even that forces us to ask to what extent we should chart our own path, and to what extent we should follow public policy decisions.

On the opposite end of the spectrum, many of the cases that we block search engines from indexing are asylum cases where a person has lost an attempt to stay in the United States, and been sent back to a country where they feel unsafe. In such cases, it seems clear that it's important to keep the person's name out of search engine results, but still we must ask to what extent do we have an obligation to identify and block such cases from appearing proactively rather than post hoc?

In both of these scenarios, we have taken a middle ground that we hope strikes a balance between the public's need for court documents and an individual's desire or need for privacy. Instead of either proactively blocking search engines from indexing cases or keeping cases in search results against a party's request, our current policy is to block search engines from indexing a web page as each request comes in. We currently have 190 cases that are blocked from search results, and the number increases regularly.

Where we do take proactive measures to block cases from search results is where we have discovered unredacted or improperly redacted social security numbers in a case. Taking a cue from the now-defunct Altlaw, whenever a case is added, we look for character strings that appear to be social security numbers, tax ID numbers or alien ID numbers. If we find any such strings, we replace them with x's, and we try to make sure the unredacted document does not appear in search results outside of CourtListener.

The methods we have used to block cases from appearing in search results have evolved over time, and I'd like to share what we've learned so others can give us feedback and learn from our experiences. There are five technical measures we use to keep a case out of search results:

  1. robots.txt file
  2. HTML meta noindex tags
  3. HTTP X-Robots-Tag headers
  4. sitemaps.xml files
  5. The webmaster tools provided by the search engines themselves

Each of these deserves a moment of explanation. robots.txt is a protocol that is respected by all major search engines internationally, and which allows site authors (such as myself) to identify web pages that shouldn't be crawled. Note that I said crawled not indexed. This is a very important distinction, as I'll explain momentarily.

HTML meta tags are a tag that you can place into the HTML of a page, and which instructs search engines not to index a page. Since this is an HTML format, this method only works on HTML pages.

HTTP X-Robots-Tag headers are similar to HTML meta tags, but they allow site authors to request that an item not be indexed. That item may be an HTML page, but equally, it may be a PDF or even an image that should not searchable.

Further, we provide an XML sitemap that search engines can understand, and which tells them about every page on the site that they should crawl and index.

All of these elements fit together into a complicated mélange that has absorbed many development hours over the past two years, as different search engines interpret these standards in different ways.

For example, Google and Bing interpret the robots.txt files as blocks to their crawlers. This means that web pages listed in robots.txt will not be crawled by Google or Bing, but that does not mean those pages will not be indexed. Indeed, if Google or Bing learn of the existence of a web page (for example, because another page linked to it), then they will include it in their indexes. This is true even if robots.txt explicitly blocks robots from crawling the page, because to include it in their indexes, they don't have to crawl it — they just need to know about it! Even your own link to a page is sufficient for Google or Bing to know about the page. And what's worse, if you have a good URL with descriptive words within it, Google or Bing will know the terms in the URLs even when they haven't crawled the page. So if your URL is example.com/private-page-about-michael-jackson, a query for [ Michael Jackson ] could certainly bring it up, even if it were never crawled.

The solution to this is to allow Google and Bing to crawl the pages, but to use noindex meta or HTTP tags. If these are in place, the pages will not appear in the index at all. This sounds paradoxical: to exclude pages from appearing in Google and Bing, you have to allow them to be crawled? Yes, that's correct. Furthermore, it's theoretically possible that Google or Bing could learn about a page on your site from a link, and then not crawl it immediately or at all. In this case, they will know the URL, but won't know about and X-Robots-Tag headers or meta tags. Thus, they might include the document against your wishes. For this reason, it's important to include private pages in your sitemap.xml file, inviting and encouraging Google and Bing to crawl the page specifically so the page can be excluded from their indexes.

Yahoo! uses Bing to power their search engine, and AOL uses Google, so the above strategy applies to them as well.

Other search engines take a different approach to robots.txt. Ask.com, The Internet Archive and the Russian search engine Yandex.ru all respect the robots meta tag, but not the x-robots-tag HTTP header. Thus, for these search engines, the strategy above works for HTML files, but not for any other files. These crawlers therefore need to be blocked from accessing those other files. On the upside, unlike Google and Bing, it appears that these search engines will not show a document in their results if they have not crawled it. Thus, using robots.txt alone should be sufficient.

A third class of search engines support neither the robots HTML meta tag, nor the x-robots-tag HTTP header. These are typically less popular or less mature crawlers, and so they must be blocked using robots.txt. There are two approaches to this. The first is to list blocked pages individually in the robots.txt file, and the second is to simply block these search engines from all access. While it's possible to list each private document in robots.txt, doing so creates a privacy loophole, since it lists all private documents in one place. At CourtListener, therefore, we take a conservative approach, and completely block all search engines that do not support the HTML meta tag or the x-robots-tag HTTP header.

The final action we take when we receive a request that a document on our site stop appearring in search results, is to use the webmaster tools provided by the major search engines1 to explicitly ask those search engines to exclude the document(s) from their results.

Between these measures, private documents on CourtListener should be removed from all major and minor search engines. Where posssible this strategy takes a very granular approach, and where minor search engines do not support certain standards, we take a conservative approach, blocking them entirely.

Update, 2012-04-29: You may also want to look at our discussion of the impact of putting people's names into your URLs, and the way that affects your sitemap files.

  1. We use Google's Webmaster Tools and Bing's Webmaster Tools. Before it was merged into Bing's tools, we also previously used Yahoo's Site Explorer.

Integrating Solr Search with Django at CourtListener

Over the past few weeks, I've been hard at work on the new version of CourtListener. Unfortunately, progress has been slower than I'd like due to the limitations of the Solr frameworks I've been using. There are a number of competing frameworks available, each with its own strengths and pitfalls.

So far, I've tried two of the popular ones, Haystack and Sunburnt. I'm pretty impressed by both, but today's blog post is to outline the problems I'm having with these frameworks so that others that are faced with choosing one might be better informed. The difference between these frameworks is vast. Haystack aims to solve all of your integration needs, while Sunburnt is a fairly lightweight wrapper around Solr.

CourtListener's needs

At CourtListener, we have some big goals for the new search version. At its core, it's essentially a search-powered site, so we have some big needs:

  • Parallel Faceted Search
  • Highlighting
  • Complex boolean searches supported by Solr's eDisMax syntax
  • Snippets below search results and in emails
  • Standard search stuff: field-level boosting, result and facet counts, field-level searching, result pagination, performance, etc.

We're currently using Sphinx Search with django-sphinx, which does a fine job, but it has some problems:

  • django-sphinx hasn't been maintained in years, and requires patching
  • django-sphinx doesn't support snippets
  • Sphinx doesn't (yet) support real time indexing (though it's in beta, I believe)
  • Sphinx doesn't have the community and features that Solr does
  • Unfamiliar syntax for users

In general, these problems aren't too difficult, but in combination, they make for a poor user experience. The last point is a real deal breaker, since most users are accustomed to making queries like [ site:google.com ], which works for Solr and Google, but not for Sphinx. In Sphinx, your query is [ @site(google.com) ]. While we could do post processing of the user's query to convert it to Google/Solr-style syntax, it's unreliable and prone to failing in corner cases. Parsing queries is hard. More on this in a moment.

Let's try Haystack

In switching from Sphinx, I first tried Haystack as a solution, since it has excellent documentation and seems to be the most popular solution. I spent about two weeks learning about it and getting it in place, but ultimately, I gave up on it because I found that I was subclassing it everywhere. Haystack is a good solution, to be sure, but I found that I was:

  • Subclassing the FacetView so it could support parallel facet counts
  • Subclassing the FacetForm for another feature I needed
  • Subclassing the Solr backend so it could support Solr's highlighting syntax
  • Further subclassing the Solr backend so it can support additional Solr parameters that aren't built in
  • ...etc...

I worked on that third point for the better part of a day before deciding that Haystack wasn't for me. Rather than spending my time working on the search needs of CourtListener, I was spending most of it hacking on Haystack, and trying to understand the way it fits together. It's not unreasonably complex, but there is a LOT of documentation, and a lot of complexity that I don't need (such as the ability to switch search backends). Instead of a big solution that allows me to subclass whatever I need (which is good), I needed a lighter-weight solution that was more nimble, and which allowed me to interact with Solr in a more direct way.

Enter Sunburnt

Sunburnt is a lightweight solution that is everything that Haystack isn't. From the moment it's installed, you can start making queries without configuring Django to use it, and without really knowing much else. Its documentation is a single page, which is actually a big relief after coming from Haystack. But Sunburnt has a major problem in its design: It doesn't support just sending queries to Solr. The expectation in Sunburnt is that each system using it does post-processing on the user's query, and then submits the query to Sunburnt in stages.

So, if a user searches for "foo bar", rather than just passing that to Sunburnt, you have to split on the white space, then pass:
si.query('foo').query('bar')

At first you think, "OK, I can do that - just split on white space, no big deal." Then you start thinking about the other syntax that Solr supports, and you realize that you have a real problem if you have to split up queries appropriately. Trust me when I say that you don't want to be thinking about how to send a query like this one to Sunburnt: [ foo bar "jakarta apache"~10 ].

The author of Sunburnt will point out that there's a workaround for this problem. You can use
si.search(q='"jakarta apache"~10')

That works, to a point, but that syntax isn't supported on facets, so your facet counts won't have the same counts as your results. And so, Sunburnt, though powerful and lightweight, fails.

What now?

Good question.

The abolishment of the Emergency Court of Appeals (April 18, 1962)

One of the coming features at CourtListener is an API for the law. Part of that feature is going to be some basic information about the courts themselves, so I spent some time over the weekend researching courts that served a special purpose but were since abolished.

One such court was the Emergency Court of Appeals. It was created during World War II to set prices, and, naturally, was the court of appeals for many cases. The creation date of the court is prominently published in various places on the Internet, but the abolishment history of the court was very difficult to find. After researching online for some time, and learning that my library card had expired (sigh), I put in a query with the Library of Congress, which provides free research of these types of things.

Within a couple days, the provided me with this amazing response, which I'm sharing here, and on the above Wikipedia article:

As stated in the Legislative Notes to 50 U.S. Code Appendix §§ 921 to 926, as posted at
http://www.law.cornell.edu/uscode/html/uscode50a/usc_sec_50a_00000921---..., the following explanation is given regarding the amendment and repeal of Act of Jan. 30, 1942, ch. 26, title II, § 204, 56 Stat. 23, 31-33:

"Section 924, acts Jan. 30, 1942, ch. 26, title II, § 204, 56 Stat. 31; June 30, 1944, ch. 325, title I, § 107, 58 Stat. 639; June 30, 1945, ch. 214, § 6, 59 Stat. 308; July 30, 1947, ch. 361, title I, § 101, 61 Stat. 619; June 25, 1948, ch. 646, § 32(a), 62 Stat. 991; May 24, 1949, ch. 139, § 127, 63 Stat. 107, authorized review of orders of the Office of Price Administrator under the Emergency Price Control Act of 1942, and created the Emergency Court of Appeals for this purpose. The Emergency Price Control Act of 1942 terminated on June 30, 1947, under the provisions of act July 25, 1946, ch. 671, § 1, 60 Stat. 664. The Housing and Rent Act of 1948, act Mar. 30, 1948, ch. 161, 62 Stat. 93, classified to section 1881 of this Appendix, continued the Court for the purpose of reviewing recommendations of local advisory boards for the decontrol or adjustment of maximum rents. Later, the Defense Production Act of 1950, act Sept. 8, 1950, ch. 932, 64 Stat. 798, classified to sections 2061 to 2166 of this Appendix, continued the Court to review regulations and orders relating to price control. The Housing and Rent Act of 1948 and the Defense Production Act of 1950 both terminated, however, the Court remained in existence “to complete the adjudication of rights and liabilities incurred prior to their termination dates.” (Transcript of Proceedings of the Final Session of the Court, 299 F.2d 1.) The final decision of the Court, Rosenzweig v. General Services Administration, 1961, 299 F.2d 22, was decided on Dec. 6, 1961. A petition for rehearing was denied on Jan. 2, 1962, and a petition for writ of certiorari to the Supreme Court of the United States was denied on Mar. 19, 1962, 82 S. Ct. 830.

The order of Chief Judge Albert B. Maris, set forth in 299 F.2d 20, provided:

“The business of this Court having been completed, it is ordered that at the expiration of 30 days from this date, if a petition for certiorari has not been filed in the Supreme Court in Case No. 676 [Rosenzweig v. General Services Administration], just decided, the acting clerk shall deliver the records and papers of the Court in his office to the General Services Administration for permanent custody as records of the Government, and shall thereupon inform the Chief Justice of the United States that the work of the Court has been completed and that the designations of the judges of the Court may therefore appropriately be terminated.

“If a petition for certiorari is filed in Case No. 676 this order shall take effect and be carried out at the expiration of 30 days after the final disposition of Case No. 676.”

In accordance with the terms of this order, the petition for certiorari having been filed, and denied Mar. 19, 1962, the Court terminated on Apr. 18, 1962."

Pretty fantastic research. And for free! Thanks LOC.

Changes and Plans at CourtListener.com

A few weeks ago, we made a fairly major change at CourtListener.com to include ID numbers in all of our case URLs. This change meant that links that were previously like this:

http://courtlistener.com/scotus/Wong-v.-Smith/

Are now like this:

http://courtlistener.com/scotus/V5o/wong-v-smith/

Most of the old links should continue to work, but using the new links should be much faster and more reliable. The major difference between the two is the ID number, which is encoded as a set of numbers (in this case V5o). This ID corresponds directly with the ID number in our database, aiding us greatly in serving up cases quickly and accurately.

Around the same time as this change, we added social networking links to all of our case pages to make them easier to share with friends and colleagues. These links use our new tiny domain, http://crt.li/, and should thus be ideal for websites like Twitter or Reddit.

In the next few months we will be getting a major new server, and will be migrating our data to it. This will allow us to serve more data, and—drum roll please—will allow us to begin serving audio content on the site. That's right, in the next few months, we will begin getting oral arguments from the circuit courts, and will be serving it directly to you on the case pages.

We also have plans to revisit our search interface in order to add date filtering and query building so look for that soon.

As always, we welcome your feedback and support, so don't hesitate to get in touch with us if you have any questions or suggestions.

Announcing CourtListener.com

I'm elated to announce today that I am officially taking the ropes of my final project and letting it loose into the wild. It's been seven months since development on it officially started and finally, the beta version is done.

If you haven't been following along, the project itself is an open source legal research tool which allows anybody to keep up to date with federal precedents as they are set by the 13 Federal Circuit courts. Right now, it has more than 130,000 documents in its corpus, including almost all of the Supreme Court record dating back to 1754. Every day it downloads the latest documents within about a half hour of when each court publishes them.

One thing we've focused on while building the site has making it as useful as possible for as many people as possible. Since not everybody likes getting updates in their inbox, we've also tied the search engine in with an Atom feed generator so that you can search for whatever you want, and then follow updates in your feed reader.

Everything we've built uses a powerful boolean search engine on the backend. At present, there are a ton of boolean connectors that you can use on our site to search our corpus or create alerts and feeds. Unlike full text search that most people are familiar with, boolean search allows incredibly complex queries, such as every document mentioning Attorney General Holder that is published in the Third Circuit of Appeals (@court ca3 @doctext holder), or perhaps every document that mentions "Roe" and "Wade" within ten words of each other (@doctext "roe wade"~10).

But that's not all. Because we also want you to be able to use this efficiently during your day-to-day searching, we've built an add-on that will work in most browsers, which allows you to search CourtListener.com without first going to our homepage.

You can also browse all of documents in our corpus, or you can go to the details page for an opinion, where you can read the text of its body without having to download a PDF and crank up Adobe Acrobat.

As I mentioned earlier, this project has been designed as an open source project, so if you're looking for something to contribute to, look no further. We have a very active bug list where you can dip your toes in, or if you prefer something meatier, we can cook something up specifically for you.

I've greatly enjoyed working on this project so far, and I'd love to get more people using it, working on it, and recommending it to their friends. We're already planning version 1.0, so drop me a line if you're interested in helping out, otherwise, go check it out already, and see all that it has to offer!

Converting PDF Files to HTML

For my final project, we are considering posting court cases on our site, and so I did some work today analyzing how best to convert the PDF files the courts give us to HTML that people can actually use. I looked briefly at google docs, since it has an amazing tool that converts PDF files to something resembling text, but short of spending a few days hacking the site, I couldn't figure out any easy way to leverage their technology in any sort of automated way.

The other two tools I have looked at today are pdftotext and pdftohtml, which, not surprisingly, do what their names claim they do. Since we're going to be pulling cases from the 13 federal circuit courts, I wanted to figure out which method works best for which court, and which method will provide us with the most generalizable solution across whatever PDF a court may crank out.

The short version is that the best option seems to be:

pdftotext -htmlmeta -layout -enc 'UTF-8' yourfile.pdf

This creates an html file with the text of the case laid out best as possible, some basic html meta data applied, and the UTF-8 encoding applied.

Before coming to this conclusion though, I looked at two settings that pdftohtml has. With the -c argument, it can generate a 'complex' HTML document that closely resembles that of the original. Without the -c argument, it will create a more simple document. Although the complex documents are rather impressive in appearance, they're abysmal when it comes to the quality of the HTML code that is generated. For an example, look at the source code for this this file. If, on the other hand, the -c argument is not run, and the simple documents are generated, the appearance of the final product is worse than the simple text documents that are created by pdftotext. Check out this one for example.

For thoroughness, here is a table containing the results from this test.

Court pdftotext pdftohtml complex pdftotext simple Original PDF
1st The first circuit publishes in HTML Format by default
2nd link link link link
3rd link link link link
4th link link link link
5th link link link link
6th link link link link
7th link link link link
8th link link link link
9th link link link link
10th link link link link
11th link link link link
DC Circuit link link link link
Federal Circuit link link link link

A caveat regarding pdftotext: This library is developed by a company called Glyph & Cog. Although the code is open source, I couldn't for the life of me figure out how to file a bug against it. This doesn't particularly bode well for using something as a dependency. On the flip side, Glyph & Cog is happy to provide support for the product.

How to Protect Your Open Source Code from Theft and a Mercurial Hook to Help

Updated, 2010-01-24: Some edits regarding the Affero license (thanks to Brian at http://cyberlawcases.com for the corrections).

I've finally begun doing some of the actual coding for my final project so the time has come to set up a mercurial repository to hold the code.

Once we complete our project, we will have built a free product that competes with some of the core functionality of both LexisNexis and Westlaw, so something we wanted to do was make sure they couldn't steal our code, enhance their product and thus moot ours.

To achieve this, we're using the GNU Affero General Public License v3, which allows people to take our code for free, but requires that they publicly share any modifications that they make to the code. The normal GNU General Public License allows the code to be used at no cost, but only requires that changes to the code be shared with the public if one distributes the changed version to the public. With a server-based project, like ours, one could operate modified versions of the code without ever having a need to distribute any of the software to the public. This loophole is closed by the Affero license.

In order to license our work, we must be its copyright holder. This is easy enough, since we get copyright instantly in the U.S., but, as has been demonstrated in Jacobsen v. Katzer, in order to seek remedies for copyright violations, we would have to register everything we made with the copyright office. This costs $35 per registration, and with open source software, it's not clear whether each and every version needs to be registered or just major releases, or what.

Since this is too onerous to be practical, an additional approach to protecting our works is useful, and in the DMCA (17 U.S.C. § 506(d)), remedies are provided for the "fraudulent removal of copyright notice." Although these do not (in any way) match the protections provided by normal copyright registration, they are a useful place to begin. Thus, if we place a copyright notice into each file of our code, those using our code must either risk violating the DMCA by removing these notices, or leave our copyright information intact. (Placing such notices in each file is also the recommendation of the Free Software Foundation.)

To place our information into each and every file of code that we upload publicly, I wrote a short mercurial hook that adds copyright and licensing information it to the top of every file that is modified or added to the repository. To use the script, simply make it executable, place it in the .hg directory of your project, and add the following lines to .hg/hgrc:

[hooks]
pretxncommit = .hg/checklicense.py

A couple of things I should note about this script is that it currently only checks for java and python files, and that it requires files called java_license.txt and python_license.txt to be in the root of your repository. It should be fairly easy to modify though to fit your own needs.

Syndicate content