policy

Further privacy protections at CourtListener

I've written previously about the lengths we go to at CourtListener to protect people's privacy, and today we completed one more privacy enhancement.

After my last post on this topic, we discovered that although we had already blocked cases from appearing in the search results of all major search engines, we had a privacy leak in the form of our computer-readable sitemaps. These sitemaps contain links to every page within a website, and since those links contain the names of the parties in a case, it's possible that a Google search for the party name could turn up results that should be hidden.

This was problematic, and as of now we have changed the way we serve sitemaps so that they use the noindex X-Robots-Tag HTTP header. This tells search crawlers that they are welcome to read our sitemaps, but that they should avoid serving them or indexing them.

Respecting privacy while providing hundreds of thousands of public documents

At CourtListener, we have always taken privacy very seriously. We have over 600,000 cases currently, most of which are available on Google and other search engines. But in the interest of privacy, we make two broad exceptions to what's available on search engines:

  1. As is stated in our removal policy, if someone gets in touch with us in writing and requests that we block search engines from indexing a document, we generally attempt to do so within a few hours.
  2. If we discover a privacy problem within a case, we proactively block search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of requests to prevent indexing by search engines, we're often faced with an ethical dilemma, since in many instances, the party making the request is merely displeased that their involvement in the case is easy to discover and/or they are simply embarrassed by their past. In this case, the question we have to ask ourselves is: Where is the balance between the person's right to privacy and the public's need to access court records, and to what extent do changes in practical obscurity compel action on our behalf? For example, if someone convicted of murder or child molestation is trying to make information about their past harder to discover, how should we weigh the public's interest in easily locating this information via a search engine? In the case of convicted child molesters, we can look to Megan's law for a public policy stance on the issue, but even that forces us to ask to what extent we should chart our own path, and to what extent we should follow public policy decisions.

On the opposite end of the spectrum, many of the cases that we block search engines from indexing are asylum cases where a person has lost an attempt to stay in the United States, and been sent back to a country where they feel unsafe. In such cases, it seems clear that it's important to keep the person's name out of search engine results, but still we must ask to what extent do we have an obligation to identify and block such cases from appearing proactively rather than post hoc?

In both of these scenarios, we have taken a middle ground that we hope strikes a balance between the public's need for court documents and an individual's desire or need for privacy. Instead of either proactively blocking search engines from indexing cases or keeping cases in search results against a party's request, our current policy is to block search engines from indexing a web page as each request comes in. We currently have 190 cases that are blocked from search results, and the number increases regularly.

Where we do take proactive measures to block cases from search results is where we have discovered unredacted or improperly redacted social security numbers in a case. Taking a cue from the now-defunct Altlaw, whenever a case is added, we look for character strings that appear to be social security numbers, tax ID numbers or alien ID numbers. If we find any such strings, we replace them with x's, and we try to make sure the unredacted document does not appear in search results outside of CourtListener.

The methods we have used to block cases from appearing in search results have evolved over time, and I'd like to share what we've learned so others can give us feedback and learn from our experiences. There are five technical measures we use to keep a case out of search results:

  1. robots.txt file
  2. HTML meta noindex tags
  3. HTTP X-Robots-Tag headers
  4. sitemaps.xml files
  5. The webmaster tools provided by the search engines themselves

Each of these deserves a moment of explanation. robots.txt is a protocol that is respected by all major search engines internationally, and which allows site authors (such as myself) to identify web pages that shouldn't be crawled. Note that I said crawled not indexed. This is a very important distinction, as I'll explain momentarily.

HTML meta tags are a tag that you can place into the HTML of a page, and which instructs search engines not to index a page. Since this is an HTML format, this method only works on HTML pages.

HTTP X-Robots-Tag headers are similar to HTML meta tags, but they allow site authors to request that an item not be indexed. That item may be an HTML page, but equally, it may be a PDF or even an image that should not searchable.

Further, we provide an XML sitemap that search engines can understand, and which tells them about every page on the site that they should crawl and index.

All of these elements fit together into a complicated mélange that has absorbed many development hours over the past two years, as different search engines interpret these standards in different ways.

For example, Google and Bing interpret the robots.txt files as blocks to their crawlers. This means that web pages listed in robots.txt will not be crawled by Google or Bing, but that does not mean those pages will not be indexed. Indeed, if Google or Bing learn of the existence of a web page (for example, because another page linked to it), then they will include it in their indexes. This is true even if robots.txt explicitly blocks robots from crawling the page, because to include it in their indexes, they don't have to crawl it — they just need to know about it! Even your own link to a page is sufficient for Google or Bing to know about the page. And what's worse, if you have a good URL with descriptive words within it, Google or Bing will know the terms in the URLs even when they haven't crawled the page. So if your URL is example.com/private-page-about-michael-jackson, a query for [ Michael Jackson ] could certainly bring it up, even if it were never crawled.

The solution to this is to allow Google and Bing to crawl the pages, but to use noindex meta or HTTP tags. If these are in place, the pages will not appear in the index at all. This sounds paradoxical: to exclude pages from appearing in Google and Bing, you have to allow them to be crawled? Yes, that's correct. Furthermore, it's theoretically possible that Google or Bing could learn about a page on your site from a link, and then not crawl it immediately or at all. In this case, they will know the URL, but won't know about and X-Robots-Tag headers or meta tags. Thus, they might include the document against your wishes. For this reason, it's important to include private pages in your sitemap.xml file, inviting and encouraging Google and Bing to crawl the page specifically so the page can be excluded from their indexes.

Yahoo! uses Bing to power their search engine, and AOL uses Google, so the above strategy applies to them as well.

Other search engines take a different approach to robots.txt. Ask.com, The Internet Archive and the Russian search engine Yandex.ru all respect the robots meta tag, but not the x-robots-tag HTTP header. Thus, for these search engines, the strategy above works for HTML files, but not for any other files. These crawlers therefore need to be blocked from accessing those other files. On the upside, unlike Google and Bing, it appears that these search engines will not show a document in their results if they have not crawled it. Thus, using robots.txt alone should be sufficient.

A third class of search engines support neither the robots HTML meta tag, nor the x-robots-tag HTTP header. These are typically less popular or less mature crawlers, and so they must be blocked using robots.txt. There are two approaches to this. The first is to list blocked pages individually in the robots.txt file, and the second is to simply block these search engines from all access. While it's possible to list each private document in robots.txt, doing so creates a privacy loophole, since it lists all private documents in one place. At CourtListener, therefore, we take a conservative approach, and completely block all search engines that do not support the HTML meta tag or the x-robots-tag HTTP header.

The final action we take when we receive a request that a document on our site stop appearring in search results, is to use the webmaster tools provided by the major search engines1 to explicitly ask those search engines to exclude the document(s) from their results.

Between these measures, private documents on CourtListener should be removed from all major and minor search engines. Where posssible this strategy takes a very granular approach, and where minor search engines do not support certain standards, we take a conservative approach, blocking them entirely.

Update, 2012-04-29: You may also want to look at our discussion of the impact of putting people's names into your URLs, and the way that affects your sitemap files.

  1. We use Google's Webmaster Tools and Bing's Webmaster Tools. Before it was merged into Bing's tools, we also previously used Yahoo's Site Explorer.

Project Idea: "Bug Trackers for Cities."

Well, today's project idea was to post about the use of bug trackers for the management of city problems, but as it should turn out, I'm behind the curve on this one, so I'll just explain the concept, and post some links to people that have live implementations or have already blogged about this. When I first researched this idea about six months ago, I didn't find anything, but it seems that steam is building behind this idea.

Essentially, the idea is this: Cities have problems that citizens know about such as potholes, busted lampposts, gang activity, etc. They want to report these things to the city, but unfortunately reporting the problems by the phone or navigating the city websites is usually an awful, time-consuming, and unrewarding experience. It goes like this: First you get bumped from one department to another, eventually finding somebody who seems like they care. You tell them about the problem and feel satisfied that you've done your part, but you don't know if it's really in their system, or when it's going to get fixed or anything. You hang up the phone, and the problem is still a part of your daily life. You know if you call again, you won't be able to get an update, and you resign yourself to simply hoping that the problem will eventually be resolved. The next time you notice something that's in need of fixing, you're less likely to try to help. As this goes on, eventually the people that once cared no longer do, and getting residents of a city engaged in the problems in their community becomes increasingly difficult.

In the software world, there is a similar phenomenon, except instead of infrastructure and safety problems, the problems are errors in the software that need to be fixed – bugs. The solution to getting these bugs triaged and managed is to use what's known as a bug tracker. These systems allow the programmers behind the software to respond to problems that people find, and to triage them appropriately. In addition, they allow other people to vote on bugs, and help solve them. They allow careful prioritization of the bugs, and they allow visualizations of the bugs to be created such as the speed that they are fixed by department, the oldest bug in the system, etc.

If such as system were used for citizens to track problems they find in their city, it would have all kinds of benefits, and indeed a few such systems have been created. The most popular that I have found is called SeeClickFix, and looking at the page for Berkeley, it seems like it is a system that is at least used by Berkeley residents. Another popular one is http://www.fixmystreet.com/. Of course, for the system to be truly effective, it would have to be endorsed by the city itself, and used by its employees as well, which is something I have yet to find an example of.

Other people have also written about this idea, and Portland appears to be considering it, so it seems this idea is ripe on the vine and ready to be picked.

The question now is what will it take to implement it correctly, and what system will be the one that gains usage. I fully expect to see more cities using this type of technology in the next few years.

An Analysis of FTC Behavioral Advertising and an End of Semester Countdown

It's coming down to the end of the semester, and after I finished the attached paper on FTC laws as the apply to online advertising, I did some calculations to figure out what I have to do still.

Turns out I have 68-95 pages to write (give or take), and two projects to complete between today and early May. Things are going to get interesting.

The lay of the land looks like this:

  • Two law/policy papers - total of 35-50 pages
  • One sociology paper - 25-35 pages
  • A final project combining some aesthetics work I have been doing
  • Two technology strategy assessments - total of eight pages
  • And an online project - watch for this soon

For now, I'll reserve my thoughts on the attached analysis, but I tried to analyze the ways that the FTC regulates online advertising...within an eight page limit.

Rebuilding the 9th Ward in New Orleans

I've spent Thanksgiving in the Deep South this year at my girlfriend's grandparent's house. It's been a great trip, but one thing about it keeps plaguing me.

We spent a couple of days in New Orleans, and one of the places I insisted upon seeing was the 9th Ward, where Katrina dealt some of the worst damage. I wanted to see if people were recovering, and how much the place had been fixed up since 2005 when the storm blew through.

What surprised me (aside from the markings) is that people are moving back into this area, rebuilding their houses, and generally, pouring money into the location. On the one hand, it makes sense since it's their home, but on the other, I can't help but think that somebody should stop this from happening. From what I've learned while here, the 9th Ward is an area of New Orleans that is going to flood if there is another hurricane Katrina...which there will be sooner or later. Considering that Katrina killed almost 2,000 people this seems rather unwise.

So, rather than moving back in, shouldn't we be cutting our losses right about now, finding somewhere for these people to live, and preventing another disaster? Or is my logic flawed?

A Music Cost Inventory

Tagged:  

According to Title 17, Chapter 5, section 504c2 of the US copyright law, if you get caught with music that you have downloaded illegally from the Internet, you can get charged up to $150,000 per infringement. I thought I would do a little experiment to see how much I would be in for if my entire collection were to be found to be illegal.

Let's do some math. I have 3,876 tracks, at $150,000 each. So if my entire collection were to be found illegal, that means it would cost me $581.4 million dollars — about .6 billion dollars.

OK, let's assume that I can live with that reality. It just seems odd that I could have bought those songs for $3,876 on amazon.com, or iTunes.

Something isn't quite right here. Also, did I mention that all US digital music sales are estimated to total $2.9B in 2007? That makes my music worth about 20% of the 2007 revenue.

Syndicate content