juriscraper

Presentation on Juriscraper and CourtListener for LVI2012

Yesterday and today I've been in Ithaca, New York, participating in the Law via the Internet Conference (LVI), where I've been learning tons!

I had the good fortune to have my proposal topic selected for Track 4: Application Development for Open Access and Engagement.

In the interest of sharing, I've attached the latest version of my slides to this Blog post, and the audio for the talk may eventually get posted on the LVI site.

New tool for testing lxml XPath queries

I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we're using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built: There's one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I consistently had the problem that when the HTML got nasty, they'd start falling over.

No more! Today I built a quick Django app that can be run locally or on a server. It's quite simple. You input some HTML and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it's pretty basic on the whole.

I'd love to get any feedback I can about this. It's probably still got some bugs, but it's small enough that they should be quite easy to stamp out.

Update: I got in touch with the developer of Xacobeo. There's an --html flag that you can pass to it at startup, if that's your intention. If you use that, it indeed uses the same DOM parser that my tool does. Sigh. Affordances are important, especially in a GUI-based tool.

My Presentation Proposal for LVI 2012

The Law Via the Internet conference is celebrating its 20th anniversary at Cornell University on October 7-9th. I will be attending, and with any luck, I'll be presenting on the topic proposed below.

Wrangling Court Data on a National Level

Access to case law has recently become easier than ever: By simply visiting a court's website it is now possible to find and read thousands of cases without ever leaving your home. At the same time, there are nearly a hundred court websites, many of these websites suffer from poor funding or prioritization, and gaining a higher-level view of the law can be challenging. “Juriscraper” is a new project designed to ease these problems for all those that wish to collect these court opinions daily. The project is under active development, and we are looking for others to get involved.

Juriscraper is a liberally-licensed open source library that can be picked up and used by any organization to scrape the case data from court websites. In addition to a simply scraping the websites and extracting metadata from them, Juriscraper has a number of other design goals:

  • Extensibility to support video, oral argument audio, and other media types
  • Support for all metadata provided by court websites
  • Extensibility to support varied geographies and jurisdictions
  • Generalized object-oriented architecture with little or no code repetition
  • Standardized coding techniques using the latest libraries and standards (Python, xpath, lxml, requests, chardet)
  • Simple installation, configuration, and API
  • Friendly and transparent to court websites

As well as a number of features:

  • Harmonization of metadata (US, USA, United States of America, etc → United States; et al, et. al., etc. get eliminated; vs., v, vs → v.; all dates are Python objects; etc.)
  • Smart title-casing of case names (several courts provide case names in uppercase only)
  • Sanity checking and sorting of metadata values returned by court websites

Once implemented, Juriscraper is part of a two-part system. The second part is the caller, which uses the API, and which itself solves some interesting questions:

  • How are duplicates detected and avoided?
  • How can the impact on court websites be minimized?
  • How can mime type detection be completed successfully so that textual contents can be extracted?
  • What should we do if it is an image-based PDF?
    • How should HTML be tidied?
    • How often should we check a court website for new content?
  • What should we do in case of failure?

Juriscraper is currently deployed by CourtListener.com to scrape all of the Federal Appeals courts, and we are slowly adding additional state courts over the coming weeks.

We have been scraping these sites in various ways for several years, and Juriscraper is the culmination of what we've learned. We hope that by presenting our work at LVI 2012, we will be able to share what we have learned and gain additional collaborators in our work.

Syndicate content