<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Michael Jay Lissner</title><link href="https://michaeljaylissner.com/" rel="alternate"></link><link href="https://michaeljaylissner.com/feeds/tag/proposal" rel="self"></link><id>https://michaeljaylissner.com/</id><updated>2012-03-15T20:09:29-07:00</updated><entry><title>My Presentation Proposal for LVI 2012</title><link href="https://michaeljaylissner.com/posts/2012/03/15/my-presentation-proposal-for-lvi-2012/" rel="alternate"></link><updated>2012-03-15T20:09:29-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2012-03-15:posts/2012/03/15/my-presentation-proposal-for-lvi-2012/</id><summary type="html">&lt;p&gt;The &lt;a href="http://blog.law.cornell.edu/lvi2012/"&gt;Law Via the Internet&lt;/a&gt; conference is  celebrating its 20th anniversary
at Cornell University on October 7-9th. I will be attending, 
and with any luck, I&amp;#8217;ll be presenting on the topic proposed&amp;nbsp;below.&lt;/p&gt;
&lt;h3 id="wrangling-court-data-on-a-national-level"&gt;Wrangling Court Data on a National&amp;nbsp;Level&lt;/h3&gt;
&lt;p&gt;Access to case law has recently become easier than ever: By simply visiting 
a court&amp;#8217;s website it is now possible to find and read thousands of cases 
withou  ever leaving your home. At the same time, there are nearly a hundred
 court websites, many of these websites suffer from poor funding or 
 prioritization, and gaining a higher-level view of the law can be 
 challenging. &amp;#8220;&lt;a href="https://github.com/freelawproject/juriscraper/"&gt;Juriscraper&lt;/a&gt;&amp;#8221; is a new project designed to ease these 
 problems for all those that wish to collect these court opinions daily. The
  project is under active development, and we are looking for others to get&amp;nbsp;involved.&lt;/p&gt;
&lt;p&gt;Juriscraper is a liberally-licensed open source library that can be picked 
up and used by any organization to scrape the case data from court websites.
 In addition to a simply scraping the websites and extracting metadata from 
 them, Juriscraper has a number of other design&amp;nbsp;goals:   &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extensibility to support video, oral argument audio, and other media&amp;nbsp;types&lt;/li&gt;
&lt;li&gt;Support for all metadata provided by court&amp;nbsp;websites&lt;/li&gt;
&lt;li&gt;Extensibility to support varied geographies and&amp;nbsp;jurisdictions&lt;/li&gt;
&lt;li&gt;Generalized object-oriented architecture with little or no code&amp;nbsp;repetition&lt;/li&gt;
&lt;li&gt;Standardized coding techniques using the latest libraries and standards (Python, xpath, lxml, requests,&amp;nbsp;chardet)&lt;/li&gt;
&lt;li&gt;Simple installation, configuration, and &lt;span class="caps"&gt;API&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Friendly and transparent to court&amp;nbsp;websites&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As well as a number of&amp;nbsp;features:  &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Harmonizatio  of metadata (&lt;span class="caps"&gt;US&lt;/span&gt;, &lt;span class="caps"&gt;USA&lt;/span&gt;, United States of America, 
 etc ? United States; et al, et. al., etc. get eliminated; vs., v, 
 vs ? v.; all dates are Python objects;&amp;nbsp;etc.)&lt;/li&gt;
&lt;li&gt;Smart title-casing of case names (several courts provide case names in 
 uppercase&amp;nbsp;only)&lt;/li&gt;
&lt;li&gt;Sanity checking and sorting of metadata values returned by court&amp;nbsp;websites&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once implemented, Juriscraper is part of a two-part system. The second part 
is the caller, which uses the &lt;span class="caps"&gt;API&lt;/span&gt;, and which itself solves some interesting&amp;nbsp;questions:  &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How are duplicates detected and&amp;nbsp;avoided? &lt;/li&gt;
&lt;li&gt;How can the impact on court websites be&amp;nbsp;minimized?&lt;/li&gt;
&lt;li&gt;How can mime type detection be completed successfully so that textual contents can be&amp;nbsp;extracted?&lt;/li&gt;
&lt;li&gt;What should we do if it is an image-based &lt;span class="caps"&gt;PDF&lt;/span&gt;?&lt;ul&gt;
&lt;li&gt;How should &lt;span class="caps"&gt;HTML&lt;/span&gt; be&amp;nbsp;tidied?&lt;/li&gt;
&lt;li&gt;How often should we check a court website for new&amp;nbsp;content?  &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;What should we do in case of&amp;nbsp;failure?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Juriscraper is currently deployed by CourtListener.com to scrape all of the 
Federal Appeals courts, and we are slowly adding additional state courts 
over the coming&amp;nbsp;weeks. &lt;/p&gt;
&lt;p&gt;We have been scraping these sites in various ways for several years, 
and Juriscraper is the culmination of what we&amp;#8217;ve learned. We hope that by 
presenting our work at &lt;span class="caps"&gt;LVI&lt;/span&gt; 2012, we will be able to share what we have 
learned and gain additional collaborators in our&amp;nbsp;work.&lt;/p&gt;</summary><category term="proposal"></category><category term="presentations"></category><category term="me"></category><category term="lvi2012"></category><category term="juriscraper"></category><category term="CourtListener"></category></entry></feed>