<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Michael Jay Lissner</title><link href="https://michaeljaylissner.com/" rel="alternate"></link><link href="https://michaeljaylissner.com/feeds/tag/juriscraper" rel="self"></link><id>https://michaeljaylissner.com/</id><updated>2012-10-09T19:42:47-07:00</updated><entry><title>Presentation on Juriscraper and CourtListener for LVI2012</title><link href="https://michaeljaylissner.com/posts/2012/10/09/presentation-on-juriscraper-and-courtlistener-for-lvi2012/" rel="alternate"></link><updated>2012-10-09T19:42:47-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2012-10-09:posts/2012/10/09/presentation-on-juriscraper-and-courtlistener-for-lvi2012/</id><summary type="html">&lt;p&gt;Yesterday and today I&amp;#8217;ve been in Ithaca, New York, participating in the Law 
via the Internet Conference (&lt;span class="caps"&gt;LVI&lt;/span&gt;), where I&amp;#8217;ve been learning&amp;nbsp;tons!&lt;/p&gt;
&lt;p&gt;I had the good fortune to have my proposal topic selected for &lt;a href="http://blog.law.cornell.edu/lvi2012/overview/track-4-application-development-for-open-access-and-engagement/"&gt;Track 4: 
Application Development for Open Access and Engagement&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the interest of sharing, I&amp;#8217;ve &lt;a href="https://michaeljaylissner.com/pdfs/LVI-Presentation-Lissner-Juriscraper.pdf"&gt;attached the latest version of my slides&lt;/a&gt; 
to this Blog post, and the audio for the talk may eventually get posted &lt;a href="http://blog.law.cornell.edu/lvi2012/presentation/wrangling-court-data-on-a-national-level/"&gt;on
the &lt;span class="caps"&gt;LVI&lt;/span&gt; site&lt;/a&gt;.&lt;/p&gt;</summary><category term="lvi2012"></category><category term="lvi"></category><category term="juriscraper"></category><category term="CourtListener"></category><category term="Cornell"></category></entry><entry><title>New tool for testing lxml XPath queries</title><link href="https://michaeljaylissner.com/posts/2012/05/20/new-tool-for-testing-lxml-xpath-queries/" rel="alternate"></link><updated>2012-05-20T15:48:06-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2012-05-20:posts/2012/05/20/new-tool-for-testing-lxml-xpath-queries/</id><summary type="html">&lt;p&gt;I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we&amp;#8217;re using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions&amp;nbsp;efficiently.&lt;/p&gt;
&lt;p&gt;There are a couple tools that are quite similar to what I just built: There&amp;#8217;s one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different &lt;span class="caps"&gt;DOM&lt;/span&gt; interpretation than the one that lxml&amp;nbsp;builds. &lt;/p&gt;
&lt;p&gt;So the problem I was running into was that while these tools helped, I consistently had the problem that when the &lt;span class="caps"&gt;HTML&lt;/span&gt; got nasty, they&amp;#8217;d start falling&amp;nbsp;over. &lt;/p&gt;
&lt;p&gt;No more! Today I built &lt;a href="https://github.com/mlissner/lxml-xpath-tester/"&gt;a quick Django app&lt;/a&gt; that can be run locally or on a server. It&amp;#8217;s quite simple. You input some &lt;span class="caps"&gt;HTML&lt;/span&gt; and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it&amp;#8217;s pretty basic on the&amp;nbsp;whole.&lt;/p&gt;
&lt;p&gt;I&amp;#8217;d love to get any feedback I can about this. It&amp;#8217;s probably still got some bugs, but it&amp;#8217;s small enough that they should be quite easy to stamp&amp;nbsp;out.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; I got in touch with the developer of Xacobeo. There&amp;#8217;s an &lt;code&gt;--html&lt;/code&gt; 
flag that you can pass to it at startup, if that&amp;#8217;s your intention. If you use 
that, it indeed uses the same &lt;span class="caps"&gt;DOM&lt;/span&gt; parser that my tool does. Sigh. Affordances 
are important, especially in a &lt;span class="caps"&gt;GUI&lt;/span&gt;-based&amp;nbsp;tool.&lt;/p&gt;</summary><category term="Python"></category><category term="lxml"></category><category term="juriscraper"></category><category term="CourtListener"></category></entry><entry><title>My Presentation Proposal for LVI 2012</title><link href="https://michaeljaylissner.com/posts/2012/03/15/my-presentation-proposal-for-lvi-2012/" rel="alternate"></link><updated>2012-03-15T20:09:29-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2012-03-15:posts/2012/03/15/my-presentation-proposal-for-lvi-2012/</id><summary type="html">&lt;p&gt;The &lt;a href="http://blog.law.cornell.edu/lvi2012/"&gt;Law Via the Internet&lt;/a&gt; conference is  celebrating its 20th anniversary
at Cornell University on October 7-9th. I will be attending, 
and with any luck, I&amp;#8217;ll be presenting on the topic proposed&amp;nbsp;below.&lt;/p&gt;
&lt;h3 id="wrangling-court-data-on-a-national-level"&gt;Wrangling Court Data on a National&amp;nbsp;Level&lt;/h3&gt;
&lt;p&gt;Access to case law has recently become easier than ever: By simply visiting 
a court&amp;#8217;s website it is now possible to find and read thousands of cases 
withou  ever leaving your home. At the same time, there are nearly a hundred
 court websites, many of these websites suffer from poor funding or 
 prioritization, and gaining a higher-level view of the law can be 
 challenging. &amp;#8220;&lt;a href="https://github.com/freelawproject/juriscraper/"&gt;Juriscraper&lt;/a&gt;&amp;#8221; is a new project designed to ease these 
 problems for all those that wish to collect these court opinions daily. The
  project is under active development, and we are looking for others to get&amp;nbsp;involved.&lt;/p&gt;
&lt;p&gt;Juriscraper is a liberally-licensed open source library that can be picked 
up and used by any organization to scrape the case data from court websites.
 In addition to a simply scraping the websites and extracting metadata from 
 them, Juriscraper has a number of other design&amp;nbsp;goals:   &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extensibility to support video, oral argument audio, and other media&amp;nbsp;types&lt;/li&gt;
&lt;li&gt;Support for all metadata provided by court&amp;nbsp;websites&lt;/li&gt;
&lt;li&gt;Extensibility to support varied geographies and&amp;nbsp;jurisdictions&lt;/li&gt;
&lt;li&gt;Generalized object-oriented architecture with little or no code&amp;nbsp;repetition&lt;/li&gt;
&lt;li&gt;Standardized coding techniques using the latest libraries and standards (Python, xpath, lxml, requests,&amp;nbsp;chardet)&lt;/li&gt;
&lt;li&gt;Simple installation, configuration, and &lt;span class="caps"&gt;API&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Friendly and transparent to court&amp;nbsp;websites&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As well as a number of&amp;nbsp;features:  &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Harmonizatio  of metadata (&lt;span class="caps"&gt;US&lt;/span&gt;, &lt;span class="caps"&gt;USA&lt;/span&gt;, United States of America, 
 etc ? United States; et al, et. al., etc. get eliminated; vs., v, 
 vs ? v.; all dates are Python objects;&amp;nbsp;etc.)&lt;/li&gt;
&lt;li&gt;Smart title-casing of case names (several courts provide case names in 
 uppercase&amp;nbsp;only)&lt;/li&gt;
&lt;li&gt;Sanity checking and sorting of metadata values returned by court&amp;nbsp;websites&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once implemented, Juriscraper is part of a two-part system. The second part 
is the caller, which uses the &lt;span class="caps"&gt;API&lt;/span&gt;, and which itself solves some interesting&amp;nbsp;questions:  &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How are duplicates detected and&amp;nbsp;avoided? &lt;/li&gt;
&lt;li&gt;How can the impact on court websites be&amp;nbsp;minimized?&lt;/li&gt;
&lt;li&gt;How can mime type detection be completed successfully so that textual contents can be&amp;nbsp;extracted?&lt;/li&gt;
&lt;li&gt;What should we do if it is an image-based &lt;span class="caps"&gt;PDF&lt;/span&gt;?&lt;ul&gt;
&lt;li&gt;How should &lt;span class="caps"&gt;HTML&lt;/span&gt; be&amp;nbsp;tidied?&lt;/li&gt;
&lt;li&gt;How often should we check a court website for new&amp;nbsp;content?  &lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;What should we do in case of&amp;nbsp;failure?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Juriscraper is currently deployed by CourtListener.com to scrape all of the 
Federal Appeals courts, and we are slowly adding additional state courts 
over the coming&amp;nbsp;weeks. &lt;/p&gt;
&lt;p&gt;We have been scraping these sites in various ways for several years, 
and Juriscraper is the culmination of what we&amp;#8217;ve learned. We hope that by 
presenting our work at &lt;span class="caps"&gt;LVI&lt;/span&gt; 2012, we will be able to share what we have 
learned and gain additional collaborators in our&amp;nbsp;work.&lt;/p&gt;</summary><category term="proposal"></category><category term="presentations"></category><category term="me"></category><category term="lvi2012"></category><category term="juriscraper"></category><category term="CourtListener"></category></entry></feed>