Michael Jay Lissner
  • Home
  • About Site
  • Contact
  • Projects & Papers
  • Tags
  • Archives

New tool for testing lxml XPath queries

I got a bit frustrated today, and decided that I should build a tool to fix my frustration. The problem was that we’re using a lot of XPath queries to scrape various court websites, but there was no tool that could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built: There’s one called Xacobeo, Eclipse has one built in, and even Firebug has a tool that does similar. Unfortunately though, these each operate on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I consistently had the problem that when the HTML got nasty, they’d start falling over.

No more! Today I built a quick Django app that can be run locally or on a server. It’s quite simple. You input some HTML and an XPath expression, and it will tell you the matches for that expression. It has syntax highlighting, and a few other tricks up its sleeve, but it’s pretty basic on the whole.

I’d love to get any feedback I can about this. It’s probably still got some bugs, but it’s small enough that they should be quite easy to stamp out.

Update: I got in touch with the developer of Xacobeo. There’s an --html flag that you can pass to it at startup, if that’s your intention. If you use that, it indeed uses the same DOM parser that my tool does. Sigh. Affordances are important, especially in a GUI-based tool.

I love getting feedback and comments. Make my day by making a comment.

Comments
comments powered by Disqus

  • « Further privacy protections at CourtListener
  • URL Hacking at REI.com »

Published

May 20, 2012

Category

Tech

Tags

  • CourtListener 17
  • juriscraper 3
  • lxml 1
  • Python 9

Contact

This is Reader-Editable

Edit this post on Github

Get Weekly Updates

  • Unless mentioned otherwise, all material on this site is licensed under a Creative Commons copyright or the GNU Affero GPL. Privacy Policy.
  • Powered by Pelican. Theme: Elegant by Talha Mansoor