<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Michael Jay Lissner</title><link href="https://michaeljaylissner.com/" rel="alternate"></link><link href="https://michaeljaylissner.com/feeds/tag/data" rel="self"></link><id>https://michaeljaylissner.com/</id><updated>2010-08-02T13:02:42-07:00</updated><entry><title>Project Idea: “Community-Curated Data Repository”</title><link href="https://michaeljaylissner.com/posts/2010/08/02/project-idea-community-curated-data-repository/" rel="alternate"></link><updated>2010-08-02T13:02:42-07:00</updated><author><name>Mike Lissner</name></author><id>tag:michaeljaylissner.com,2010-08-02:posts/2010/08/02/project-idea-community-curated-data-repository/</id><summary type="html">&lt;p&gt;There&amp;#8217;s an interesting problem that I&amp;#8217;ve run into a number of times that goes 
like this: You want to start a new project studying &lt;strong&gt;X&lt;/strong&gt; dump of 
data, and you have a great idea of how to do &lt;strong&gt;Y&lt;/strong&gt; with it. You 
go download the data, but then you spend hours (days and weeks) manipulating 
it, manicuring it, and stuffing it neatly into a database. The problem is that 
the data is in &lt;em&gt;their&lt;/em&gt; format, and they probably haven&amp;#8217;t told you much 
about it, much less put it into a useful format for other people. You have no 
option but to figure it out, optimize it, make it queryable, etc, when really, 
what you wish you were doing was simply &lt;em&gt;working with it&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;In other words, the data format and quality keeps you from working with the 
data itself. I&amp;#8217;ve run into this a number of times, most notably when trying to 
work with the &lt;a href="http://www.recovery.gov/FAQ/Pages/DownLoadCenter.aspx"&gt;Recovery 
Data&lt;/a&gt;. I&amp;#8217;ve also had fun working with &lt;a href="http://census.gov"&gt;census 
data&lt;/a&gt;, geographic data, and the list goes on. There are any number of useful 
data sources that are provided by non-profits and government bodies, such as 
population, economic, health, and agricultural&amp;nbsp;data.&lt;/p&gt;
&lt;p&gt;The solution to this problem is simple. A community needs to be built around 
curating the data and providing it in useful formats, and a repository of some 
sort needs to be made so people can download &lt;em&gt;and install&lt;/em&gt; the data. 
Similar ideas have come up a few times in various formats. Most notably, 
Google has taken a stab at solving this with their &lt;a href="http://www.google.com/publicdata/home" 
target="_blank"&gt;public data sets&lt;/a&gt;, and back around the turn of the 
millennium, Debian &lt;a href="http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=38902" 
target="_blank"&gt;considered making a repository&lt;/a&gt; for the&amp;nbsp;data.&lt;/p&gt;
&lt;p&gt;Neither of these solutions are good enough though. In Google&amp;#8217;s case, they&amp;#8217;re 
providing a one-way street: They choose the data source, they tune-up the 
data, and they provide the data. If there&amp;#8217;s a source you don&amp;#8217;t like, or if 
it&amp;#8217;s in a format you don&amp;#8217;t like, well, too bad. In the case of Debian, they 
decided not to go for it, but they should have. They had the right idea, but 
weren&amp;#8217;t prepared to give the idea its&amp;nbsp;due.&lt;/p&gt;
&lt;p&gt;The right solution will be one in which the community can suggest and debate 
data sources, and which treats the data with the respect it deserves. I think 
we&amp;#8217;ll see a data source like this eventually, but I fear that until we do, 
researchers around the world will be stuck doing unnecessary data&amp;nbsp;transformations.&lt;/p&gt;</summary><category term="recovery"></category><category term="Project idea"></category><category term="debian"></category><category term="data"></category><category term="curation"></category></entry></feed>