Wednesday 28 April 2010

data.gov.uk

Over the past couple of weeks at work I have been attempting to search for and download datasets from data.gov.uk for use in a project, it has been rather frustrating.  

For those who haven't come across data.gov.uk it is perhaps the greatest opportunity for the HM Government to open the wealth of non personal data to the public, research groups and private companies.  More than that there actually seems to be a group of people behind the project that are really passionate about the provision of data.  Currently the site is still in a beta phase with new datasets being added constantly.


The first thing to establish quite what data.gov.uk is.  Before I started this project I thought that it was a data repository from which I could download data tables, and while to some extent it is, arguably it is actually more of a data search tool which provides links to data held by different government departments and other organisations.   SPARQL adds a bit of a twist to this but more of SPARQL later.


The issue here is that each department or organisation is only interested in data that it provides, the Justice department provides information on Crime and Justice, health data comes from the NHS, Primary Health Trusts and so forth.  This is understandable I suppose, however what this means is that there is no overall attempt to make any of the data comparable with other datasets.  Data from the Justice department is often recorded against a different geography either in terms of boundaries or geographic resolution from that of the NHS.  Census data is recorded against another level of geography while election results are stored against yet another.

A further issue posed by the lack of coordination whilst seemingly pretty straight forward makes the analysis of data even more difficult.  As yet there is no adhered to naming convention for the different levels of geography.  For example a local Primary Care Trust might record data against the “London Borough of Brent”, a police authority may record data against simply “Brent” while census information may record information against “Brent LB” or even a borough code “00AE”  While it is simple for a human to see that at least three of the names here are the same a computer cannot do the same thing, consequently a manual process of ensuring that all names are the same is required.  This is a very time consuming process and has to be repeated for almost every dataset that you wish to download and map.

The temporal aspects of data also have to be considered, different datasets are recorded over different time periods.  Data could be recorded for the calendar year, financial year, quarterly, rolling period and so forth.  In many cases because data is coming from different departments there is no commonality between them.

This sounds very negative and to some extent that is the case, it has been a very frustrating process of identifying data, downloading it, formatting it only to find that the dataset against which I want to perform a comparison is in a different format, covers a different date range and is at a different geographic scale rendering comparisons impossible.


This is perhaps the time to talk about SPARQL.  If like me you don't know your SPARQL from your elbow then data.gov.uk isn't really the place to look for answers.  To understand SPARQL you need to understand the Semantic Web and RDF.  So lets start with the Semantic Web which was thought up by Tim Berners-Lee, founder of the WWW, URLs, HTTP and HTML. Web sites across the world store huge amounts of data, whether that data contains football results, weather reports, demographic information, or crime statistics, however in HTML this data is difficult to use in the way that you might like to as it is generally unstructured.  


What the Semantic Web attempts to do is to provide a structural format (built on syntaxes which use URIs to represent data) which can be queried or processed by machines.  Incidentally these syntaxes are called Resource Description Framework (RDF) syntaxes.


So what does the Semantic Web and RDF have to do with data.gov.uk, well the clear plan of data.gov.uk is to store government data in RDF syntaxes which can then be queried using SPARQL which is a query language and data access protocol for the Semantic Web.  So if you know the SPARQL you can get the data, or at least that's the idea.  But SPARQL isn't a simple language to pick up and it isn't something that your average Joe is going to get into and at the moment only a limited number of datasets are suitable for SPARQL queries.


SPARQL does provide a an opportunity to query data in a way that wasn't previously possible but it's pretty heavy weight and difficult to get into and for many the barrier will be set to high.


Perhaps I'm asking for too much, just releasing the data is a major step forward and for that we should be thankful, and yes it is still in beta and I'm sure that over time it will improve.  Actually being able to search for data and find data is something that I shouldn't understate either, it's as big as the Ordnance Survey finally making some of their data public (incidentally data.gov.uk does find OS data as well), but for the geographer it's all a little frustrating.

No comments:

Post a Comment