Making open data consumable

 Open Data Index

There is an interesting organization which assesses the state of open data around the world’ – the Open Data Index.  The need for open data is to make governments transparent and accountable.  An important goal of open data is to make the data machine readable.  The organization maintains an index/score of open data for various countries.

Open Data Index

Open Data Index

Although the organization tracks whether the data format is in spreadsheet or PDF format, I think current technologies can benefit much more if the data is further refined.  I think with very little effort the data can be made available in a format which would make it much more consumable.  Creating the data in a simple JSON format can preserve the relationships among the data.  This can make it much easier to analyze and understand.

In a previous blog post I have converted Singapore’s 2014 budget expenditure to JSON format.  You can see the visualization and notice that it allows you to easily drill down into the details of the data.  In the coming weeks I will show other visualizations of this data.  In this post I explain how I converted the data to JSON.

Original PDF

Singapore’s budget is available for download in PDF format.

Original PDF

Original PDF

 

Copy PDF data to excel

I copy the data from the PDF to excel.  I use ‘Alt’ + left mouse click and drag on the pdf to copy columns.  I create indents using ‘Insert’ (in excel).  I first copy the titles and paste them in excel.  I then copy and paste the values.  I do this for each page in the PDF. It takes me around 15 minutes to convert each pdf to excel (It helps if you are comfortable with working with raw data.  It might take others more time.)

Excel from PDF

Excel from PDF

Labeling nodes

This is the critical step in the conversion to excel.

  1. I label the items which do not have children as ‘nodes’ – notice the letter ‘n’ in the first column of the excel.
  2. Note also that I create a hierarchy in excel.  Each subcategory is in the column after that of the parent.

Convert to JSON

I exported the excel created in the last step into a .csv file.  I used Ruby to convert the .csv to .json.  This is very straightforward for most programmers.  Here is how I converted the file:

I read the file and convert it into a tree

Read each line

If it’s a leaf create a node with content

    else create a node with children

         recursively add each child to the tree

If it’s a parent

   I save the node as a parent at that level

  I add the node to the parent at the previous level

Traverse through the tree and convert into arrays of hashes

Write to JSON

 I created JSON files for each ministry. I then wrote another script to combine the individual files into a single file.  After creating each JSON file I used a visualization to check if there were any problems.

Singapore’s Budget Data – Exemplary

Singapore’s budget data is very easy to understand (relatively).  Of course this probably is helped because of the size of Singapore.  Regardless, there are some good lessons for open data.  The most important property of the data is that it is very consistent.  The broad categories are the same across all the ministries – operational expenditure, development expenditure.  The sub-categories are also very consistent.  Finally, in PDF, the format is very clear.

 Downloads/links

Before downloading any file, please remember my disclaimer, “IT’S ALL YOUR FAULT”.

Singapore 2014 budget data

Sample JSON file of Ministry of Health

JSON file used in my visualization

The Ministry of Health file shows the heirarchy clearly.