Making open data consumable
Open Data Index
There is an interesting organization which assesses the state of open data around the world’ – the Open Data Index. The need for open data is to make governments transparent and accountable. An important goal of open data is to make the data machine readable. The organization maintains an index/score of open data for various countries.
Although the organization tracks whether the data format is in spreadsheet or PDF format, I think current technologies can benefit much more if the data is further refined. I think with very little effort the data can be made available in a format which would make it much more consumable. Creating the data in a simple JSON format can preserve the relationships among the data. This can make it much easier to analyze and understand.
In a previous blog post I have converted Singapore’s 2014 budget expenditure to JSON format. You can see the visualization and notice that it allows you to easily drill down into the details of the data. In the coming weeks I will show other visualizations of this data. In this post I explain how I converted the data to JSON.
Singapore’s budget is available for download in PDF format.
Copy PDF data to excel
I copy the data from the PDF to excel. I use ‘Alt’ + left mouse click and drag on the pdf to copy columns. I create indents using ‘Insert’ (in excel). I first copy the titles and paste them in excel. I then copy and paste the values. I do this for each page in the PDF. It takes me around 15 minutes to convert each pdf to excel (It helps if you are comfortable with working with raw data. It might take others more time.)
This is the critical step in the conversion to excel.
- I label the items which do not have children as ‘nodes’ – notice the letter ‘n’ in the first column of the excel.
- Note also that I create a hierarchy in excel. Each subcategory is in the column after that of the parent.
Convert to JSON
I exported the excel created in the last step into a .csv file. I used Ruby to convert the .csv to .json. This is very straightforward for most programmers. Here is how I converted the file:
I read the file and convert it into a tree
Read each line
If it’s a leaf create a node with content
else create a node with children
recursively add each child to the tree
If it’s a parent
I save the node as a parent at that level
I add the node to the parent at the previous level
Traverse through the tree and convert into arrays of hashes
Write to JSON
I created JSON files for each ministry. I then wrote another script to combine the individual files into a single file. After creating each JSON file I used a visualization to check if there were any problems.
Singapore’s Budget Data – Exemplary
Singapore’s budget data is very easy to understand (relatively). Of course this probably is helped because of the size of Singapore. Regardless, there are some good lessons for open data. The most important property of the data is that it is very consistent. The broad categories are the same across all the ministries – operational expenditure, development expenditure. The sub-categories are also very consistent. Finally, in PDF, the format is very clear.
Before downloading any file, please remember my disclaimer, “IT’S ALL YOUR FAULT”.
The Ministry of Health file shows the heirarchy clearly.