
We needed to be able to stop waiving our arms around while talking about “knowledge visualization” and be able to point to something real. It also seemed like a great opportunity to try out some software too, in particular the Tornado web server, the Mappa implementation of Topic Maps for Python, and interactive graphics using HTML5 open standards, in particular SVG, AJAX and Javascript rather than Flash, long a proprietary bêtes noire, now thankfully on its way out.
I remembered orinoco‘s lovely Find the Country game (which I had earlier made an OLPC Activity out of) and kunuk‘s further work making the SVG world map nicely zoomable, so it was clear what to start with on the tech front. Wolf found some great global energy data from the UN and had a vision for a punchy look for the site too.
There is a vast amount of legacy data in flat files and when most people have some data they’d like to put shape to they turn first to the venerable spreadsheet program, not relational databases or some frisky new knowledge management app that lets them collaborate semantically — hmm, where are those, anyway? :-) So, flat files are not going away anytime soon and besides, they are the lowest common denominator for structured information, so we’d best have some tools for upgrading them semantically, no? So it seemed appropriate to make a tool to renormalize this flattened relational data so it can be processed semantically.
What do I mean by flattened relational data? It is a pattern of flat file organization that is encountered when relational data is denormalized for export. Here’s an example. This is the very small sample from the original data:
Note the duplication of data in ‘Series Name’ and ‘Country Name’ columns. For example, every appearance of “Energy production (kt of oil equivalent)” — after the first one has established that as the human readable expansion of the code “EG.EGY.PROD.KT.OE” — is redundant, and a violation of the First Normal Form (1NF) of the relational model.
So the data format above is a ‘flattened relational’ version of a relational database which looked like this:
So using TDD (test driven development) I wrote a tool flatfile.py for reading CSV files which contain flattened relational data and creating an in-RAM representation in 1NF. Then wrote another tool called flatfile_to_ltm.py to convert such documents to Linear Topic Map (.ltm) format, to facilitate import into Mappa. Why .ltm? It is arguably the most human writeable topic map file format Mappa supports and since I’m all about full round trip knowledge management, (ie being able to edit knowledge either graphically or in emacs, sorry sorry… textually) it seemed to be the way to go. Here is a sample.
What are the advantages? Well, brevity; ease of data update; ease of querying; lowered cognitive load dealing with the data; and much greater flexibility in the use of the data. For example, in http://mapdemo.ponderate.com the pop menu is populated by the “Energy Statistic” table.