Open data, open dictionaries

Dictionary pageThe Isle of Man branch of the British Computer Society had a fascinating presentation on open data and mash-ups on Friday. The talk was given by Prof. Robert Barr OBE, and the gist of the session was that data should flow freely to the people in a useful data structure, yet also that the open-ness should be considered with attention to commercial considerations such as intellectual property and the benefits to the wider economy.

While listening to Robert, it struck me that I am in my very own battle for the extraction of data that should be more readily available. As you may know, I am learning Manx. As part of this, I am generating my own revision notes, references, blog posts and the like that may someday see the light of day. Part of this work is the development of a Manx language dictionary for Windows  Phone 7.

To achieve my goal, I needed a copy of the Manx dictionary. Having asked around and researching myself, I gathered a number of links to existing on-line resources. These ranged from PDF formatted documents to fully indexed dictionaries. The PDF version (English to Manx, Manx to English) was unsuitable because it would be difficult to accurately extract the words from the PDF “printed page”. The RoadLingua and FreeLang dictionaries appeared promising, and the dictionaries appeared to be out of copyright. But these were encoded in proprietary dictionary file formats. So ironically, even though the dictionary was “open”, the software needed to be reverse engineered to access the dictionary, itself a violation of copyright. So it was that I was left with the remaining two options that may prove to be useful. These were the Phil Kelly dictionary and the Faragher’s. These were, however, only HTML sites. Between the two, Faragher’s seemed the best, as it provided value-added content such as use of the words within sentences and Manx phrases – ideal if you are interested in the many idioms in use in Manx Gaelic.

So it seemed that I would need to use the Faragher’s site as a “back end” to my application, essentially screen-scraping the site for translations. And indeed, to accomplish this, I would be best served if I wrote my own web site, which acted as a bridge between my Windows Phone 7 application and the dictionary itself. This would double my work, but the reasons were various; the extended platform on a server would allow me to parse the HTML from the site more reliably and by caching words as they were requested, I could – over time – create a reliability buffer in case the original site was to fail. I set about the task and have just launched the site in a very early form of initial testing (take a look, at http://taggloo.im). This was particularly challenging, as the HTML from the Faragher’s dictionary is flakey at best. However, by inserting that middle layer, I could hide this trickery from the user.

All this, because the dictionary was not available electronically in an indexed form. And this resonates with Robert Barr’s point about open data. Open data should not only be open, but also be usefully formatted to allow for its use. An unindexed dictionary is hardly a dictionary! More frustration was in the encapsulation of the indexed dictionary within copyrighted software which was quite closed! I approached RoadLingua about how they would feel about releasing the file formats to their dictionary but I received no response.

So it was with great surprise and relief when I realised that by navigating to an unpublished URL (that should have been concealed from internet users) I could extract the entire Faragher’s dictionary from the site, and put it to my own use! So, after playing with MySQL scripts in order to format them into T-SQL, I now have two 50,000 word dictionaries, one for each direction (Manx to English, English to Manx). Am I going to keep this to myself?

No. I’ve checked about copyright, and I’m informed that this is not an issue, certainly in the spirit of expanding the availability of Manx learning resources. So, as part of my Taggloo project, which already has an effective and reliable API for XML and JSON consumers, I’m going to make the entire database available for use by other applications (maybe mobile phone applications, competing with my own) and web-sites (it becomes possible to “embed” Manx dictionaries on even the simplest of sites). Although the final API has yet to be defined, and there will likely be changes to it in the coming weeks, this data will obviously be free for use by anyone and everyone (subject to fair use – ie. not crashing my server), the API will ask for one thing: the opportunity to record the words being indexed. This itself, over time, will create a second rich data-set. What words are people regularly using? Do these correlate to students’ progress in classes, or do the translations point to any cultural significance such as house names, which are regularly seen in Manx, yet seldom understood?

I have many plans around this project, with further data-sets springing from them, and adding further depth to what will hopefully become reliable and rich data-set containing both formal dictionary content and community contributions. This complements the already available learning resources for the user, particularly those found at LearnManx.com. I’ll be blogging about them very soon, hopefully in line with an exciting new blog design.