As part of our experiments on publicly available data, we recently had a go at dismembering the Hansard “daily parts” that are published on publications.parliament.uk and seeing if placing the bits into a MongoDB store was useful.
Originally a sideline for a project to create a Hansard-specific search demo (spun up as soon as we realised we were about to start scraping the same pages over and over for multiple projects), this has become a useful thing in its own right. It turns out that, even in an early form, the end product is really useful – MongoDB is much more usable and flexible (both in terms of storage and data retrieval) than traditional SQL solutions and that there’s more data embedded in the publications.parliament HTML than has been previously realised/exploited.
After an extended period of trial and error, the data broke down quite nicely into an object model that allowed us to address it in chunks of a day, a section (a first-pass abstraction of “Debates and Oral Answers” vs, say, “Written Answers to Questions”) or even a paragraph. Mongo’s flexible “schemaless” approach has allowed for a nice simple typing system that lets us differentiate between paragraphs which are contributions (and need to have Member details attached) and those which are not as well as creating a conceptual fragment (a collection of paragraphs) that can be typed as a Debate, a Statement, a Question and so on. Its even allowed us to start to solve the age-old problem of the desire to be able to fetch things as a Debate and as a Column (this isn’t quite correct due to some missing data in the HTML when it comes to tables and divisions and what column(s) they are supposed to be in).
As a bonus feature we’ve also been able to tinker around and generate some extra fields which have lead to the production of a few Kindle-formatted Hansards – complex enough to feature a newspaper-style layout for easy navigation, but using vastly simplified HTML for the content.
Hopefully we can take this further. A quick look at some data with a much more controlled circulation has revealed that the missing column data is captured as part of the original publication process in a form we have a chance of interpreting but this is not something we’ve been able to spend much time on. At least the idea’s out there now. Object stores aren’t the answer for everything but they definitely have the edge in this particular case.
While we haven’t quite worked out the mechanics of releasing the dataset just yet, the source code is available to view on Github if you find yourself in a tinkering mood.