Typos, spelling mistakes and the future

Robert Brook

The pages displayed on the Hansard Prototype are generated from XML, which in turn is the result of several OCR systems, taken from scanned images of original printed paper copies of Hansard.

The are typographical errors, OCR errors – and spelling errors introduced at one or more stages in the process or getting from the printed page onto the screen. Material printed in the original Hansard is properly considered to be part of the Official Report.

The Hansard Prototype presents information from Hansard. It is not the Official Report.

We are aware that there are many, many examples of incorrect text. We are using several methods to reduce the occurences, but it is unlikely that we’ll get to a very high level of accuracy soon.

If you have seen a textual error, we’re unlikely to be able to do much with that information presently. In the future, we may experiment with a publicly accessible function to ‘flag’ textual errors for further investigation.

We’re concentrating our rather limited resources on the areas we can make most progress: manual intervention is not something we can support at the moment.

Posted in: Historic Hansard, OCR

---

« What else are we working on? Names and naming »