The HisDoc research project, short for “Historical Document Analysis, Recognition, and Retrieval”, has brought together three partners from the Universities of Fribourg, Bern, and Neuchâtel under the Sinergia program of the SNF in order to develop tools to support cultural heritage preservation by integrating historical manuscripts in digital libraries. Methods from artificial intelligence and pattern recognition have been developed that attempt to automatically extract the textual content from scanned handwritten documents and thus create an electronic manuscript edition, which is amenable to searching and browsing in digital libraries. In a pioneering research effort, a complete processing chain has been investigated that includes image analysis, text recognition, and information retrieval. Each partner has treated one of these tasks in an individual research module. Image analysis has been investigated at the University of Fribourg, text recognition at the University of Bern, and information retrieval at the University of Neuchâtel. The research modules have then been integrated into a complete system.
The major aims of the research project were, first, to develop generic tools that can be adapted with little effort to different types of documents and languages. Secondly, after an interactive training phase, we wanted to perform image analysis and text recognition fully automatically. Thirdly, the text search engine should be able to cope with old languages as well as errors in the automatic transcription. We consider all these aims as reached. Although automatically created electronic manuscript editions contain textual errors as expected, the accuracy achieved by the HisDoc methods is promising for the integration of historical manuscripts into digital libraries.
Sample pages of the IAM-HistDB datasets: Parzival, Saint Gall, George Washington
While for normal documents a number of methods for finding paragraphs, text lines, graphical elements and others, have become available, the analogue problem for historical documents was widely unsolved. Clearly, in historical documents we encounter artifacts, such as color bleedthrough, paper degradation, and stains, which make these problems more difficult than for contemporary documents of good printing quality.
The extraction of meta-information also serves the purpose of locating the individual lines on a page of text. This step is a necessary requirement for automatic document transcription, which goal is to transform the image of a text line into its corresponding string of Unicode characters.
While for machine printed text of good quality, commercial optical character recognition devices existed already, the problem of handwritten text recognition, as encountered in the project, was still widely unsolved. In historical documents a wide variety of writing styles are found, with character shapes quite different from the one used today.
Once the transcription of a document has been automatically created, its content and metainformation are made available for being searched via a browser. In this context, a number of open problems were addressed.