Integrated System

As envisaged by the research plan, we have integrated the three modules into a complete system, which is described in [C12.1, Cn1]. In addition, we plan to publish a journal article about the integrated HisDoc system in 2014 with the latest results.

Two interfaces were important for successful system integration. First, the image analysis module was required to extract text line images instead of word images. Especially in Latin manuscripts, word segmentation is a challenging problem that should be treated by the text recognition module in order to achieve a high accuracy. The HisDoc handwriting recognition methods are, indeed, able to cope with complete text lines. Secondly, the transcription output was required to provide several word recognition hypotheses instead of only the best transcription result. Word confusion candidates have proven to be of central importance in order to cope with transcription errors in the information retrieval module.

As expected, electronic manuscript editions that are automatically created by the integrated HisDoc system are not error-free. However, the achieved accuracies are promising for content-based indexing. Even if the textual content can only partly be captured from scanned images, the readable parts can be sufficient to find a manuscript of interest in a digital library.

Publications

[C12.1] A. Fischer, H. Bunke, N. Naji, J. Savoy, M. Baechler, and R. Ingold. HisDoc: Historical Document Analysis, Recognition, and Retrieval. In Digital Humanities, Book of Abstracts, pp. 94–97, 2012. Digital Humanities-Link

[Cn1] A. Fischer, H. Bunke, N. Naji, J. Savoy, M. Baechler, and R. Ingold. The HisDoc Project. Automatic Analysis, Recognition, and Retrieval of Handwritten Historical Documents for Digital Libraries. In Proc. InterNational and InterDisciplinary Aspects of Scholarly Editing, in print, Bern, 2012. Link