Module 3 - Information Retrieval

The goal of the information retrieval module was to create a search engine for the automatically transcribed manuscripts. When compared with a standard search on the Internet, additional challenges include spelling variants in old languages and recognition errors in the automatic transcriptions.

The information retrieval module receives machine-readable texts as input and performs a full text search on them. The users of digital libraries can then be presented with a ranked list of manuscripts as a result, similarly to a search on the Internet. When compared with a search on a perfect transcription, a special challenge is to cope with the errors of the handwriting recognition module.

First, a basic search engine was implemented based on several state-of-the-art information retrieval models. For Old German manuscripts, a special text analysis tool was developed that includes a decompounder to deal with compound German words and a stemmer to deal with spelling variants.


Multiple hypotheses representation

We have addressed the problem of errors in the automatic transcriptions by taking multiple recognition hypotheses into account. Besides the most likely transcription, the text recognition module provides a list of word recognition hypotheses. Expanding the text with the most likely confusion candidates has significantly increased the retrieval performance [C11.1] and made it much more robust against transcription errors when compared with OCR errors in printed documents [C11.8, C11.9].

Then the potential of recognition hypotheses was explored in the context of the thesaurus. Adding likely word confusion candidates as synonyms is not only useful for query suggestion and autocompletion but has also proven to be very successful for query expansion [C12.5], especially for misspelled one-term queries. Inflectional variants (e.g., “Parcival”, “Parcivale”, and “Parcivals” in the Cod. 857) as well as orthographical variants (e.g., “Parcival”, “Parcifal”, and “Parzifal” in the Cod. 857) can be treated as the same entity.

The main aim of the information retrieval module was to develop a search engine that is robust to spelling variants and transcription errors. The proposed methods have achieved this aim. To give an example of the performance, the loss in information retrieval performance was significantly below the error rate of the transcription for the Old German manuscripts [Cn1]. That is, the system could recover from transcription errors to some degree.

Publications

Thesis

[T2] N. Naji. Information Retrieval of Digitized Medieval Manuscripts. PhD thesis, University of Neuchâtel, to be defended in September 2013.

Conference Papers

[C13.2] N. Naji and J. Savoy. Back to Our Roots for Retrieving Very Short Passages. In Proc. 76th Annual Meeting of the Association for Information Science and Technology, in print, 2013.

[C12.1] A. Fischer, H. Bunke, N. Naji, J. Savoy, M. Baechler, and R. Ingold. HisDoc: Historical Document Analysis, Recognition, and Retrieval. In Digital Humanities, Book of Abstracts, pp. 94–97, 2012. Digital Humanities-Link

[C12.5] N. Naji and J. Savoy. Etude comparative de l’efficacité du dépistage de l’information dans des manuscrits médiévaux. In Proc. 11th Int. Conf. on Statistical Analysis of Textual Data, pp. 1–12, 2012. oai:doc.rero.ch:20130103104948-QM

[C11.8] N. Naji, J. Savoy, and L. Dolamic. Recherche d’information dans un corpus bruité (OCR). In Proc. 8ième Conférence en Recherche d’Information et Applications, pp. 271–286, 2011. Link

[C11.9] N. Naji and J. Savoy. Comparative Information Retrieval Evaluation for Scanned Documents. In Proc. 15th WSEAS Int. Conf. on Computers, pp. 527–534, 2011. ACM-Link

[C11.1] N. Naji and J. Savoy. Information retrieval strategies for digitized handwritten medieval documents. In Proc. 7th Asia Information Retrieval Societies Conference, pp. 103–114, 2011. 10.1007/978-3-642-25631-8_10

[Cn1] A. Fischer, H. Bunke, N. Naji, J. Savoy, M. Baechler, and R. Ingold. The HisDoc Project. Automatic Analysis, Recognition, and Retrieval of Handwritten Historical Documents for Digital Libraries. In Proc. InterNational and InterDisciplinary Aspects of Scholarly Editing, in print, Bern, 2012. Link