HisDoc III

The HisDoc III is the official successor project of HisDoc 2.0.

 

NEWS: The HisDoc III team is well represented at DAS2018 in Vienna. Find out more here


Goals

In HisDoc III we target historical document classi cation for large amounts of uncategorized
facsimiles with the intent to provide new capabilities for researchers in the Digital Humanities.
In particular, we will address the task of categorizing document images with respect to
content, language, script, and layout. To do so, we will leverage the expertise gained from
our previous projects HisDoc and HisDoc 2.01. In HisDoc we have shown that historical
Document Image Analysis (Dia) can be e ectively applied to extract layout structures and
textual transcriptions and in the current HisDoc 2.0 project we successfully retrieved additional
paleographic information. The novel contributions of HisDoc III will be complemented
by these methods to cope with large document collections.

Existing methods are largely based on supervised learning and thus require an extensive
amount of labeled training data. Therefore they are not directly applicable to classify collections
of heterogeneous manuscripts with a large variety of layout structure, textual content,
degradation traces, and other artifacts. While this problem is already relevant for homogeneously
digitized books, it becomes even more crucial for isolated pages and the tremendous
amount of yet non-cataloged and unexplored fragments distributed over many libraries around
the world.

The objective of HisDoc III is twofold: (i) fundamental research on combined text- and
image-based classi cation methods and (ii) making developed technology useful for libraries,
archives, and researchers in the Humanities. Firstly, for the classi cation of documents
we will study novel deep learning methods for large amounts of unlabeled text and image
data. These methods will be complemented by structural approaches based on document
graphs. For the combination of these diverse approaches we will investigate Multiple Classi er
Systems (Mcs) on the one hand and integrated neural network architectures on the other.
Secondly, we will combine three ideas for making methods useful for libraries: (i) novel
means for reducing the needed amount of ground truth by unsupervised machine learning and
alternatively bootstrapping combined with active learning; (ii) intuitive computer-assisted
presentation and annotation tools; and (iii) making our systems publicly available as Web
services.

To demonstrate the suitability of the HisDoc III research results, we will design novel
computer-assisted workflows in collaboration with an advisory board compiled of scholars,
librarians and archivists. A particular focus is speeding up the generation of catalog and
database entries and devising ways to present methods and results in an understandable way.
In HisDoc III, we formulate novel research ideas, solve fundamental problems in Dia, and
make innovative tools and services available for the research community. We expect this
project to become a catalyst for the development of innovative solutions for the Digital
Humanities.


Materials

Sample images illustrating the various challenges to be addressed in HisDoc III. a) Fragments of
a scroll conveying high levels of degradation, di erent layouts, and an exotic page format where text and
image could give clues for the classi cation. b) A bill from the 15th century where Ocr is needed for a robust
classi cation. c) A fragment used as book cover for a manuscript (from the Fragmentarium project) containing
only very little textual information, where image features as well as script and language information could
improve the classi cation accuracy.


a) Aarau, Aargauer Kantonsbibliothek, MsMurF 31a

Archives de l'Etat de Neuch^atel (AEN), Rec. Div. Vol. 203, n°891
b) Archives de l'Etat de Neuchâtel (AEN), Rec. Div. Vol. 203, n°891

c) Fragment of a Breviary (15th century); St. Gallen, Stiftsbibliothek, Cod. Sang. 635, Buchrucken | Paulus Diaconus, Historia Longobardorum


For further information you can get in touch with:

Marcus Liwicki