DIVA-HisDB
Within HisDoc 2.0 we have developed DIVA-HisDB a precisely annotated large dataset of challenging medieval manuscripts for the evaluation of several Document Image Analysis (DIA) tasks such as layout analysis, text line segmentation, binarization and writer identification. The database consists of 150 annotated pages of three different medieval manuscripts with challenging layouts.
DIVA-HisDB is a collection of three medieval manuscripts that have been chosen regarding the complexity of their layout [1], together with partners from e-codices (http://www.e-codices.unifr.ch/) and the Humanities faculty in the University of Fribourg:
- St. Gallen, Stiftsbibliothek, Cod. Sang. 18, codicological unit 4 (CSG18),
- St. Gallen, Stiftsbibliothek, Cod. Sang. 863 (CSG863),
- Cologny-Geneve, Fondation Martin Bodmer, Cod. Bodmer 55 (CB55).
Following are illustrated sample pages from the three medieval manuscripts:
CSG18, p.116 | CSG863, p.17 | CB55, p.67v |
And their corresponding GT visualization of the three annotation categories (main text body, comments, decorations):
CSG18, p.116 | CSG863, p.17 | CB55, p.67v |
DIVA-HisDB consists of 150 pages in total, 50 pages from each manuscript. For the dataset, as well as for the division into training, validation, and test set we have selected a representative set of pages.
Creation of DIVA-HisDB
For annotating the three selected medieval manuscripts we use GraphManuscribble, a semi-automatic tool which is based on document graphs and pen-based scribbling interaction. DIVA-HisDB provides the GT in the PAGE format [2].
Seperation line between marginal glosses | Corrections above and below the line |
Correction overwriting the text | Correction above the line |
We distinguish between the main text body, the comments (marginal and interlinear glosses, explanations, corrections) and decorations (every character/sign that exceeds the size of a text line and/or is written in red). Paleographic features like display scripts or black majuscules are not marked as a decoration, since they are too difficult to distinguish for the automatic layout analysis tool in this first step. We totally disregard elements that don't belong to a text line or are not characters/signs, like separation lines between two marginal glosses.
Within the comments' category, we don't distinguish between the original glosses, added by the writer of the main text, and the foliation added in the modern period. Also, if corrections have been made by overwriting the main text, they are not marked as posterior to the original main text; but if the corrections are made above and below the line, e.g., with a sign below the line to mark the affected letters, and the correct letter(s) above the line, they are annotated as comments.
Distribution of the annotation categories
For the whole DIVA-HisDB 63.97% of the total number of regions per annotated category are comments, 9.36% are decorations and 26.67% consists the main text body. Note that the proportions of the categories differ between the manuscripts. While the amount of textlines and comments is almost the same in CSG863, significantly less comments are present in CSG18. Additionally, by measuring the surface area (number of pixels) for each annotated category, 41.37% of the total surface area are comments, 1.69% are decorations and 56.94% consists the main text body.
Download DIVA-HisDB
download here the CB55 image files (205 MB)
download here the CB55 PAGE XML files (2.5 MB)
download here the CB55 pixel-labeled image files (23.2 MB)
download here the CB55 PAGE XML files for textline segmentation and baseline detection (Task 2 & Task 3) (1.5 MB)
download here the CSG18 image files (82.7 MB)
download here the CSG18 PAGE XML files (1.7 MB)
download here the CSG18 pixel-labeled image files (12.7 MB)
download here the CSG18 PAGE XML files for textline segmentation and baseline detection (Task 2 & Task 3) (795 KB)
download here the CSG863 image files (113 MB)
download here the CSG863 PAGE XML files (1.6 MB)
download here the CSG863 pixel-labeled image files (12.4 MB)
download here the CSG863 PAGE XML files for textline segmentation and baseline detection (Task 2 & Task 3) (1.5 MB)
download all
Download DIVA-HisDB private test new!!
The private test of ICDAR2017 Competition on Layout Analysis for Challenging Medieval Manuscripts is now available online:
download here the CB55 input files for all Tasks (52.4 MB)
download here the CB55 pixel-labeled gt files for Task 1 (5.7 MB)
download here the CB55 PAGE XML gt files for Task 2 & Task 3 (387 KB)
download here the CB55 PAGE XML input files for Task 2 & Task 3 (7 KB)
download here the CSG18 input files for all Tasks (21.2 MB)
download here the CSG18 pixel-labeled gt files for Task 1 (2.9 MB)
download here the CSG18 PAGE XML gt files for Task 2 & Task 3 (190 KB)
download here the CSG18 PAGE XML input files for Task 2 & Task 3 (7 KB)
download here the CSG863 input files for all Tasks (30.1 MB)
download here the CSG863 pixel-labeled gt files for Task 1 (3.8 MB)
download here the CSG863 PAGE XML gt files for Task 2 & Task 3 (352 KB)
download here the CSG863 PAGE XML input files for Task 2 & Task 3 (7 KB)
download all-private-test (117.2 MB)
Terms of Use
The DIVA-HisDB is publicly accessible and freely available for non-commercial research purposes. If you are publishing scientific work based on the DIVA-HisDB, we request you to include a reference to the paper [1]. If you are using images for the publications, please cite the original source as we have done on our website, e.g., St. Gallen, Stiftsbibliothek, Cod. Sang. 18, codicological unit 4 (CSG18).
The images on this page and the original dataset are published under Creative Commons license. "The use of our digitally reproduced images is regulated by a Creative Commons License Creative Commons License. This permits the use of individual images for non-commercial purposes (particularly in the areas of education and research), as long as proper source citation is used."
Contact
Publications
1. F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, and R. Ingold (2016). DIVA-HisDB: A Precisely Annotated Large Dataset of Challenging Medieval Manuscripts, International Conference on Frontiers in Handwriting Recognition, pp. 471-476.
PDF
2. S. Pletschacher and A. Antonacopoulos (2010). The PAGE (Page Analysis and Ground-Truth Elements) format framework, in 20th ICPR, pp. 257–260.