Ground truth description

Each word image in APTI Database is fully described using an XML file containing ground truth information about the sequence of characters as well as information about the generation. An example of such XML file is given in the following Figure (Fig. 1).

Fig. 1: Example of XML file including ground truth information about a given word image

The XML file is composed by four principal markups sections:

Letter label Number of Occurence Isolate Begin Middle End
Alif 90353
Baa 28119
Taaa 59343
Thaa 3803
Jiim 11455
Haaa 17866
Xaa 8492
Daal 18399
Thaal 3100
Raa 37571
Zaay 6325
Siin 21648
Shiin 8668
Saad 8310
Daad 5548 ﺿ
Thaaa 8610
Taa 1438
Ayn 16552
Ghayn 5912
Faa 13749
Gaaf 16819
Kaaf 12711
Laam 41159
Miim 47084
Nuun 44186
NuunChadda 1343 ﻥّ ﻧّ ﻨّ ﻦّ
Haa 16094
Waaw 26008
Yaa 40215
YaaChadda 4348 ﻱّ ﻳّ ﻴّ ﻲّ
Hamza 1142 ء
HamzaAboveAlif 8770 أ
TaaaClosed 8376    
HamzaUnderAlif 1501
AlifBroken 972    
TildAboveAlif 500
HamzaAboveAlifBroken 1253
HamzaAboveWaaw 538 ؤ ـؤ
Quantity of Characters 648’280    
Quantity of PAWs 274833    
Quantity of words 113’284    

Table 1: Arabic letters with used labels and occurrence in APTI Database

The different character labels are summarized in Table 1. As the shape of characters are varying according to their position in the word, the character labels also include a suffix to specify the position of the character in the word: “B” standing for beginning, “M” for Middle, “E” for end and “I” for isolated. The character “Hamza” being always isolated, we don’t use the position suffix for this character. We also artificially inserted characters labels such as “NuunChadda” or “YaaChadda” to represent the character shape issued from the combination of “Nuun” and “Chadda” or “Yaa” and “Chadda”.