Ground truth description
Each word image in APTI Database is fully described using an XML file containing ground truth information about the sequence of characters as well as information about the generation. An example of such XML file is given in the following Figure (Fig. 1).Fig. 1: Example of XML file including ground truth information about a given word image
The XML file is composed by four principal markups sections:
- Content: in this element, we have the transcription of Arabic word, the number of Piece of Arabic Word (nPaws) and sub-elements for each PAW with the sequence of characters. In our representation, characters are identified using plain English labels as described below.
- Font: in this element, we specify the font name, font style and size used to generate the image word.
- Specs: in this element, we present the encoding of image, width, height and eventual addition effect. In the current version of APTI, there is actually no added effects but we have planned to use this attribute for later versions of image rendering where effects could be present.
- Generation: in this element, we indicate the type of generation, the tool used for generation and the used filter in generation. In the current version of APTI, this element is constant as the same generation procedure has been applied. The type ‘downsampling5’ is here indicating that the generation procedure correspond to a downsampling, using factor 5, from high resolution source images.
Letter label | Number of Occurence | Isolate | Begin | Middle | End |
Alif | 90353 | ﺍ | ﺎ | ||
Baa | 28119 | ﺏ | ﺑ | ﺒ | ﺐ |
Taaa | 59343 | ﺕ | ﺗ | ﺘ | ﺖ |
Thaa | 3803 | ﺙ | ﺛ | ﺜ | ﺚ |
Jiim | 11455 | ﺝ | ﺟ | ﺠ | ﺞ |
Haaa | 17866 | ﺡ | ﺣ | ﺤ | ﺢ |
Xaa | 8492 | ﺥ | ﺧ | ﺨ | ﺦ |
Daal | 18399 | ﺩ | ﺪ | ||
Thaal | 3100 | ﺫ | ﺬ | ||
Raa | 37571 | ﺭ | ﺮ | ||
Zaay | 6325 | ﺯ | ﺰ | ||
Siin | 21648 | ﺱ | ﺳ | ﺴ | ﺲ |
Shiin | 8668 | ﺵ | ﺷ | ﺸ | ﺶ |
Saad | 8310 | ﺹ | ﺻ | ﺼ | ﺺ |
Daad | 5548 | ﺽ | ﺿ | ﻀ | ﺾ |
Thaaa | 8610 | ﻁ | ﻃ | ﻄ | ﻂ |
Taa | 1438 | ﻅ | ﻇ | ﻈ | ﻆ |
Ayn | 16552 | ﻉ | ﻋ | ﻌ | ﻊ |
Ghayn | 5912 | ﻍ | ﻏ | ﻐ | ﻎ |
Faa | 13749 | ﻑ | ﻓ | ﻔ | ﻒ |
Gaaf | 16819 | ﻕ | ﻗ | ﻘ | ﻖ |
Kaaf | 12711 | ﻙ | ﻛ | ﻜ | ﻚ |
Laam | 41159 | ﻝ | ﻟ | ﻠ | ﻞ |
Miim | 47084 | ﻡ | ﻣ | ﻤ | ﻢ |
Nuun | 44186 | ﻥ | ﻧ | ﻨ | ﻦ |
NuunChadda | 1343 | ﻥّ | ﻧّ | ﻨّ | ﻦّ |
Haa | 16094 | ﻩ | ﻫ | ﻬ | ﻪ |
Waaw | 26008 | ﻭ | ﻮ | ||
Yaa | 40215 | ﻱ | ﻳ | ﻴ | ﻲ |
YaaChadda | 4348 | ﻱّ | ﻳّ | ﻴّ | ﻲّ |
Hamza | 1142 | ء | |||
HamzaAboveAlif | 8770 | أ | ﺄ | ||
TaaaClosed | 8376 | ﺓ | ﺔ | ||
HamzaUnderAlif | 1501 | ﺇ | ﺈ | ||
AlifBroken | 972 | ﻯ | ﻰ | ||
TildAboveAlif | 500 | ﺁ | ﺂ | ||
HamzaAboveAlifBroken | 1253 | ﺉ | ﺋ | ﺌ | ﺊ |
HamzaAboveWaaw | 538 | ؤ | ـؤ | ||
Quantity of Characters | 648’280 | ||||
Quantity of PAWs | 274833 | ||||
Quantity of words | 113’284 |
Table 1: Arabic letters with used labels and occurrence in APTI Database
The different character labels are summarized in Table 1. As the shape of characters are varying according to their position in the word, the character labels also include a suffix to specify the position of the character in the word: “B” standing for beginning, “M” for Middle, “E” for end and “I” for isolated. The character “Hamza” being always isolated, we don’t use the position suffix for this character. We also artificially inserted characters labels such as “NuunChadda” or “YaaChadda” to represent the character shape issued from the combination of “Nuun” and “Chadda” or “Yaa” and “Chadda”.
Recent News
[23/01/2017] The third edition of the ICDAR2017 Competition on Multi-font and Multi-size Digitally Represented Arabic Text will be organized at ICDAR'2017 using APTI Database.
[03/01/2013] The second edition of the Competition on Multi-font and Multi-size Digitally Represented Arabic Text will be organized at ICDAR'2013 using APTI Database.
[14/02/2011] The first edition of the Arabic Recognition Competition: Multi-font Multi-size Digitally Represented Text was organized at ICDAR'2011 using APTI Database.
[06/06/2009] APTI Database was officially presented at ICDAR'09.
This work is a joint collaboration between diferent research groups:
http://diuf.unifr.ch/diva
DIVA Group from University of Fribourg (Switzerland)
REGIM Group from University of Sfax (Tunisia)
http://iig.hevs.ch/valais/software-engineering.html
Software Engineering Unit from Business Information System Institute (HES-SO //Wallis - Switzerland)