APTI-Database

Ground truth description

Each word image in APTI Database is fully described using an XML file containing ground truth information about the sequence of characters as well as information about the generation. An example of such XML file is given in the following Figure (Fig. 1).

Fig. 1: Example of XML file including ground truth information about a given word image

The XML file is composed by four principal markups sections:

Content: in this element, we have the transcription of Arabic word, the number of Piece of Arabic Word (nPaws) and sub-elements for each PAW with the sequence of characters. In our representation, characters are identified using plain English labels as described below.
Font: in this element, we specify the font name, font style and size used to generate the image word.
Specs: in this element, we present the encoding of image, width, height and eventual addition effect. In the current version of APTI, there is actually no added effects but we have planned to use this attribute for later versions of image rendering where effects could be present.
Generation: in this element, we indicate the type of generation, the tool used for generation and the used filter in generation. In the current version of APTI, this element is constant as the same generation procedure has been applied. The type ‘downsampling5’ is here indicating that the generation procedure correspond to a downsampling, using factor 5, from high resolution source images.

Letter label	Number of Occurence	Isolate	Begin	Middle	End
Alif	90353	ﺍ		ﺎ
Baa	28119	ﺏ	ﺑ	ﺒ	ﺐ
Taaa	59343	ﺕ	ﺗ	ﺘ	ﺖ
Thaa	3803	ﺙ	ﺛ	ﺜ	ﺚ
Jiim	11455	ﺝ	ﺟ	ﺠ	ﺞ
Haaa	17866	ﺡ	ﺣ	ﺤ	ﺢ
Xaa	8492	ﺥ	ﺧ	ﺨ	ﺦ
Daal	18399	ﺩ		ﺪ
Thaal	3100	ﺫ		ﺬ
Raa	37571	ﺭ		ﺮ
Zaay	6325	ﺯ		ﺰ
Siin	21648	ﺱ	ﺳ	ﺴ	ﺲ
Shiin	8668	ﺵ	ﺷ	ﺸ	ﺶ
Saad	8310	ﺹ	ﺻ	ﺼ	ﺺ
Daad	5548	ﺽ	ﺿ	ﻀ	ﺾ
Thaaa	8610	ﻁ	ﻃ	ﻄ	ﻂ
Taa	1438	ﻅ	ﻇ	ﻈ	ﻆ
Ayn	16552	ﻉ	ﻋ	ﻌ	ﻊ
Ghayn	5912	ﻍ	ﻏ	ﻐ	ﻎ
Faa	13749	ﻑ	ﻓ	ﻔ	ﻒ
Gaaf	16819	ﻕ	ﻗ	ﻘ	ﻖ
Kaaf	12711	ﻙ	ﻛ	ﻜ	ﻚ
Laam	41159	ﻝ	ﻟ	ﻠ	ﻞ
Miim	47084	ﻡ	ﻣ	ﻤ	ﻢ
Nuun	44186	ﻥ	ﻧ	ﻨ	ﻦ
NuunChadda	1343	ﻥّ	ﻧّ	ﻨّ	ﻦّ
Haa	16094	ﻩ	ﻫ	ﻬ	ﻪ
Waaw	26008	ﻭ		ﻮ
Yaa	40215	ﻱ	ﻳ	ﻴ	ﻲ
YaaChadda	4348	ﻱّ	ﻳّ	ﻴّ	ﻲّ
Hamza	1142	ء
HamzaAboveAlif	8770	أ		ﺄ
TaaaClosed	8376	ﺓ			ﺔ
HamzaUnderAlif	1501	ﺇ		ﺈ
AlifBroken	972	ﻯ			ﻰ
TildAboveAlif	500	ﺁ		ﺂ
HamzaAboveAlifBroken	1253	ﺉ	ﺋ	ﺌ	ﺊ
HamzaAboveWaaw	538	ؤ		ـؤ
*Quantity of Characters*	*648’280*
*Quantity of PAWs*	*274833*
*Quantity of words*	*113’284*

Table 1: Arabic letters with used labels and occurrence in APTI Database

The different character labels are summarized in Table 1. As the shape of characters are varying according to their position in the word, the character labels also include a suffix to specify the position of the character in the word: “B” standing for beginning, “M” for Middle, “E” for end and “I” for isolated. The character “Hamza” being always isolated, we don’t use the position suffix for this character. We also artificially inserted characters labels such as “NuunChadda” or “YaaChadda” to represent the character shape issued from the combination of “Nuun” and “Chadda” or “Yaa” and “Chadda”.

Recent News

[23/01/2017] The third edition of the ICDAR2017 Competition on Multi-font and Multi-size Digitally Represented Arabic Text will be organized at ICDAR'2017 using APTI Database.

[03/01/2013] The second edition of the Competition on Multi-font and Multi-size Digitally Represented Arabic Text will be organized at ICDAR'2013 using APTI Database.

[14/02/2011] The first edition of the Arabic Recognition Competition: Multi-font Multi-size Digitally Represented Text was organized at ICDAR'2011 using APTI Database.

[06/06/2009] APTI Database was officially presented at ICDAR'09.

......................................................................

This work is a joint collaboration between diferent research groups:

http://diuf.unifr.ch/diva
DIVA Group from University of Fribourg (Switzerland)

http://www.regim.org
REGIM Group from University of Sfax (Tunisia)

http://iig.hevs.ch/valais/software-engineering.html
Software Engineering Unit from Business Information System Institute (HES-SO //Wallis - Switzerland)

APTI Database Arabic Printed Text Image Database

Ground truth description

Recent News

APTI Database
Arabic Printed Text Image Database