APTI Database is the large-scale benchmarking of open-vocabulary, multi-font, multi-size and multi-style text recognition systems in Arabic. The database is called APTI for Arabic Printed Text Image. The challenges that are addressed by the database are in the variability of the sizes, fonts and style used to generate the images. A focus is also given on low-resolution images where anti-aliasing is generating noise on the characters to recognize. The database is synthetically generated using a lexicon of 113'284 words, 10 Arabic fonts, 10 font sizes and 4 font styles. The database contains 45'313'600 single word images totaling to more than 250 million characters. Ground truth annotation is provided for each image thanks to a XML file. The annotation includes the number of characters, the number of PAWs (Pieces of Arabic Word), the sequence of characters, the size, the style, the font used to generate each image, etc.

Lexicon of APTI Database

The APTI Database contains a mix of decomposable and non-decomposable word images. Decomposable words are generated from root Arabic verbs using Arabic schemes whereas non-decomposable words are formed by Arabic proper names, general names, country/town/village names, Arabic prepositions, etc. To generate the lexicon, we have parsed different Arabic books such as The Muqaddimah - An introduction to history of Ibn Khaldun and Al-bukhala of Gahiz as well as a collection of recent Arabic newspapers articles taken from the Internet and a large lexicon file produced by Kanoun in 2005. This parsing procedure totalled 113'284 single different Arabic words, leading to a pretty good coverage of the Arabic words mostly used in texts.

Fonts, styles and sizes

Taking as input the words in the lexicon, the images of APTI are generated using 10 different fonts presented in Fig. 1: Andalus, Arabic Transparent, AdvertisingBold, Diwani Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType Naskh, M Unicode Sara. These fonts have been selected to cover different complexity of shapes of Arabic printed characters, going from simple fonts with no or few overlaps and ligatures (AdvertisingBold) to more complex fonts rich in overlaps, ligatures and flourishes (Diwani Letter or Thuluth).
Different sizes are also used in APTI: 6 points, 7 points, 8 points, 9 points, 10 points, 12 points, 14 points, 16 points, 18 points and 24 points. We also used 4 different styles namely plain, italic, bold and combination of italic and bold.
These sizes, fonts and styles are widely used on computer screen, Arabic newspapers, books and many other documents. The combination of fonts, styles and sizes guaranties a wide variability of images in the database.
Overall, the APTI Database contains 45’313’600 single words images, taking into account the full lexicon where the different combinations of fonts, style and sizes are applied.

Fig. 1: Fonts used to generate the APTI Database: (A) Andalus, (B) Arabic Transparent, (C) AdvertisingBold, (D) Diwani Letter, (E) DecoType Thuluth, (F) Simplified Arabic, (G) Tahoma, (H) Traditional Aatbic, (I) DecoType Naskh, (J) M Unicode Sara

Sources of variability

The sources of variability in the generation procedure of text images in APTI are the following:
1. 10 different fonts: Andalus, Arabic Transparent, AdvertisingBold, Diwani Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType Naskh, M Unicode Sara;
2. 10 different sizes: 6, 7, 8, 9, 10, 12, 14, 16, 18 and 24 points;
3. 4 different styles: plain, bold, italic, italic and bold;
4. Various forms of ligatures and overlaps of characters thanks to the large combination of characters in the lexicon and thanks to the used fonts;
5. Very large vocabulary that allows to test systems on unseen data;
6. Various artefacts of the downsampling and antialiasing filters due to the random insertion of columns of white pixels at the beginning of image words;
7. Variability of the height of each word image.

The last point of the previous list is actually intrinsic to the sequence of characters appearing in the word. In APTI, there is actually no a priori knowledge of the position of the baseline and it is up to the recognition algorithm to compute the baseline, if needed.

The APTI Database was developed by:

 

 




Document, Image and Voice Analysis (DIVA)
University of Fribourg (Unifr), Switzerland







REsearch Group on Intelligent Machines (REGIM)
National Engineering School of Sfax (ENIS), Tunisia


 


Software Engineering Unit
Business Information System Institute, (HES-SO //Wallis), Switzerland