Competition on Superimposed Text Detection and Recognition in Arabic News Video Frames
hold within the framework of the 25th International Conference on Pattern Recognition (ICPR2020),Milano- Italy 13 -18 September 2020
AcTiVComp20 includes three main tasks: Text detection, text recognition and end-to-end text recognition in news video frames. Each of these tasks may include one or more evaluation protocols. In order to participate in this challenge, you have to participate in at least one task. The two first tasks are similar to the ones of AcTiVcomp17  but they are re-opened for this edition with a special focus on the channel free and YouTube quality protocols (red protocols in tables I and II).
The objective of this task is to obtain an estimation of the text regions in a video frame in terms of bounding boxes <x, y, w, h>. In what follows we present details about the used datasets, evaluation protocols and metrics for each task.
AcTiV-D represents a subset of non-redundant frames collected from AcTiV dataset [2,3] and used here to measure the performance of participating methods in the text detection task. It contains a total of 2,557 news video frames that have been hand-selected with a particular attention to achieve a high diversity in text regions. Figure 1 states some examples from AcTiV-D.
The groundtruth information (xml file) for the detection task are provided at the line level for each TV channel's frame. Figure 2 shows a part of an xml file of the evaluation protocol p4.3 (TunisiaNat1 TV). A text boundingbox is described by the element <Rectangle> which contains the rectangle’s attributes: x,y coordinates (upper-left pixel), width and height. The output of a text detector is an xml file having the same structure as the groundtruth file. The output image and the original one should have the same label: [channel_source_frame_id] (e.g. TunisiaNat1_vd01_frame_7). It is to note that the proposed dataset includes some frames which do not contain any text and others which contain the same text regions but with different background.
|Resolution||Protocol||TV Channel||Training set||Test set||cTest set||Precision||Recall||F-measure|
|4.1||France 24 Arabe||331||80||104||%||%||%|
|4.2||Russia Today Arabic||323||79||100||%||%||%|
|4.4||All SD channels||1146||275||310||%||%||%|
Metrics: The performance of the participating text detectors will be evaluated based on precision, recall and F-score metrics using our evaluation tool . This tool takes into account all types of matching cases
between groundtruth bounding boxes and detected bounding boxes: one-to-one, one-to-many and many-to-one
matching. The proposed performance metrics are similar to those used in ICDAR Robust Reading Competition (RRC) series.
Note: Participants should hide the scrolling bar (news ticker) for AljazeeraHD and France24 TV channels before any processing. No groundtruth data are provided for dynamic text in the AcTiV-D dataset.
Taking a textline image as input, the objective of this task is to generate the corresponding text transcriptions.
AcTiV-R represents a subset of cropped textline images created from AcTiV dataset [2,3] and used to evaluate the performance of participating methods in the Arabic textline recognition task. As shown in Figure 4 this dataset includes five different text fonts with various sizes and colors. AcTiV-R consists 10,415 cropped textline images, 44,583 words and 259,192 characters.
The recognition ground-truth information are provided at the line level for each cropped image. Figure 5 depicts an example of a ground-truth XML file and its corresponding textline image. The XML file is composed of two principal markup sections: ArabicTranscription and LatinTranscription. In order to have an easily accessible representation of Arabic text, it is transformed into a set of Latin labels with a suffix that refers to the letter’s position in the word, i.e. B: Begin, M: Middle, E: End and I: Isolate. The latin transcriptions of these two arabic words " نشرة الأخبار" are "Nuun_B Shiin_M Raa_E TaaaClosed_I Space Alif_I Laam_EHamzaAboveAlif_E Xaa_B Baa_M Alif_E Raa_I". During the annotation process, 165 character glyphs (i.e. shapes) were considered, including 10 digits and 12 punctuation marks.
Note: The participants are free to use arabic characters or their corresponding latin transcription to train their OCR systems. However, the output results of the recognition task should be transcribed with latin labels. Hereafter, a typical example of text recognizer input/output files.
$ ProposedRec input.txt output.txt
The output file should contain the path and the generated character labels of the recognized image.
For example, the correct output result of the image in Figure4 should be as follows:
|Resolution||Protocol||TV Channel||Training set||Test set||cTest set||CRR||WRR||LRR|
|6.2||Russia Today Arabic||2,127||13,462||78,936||250||1,483||8,749||256||1,598||9,305||%||%||%|
|6.4||All SD channels||6,034||28,483||165,830||618||2,856||16,671||668||3,286||19,502||%||%||%|
Metrics: To evaluate an OCR system we use the Line Recognition Rate (LRR), the Word Recognition Rate (WRR) and the Character Recognition Rate (CRR) metrics. These metrics are based on the computation of the insertion, deletion and substitution errors at the character level.
The End-to-End task represents the main novelty of this edition where, in contrast to the others tasks, all textlines in a given input frame
should be localized and recognized in a single step. For this purpose, we provide the participants with a new version of the AcTiV dataset with a different format of ground-truth files.
Input: A set of 588 frames is used in this task. It includes almost all the data of protocol 1 and some added frames, and is distributed as 355, 116 and 117 frames for respectively training, validation and test purposes. A total of 2534 textlines, including 167 new ones, is provided for the recognition part. Figure 6 depicts a groundtruth file example of this task.
 O. Zayene, J. Hennebert, R. Ingold et N. E. Ben Amara. “ICDAR2017 Competition on Arabic Text Detection and Recognition in Multi-resolution Video Frames”. In 14th International Conference on Document Analysis and Recognition (ICDAR), p. 1460-1465, Novembre 2017.
 O. Zayene, J. Hennebert, S. M. Touj, R. Ingold, and N. Essoukri Ben Amara, “A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV”, in Proc. of the 13th International Conference on Document Analysis and Recognition (ICDAR'15), Nancy- France, August 2015.
 O. Zayene, S. M. Touj, J. Hennebert, R. Ingold et N. E. Ben Amara. “Open Datasets and Tools for Arabic Text Detection and Recognition in News Video Frames”. In Journal of Imaging, (4) 2, p.1-19, Janvier 2018.
 O. Zayene, S. M. Touj, J. Hennebert, R. Ingold et N. E. Ben Amara. “Data, protocol and algorithms for performance evaluation of text detection in Arabic news video”. In the 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), p. 258-263, Mars 2016.