AcTiVComp20

Competition on Superimposed Text Detection and Recognition in Arabic News Video Frames

hold within the framework of the 25th International Conference on Pattern Recognition (ICPR2020),Milano- Italy 13 -18 September 2020

Call For Participation (PDF)

Tasks and evaluation procedure




AcTiVComp20 includes three main tasks: Text detection, text recognition and end-to-end text recognition in news video frames. Each of these tasks may include one or more evaluation protocols. In order to participate in this challenge, you have to participate in at least one task. The two first tasks are similar to the ones of AcTiVcomp17 [1] but they are re-opened for this edition with a special focus on the channel free and YouTube quality protocols (red protocols in tables I and II).

Task1: Text detection in Arabic news video frames

The objective of this task is to obtain an estimation of the text regions in a video frame in terms of bounding boxes <x, y, w, h>. In what follows we present details about the used datasets, evaluation protocols and metrics for each task.
AcTiV-D represents a subset of non-redundant frames collected from AcTiV dataset [2,3] and used here to measure the performance of participating methods in the text detection task. It contains a total of 2,557 news video frames that have been hand-selected with a particular attention to achieve a high diversity in text regions. Figure 1 states some examples from AcTiV-D.


Fig.1. Typical video frames from AcTiV-D. From left to right (clockwise): examples of Russiya Today Arabic, AljazeeraHD, TunisiaNat1 and France24 Arabe frames.

The groundtruth information (xml file) for the detection task are provided at the line level for each TV channel's frame. Figure 2 shows a part of an xml file of the evaluation protocol p4.3 (TunisiaNat1 TV). A text boundingbox is described by the element <Rectangle> which contains the rectangle’s attributes: x,y coordinates (upper-left pixel), width and height. The output of a text detector is an xml file having the same structure as the groundtruth file. The output image and the original one should have the same label: [channel_source_frame_id] (e.g. TunisiaNat1_vd01_frame_7). It is to note that the proposed dataset includes some frames which do not contain any text and others which contain the same text regions but with different background.


Fig.2. A part of the detection xml file of protocol4.3 (TunisiaNat1 TV channel)

The evaluation protocols of the detection task are as follow:
  • Protocol 1 aims to measure the performance of single-frame based methods to localize text regions in still HD images.
  • Protocols 4 are similar to protocol 1, differing only in channel resolution. All SD (720x576) channels in our database are targeted by these protocols, which are split in four sub-protocols: three channel-dependent protocols (p4.1, p4.2 and p4.3) and one channel-free protocol (p4.4).
  • Protocol 4bis is dedicated to the new collected data (480x360) from the official YouTube channel of TunisiaNat1 TV. The main idea of this protocol is to train a given system with SD (720x576) data, i.e. protocol p4.3 or p4.4, and test it with different resolution and quality.
  • Protocol 7 is the generic version of the previous protocols where text detection is evaluated regardless of data quality.
More details are presented in Table I in terms of resolution, protocol, data (frames) distribution and metrics.
Table I. Detection Dataset and Evaluation Protocols. "cTest" stands for competitionTest set.
Resolution Protocol TV Channel Training set Test set cTest set Precision Recall F-measure
HD
(1920x1080)
1 AljazeeraHD 337 88 103 % % %


SD
(720x576)
4.1 France 24 Arabe 331 80 104 % % %
4.2 Russia Today Arabic 323 79 100 % % %
4.3 TunisiaNat1 492 116 106 % % %
4.4 All SD channels 1146 275 310 % % %
SD
(480x360)
4bis TunisiaNat1 Youtube - 150 149 % % %
- 7 All channels 1483 362 413 % % %

Metrics: The performance of the participating text detectors will be evaluated based on precision, recall and F-score metrics using our evaluation tool [4]. This tool takes into account all types of matching cases between groundtruth bounding boxes and detected bounding boxes: one-to-one, one-to-many and many-to-one matching. The proposed performance metrics are similar to those used in ICDAR Robust Reading Competition (RRC) series.

Note: Participants should hide the scrolling bar (news ticker) for AljazeeraHD and France24 TV channels before any processing. No groundtruth data are provided for dynamic text in the AcTiV-D dataset.

Fig.3. No groundtruth information for scrolling text bar in the detection task

Task 2: Text recognition in cropped news images

Taking a textline image as input, the objective of this task is to generate the corresponding text transcriptions.
AcTiV-R represents a subset of cropped textline images created from AcTiV dataset [2,3] and used to evaluate the performance of participating methods in the Arabic textline recognition task. As shown in Figure 4 this dataset includes five different text fonts with various sizes and colors. AcTiV-R consists 10,415 cropped textline images, 44,583 words and 259,192 characters.


Fig.4. Examples of some textline images from AcTiV-R dataset

The recognition ground-truth information are provided at the line level for each cropped image. Figure 5 depicts an example of a ground-truth XML file and its corresponding textline image. The XML file is composed of two principal markup sections: ArabicTranscription and LatinTranscription. In order to have an easily accessible representation of Arabic text, it is transformed into a set of Latin labels with a suffix that refers to the letter’s position in the word, i.e. B: Begin, M: Middle, 􏰂E: End and I: Isolate. The latin transcriptions of these two arabic words "􏰀􏰁􏰃􏰂􏰆􏰅 نشرة الأخبار"􏰈􏰇􏰊􏰋􏰌􏰍􏰎􏰏􏰐􏰅 are "Nuun_B Shiin_M Raa_E TaaaClosed_I Space Alif_I Laam_EHamzaAboveAlif_E Xaa_B Baa_M Alif_E Raa_I". During the annotation process, 165 character glyphs (i.e. shapes) were considered, including 10 digits and 12 punctuation marks.

Fig.5. Example of a recognition xml file and its corresponding textline image


Note: The participants are free to use arabic characters or their corresponding latin transcription to train their OCR systems. However, the output results of the recognition task should be transcribed with latin labels. Hereafter, a typical example of text recognizer input/output files.


$ ProposedRec input.txt output.txt
input.txt:
/home/AcTiVComp20/AcTiV-R/p4.3/testImages/TunisiaNat1_vd01_frame_7-1.png
/home/AcTiVComp20/AcTiV-R/p4.3/testImages/TunisiaNat1_vd01_frame_7-2.png
/home/AcTiVComp20/AcTiV-R/p4.3/testImages/TunisiaNat1_vd01_frame_16-1.png
...
output.txt:
The output file should contain the path and the generated character labels of the recognized image.
For example, the correct output result of the image in Figure4 should be as follows:
#!MLF!#
"/home/AcTiVComp20/AcTiV-R/p4.3/outputRec/TunisiaNat1_vd07_frame_131-4.rec"
Jiim
Baa
Laam
Space
Alif
Laam
Shiin
Ayn
Alif
Nuun
Baa
Yaa
.
"/home/AcTiVComp20/task2/p4.3/outputRec/TunisiaNat1_vd07_frame_131-5.rec"
...

The evaluation protocols of the recognition task are as follow:
  • Protocol 3 aims to evaluate the performance of OCR systems to recognize texts in HD images.
  • Protocols 6 are similar to protocol 3, differing only in channel resolution. All SD (720x576) channels in our database are targeted by these protocols split in four sub-protocols: three channel-dependent protocols (p6.1, p6.2 and p6.3) and a channel-free one (p6.4).
  • Protocol 6bis is dedicated to the new stream-resolution (480x360) for TunisiaNat1 TV. The idea behind is to train a given system with SD (720x576) data and test it with different data resolution and quality.
  • Protocol 9 is the generic version of the previous protocols where text recognition is evaluated independently to data quality.
More details about the protocols and statistics of the used dataset are given in Table II.
Table II. Recognition Dataset and Evaluation Protocols. “Lns”, “Wds” and "Chars" respectively denote “Lines”, “Words” and "Characters.
Resolution Protocol TV Channel Training set Test set cTest set CRR WRR LRR
#Lns #Wds #Chars #Lns #Wds #Chars #Lns #Wds #Chars
HD
(1920x1080
3 AlJazeeraHD 1,909 8,110 46,563 196 766 4,343 262 1,082 6,283 % % %


SD
(720x576)
6.1 France24 Arabe 1,906 5,683 32,085 179 667 3,835 191 734 4,600 % % %
6.2 Russia Today Arabic 2,127 13,462 78,936 250 1,483 8,749 256 1,598 9,305 % % %
6.3 TunisiaNat1 2,001 9,338 54,809 189 706 4,087 221 954 5,597 % % %
6.4 All SD channels 6,034 28,483 165,830 618 2,856 16,671 668 3,286 19,502 % % %
SD
(480x360)
6bis TunisiaNat1 YouTube - - - 320 1,487 8,726 311 1,148 6,645 % % %
       - 9 All channels 7,943 36,593 212,393 814 3,622 21,014 930 4,368 25,785 % % %

Metrics: To evaluate an OCR system we use the Line Recognition Rate (LRR), the Word Recognition Rate (WRR) and the Character Recognition Rate (CRR) metrics. These metrics are based on the computation of the insertion, deletion and substitution errors at the character level.


Task3: End-to-End Arabic video text recognition

The End-to-End task represents the main novelty of this edition where, in contrast to the others tasks, all textlines in a given input frame should be localized and recognized in a single step. For this purpose, we provide the participants with a new version of the AcTiV dataset with a different format of ground-truth files.
Input: A set of 588 frames is used in this task. It includes almost all the data of protocol 1 and some added frames, and is distributed as 355, 116 and 117 frames for respectively training, validation and test purposes. A total of 2534 textlines, including 167 new ones, is provided for the recognition part. Figure 6 depicts a groundtruth file example of this task.

Fig.6. Example of a video frame and its corresponding new GT file format for the End-to-End recognition task

Output: End-to-end results should be provided in a similar format as the groundtruth text file.
Metrics: Average Precision (AP) calculated at IoU = 0.5 will be taken as the primary challenge metric. The metrics are calculated in the same way as that in Task 1, except that the recognition results will be taken into consideration. A detection will be considered a true positive if its bounding box sufficiently overlaps with the matching groundtruth box and its recognition matches the groundtruth line. It is woth noting that these metrics are similar to those used in ICDAR2017 COCO-Text Challenge with the only difference that the evaluation is made at the line level for the recognition task (not on the word level).



References
[1] O. Zayene, J. Hennebert, R. Ingold et N. E. Ben Amara. “ICDAR2017 Competition on Arabic Text Detection and Recognition in Multi-resolution Video Frames”. In 14th International Conference on Document Analysis and Recognition (ICDAR), p. 1460-1465, Novembre 2017.
[2] O. Zayene, J. Hennebert, S. M. Touj, R. Ingold, and N. Essoukri Ben Amara, “A dataset for Arabic text detection, tracking and recognition in news videos- AcTiV”, in Proc. of the 13th International Conference on Document Analysis and Recognition (ICDAR'15), Nancy- France, August 2015.
[3] O. Zayene, S. M. Touj, J. Hennebert, R. Ingold et N. E. Ben Amara. “Open Datasets and Tools for Arabic Text Detection and Recognition in News Video Frames”. In Journal of Imaging, (4) 2, p.1-19, Janvier 2018.
[4] O. Zayene, S. M. Touj, J. Hennebert, R. Ingold et N. E. Ben Amara. “Data, protocol and algorithms for performance evaluation of text detection in Arabic news video”. In the 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), p. 258-263, Mars 2016.