department of informatics

VLR-OCR

Summary :Current Web indexing technologies suffer from a severe drawback due to the fact that web documents often present textual information that is encapsulated in digital images and therefore not available as actual coded text. Moreover such images are not suited to be processed by existing OCR software, since the latter are generally designed for recognizing binary document images produced by scanners with resolutions between 200 and 600dpi, whereas web images are colored and have generally less than 100dpi. The proposed project shall bring a significant contribution to overcome the severe limitation of current character recognition methodology. The idea is to take advantage of the image quality, which is supposed to be unskewed and noise free. Furthermore, the proposed strategy includes explicit knowledge on digital typography and it introduces innovative methods by combing judiciously font recognition with character recognition.


Difficulties of web-image-OCR:


1. Sources of variability

a) Anti-aliasing


Fig. 1 illustrates the images of the bi-level “A” and the anti-aliased “A”. Anti-aliasing is a rendering method to smooth edges and diagonals by very low resolution characters, i.e. when few pixels are available for rendering the image. Anti-aliasing uses 256 gray levels hoping to profit from the way our eyes tend to average adjacent pixels.

 

 

b) Grid-alignment

As is illustrated in Fig.2, the anti-aliased image of the same character in web images varies according to the sampling grid, that we call the grid alignment.

 

C) Influence of adjacent characters

The images of characters in context are influenced by their adjacent characters at their left and right borders; this is another source of variability, since the images of characters in context vary from the images of their isolated counterparts as illustrated in Fig.3:

 

2. Segmentation problem

As the Fig. 4 illustrates, there are no character interspaces available to segment characters within the word “School”. This is a big challenge as segmentation and recognition can’t be separated. To develop a Web-Image-OCR, segmentation and recognition have to be combined in the same process.

The current state of research: We have conducted two preliminary studies about very low resolution character identification. The first experiment about identification of isolated characters delivered very accurate results, whereas the second experiment about characters in context delivered results, which were less accurate. Therefore, we plan to design a more reliable character identification system by using Hidden Markov Models (HMMs), which is able to combine segmentation and recognition within the same process. Furthermore, HHMs are also well suited to include linguistic knowledge such as probabilities of character and word sequences which allow increasing the future recognition rate.

 

Period : The project has started on April 2003 and is supposed to last 4 years

Fundings : The project is funded by University of Fribourg

Participants :

  • Farshideh Einsele AT unifr.ch
  • Jean Hennebert AT unifr.ch
  • Rolf Ingold AT unifr.ch

Publications:
  • F. Einsele, R. Ingold, "A Study of the Variability of Very Low Resolution Characters and the Feasibility of their Discrimination Using Geometrical Features." , proc. of 4th World Enformatica Congress, International Conference on Pattern Recognition and Computer Vision in Istanbul (Turkey), June 24 - 26 2005, pp. 213-217
  • F. Einsele, J. Hennbet, R. Ingold , "Towards Identification Of Very Low Resolution, Anti-Aliased Characters", IEEE International Symposium on Signal Processing and its Applications (ISSPA'07), Sharjah, United Arab Emirates, 2007.