department of informatics

XED - Extracting Hidden Structures from Electronic Documents

Keywords: PDF conversion, electronic document analysis, physical and logical layout, XML format, XCDF

Summary :

XED is a reverse engineering tool for PDF documents, which discovers and extracts the original document layout structure. XED mixes electronic extraction methods with state-of-the-art document analysis techniques and outputs the layout structure in the XCDF hierarchical canonical form, which is universal and independent of the document type.

 


click to enlarge

 

XED proceeds in two main steps: firstly, it converts the PDF document in an internal Java tree, normalizing and cleaning the primitives of the original document and taking into account all types of embedded resources such as raw images and fonts. Secondly, XED analyzes the internal Java tree document for recovering physical structures and representing them in the canonical format XCDF. XCDF is able to represent the reorganized document in a structured and unique manner that would greatly help to access easily the document content for further works. XED is able to generate XCDF files without supplemental and specific calibration of the system.


click to enlarge

Period : The project has started on summer 2003 and is still ongoing.

 

Participants :


Achievements

Currently, XED has already been integrated in several projects and applications that use PDF as format for documents representation: FaericWorld, SMAC, DOLORES and JFriDoc.

 

Applications extending XED

  • PurifyDoc converts XCDF files in clean and structured PDF.
  • Xed(dot)net is a web service for testing XED capabilities.
  • Inquisitor is an interface for editing documents structures (see FaericWorld) that allows to validate the results of XED and to create high-level annotations from XCDF.


click to enlarge

  • Structexed analyses XCDF format in order to extract the logical structures of newspapers.


click to enlarge

Publications related to this project

  • J.-L. Bloechle, M. Rigamonti, D. Lalanne, R. Ingold, "XCDF : un format canonique pour la représentation de documents." In proc. of Colloque International Francophone sur l'Ecrit et le Document (CIFED'06), Fribourg (Switzerland), September 18 - 22 2006 , pp. 19-23.
  • J.-L. Bloechle, M. Rigamonti, K. Hadjar, D. Lalanne, R. Ingold, "XCDF: A Canonical and Structured Document Format." In Horst Bunke, A. Lawrence Spitz (eds.), LNCS: "7th International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006, Proceedings", Springer-Verlag, vol. 3872, ISBN:3-540-32140-3, 2006 , pp. 141-152.
  • M. Rigamonti, J.-L. Bloechle, K. Hadjar, D. Lalanne, R. Ingold, "Towards a Canonical and Structured Representation of PDF Documents through Reverse Engineering." In proc. of 8th International Conference on Document Analysis and Recognition (ICDAR'05), Seoul (Korea), August 29 - September 01 2005 , pp. 1050-1054.
  • M. Rigamonti, K. Hadjar, D. Lalanne, R. Ingold, "Xed: un outil pour l'extraction et l'analyse de documents PDF." In proc. of Huitième Colloque International Francophone sur l'Ecrit et le Document (CIFED'04), La Rochelle (France), June 21 - 25 2004 , pp. 85-90.
  • K. Hadjar, M. Rigamonti, D. Lalanne, R. Ingold, "Xed: a new tool for eXtracting hidden structures from Electronic Documents." In proc. of International Workshop on Document Image Analysis for Libraries (DIAL'04), Palo Alto, CA (USA), January 23 - 24 2004 , pp. 211-221.