next up previous


Transforming Logical Structures of Documents into SGML Markups

Rolf Brugger

October 1994

1. Introduction

In this article a generator for SGML marked up ASCII text is presented. It has been integrated into the OSCAR-II prototype (further referred to as document recognition module), which is part of a project for structural optical recognition of printed documents. This has been used in the HIPOCAMPE project [5] whose goal was the realization of a prototype of an interactive computer assisted instruction system.

2. The Hipocampe Project

2.1 The Document Recognition Module

The document recognition module plays an important role in the HIPOCAMPE system. Its task is to recognize two kinds of information from a printed and scanned document. It recognizes the logical structure by using segmentation and character and font recognition. Merging together these two intermediate results leads to the specific logical structure of a document which is the result used by the following modules.

   
2.1.1 Overview over the architecture of the document recognition module

This section explains shortly how the document recognition module can be decomposed to submodules as shown in Figure 1. The root box symbolizes the generic logical document structure. It describes in a generic way the logical structure of the documents that the system potentially should be able to recognize. Here, the generic logical structure is formalized in a document type definition (DTD) according to the SGML standard. It describes for example the set of all the possible chapters of a textbook.

The given DTD must the be translated by hand into another formalism called document description. The document description was considered the more appropriate generic formalism for the OSCAR-II recognition approach. In section 3.2 we explain how a DTD can be translated correctly to a document description.

The document description is a human readable and editable ASCII file. In order to be processable by the analyzer, the document description has to be transformed to a finite state automaton in a well machine readable binary form.

The main data stream that leads to the analyzer is the actual document image to be processed. A printed document is scanned and transformed to a digital image format. Three different processes extract OCR, OFR and segmentation information from the image. The OCR (optical character recognition) process recognizes the pure textual information consisting of letters, digits and special characters. The OFR (optical font recognition) process recognizes font information of the text like font family, font size and font style. Finally, the segmentation process recognizes the position in the document and the shape of the envelope of entire text blocks, text lines, figures etc.


  
Figure 1: Architecture of the document recognition and related modules.
\begin{figure}
\begin{center}
\epsfig{file=fig_recog_arch.eps} \end{center}\end{figure}

These intermediate results are then passed to the analyzer which is parameterized by a document description represented as automaton. The analyzer, in a few brief words, is an error tolerant parser that scans the OCR data while simultaneously taking into account the OFR and segmentation data. This matching of the actual document to a document description results in the specific logical structure of the actual document. Typically, the specific structure is represented as a tree where the nodes correspond to nonterminal symbols of the document description and the leaves correspond the actual characters of the recognized text.

The specific document structure can then be postprocessed in two different ways: Either its tree structure is visualized on the screen or it is transformed to an ASCII text document enriched by SGML tags. As the SGML tags have to correspond to the DTD, a relation between document description and SGML entities has to be established. This relation is defined in the translation table which parameterizes the specific structure to SGML translator. The translation table has to be created by hand. The relations between DTD, document description and translation table, and the SGML generator algorithm will be documented in detail in Section 3 and the edition of a translation table in Section 4.

2.1.2 The generic structure used by the document recognition module

As explained in the previous section the document recognition module (the analyzer respectively) is parameterized by a generic description of the document class to be recognized. It defines on the one hand the entities that may exist in a document and how they can be arranged logically. On the other hand it defines the physical appearance of these entities or, in other words, how they are typeset in a printed document. These two aspects of a document class description are called generic logical structure and generic physical structure.

The formalism that is being used to describe the generic logical structure was proposed by Ingold [3]. It is a grammar expressed in with an EBNF-like notation (refer to Figure 2). A nonterminal symbol can be composed by other symbols using lists $s_1, s_2, \ldots s_n$, iterations $\{ s\}$, alternatives $s_1\mid s_2\mid \ldots \mid s_n$ or options [s]. Parentheses can be used to group subexpressions in an expression that would be ambiguous otherwise.


  
Figure 2: Definition of the generic logical structure using grammatical rules.

Hipsix: DOC => Chapter;
Chapter: PRT => ChapTitle {SectOne};
ChapTitle: FRG => LevelOneNum {Word};
LevelOneNum: STR => FDigit Period;
Word: STR => {Letter};
SectOne: PRT => SectOneTitle SectOneCont;
SectOneTitle: FRG => LevelTwoNum  {Word};
SectOneCont: PRT => MainText | {SectTwo};




The generic physical structure is described by attributes associated to the symbols of the above grammar. As can be seen in Figure 3 the attributes specify typographical parameters like alignment, font, line height, line distance, margins etc.


  
Figure 3: Definition of the generic physical structure using attributes.

  ChapTitle.zone = main;
  ChapTitle.alignment = (Allowed, Leftadjusted, [ -3 pt, 3 pt],
		      [-3 pt, 3 pt], [-3 pt, 3 pt]);
  ChapTitle.lineHeight = [18pt, 18pt];
  ChapTitle.spaceBefore = (Obligatory, [0 pt, 100pt]);
  ChapTitle.interSpace = (Forbidden, [5 pt, 25 pt]);
  ChapTitle.font = (Times, 18pt, Bold, Roman);

  SectOne.zone = main; 
  SectOne.alignment = (Allowed, Justified, [-3 pt, 3 pt],
		      [-3 pt, 3 pt], [-3 pt, 3 pt]);
  SectOne.lineHeight = [10 pt, 10pt];
  SectOne.spaceBefore = (Allowed, [0 pt, 50pt]);
  SectOne.interSpace = (Allowed, [0 pt, 4 pt]);
  SectOne.font = (Times, 10pt, Normal, Roman);




Due to the EBNF-like notation that has been adopted a context free language could be described with it. However, by limiting the document description to non-recursive production rules it can also be formalized by regular expressions. It is an important property of the document description to cover only regular languages because it guarantees the possibility to create an automaton for the parsing process.

  
3. Generation of SGML tagged Text

3.1 Quick Introduction to SGML

The Standardized Generalized Markup Language SGML is a widely adopted document description standard [4]. In the SGML approach, the ASCII text of a specific document is logically structured by ASCII markups. The markups themselves and how they can be arranged is defined in the document type definition DTD, which corresponds to the generic document structure of a document class. Figure 4 shows an example of the DTD for a very simple document class and an instance of it.


  
Figure 4: Comparison of an SGML DTD and an instance of it.
SGML DTD SGML Data
Generic logical structure. Specific logical structure.
Describing a class of documents. Describing one specific document.


<!DOCTYPE simple
<!ELEMENT doc -- (doctit, maintext)>
<!ELEMENT doctit -- (#PCDATA) >
<!ELEMENT maintext -- (tit, parag*)* >
<!ELEMENT tit -- (#PCDATA) >
<!ELEMENT parag -- (#PCDATA) >
<!ENTITY kwdoc -- "DOCUMENT:" >
>






<doc>
<doctit>&kwdoc A sample document</doctit>
<maintext>
<tit>Introduction</tit>
<parag>
In this text we would like ...
</parag>
<parag> ... </parag>
...
</maintext>
</doc>




SGML defines three types of markups:

Elements:
Elements are used to define the logical structure of a document.
Attributes:
Attributes are objects that are attached to elements and contain an attribute value. Thus attributes are always used when an element should be enriched by additional information. Examples are hypertext references where the attribute is a pointer to a text location or inclusion of a figure in a text where the attribute contains the filename of the figure to be included.
Entities:
Entities are symbolic names for any type of data. Examples are keywords like ``&fig'' that may symbolize the string ``Figure'' or special characters like ``&eacute;'' defining ``é''.

   
3.2 Dependencies between DTD, Document Description and Translation Table

In Section 2.1.1 we already pointed out the fact that the document recognition module is based on a proprietary document description. Therefore, the DTD has to be translated to the document description format. This translation would be very difficult to implement because the two formalisms have a different expressiveness. So we decided to do the translation by hand. This implies two consequences: First, only a subset of the SGML expressiveness can be used for the DTD and second, all information that will be used by the document recognition module that is not defined in the DTD must be added by the user.

The document description is then passed to the recognition module. Its output is a tree structure where the nodes correspond to the nonterminal symbols of the document description and where the leaves contain the actual text information. In order to be able to transform this structure to an SGML text the node labels (which are identical to the nonterminal symbol's identifier) have to be translated to the respective SGML identifiers. This is what the translation table is used for: it lists all document description identifiers that can appear in a specific document and associates to each of them the corresponding SGML identifier (Figure 5).


  
Figure 5: Architecture of the SGML generator and its dependency from the DTD, the document description and the translation table.
\begin{figure}
\begin{center}
\epsfig{file=fig_docdesc_arch.eps} \end{center}\end{figure}

In order to make the SGML text generator and the document recognizer work correctly several restrictions on the DTD and document description have to be respected:

Non recursiveness:
In the document recognition module the generic document structure is represented as automaton. Therefore the DTD (and consequently the document description) should only contain structures that could also be formalized as regular expressions. This forbids the usage of recursive structures in the DTD.
No SGML attributes:
The document recognizer cannot generate attributes for logical elements. Therefore, it is not forbidden but useless to define attributes in a DTD.
L(DTD) $\subseteq$ L(doc descr):
It is obvious that the document description should at least cover the language defined by the DTD. In general, the document description would be refined and more precise than than the DTD in order to enable an adequate document recognition. In such a case, the SGML text generator would translate a specific document by mapping several document description identifiers to one SGML identifier if necessary.

3.3 Generating SGML Text

In this section we will explain in detail the algorithm of the SGML text generator. It consists of two parts: first, the SGML text is created and second, some simple formatting operations are applied on it, leading to the marked up ASCII file.

SGML formatted documents contain pure text that is structured by three types of markups -- elements, entities and attributes. As the document recognizer cannot generate attributed logical elements it is not possible to generate SGML attributes. Thus the SGML text generator must be able to create elements, entities and the characters of the text. The main principle is to traverse the document tree in a depth first manner. Whenever a node is encountered the appropriate character or SGML tag is created (Figure 6).


  
Figure 6: Traversing the document tree (specific logical structure of a document). The tree refers to the example document from Figure 4. Note that the node identifiers have changed with respect to the SGML tag identifiers.
\begin{figure}
\begin{center}
\epsfig{file=fig_tree_trav.eps,width=0.8\textwidth} \end{center}\end{figure}

Translation of elements:
SGML elements correspond to nodes in the document tree that are neither leaves nor entities (see ``Translation of entities''). When an element identifier has been encountered its corresponding SGML identifier is looked up in the translation table. For an element node two SGML tags have to be created -- on node entry an opening tag (e.g. <paragraph>) and on node exit a closing tag (e.g. </paragraph>). When a node is encountered its corresponding SGML tag is written directly to the result file.
Translation of entities:
SGML entities correspond to nodes in the document tree that are not leaves and whose document description identifier corresponds to a SGML identifier that begins with an ampersand character ``&''. As entities are mainly used as symbols for keywords or special signs, their SGML tag is printed only once to the result file on node entry. Additionally, any output to the result file is inhibited for the whole subtree of an entity node.
Translation of characters:
All leaves in a document tree correspond to characters of the actual text. Whenever a character node is encountered, it is printed directly to the result file. Exception: A character that is part of an entity node is not printed to the result output, because it is already represented by the entity tag.

3.4 SGML Text Formatting

The following formatting operations are applied while the SGML text generation proceeds:

Word separation:
The specific document structure has no notion of space characters. Words are delimited by a logical entity ``string'' that contains the word's characters. Therefore extra space characters have to be introduced to generate correct SGML text. Thus, on every string node entry, a space character is written to the result stream.
Line length limitation:
An ASCII file should not have text lines longer than about 80-100 characters. Otherwise, they would get difficult to handle with standard tools like text editors. Therefore, the actual column position of the result is held in an internal counter and updated on every output to the result file. Whenever the column counter exceeds a certain threshold on a node entry, a new line is started. The line breaking facility is inhibited on character nodes in order not to split words across lines.
Element pretty printing:
Every element or entity tag can optionally be preceded or followed by a newline character. The pretty print option is specified in the translation table for every opening and closing tag (see Section 4.1.2).

   
4. Documentation of the SGML Text Generator

4.1 User Documentation

4.1.1 Syntax and Semantics of the Translation Table

The translation table is an ASCII file that can be edited with standard text editors. It is interpreted line by line, providing four line types:

Comment line:
The first character of the line is a sharp sign ``#'' followed by an arbitrary number of characters. The line will be ignored entirely.
Empty line:
The line contains only space or tabulator signs or no character at all. The line will be ignored entirely.
Element translation:
Used to translate a document description identifier to a SGML element tag. The first string is interpreted as the document description identifier the second as SGML identifier. Formatting instructions may follow optionally (see below).
Entity translation:
Used to translate a document description identifier to a SGML entity tag. The first string is interpreted as the document description identifier the second as SGML identifier. Entity translations are distinguished from element translations by the first character of the SGML identifier -- it must be an ampersand character ``&'' in the case of an entity translation and any letter otherwise. Formatting instructions may follow optionally (see below).

A formatting instruction is one of the keywords bs, as, be or ae. They are used to insert line breaks in the resulting SGML text before an opening tag (bs), after an opening tag (as), before a closing tag (be) or after a closing tag (ae). As SGML entities are only generated at a node entry, the instructions be and ae would be ignored for them. Formatting instructions may be listed in any order.

Note, that all identifiers are interpreted in a case sensitive way. The complete syntax of the translation table is listed in Figure 7. An example can be found in Appendix A.4.


  
Figure 7: Syntax of translation tables.


TransTab ::= {TransLine <CR>}
TransLine ::= Comment | Empty | ElementTrans | EntityTrans
Comment ::= '#'{Char}
Empty ::= {<TAB>| }
ElementTrans ::= Ident Ident FormInstr
EntityTrans ::= Ident '&'Ident FormInstr
FormInstr ::= {'bs' | 'as' | 'be' | 'ae'}
Ident ::= Letter {Letter | Digit}
Letter ::= {'a' | .. | 'z' | 'A' | .. | 'Z'}
Digit ::= {'0' | .. | '9'}



   
4.1.2 Construction of the Translation Table

It should be easy to edit a translation table for a DTD and document description when the restrictions discussed in Section 3.2 have been taken into account.

The first step is to list all document description identifiers (all nonterminal symbols of the document description). Then for each document description identifier the semantically corresponding SGML element or entity is associated. Several document description identifiers may be associated to one SGML identifier but not vice versa. This is the reason why a document description identifier must not appear more than once in a translation line. In the case where a document description identifier needn't be translated, the according translation line can simply be omitted.

For an example refer to the translation table in Appendix A.4 which corresponds to the DTD in Appendix A.2 and the document description in Appendix A.3.

4.1.3 Generating SGML Text from within the OSCAR-II Prototype

The SGML text generator has been integrated to the OSCAR-II Prototype, which was realized by Hu [2]. Its usage is documented in detail in [1]. All the examples in the appendix have been created with this prototype.

4.2 System Documentation

The SGML text generator has been implemented in ADA and consists of one package. The interface (Appendix B) exports two routines: read_transl_params to read a translation table from an ASCII file and write_sgml to traverse the document tree, translating the identifiers and printing the SGML result to the output file.

Bibliography

1
R. Brugger.
Manual of the OSCAR-II prototype.
Rapport interne, IIUF-Université de Fribourg, October 1994.

2
T. Hu.
New methods for robust and efficient recognition of the logical structures in documents.
PhD thesis, IIUF-Université de Fribourg, 1994.
Thesis no 1076.

3
R. Ingold.
A document description language to drive document analysis.
In G. Lorette, editor, Proceedings of the First international Conference on Document Analysis and Recognition, pages 294-301. AFCET-IRISA, Saint-Malo, September 1991.

4
E. Van Herwijnen.
Practical SGML.
Kluwer Academic Publishers, 1990.

5
M. Wendtland, R. Ingold, C. Vanoirbeek, and E. Forte.
Hipocampe: Towards learner sensitive, context optimized interactive CAI.
In E. N. Forte, editor, Proceedings of the Interational Conference on Computer Aided Learning and Instruction in Science and Engineering, pages 233-240. EPFL, Lausanne, September 1991.

  
5. An Example: First page of chapter ``Thermodynamique''

  
5.1 Document to be recognized

\fbox{\epsfig{file=fig_hipsix.eps,width=0.8\textwidth}}

  
5.2 The SGML Document Type Definition DTD

Version of 4 September 1992. By Daniel Wagner (LITH).

<!DOCTYPE  Book
 [<!ENTITY title           ""    --                                                   -->
  <!ENTITY digit           ""    --0,1,2,3,4,5,6,7,8,9                                -->
  <!ENTITY period          ""    --                                                   -->
  <!ENTITY word            ""    --                                                   -->
  <!ENTITY chemelmt        ""    --exemple: C4H12O6; NaCl,4(H2O); Na(+); Cl(-); Cu(2+)-->
  <!ENTITY chemform        ""    --        2(NaCl)  + 8(H2O) --> 2(NaCl,4(H2O))       -->
  <!ENTITY mathequ         ""    --                         bitmap                    -->
  <!ENTITY chapnum         ""    --            (%digit)+,  %period, (%digit)+         -->
  <!ENTITY superscript     ""    --        reference a une note en bas de page        -->
  <!ENTITY bigperiod       ""    --  .                                                -->
  <!ENTITY openquot        ""    --  ``                                               -->
  <!ENTITY punct           ""    --  toute autre ponctuation                          -->
  <!ENTITY closguot        ""    --  ''                                               -->
  <!ENTITY kwdexp          "Experience"                                            -- -->
  <!ENTITY bitmap          ""                                                      -- -->
  <!ENTITY kwdtabl         "Table"                                                 -- -->
  <!ENTITY kwdfig          "Fig."                                                  -- -->

  <!ELEMENT  chap     - -  (chapti,   (maintext)?,  sect+)                              >
  <!ELEMENT  chapti   - -  (chapnu,   %title)                                           >
  <!ELEMENT  chapnu   - -  ((%digit)+,   %period)                                       >

  <!ELEMENT  maintext - -  ((parabeg | parafol | exp | tabl | fig | topic)+)            >

  <!ELEMENT  parabeg  - -  ((%word | %punct | ref  | %chemelmt  | list | %chemform |    >
  <!ELEMENT  ref      - -  (%chapnum |  %superscript)                                   >
  <!ELEMENT  list     - -  (%bigperiod, listitem)                                       >
  <!ELEMENT  listitem - -  ((%word | %punct | ref  | %chemelmt  | %chemform |           >
  <!ELEMENT  keyexpr  - -  ((%word)+)                                                   >
  <!ELEMENT  keysent  - -  ((%word | %punct | ref  | %chemelmt | %mathequ)+)            >
  <!ELEMENT  quotat   - -  (%openquot, (%word | %punct)+), %closquot)                   >

  <!ELEMENT  parafol  - -  ((%word | %punct | ref | %chemelmt | list | %chemform |
                            %mathequ | keyexpr | keysent | quotation)+)                 >

  <!ELEMENT  exp      - -  (expti, artwork?, expdesc)                                   >
  <!ELEMENT  expti    - -  (%kwdexp, expnu, %title)                                     >
  <!EtEMENT  expnu    - -  ((%digit)+, %period, (%digit)+)                              >
  <!ELEMENT  artwork  - -  (%bitmap)                                                    >
  <!ELEMENT  expdesc  - -  ((parabeg, parafol)+)                                        >

  <!ELEMENT  tabl     - -  (tablti, tablcont)                                           >
  <!ELEMENT  tablti   - -  (%kwdtabl, tablnu, %title)                                   >
  <!ELEMENT  tablenu  - -  ((%digit)+, %period, (%digit)+)                              >
  <!ELEMENT  tablcont - -  (%bitmap)                                                    >

  <!ELEMENT  fig      - -  (artwork, figti)                                             >
  <!ELEMENT  figti    - -  (%kwdfig, fignu, %title)                                     >
  <!ELEMENT  fignu    - -  ((%digit)+, %period, (%digit)+)                              >

  <!ELEMENT  topic    - -  (topicti, (maintext)+)                                       >
  <!ELEMENT  topicti  - -  (%title)                                                     >

  <!ELEMENT  sect     - -  (sectti, (maintext)*, susect*)                               >
  <!ELEMENT  sectti   - -  (sectnu, %title)                                             >
  <!ELEMENT  sectnu   - -  ((%digit)+, %period, (%digit)+)                              >

  <!ELEMENT  susect   - -  (susectti, (maintext)*)                                      >
  <!ELEMENT  susectti - -  (susectnu, %title)                                           >
  <!ELEMENT  susectnu - -  ((%digit)+, %period, (%digit)+, %period, (%digit)+)          >
]>

  
5.3 The Document Description

Hipsix: DOC => Chapter;

Chapter: PRT => ChapTitle {SectOne};

  ChapTitle.zone = main;
  ChapTitle.alignment = (Allowed, Leftadjusted, [ -3 pt, 3 pt],
		      [-3 pt, 3 pt], [-3 pt, 3 pt]);
  ChapTitle.lineHeight = [18pt, 18pt];
  ChapTitle.spaceBefore = (Obligatory, [0 pt, 100pt]);
  ChapTitle.interSpace = (Forbidden, [5 pt, 25 pt]);
  ChapTitle.font = (Times, 18pt, Bold, Roman);

  SectOne.zone = main; 
  SectOne.alignment = (Allowed, Justified, [-3 pt, 3 pt],
		      [-3 pt, 3 pt], [-3 pt, 3 pt]);
  SectOne.lineHeight = [10 pt, 10pt];
  SectOne.spaceBefore = (Allowed, [0 pt, 50pt]);
  SectOne.interSpace = (Allowed, [0 pt, 4 pt]);
  SectOne.font = (Times, 10pt, Normal, Roman);

ChapTitle: FRG => LevelOneNum {Word};
  ChapTitle.separBefore = (Allowed, [7 pt, 30 pt]);

LevelOneNum: STR => FDigit Period;

  FDigit.cand = {"0" | "1"| "2"| "3"| "4"| "5"| "6"| "7"| "8"| "9"};
  FDigit.font = (@, @, @, @);
  FDigit.separBefore = (@, [@, @]);

  Period.cand = {"."};
  Period.font = (@, @, @, @);
  Period.separBefore = (Forbidden, [0 pt, 2 pt]);

RefOne: STR => FDigit Period;

Word: STR => {Letter};

  Letter.cand = {"A"|"B"|"C"|"D"|"E"|"F"|"G"|"H"|"I"|"J"|"K"|"L"|
	         "M"|"N"|"O"|"P"|"Q"|"R"|"S"|"T"|"U"|"V"|"W"|"X"|"Y"|"Z"|
	         "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|
	         "m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"|
		 "a^"|"e^"|"i^"|"o^"|"u^"|"e/"|"e:"|"i:"|"a\"|"e\"|"u\"|"c,"};
  Letter.font = (@, @, @, @);
  Letter.separBefore = <FST: (@, [@, @]), 
                        NXT: (Forbidden, [0 pt, 2 pt])>;
  
GWord: STR => {GLetter};

  GLetter.cand = {"A"|"B"|"C"|"D"|"E"|"F"|"G"|"H"|"I"|"J"|"K"|"L"|
	         "M"|"N"|"O"|"P"|"Q"|"R"|"S"|"T"|"U"|"V"|"W"|"X"|"Y"|"Z"|
	         "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|
	         "m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"|
		 "a^"|"e^"|"i^"|"o^"|"u^"|"e/"|"e:"|"i:"|"a\"|"e\"|"u\"|"c,"};
  GLetter.font = (@, @, @, @);
  GLetter.separBefore = <FST: (Forbidden, [0 pt, 2 pt]),
			 NXT: (Forbidden, [0 pt, 2 pt])>;
  
ComWord: STR => {Letter} ((Connection {GLetter})|
			(Break BLetter {GLetter}));
  
  Connection.cand = {"'"|"/"};
  Connection.font = (@, @, @, @);
  Connection.separBefore = (Forbidden, [0pt, 2pt]);

  Break.cand = {"-"};
  Break.font = (@, @, @, @);
  Break.separBefore = (Forbidden, [0pt, 2pt]);

  BLetter.cand = {"A"|"B"|"C"|"D"|"E"|"F"|"G"|"H"|"I"|"J"|"K"|"L"|
	         "M"|"N"|"O"|"P"|"Q"|"R"|"S"|"T"|"U"|"V"|"W"|"X"|"Y"|"Z"|
	         "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|
	         "m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"|
		 "a^"|"e^"|"i^"|"o^"|"u^"|"e/"|"e:"|"i:"|"a\"|"e\"|"u\"|"c,"};
  BLetter.font = (@, @, @, @);
  BLetter.separBefore = (Obligatory, [1 pt, 30 pt]);
  
MainText: PRT => {(ParaSta [{(Experience | NonText| Table | 
                  ItemFST | ItemFol) [ParaSpe]}]) | 
                  (NonText ItemFol) | TopicTitle};

  Experience.alignment = (Allowed, Justified, [11 pt, 17 pt],
		       [11 pt, 17 pt], [-3 pt, 3 pt]);
  Experience.font = (Times, 8 pt, Normal, Roman);

  ParaSta.zone = @;
  ParaSta.alignment = (@, @, [@, @], [@, @], [11 pt, 17 pt]);
  ParaSta.lineHeight = [10pt, 10pt];
  ParaSta.spaceBefore = (Allowed, [1 pt, 50pt]);
  ParaSta.interSpace = (Allowed, [1 pt, 4 pt]);
  ParaSta.font = (Times, 10pt, Normal, Roman);

  ParaSpe.zone = @;
  ParaSpe.alignment = (@, @, [@, @], [@, @], [@, @]);
  ParaSpe.lineHeight = [10pt, 10pt];
  ParaSpe.spaceBefore = (Allowed, [1 pt, 50 pt]);
  ParaSpe.interSpace = (Allowed, [1 pt, 4 pt]);
  ParaSpe.font = (Times, 10pt, Normal, Roman);

  ItemFST.zone = main;
  ItemFST.alignment = (Allowed, Justified, [22 pt, 28 pt], 
			   [-3 pt, 3 pt], [-6pt, -12pt]);
  ItemFST.lineHeight = [10 pt, 10 pt];
  ItemFST.spaceBefore = (Allowed, [2 pt, 20pt]);
  ItemFST.interSpace = (Allowed, [1 pt, 4 pt]);

  ItemFST.font = (Times, 10 pt, Normal, Roman);

  ItemFol.zone = main;
  ItemFol.alignment = (Allowed, Justified, [22 pt, 28 pt], 
                       [-3 pt, 3 pt], [-3 pt, 3 pt]);
  ItemFol.lineHeight = [10 pt, 10 pt];
  ItemFol.spaceBefore = (Allowed, [20 pt, 50pt]);
  ItemFol.interSpace = (Allowed, [1 pt, 4 pt]);

  ItemFol.font = (Times, 10 pt, Normal, Roman);

  NonText.zone = @;
  NonText.alignment = (Obligatory, @, [100 pt, 100 pt], 
                       [100 pt, 100 pt], [100 pt, 100 pt]);
  NonText.lineHeight = [1000 pt, 1000 pt];
  NonText.spaceBefore = (Obligatory, [100 pt, 100 pt]);
  NonText.interSpace = (Obligatory, [100 pt, 100 pt]);
  NonText.font = (Times, 10pt, Normal, Roman);

NonText: FRG => {Letter} ;
  NonText.separBefore = (Obligatory, [100 pt, 100pt]);  

ParaSta: FRG => ParaData ;
  ParaSta.separBefore = (Allowed, [3pt, 20pt]);  

ParaSpe: FRG => ParaData;
  ParaSpe.separBefore = (Allowed, [3pt, 20pt]);  

ParaData: STR => { Word | ComWord | Number | ComNumber | RefOne | 
		   RefTwo | RefThree | Punction | KeyExpr | KeySent | 
		   MathOpe | MixWord | Phrase };

  Punction.cand = {","|"."|"?"|":"|";"};
  Punction.font = (@, @, @, @);
  Punction.separBefore = <FST: (Forbidden, [0pt, 2pt]),
			  NXT: (Forbidden, [0pt, 2pt])>;

  MathOpe.cand = {"+"|"-"|"*"|"/"|"="};
  MathOpe.font = (@, @, @, @);
  MathOpe.separBefore = <FST: (Forbidden, [2 pt, 15 pt]), 
			 NXT: (Forbidden, [2 pt, 15 pt])>;

Number: STR => {Digit};

  Digit.cand = {"0" | "1"| "2"| "3"| "4"| "5"| "6"| "7"| "8"| "9"};
  Digit.font = (@, @, @, @);
  Digit.separBefore = <FST: (@, [@, @]), 
                       NXT: (Forbidden, [0 pt, 2 pt])>;

GNumber: STR => {GDigit};

  GDigit.cand = {"0" | "1"| "2"| "3"| "4"| "5"| "6"| "7"| "8"| "9"};
  GDigit.font = (@, @, @, @);
  GDigit.separBefore = <FST: (Forbidden, [0 pt, 2 pt]), 
			NXT: (Forbidden, [0 pt, 2 pt])>;

ComNumber: STR => {Digit} Comma {GDigit};

  Comma.cand = {","};
  Comma.font = (@, @, @, @);
  Comma.separBefore = (Forbidden, [0 pt, 2 pt]);

LevelTwoNum: STR => FDigit Period GDigit;

RefTwo: STR => FDigit Period GDigit;

LevelThreeNum: STR => FDigit Period GDigit Period GDigit;

RefThree: STR => FDigit Period GDigit Period GDigit;

KeyExpr: STR => {Word};

  KeyExpr.font = (@, @, Bold , Italic);

KeySent: STR => {Word};

  KeySent.font = (@, @, Normal , Italic);

MixWord: STR => {Letter} (LeftPar|GDigit) 
		[{GLetter|LeftPar|GDigit|RightPar|Plus}];

  LeftPar.cand = {"("};
  LeftPar.font = (@, @, @, @);
  LeftPar.separBefore = <FST: (Forbidden, [0pt, 2pt]),
			 NXT: (Forbidden, [0pt, 2pt])>;

  RightPar.cand = {")"};
  RightPar.font = (@, @, @, @);
  RightPar.separBefore = <FST: (Forbidden, [0pt, 2pt]),
			  NXT: (Forbidden, [0pt, 2pt])>;

  Plus.cand = {"+"};
  Plus.font = (@, @, @, @);
  Plus.separBefore = <FST: (Forbidden, [0pt, 2pt]),
			NXT: (Forbidden, [0pt, 2pt])>;

Phrase: STR => PhraseBegin {Word | Punction | Number | MixWord} 
	       PhraseEnd;

  PhraseBegin.cand = {"("|"-"};
  PhraseBegin.font = (@, @, @, @);
  PhraseBegin.separBefore = (@, [@, @]); 
  \ Allowed, 2, 25 \

  PhraseEnd.cand = {")"|"-"};
  PhraseEnd.font = (@, @, @, @);
  PhraseEnd.separBefore = (Forbidden, [0pt, 2pt]);

Experience: PRT => ExpeTitle [NonText] ExpeDesc;

  ExpeTitle.zone = @;
  ExpeTitle.alignment = (@, Centered, [11 pt, 17 pt], 
                         [11 pt, 17 pt], [-3 pt, 3 pt]);
  ExpeTitle.lineHeight = [8 pt, 8 pt];
  ExpeTitle.spaceBefore = (Allowed, [5 pt, 50pt]);
  ExpeTitle.interSpace = (Forbidden, [0 pt, 3 pt]);
  ExpeTitle.font = (Times, 8pt, Normal, Roman);

  ExpeTitle.separBefore = (Allowed, [2pt, 15 pt]);  

ExpeTitle: FRG => ExpeTitleKey ExpeTitleNum ExpeTitleText;
  ExpeTitleKey.font = (Times, 8 pt, Bold, Roman);
  ExpeTitleNum.font = (Times, 8 pt, Bold, Roman);

ExpeTitleKey: STR => CharE Charx Charp {CharEArie} 
                     Charn Charc Chare;
  CharE.cand = {"E"};
  CharE.font = (@, @, @, @);
  CharE.separBefore = (Obligatory, [@, @]);  

  Charx.cand = {"x"};
  Charx.font = (@, @, @, @);
  Charx.separBefore = (Forbidden, [0pt, 1pt]); 

  Charp.cand = {"p"};
  Charp.font = (@, @, @, @);
  Charp.separBefore = (Forbidden, [0pt, 1pt]);  

  CharEArie.cand = { "e/"|"r"|"i"|"e"};
  CharEArie.font = (@, @, @, @);
  CharEArie.separBefore = (Forbidden, [0pt, 1pt]);  

  Charn.cand = {"n"};
  Charn.font = (@, @, @, @);
  Charn.separBefore = (Forbidden, [0pt, 1pt]);  

  Charc.cand = {"c"};
  Charc.font = (@, @, @, @);
  Charc.separBefore = (Forbidden, [0pt, 1pt]);  

  Chare.cand = {"e"};
  Chare.font = (@, @, @, @);
  Chare.separBefore = (Forbidden, [0pt, 1pt]);  

ExpeTitleNum: STR => FDigit Period GDigit;
  ExpeTitleNum.separBefore = (@, [@, @]);

TableTitleNum: STR => FDigit Period GDigit;
  TableTitleNum.separBefore = (@, [@, @]);

ExpeTitleText: STR => {Word | Punction | MixWord | ComWord};

ExpeDesc: PRT => ExpeParaSta [{{ExpeParaSta} | {(NonText 
                 [ExpeParaSpe])} | {ExpeItem}}];

  ExpeParaSta.zone = @;
  ExpeParaSta.alignment = (@, @, [11 pt, 17 pt], [11 pt, 17 pt],
			 [11 pt, 17 pt]);
  ExpeParaSta.lineHeight = [8 pt, 8 pt];
  ExpeParaSta.spaceBefore = (Allowed, [1 pt, 300pt]);
  ExpeParaSta.interSpace = (Allowed, [0 pt, 3 pt]);
  ExpeParaSta.font = (Times, 8pt, Normal, Roman);

  ExpeParaSpe.zone = @;
  ExpeParaSpe.alignment = (@, @, [11 pt, 17 pt], [11 pt, 17 pt],
			 [-3 pt, 3 pt]);
  ExpeParaSpe.lineHeight = [8 pt, 8 pt];
  ExpeParaSpe.spaceBefore = (Allowed, [1 pt, 300pt]);
  ExpeParaSpe.interSpace = (Allowed, [0 pt, 3 pt]);
  ExpeParaSpe.font = (Times, 8pt, Normal, Roman);

  ExpeItem.zone = @;
  ExpeItem.alignment = (@, @, [39 pt, 45 pt], [11 pt, 17 pt], 
                       [-11pt, -17pt]);
  ExpeItem.lineHeight = [8 pt, 8 pt];
  ExpeItem.spaceBefore = (Allowed, [0 pt, 10pt]);
  ExpeItem.interSpace = (Allowed, [0 pt, 3 pt]);
  ExpeItem.font = (Times, 8pt, Normal, Roman);

ExpeParaSta: FRG => ParaData;
  ExpeParaSta.separBefore = (Allowed, [2pt, 15pt]);  

ExpeParaSpe: FRG => ParaData;
  ExpeParaSpe.separBefore = (Allowed, [2pt, 15pt]);  

ExpeItem: FRG => BigPeriod ParaData;

  ExpeItem.separBefore = (Allowed, [2pt, 15pt]);  
	
  BigPeriod.cand = {"."};
  BigPeriod.font = (@, @, Bold, @);
  BigPeriod.separBefore = (Obligatory, [2pt, 8pt]);

Table: PRT => TableTitle NonText;
  TableTitle.zone = @;
  TableTitle.alignment = (Allowed, Centered, [11 pt, 17 pt],
		      [11 pt, 17 pt], [-3 pt, 3 pt]);
  TableTitle.lineHeight = [8 pt, 8 pt];
  TableTitle.spaceBefore = (Allowed, [10 pt, 30pt]);
  TableTitle.interSpace = (Forbidden, [0 pt, 3 pt]);
  TableTitle.font = (Times, 8 pt, Normal, Roman);

  TableTitle.separBefore = (Allowed, [3pt, 15 pt]);  

TableTitle: FRG => TableTitleKey TableTitleNum TableTitleText;
  TableTitleKey.font = (Times, 8 pt, Bold, Roman);

TableTitleKey: STR => CharT Chara {Charble} Chara Charu 
                      TableTitleNum;
  CharT.cand = {"T"};
  CharT.font = (@, @, @, @);
  CharT.separBefore = (Obligatory, [@, @]);  

  Chara.cand = {"a"};
  Chara.font = (@, @, @, @);
  Chara.separBefore = (Forbidden, [0pt, 2pt]); 

  Charble.cand = {"b"|"l"|"e"};
  Charble.font = (@, @, @, @);
  Charble.separBefore = (Forbidden, [0pt, 2pt]);  

  Charu.cand = {"u"};
  Charu.font = (@, @, @, @);
  Charu.separBefore = (Forbidden, [0pt, 2pt]);  

TableTitleText: STR => {Word | Punction};

ItemFST: FRG => BigPeriod ParaData;
  ItemFST.separBefore = (Allowed, [3pt, 20 pt]);  

ItemFol: FRG => ParaData;
  ItemFol.separBefore = (Allowed, [3pt, 20 pt]);  

Topic: PRT => TopicTitle ParaSta [ParaSpe];
  TopicTitle.zone = main;
  TopicTitle.alignment = (Allowed, Leftadjusted, [-3 pt, 3 pt],
		      [-3 pt, 3 pt], [-3 pt, 3 pt]);
  TopicTitle.lineHeight = [10 pt, 10 pt];
  TopicTitle.spaceBefore = (Allowed, [10 pt, 25pt]);
  TopicTitle.interSpace = (Forbidden, [1 pt, 6 pt]);

  TopicTitle.font = (Times, 10pt, Bold, Roman);

TopicTitle: FRG => { Word };
  TopicTitle.separBefore = (Allowed, [5pt, 20 pt]);  

SectOne: PRT => SectOneTitle SectOneCont;
  SectOneTitle.zone = main;
  SectOneTitle.alignment = (Allowed, Leftadjusted, [-3 pt, 3 pt],
		      [-3 pt, 3 pt], [-3 pt, 3 pt]);
  SectOneTitle.lineHeight = [12 pt, 12 pt];
  SectOneTitle.spaceBefore = (Allowed, [15 pt, 40pt]);
  SectOneTitle.interSpace = (Forbidden, [4 pt, 10 pt]);

  SectOneTitle.font = (Times, 12pt, Bold, Roman);

SectOneTitle: FRG => LevelTwoNum  {Word};
  SectOneTitle.separBefore = (Allowed, [6pt, 25 pt]);  

SectOneCont: PRT => MainText | {SectTwo};

SectTwo: PRT => SectTwoTitle ( MainText| {Topic});
  SectTwoTitle.zone = main;
  SectTwoTitle.alignment = (Allowed, Leftadjusted, [-3 pt, 3 pt],
		      [-3 pt, 3 pt], [-3 pt, 3 pt]);
  SectTwoTitle.lineHeight = [10 pt, 10 pt];
  SectTwoTitle.spaceBefore = (Allowed, [10 pt, 25pt]);
  SectTwoTitle.interSpace = (Forbidden, [1 pt, 6 pt]);

  SectTwoTitle.font = (Times, 10pt, Bold, Roman);

SectTwoTitle: FRG => LevelThreeNum {Word};
  SectTwoTitle.separBefore = (Allowed, [5pt, 20 pt]);

  
5.4 The Translation Table

# This is a test translation table
# Translation from logi-docu-tags to sgml-tags.
#
# ld-tag        sgml-tag        formatting instructions
#--------------------------------------------------------
Chapter	        chap            bs as be ae
ChapTitle       chapti          bs ae
LevelOneNum     chapnu

SectOne         sect            bs as be ae
SectOneTitle    sectti          bs ae
LevelTwoNum     sectnu       

MainText        maintext

ParaSta         parabeg         bs as be ae
ParaSpe         parabeg         bs as be ae
RefOne          ref
RefTwo          ref
RefThree        ref

Experience      exp             bs as be ae
ExpeTitle       expti           bs ae
ExpeTitleKey    &kwdexp
ExpeTitleNum    expnu

NonText         artwork

ExpeDesc        expdesc         bs as be ae
ExpeParaSta     parabeg         bs as be ae
ExpeParaSpe     parabeg         bs as be ae

SectTwo         susect          bs as be ae
SectTwoTitle    susectti        bs ae
LevelThreeNum   susectnu

Table           tabl            bs as be ae
TableTitle      tablti          bs ae
TableTitleKey   &kwdtabl
TableTitleNum   tablnum

#?              list
ItemFST         listitem        bs ae
ItemFol         listitem        bs ae

Topic           topic           bs as be ae
TopicTitle      topicti         bs ae

KeyExpr         keyexpr
KeySent         keysent

Phrase          quotat
PhraseBegin     &openquot
PhraseEnd       &closquot

  
5.5 The Specific Logical Structure: SGML Text

Original output of the SGML generator. The tilde characters ~ correspond to unrecognized characters.

<chap>
<chapti> <chapnu>6.</chapnu> Thermodynamique</chapti>
<sect>
<sectti> <sectnu>6.1</sectnu> Introducti~n</sectti>
<maintext>
<parabeg>
  Les nombreu~ exemples de r act~ons vues jusqu'ici dans ce~ 
ouvrage on~ mon~r  que l'on peu~ ais men~ e~ u~~lemen~ d cr~re 
ce qu~ se d roule lors d'une r ac~~on c~m~que au moyen d'une 
 qua~on. Il es~ cependan~ un ph nom ne qui n'es~ pas d cri~ 
par les  qua~ons ~elles que nous les avons  cri~es, c'es~ le 
d gage-men~ ou l'absorp~~on d' nerg~e. Les e~p r~ences <ref>6.1</ref> 
e~ <ref>6.2</ref> d mon~ren~ que les r ac~~ons c~~m~ques son~ 
le plus souven~ accompagn es de ph nom nes ~er-~ques.
</parabeg>
<exp>
<expti> &kwdexp <expnu>6.1</expnu>  R ~ct~on ~u cu~vre ~vec 
l'~c~de ~~~que, d~g~~ement de ch~leur.</expti>
<expdesc>
<parabeg>
  En f~i~~nt couler d~ l'~c~d~ ~~~qu~ conc~n~  ~u~ d~~ tou~nur~s 
d~ cu~vr~, on con~t~t~ qu'~l ~~ d~roul~ un~ v~o~~nt~ r~~ct~on: 
~e cu~vre ~~ d~s~out ~n donn~nt un~ so~u~on v~~~, ~~ s~ d g~g~ 
un g~z brun. D'~u~e p~r~, l~ ~~rmom~~~ p~~c~ ~~ns ~e b~~~on 
~d~que une bru~que ~u~en~-t~on de te~p~~~tu~e.
</parabeg>
<parabeg>
  L~ r ~ct~on est ~epr ~ent e p~ l' qu~~on:
</parabeg>
</expdesc>
</exp>
</maintext>
</sect>
</chap>

  
6. ADA Interface of the SGML text generator

--------------------------------------------------------------------------------
---                         FRIBOURG UNIVERSITY                              ---
---                     COMPUTER SCIENCE LABORATORY                          ---
---            Chemin du Musee 3, CH-1700 FRIBOURG, SWITZERLAND              ---
--------------------------------------------------------------------------------
--+ TITLE:     sgml_output 
--+ SUPPORT:   Rolf Brugger
--+ CREATION:  August 1994
--+ VERSION:   of 30.9.94
--------------------------------------------------------------------------------


WITH logical_document_manager;        USE logical_document_manager;
WITH long_string;                     USE long_string;
WITH TABLE_OF_STATIC_KEYS_AND_STATIC_VALUES_G;

PACKAGE sgml_output IS

TYPE tab_entry_type IS RECORD
  tagname:     v_string;              -- sgml: element's generic identifier
  nl_bs:       BOOLEAN:=FALSE;        -- print newline before start-Tag
  nl_as:       BOOLEAN:=FALSE;        -- print newline after  start-Tag
  nl_be:       BOOLEAN:=FALSE;        -- print newline before end-Tag
  nl_ae:       BOOLEAN:=FALSE;        -- print newline after  end-Tag
END RECORD;

PACKAGE table IS NEW TABLE_OF_STATIC_KEYS_AND_STATIC_VALUES_G(
              KEY_TYPE =>       v_string,
              LESS =>           "<",
              EQUALS =>         "=",
              VALUE_TYPE =>     tab_entry_type);
TYPE transl_table_type IS NEW table.table_type;

--------------------------------------------------------------------------------


PROCEDURE read_transl_params(paramfile:IN          STRING;
                             transl_table: IN OUT  transl_table_type);
-- Reads the translation table from the file 'paramfile' and transfers the 
-- data to 'transl_table'.
-- The translation table is used to translate logical document tags 
-- (-identifiers) to SGML-identifiers.
-- If the file 'paramfile' doesn't exist, an error message will be printed to
-- standard output.


PROCEDURE write_sgml(doc: IN                logical_entity_type;
                     transl_table: IN       transl_table_type;
                     output_file: IN        STRING);
-- Traverses the tree 'doc' in a depth first manner.
-- The nodes of the tree are interpreted and translated to an sgml-standard
-- output, that will be written to the file 'output_file'. 
-- According to the contents of the translation table 'transl_table' the 
-- node-tags are translated to sgml-tags.

--------------------------------------------------------------------------------

END sgml_output;

About this document ...

Transforming Logical Structures of Documents into SGML Markups

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html hipo.

The translation was initiated by Rolf Brugger on 1999-09-15


next up previous
Rolf Brugger
1999-09-15