Open Source EXPRESS Parser Documentation

The Java Express Parser is a parser written in Java using the ANTLR parser generator and is able to parse the EXPRESS (ISO 10303-11) language. In fact, it is just a bunch of code intended to be used by developers who want to include an Express parser in their Java application. It is basically able to read a single file, parse it, and build an AST (Abstract Syntax Tree) from it. What a developer wants to do is run the parser code and then use the output AST by walking it and/or modifying it. In addition, code to produce Express indented text from an AST is provided with the parser (it is called a pretty-printer). This document is just a brief overview of how to set up the parser, run it and walk the AST. Since the AST is implemented with ANTLR's tree structures, we strongly recommend to read ANTLR's documentation about ASTs.

First of all, you will need the ANTLR runtime to run the parser, and ANTLR tools to produce code, so install ANTLR on your platform (http://www.antlr.org) and take a look at the documentation. At first, if you don't have access to the Java (generated) classes of the parser, you'll have to build it. It is easy, just type:


java antlr.Tool express.g

in the src directory.

To generate a walker from a .g file (like template.g, pretty.g or your own), just do the same:


java antlr.Tool whatever.g

Your CLASSPATH should point to ANTLR's .jar file. Otherwise, the parser won't be able to run.

1.2. Parser's Setup

1.2.1. Easy Parser

EasyParser is a class that allows to easily setup and run the parser. To use it, you just need to create an EasyParser instance with a String (filename), or a BufferedInputStream as a parameter. Then call the parse() function. The parse() function returns a antlr.CommonAST object. It is the root of the AST.

So typical calling code is:


antlr.CommonAST ast;

try {
     EasyParser parser = new EasyParser("file.exp");
     ast=parser.parse();
    }
catch (Exception e) { [...] }
/* then you can use ast */

EasyParser assumes you want to discard comments. EasyParser doesn't display any status string of any kind.

1.2.2. Less Easy Parser

If you want more control of how parsing is performed, you might want to directly use ANTLR's generated classes.

First of all, you have to read the file to be parsed before parsing it. Since two passes are necessary, it is better to find a way to read the file once and then use the stored data for the two passes. In the sample files provided, a ByteArrayInputStream is used.

Then, after the file has been read, an ExpressLexer object and an ExpressParser object should be created. The lexer is instanciated with the ByteArrayInputStream as a parameter, and the parser is instanciated with the lexer object as a parameter. Then, for some technical reasons, the lexer needs a reference to the parser, so the setParser() function of ExpressLexer should be used.

The first pass is initiated by a call to syntax() by the parser. After the first pass is done, the parser needs to do some linking in the case there are multiple schemas in the file, so it has to call processExternals(). During the first pass, the parser collected a lot of useful information that will be needed during the second pass. This information has a tree structure, it is actually a tree of Scope objects. We just need to get a reference to the root of the tree so as to be able to pass it to the second pass parser. The root scope is stored in the rootScope public field of the parser, so we just need to affect a local variable of type Scope with parser.rootScope.

After that we have to setup the second pass. First of all, the input stream used should be able to get to the beginning of the file again. If a ByteArrayInputStream was used, a call to reset() will do that without reading the file again (provided a call to mark() has been done before). Another kind of input stream may need to create a new instance and read the file again. A new lexer and a new parser objects are then created, and a filter may be added between the two of them if needed. The filter is able to discard or hide tokens coming from the lexer depending on their type. Since comments tokens are not skipped during lexing, a filter is needed here to at least discard these tokens. In the future, one might just want to hide them and then be able to retrieve them during the last step, tree walking. To be able to use hiding, a TokenStreamHiddenTokenFilter object should be created with the lexer object as a parameter. Then, the new parser object should be created with the filter as a parameter. The lexer token object class (the class used to represent tokens) should be set to antlr.CommonHiddenStreamToken with the setTokenObjectClass() function. This class is able to store hidden tokens, so the parser is not able to see them, but they are there anyway. Later, the hidden tokens are likely to be attached to tree nodes or so. To specify which token should be hidden or discarded, the hide() or discard() functions may be used.


    filter.hide(ExpressParser.COMMENT)
    filter.discard(ExpressParser.LINECOMMENT)

Then again, as in the first pass, the parser needs a reference to the lexer, so the setParser() function is used. Finally, the parser needs the information collected during the first pass, i.e. the tree of Scope objects, so the parser calls the setRootScope() function with the previously extracted rootScope as a parameter.

The second pass is set to begin, using a call to syntax() as in the first pass. The AST that we will need later (if the application is doing something) is created there. The getAST() function will return a reference to this AST.

The Parser.java file is an example of how to setup and run the two passes.

1.2.3. Tree Structure

Before walking the tree, it is useful to know its structure. The tree structure was designed to correspond exactly to the EXPRESS grammar as defined in ISO 10303-11:1994>. Thus every rule in the grammar produces a node in the tree. The node contains a token which name is the name of the rule.

Let's parse this sample EXPRESS code:


ENTITY entity1;
  a : INTEGER;
  b : INTEGER;
END_ENTITY;

The corresponding grammar rules are:


entity_decl = entity_head entity_body END_ENTITY ';'

entity_head = ENTITY entity_id [ subsuper ] ';'

entity_id = simple_id

entity_body = { explicit_attr } [ derive_clause ] [ inverse_clause ] [
                unique_clause ] [ where_clause ]

explicit_attr = attribute_decl { ',' attribute_decl } ':' [ OPTIONAL ]
                base_type ';'


attribute_decl = attribute_id | qualified_attribute

attribute_id = simple_id

base_type = aggregation_types | simple_types | named_types

simple_types = binary_type | boolean_type | integer_type |
               logical_type | number_type | real_type | string_type


integer_type = INTEGER

Note

simple_id is a lexical element (token).

Now, let's see what the tree produced looks like:

A node may look like this:

(NODE_TOKEN elt1 elt2
... eltn)

So using this representation, the result tree for the little piece of code above is:


(ENTITY_DECL 
  (ENTITY_HEAD 
    (ENTITY_ID "entity1")) 
  (ENTITY_BODY 
    (EXPLICIT_ATTR 
      (ATTRIBUTE_DECL 
        (ATTRIBUTE_ID "a")) 
      (BASE_TYPE 
        (SIMPLE_TYPES 
          (INTEGER_TYPE))))
    (EXPLICIT_ATTR 
      (ATTRIBUTE_DECL 
        (ATTRIBUTE_ID "b")) 
      (BASE_TYPE 
        (SIMPLE_TYPES 
          (INTEGER_TYPE))))))

Two remarks:

simple_ids like "entity1" or "a" are actually mapped as IDENT tokens in the tree. The text of the IDENT token is the actual id.
```
(INTEGER_TYPE)
```
is a node without any element.

Sometimes strings are attached to the tree when they are needed. For example, parsing a built_in_function call would look like this:

Code:

ACOS(...)

Tree:

(BUILT_IN_FUNCTION "acos")

Since there is not a rule for each built_in_function, we need to add the string (the name of the function) in the tree.

In other cases, we don't need to add a string. For example, the integer_type could look like

(INTEGER_TYPE "integer")

but we don't need to add "integer" since there is no other choice. Other examples: strings like "optional" or "unique" are added because there is no rule attached to them, but "otherwise" or "enumeration of" are not.

The two things that one should remember: the tree structure is intended to match the grammar structure, and a string is added in the tree where it makes sense (where it is needed to extract the information).

Lexer tokens used in the tree:

IDENT (matches lexical element simple_id) INT (matches lexical element integer_literal) FLOAT (matches lexical element real_literal) STRING (matches lexical element simple_string_literal)

The exact structure of the AST, represented as rules, is provided in the file template.g.

1.2.4. Walking/Modifying the Tree

1.2.4.1. Using ANTLR's Tree Walker Generator

1.2.4.1.1. Walk the Tree

ANTLR provides a convenient way to create code to walk through the AST. The structure of the tree may be written as a grammar-like representation with a rule defined for each node. Actions may be added, and custom information may be returned from a rule.

The basic structure of such a rule looks like this:

rule_name : #(NODE_NAME elt1 elt2 ... eltn)

Since we decided that the node tokens have the names of the Express grammar's rules, it'd rather look like this:

rule_name : #(RULE_NAME elt1 elt2 ... eltn)

but not necessarily, since one can choose its own representation, rule names. The resulting set of rules just has to be able to match the tree structure.

For example:

rule1 : #(RULE1 a b)
rule2 : #(RULE2 rule1 c)

could also be written this way:

rule2 : #(RULE2 #(RULE1 a b) c)

The template.g file provided with the parser's code is an example. If you run the generator (antlr.Tool) on it, it will produce code able to walk the AST. This code won't do anything though, since no action was added. This template file uses the same rules as in the Express grammar, but it can be modified as well, as long as the tree structure is respected.

For simple applications, the template.g file might be used, one would just need to fill the rules with actions.

An action is java code inserted between '{' '}' signs. It can be inserted anywhere in the rule. Example:

local_variable
             { int a=0;
               String or=null; }
             : #(LOCAL_VARIABLE { action(); } variable_id { a =
               anotherone; }  ( variable_id { why_not_there(); } )*
               parameter_type { or=there; }  ( expression  )? )

The location of the action may be important depending on the way local variable are used. The block just after the name of the rule is usually used to perform local variables declarations/initializations.

Warning

Every variable declared in a rule should be initialized at declaration. Since ANTLR generated code intensively uses Exception management features, a non-initialized variable will make the java compiler crash at compile time.

A rule may return a result of any type, and actions are able to use the returned information of other rules. One might want to return a Vector object for example:

local_variable returns [Vector v]
    { v=null; }
    ...

Declaring a return type for a rule implicitly declares the variable (v here). But it is not initialized, that't why

v=null

has to be added here. One might want to instanciate the variable here, so

v = new Vector();

will be used rather than

v=null;

. The name of the returned variable is user defined (it could be vec, cat, express_is_good or whatever). Obviously, the returned object may be modified at other locations in the rules.

To get the returned object of a rule in another rule, use a variable of the same type:

rule1
 { Vector v=null; }
 : whatever v=local_variable whatever { /* you can use 'v' here */ }
 ;

Sometimes, you might just want to straightly return the same object:

rule1 returns [Vector abc]
 { abc=null; }
 : whatever abc=local_variable whatever

Sometimes a rule may contain a token (lexical element). The way of getting the information is different:

attribute_id
    : id:IDENT { System.out.println(id.getText()); }
    ;

There is no need to declare the name (id in this example) of the token identifier, it is implicitely declared in the generated code. The identifier is then a reference to a token object, so one can get the text attached to it (getText()) and one can set it (setText()) if one wants to modify the tree.

pretty.g (the Express indented text generator aka pretty-printer) is a good example of how to create a tree walker with ANTLR's facilities.

1.2.4.1.2. Modify the Tree

It is called a tree transformation. The buildAST option should be set to true in the walker file (template.g for example).

Then, a new AST will be generated by walking the first AST. If no transformation is performed, the new tree and the former will be identical twins. To modify the structure of the node produced in a rule, just put the “!” character after the rule name. Auto transform will be turned off, and your code will have to manually build the produced node. See the AST transformations chapter of the ANTLR documentation for further explanation.

1.2.4.2. Using Custom Code

The default class used to represent nodes is antlr.CommonAST. It extends abstract class antlr.BaseAST. The AST is a Child-Sibling tree (see BaseAST.java) A node has only one child, and the other children are linked beetwin them. So a node actually contains a linked list of children. Walking/modifying an AST with custom code requires that you know what you are doing, so we'll just introduce the interfaces of the functions you will need. BaseAST's interface provides some useful functions to walk/check the tree and use pattern matching:

    /** Is node t equal to this in terms of token type and text? */
    public boolean equals(AST t)

    /** Is t an exact structural and equals() match of this tree.  The
    *  'this' reference is considered the start of a sibling list.
    */
    public boolean equalsList(AST t)

    /** Is 'sub' a subtree of this list?
     *  The siblings of the root are NOT ignored.
     */
    public boolean equalsListPartial(AST sub)

    /** Is tree rooted at 'this' equal to 't'?  The siblings
     *  of 'this' are ignored.
     */
    public boolean equalsTree(AST t)

    /** Is 't' a subtree of the tree rooted at 'this'?  The siblings
     *  of 'this' are ignored. 
     */
    public boolean equalsTreePartial(AST sub)

    /** Walk the tree looking for all exact subtree matches.  Return
     *  an ASTEnumerator that lets the caller walk the list
     *  of subtree roots found herein.
     */
    public ASTEnumeration findAll(AST target)

    /** Walk the tree looking for all subtrees.  Return
     *  an ASTEnumerator that lets the caller walk the list
     *  of subtree roots found herein.
     */
    public ASTEnumeration findAllPartial(AST sub)

    /** Get the first child of this node; null if not children */
    public AST getFirstChild()

    /** Get the next sibling in line after this one */
    public AST getNextSibling() 

    /** Get the token text for this node */
    public String getText()

    /** Get the token type for this node */
    public int getType()

and other ones to modify the tree:

    /**Add a node to the end of the child list for this node */
    public void addChild(AST node)

    public void setFirstChild(AST c)

    public void setNextSibling(AST n)

    /** Set the token text for this node */
    public void setText(String text)

    /** Set the token type for this node */
    public void setType(int ttype)

To create a node:

    public CommonAST()
    
    public CommonAST(Token tok)

    public void initialize(Token tok) {

    public void initialize(int t, String txt)

    public void initialize(AST t)

NOTE: token types are represented with integers. You might want to implement the interface ExpressParserTokenTypes to use token names in your code.

In the future, CommonASTWithHiddenTokens might be used instead of CommonAST. It would allow to store hidden tokens (see Section 1.2 about filtering) in the AST nodes. See ANTLR documentation about token stream splitting.

1.3. Using the EXPRESS Indented Code Generator

It is able to output well indented Express text from an AST (produced by the parser).

It is very easy to use: the generated file is PrettyWalker.java, you just need to instanciate a PrettyWalker, and then run the syntax() function with the AST result of the parsing as a parameter.

Looks like this:

 walker = new PrettyWalker();
 result = walker.syntax(parser.getAST());

syntax() returns a String.

The file PrettyPrinter.java is a full example of how to build a basic translation tool using the parser and ANTLR's generated tree walker.

2. Technical Documentation (For EXPRESS Parser Developers)

Read this if you want to change the behavior of the parser.

2.1. Some ANTLR Facts

ANTLR is a lexer/parser/walker generator, i.e. it combines a lex-like generator, a yacc-like generator, plus it is able to generate code tu build an AST (Abstract Syntax Tree) and it is able to generate code to walk through the AST.

You can put the definition of all parts in a single file, which is convenient. However, we decided to put the walker code apart, since it is likely to be modified by developers, it is the customizable part of the parser.

So the lexer and the parser (and the AST producer) are defined in a file named express.g.

The lexer part generates a class that extends the ANTLR class Lexer and the parser part generates a class that extends Parser.

ANTLR also provides a concept: token stream splitting. It can generate code to filter the token stream sent to the parser, some tokens (depending on their type) are hidden or discarded. Hidden means that they are attached to the previous or the next token, but the parser can't access them during reduction. They may be retrieved later by custom code.

see ANTLR documentation for further information - http://www.antlr.org.

2.2. Lexer Setup for EXPRESS Parsing

Since there are tokens that need up to 4 characters to be reduced, a lookahead of 4 (

k=4

) is used. It means that every time the lexer looks at a character from the character stream, it is able to access the three next characters to make a decision.

The literals are case insensitive (although most people write Express literals with uppercase characters, ISO 10303-11:1994 defines them as case insensitive), so caseSensitiveLiterals is set to false. The character vocabulary used is the extended ascii character set, so the charVocabulary option is used.

2.3. Parser

The generated parser is LL(1) (

k=1

) and the AST code generation is activated (buildAST is set)

2.3.1. Two passes

The Express grammar is defined in a way that we can't parse Express code in a single pass. It sometimes needs to know the type of the identifier to reduce a rule. For example, if we want to reduce the rule:

      named_types = entity_ref | type_ref

we need to know whether the identifier is a type_id or a entity_id.

So, there is a first pass with a modified grammar where ids are recorded with their types. Then the second pass is able to run because the lexer is able to identify a type_id or an entity_id. The modified grammar is able to handle the token stream without knowing the type of the identifiers, but some rules are modified in a way that the tree cannot be built from them, only the true grammar (used in the second pass) is able to produce a full parse-tree of the input text.

2.3.2. Tokens

Special tokens are defined to represent typed ids. During the first pass, all ids are passed to the parser as IDENT token, and during the second pass the token used to pass the id depends on its type.

Here are the special tokens defined in the header of the express.g file:


      CONSTANT_IDENT;
      ENTITY_IDENT;
      FUNCTION_IDENT;
      PROCEDURE_IDENT;
      PARAMETER_IDENT;
      SCHEMA_IDENT;
      TYPE_IDENT;
      VARIABLE_IDENT;
      ENUMERATION_IDENT;
      ATTRIBUTE_IDENT;
      ENTITY_ATTR_IDENT;
      TYPE_ATTR_IDENT;
      ENTITY_VAR_IDENT;
      TYPE_VAR_IDENT;
      ENTITY_PARAM_IDENT;
      TYPE_PARAM_IDENT;

The ENTITY_ATTR_IDENT, TYPE_ATTR_IDENT, ENTITY_VAR_IDENT, TYPE_VAR_IDENT, ENTITY_PARAM_IDENT and TYPE_PARAM_IDENT tokens will be explained later.

2.3.3. So, Basically, How Does It Work ?

2.3.3.1. Scoping Principles

First of all, we use semantic predicates to force the parser to use modified rules during the first pass. So, lines beginning with

{ isFirst }?

are the modified parts.

During the first pass, we want to record ids with their types. Sometimes you may have ids with the same name, because Express uses scoping. It means that you can declare an id only once in a block (scope) but you can have ids with the same name in different scopes. It's basically the same as in C, for example you can write:

int a;

{
    int a;
}

but not:

int a;
int a;

So, we need to be able to record different ids with the same name, and attach information about the scopes they are declared in. We also need to be able to detect multiple definition errors (when ids with the same name are declared in the same scope).

2.3.3.2. A Scope Tree

To achieve that, we decided to use a scope tree. It means that a tree of objects representing scopes is built during the first pass. Each scope object contains a hashtable (with the id name as the key, and its type as the value) to record the ids. During the first pass, at any time there is a current scope (currentScope variable) that is updated every time we enter a new scope or every time we leave a scope (then the current scope is the scope containing the one we left, its parent). When an id is found, it is recorded in the current scope's hashtable. The type of the id is stored as an integer, according to the definition of the tokentypes in the ExpressParserTokenTypes.java generated interface. So, in the express.g file, every rule where a new scope starts contains a call to newScope() and every rule (sometimes it is the same) where a scope ends contains a call to upScope(). According to section 10 of the ISO 10303-11:1994 document, Express items defining a scope are: alias statement, entity, function, procedure, query expression, repeat statement, rule, schema and type.

2.3.3.3. Retrieve Scopes in the Second Pass

>So how are we able to retrieve such information during the second pass ? First, when entering a new scope we need to get the scope object that was created when entering the same scope during the first pass. In order to do so, additional information is recorded during the first pass: the order of the scope objects creations. It means that every scope object has a link to the next scope object created. So we use another variable: lastCreatedScope, which is not equivalent to currentScope. Every time a new scope is entered during the second pass, currentScope is updated with the scope that was created just after lastCreatedScope. Then lastCreatedScope is updated (with the new current scope). When leaving a scope, currentScope becomes the parent scope of itself, but this parent scope was created before, so currentScope becomes different from lastCreatedScope. It may sound a little bit unclear, but here is a drawing that will explain better the way it works.

2.3.3.4. Example

Consider the rules function_decl and function_head:


function_decl
             :  function_head ( algorithm_head)?  stmt  (  stmt  )*
                "end_function"!  SEMI! { upScope(); }

function_head
             : "function"!  function_id { newScope(); } (  LPAREN!
               formal_parameter  (  SEMI!  formal_parameter  )*
               RPAREN!  )?  COLON! parameter_type  SEMI!

During the first pass, newScope() creates a new Scope object which parent is currentScope, and then attaches the newly created scope to the last created scope:


public void newScope1() {
                /* creates a new Scope when entering a rule defining
                a scope in the grammar. */
                Scope ns;

                ns = new Scope(currentScope);
                currentScope=ns;
                lastCreatedScope.setNext(ns);
                lastCreatedScope=ns;
        }

And during the second pass, it retrieves this scope from the last created scope:


public void newScope2() {
                /* retrieve the scope created in the first pass when
                entering the same rule. 
                See comments in the lexer's IDENT rule definition */

                currentScope=lastCreatedScope.next;
                lastCreatedScope=currentScope;
        }

In order for this to work, the rootScope (the scope containing every other scope) has to be the same in the first and the second pass, so after the first pass is completed, rootScope is passed to the second pass parser thanks to the setRootScope() method.

Note that currentScope is not always the same as lastCreatedScope, because when you leave a scope, as in the function_decl rule, upScope() is called:


 public void upScope() {
                /* when exiting a scope */
                currentScope=currentScope.parent;
        }

currentScope.parent was created before currentScope, so currentScope changes but not lastCreatedScope.

2.3.3.5. Id Recording During First Pass

The recording is very simple, as shown in the function_id rule restricted to its first pass part:


function_id
             : { isFirst }? id:IDENT { addId(id.getText(),FUNCTION_IDENT); }

addId just adds the id name (id.getText()) and its type (FUNCTION_IDENT) to the id table of currentScope. If the id already exists, a double definition warning is displayed (see Scope.java source code)

2.3.3.6. Id Type Retrieval During Second Pass

The retrieval part (second pass) occurs in the lexer. Instead of just producing IDENT tokens for each identifier, the lexer runs a search algorithm through the scope tree, and then produces a token with a specific type (FUNCTION_IDENT for example) The algorithm searchs for the id in the current scope, if it doesn't exist then in the parent scope, and so on. If it doesn't find anything in the end (the rootScope) then it returns a simple IDENT token.

2.3.4. But It's Not That Simple, Is It ?

No, it is not. In fact, there are 7 particular issues that we have to deal with.

2.3.4.1. Parser's Tokens Buffering

The ANTLR parser code seems to bufferize tokens, and it can be a problem because of the fact that in the second pass, id search is performed by the lexer. The lexer uses currentScope to perform the search, and currentScope is updated depending on the tokens read by the parser. Let's take an example to show what is wrong:


alias_stmt
    : { newScope(); } "alias"! variable_id "for"! general_ref (
      qualifier )* SEMI! stmt ( stmt )* "end_alias"! { upScope(); } SEMI!

Let's parse this piece of code:


FUNCTION a(b : c) : d;
    ALIAS x FOR b;
      [...]
    END_ALIAS;
END_FUNCTION;

We are doing the second pass. Suppose that the parser always has a buffer containing the four next tokens. Then when dealing with the last colon character of the function declaration, the d id, the semi-and colon the ALIAS literal have been read and tokenized by the lexer. So, when dealing with the d id, the x id has been read by the lexer. But we are not in the alias_stmt rule yet, so the newScope() call didn't occur, so the current scope is still the scope of the function. When the lexer is looking for the id x, it believes the current scope is the function's one, but x was recorded in alias' scope, so the lexer is not able to find x, an error occurs.

This problem occurs with rules defining scopes and where an id is declared in this same scope. These rules are: repeat_stmt, alias_stmt, query_expression, and rule_head. For example, the name of a function is recorded in the outer scope of the function, so there is no problem here, but the variable name of a query expression is recorded in the inner scope of the query expression, so there is the same problem as in the alias example. To fix the problem, we update the currentScope in the lexer instead of doing it in the parser's rule. So, when dealing with an identifier (literals are identifiers for the lexer) the string is compared to the ALIAS, REPEAT, QUERY and RULE keywords, and if it is one of those, a newScope() call is performed.

That's why there is no newScope() call in these rules in the parser.

2.3.4.2. Enumeration Ids

Enumeration ids are different from other ids in two ways. First, they can be duplicated, that is to say you can have several enumeration ids with the same name in the same scope. That's because you can select them by placing the name of their enumeration type before them. The second thing, enumeration ids are not recorded in the scope they are declared in, but in its outer scope. Example:

TYPE a = ENUMERATION OF (u, v, w, x);
END_TYPE;

TYPE b = ENUMERATION OF (u, f, g, h);
END_TYPE;

For enumeration id u, u is recorded twice in the same scope (because the scope where u, v, w, x are visible is not the scope of type a, but the outer scope, the scope where types a and b are declared in).

In order to do that, when dealing with a enumeration id, the addId function of Scope.java then calls a special function of the parent scope. This function doesn't check double definition.

2.3.4.3. Attributes Inheritance in Entities

Entities can have supertypes, so when in an entity declaration (in the where clause for example) you can use references to attributes of supertypes of the current entity. You can use attributes defined in the supertype of the supertype, and so on. So, during the second pass and when resolving an id in an entity declaration we also have to search the inheritance tree in addition to search the scope tree. In order to search the inheritance tree, we have to build it. We use the scope tree structure by recording some additional information when parsing an entity declaration. Three pieces of information are recorded. First when parsing an entity declaration, we set a value to specify that the current scope is a scope defined by an entity. Then we tell the outerscope that it contains a scope defined by an entity, and the name of the entity is given to the outerscope. Finally, when in a subtype_declaration, we record the names of the supertypes of the entity in its scope.

see newEntityScope() and addSuper() functions in express.g.

How information is stored ? Every scope has a boolean value to tell whether it is an entity scope. Then the scopes where an entity can be declared (schema, function, procedure, rule) contain a hashtable for the inside entity scopes. The key is the name of the entity declared inside this scope, and the value is a reference to the scope defined by the entity. A scope defined by an entity contains a vector that contains the names of the supertypes of the entity.

How do we use this information ? When looking for an id inside an entity declaration, we first search the inheritance tree. So, we look in the id table of the current entity, then we have to search the supertype entities. We use the vector containing the names of the supertypes. For each name, we search in the scope tree (the supertype must be declared in a scope higher than the current scope) until we find an entity with the good name. To perform this search, we use the entities hashtable described earlier. So, when the entity is found, we get its scope. So we can look for our id in this scope's id table. We do that for every supertype entity, and for every supertype of the supertype entities and so on until we find the id. If the id wasn't found, then we perform the usual scope tree search algorithm.

So, as a conclusion, we first search the whole inheritance tree (the entities are resolved on the fly) and then we search the scope tree.

It is surely the most complicated part of the parser.

2.3.4.4. Attribute Qualifiers and Inverse_clause Issues

In the attribute_qualifier rule, we have to identify an attribute_ref, i.e. an attribute id (ATTRIBUTE_IDENT). In order to find this id, we would have to search from the target of the qualifier, but the target could be anything, an entity_reference, a select type, an array with an index_qualifier. It would be way too hard to try to get the actual type of the target and then get its scope. But why would we need this since thanks to the dot we are sure that we are in an attribute_qualifier rule ? So we do know the identifier is an ATTRIBUTE_IDENT (well if it is not, the code is not correct, but typed ids are not used to check correctness, they are used to help rules reduction), we don't need to search its type. That's why the global_ident pseudo rule was introduced. We use it to say: whatever the id type, we don't care. We use it the same way in the inverse_attr rule.

2.3.4.5. The 10.2.2 Visibility Rules

When an id x is declared in a scope A and then an id with the same name is declared in a scope B included in A, then the second x will hide the first. In the scope B, only the x declared in B is visible. This is not alway true, there is an exception where x declared in A is visible in B even if a new x was declared in B. This exception is described in section 10.2.2 of the ISO 10303-11:1994 document. In our example, if x is an entity or a defined data type identifier, and if:

B is defined by an entity declaration and x is declared in B as an attribute

B is defined by a function, procedure or rule declaration and x is declared in B as a formal parameter or variable

Then both x declared in A and x declared in B are visible in B.

So this text is valid Express:

ENTITY e1;
END_ENTITY;

ENTITY e2;
    e1 : e1;
END_ENTITY;

In this case, the id search algorithm in e2 returns a ENTITY_ATTR_IDENT token. It means that e1 is visible as an attribute or as an entity. So the grammar uses these tokens where they may appear. The TYPE_ATTR_IDENT works the same way for type ids. For the second case (parameters and variables) we use ENTITY_VAR_IDENT, TYPE_VAR_IDENT, ENTITY_PARAM_IDENT and TYPE_PARAM_IDENT. Sometimes, using such tokens leads to ambiguous situations. For example, when reading a TYPE_ATTR_IDENT token, we don't know if it is a TYPE_IDENT or an ATTRIBUTE_IDENT, but sometimes we need to know. The solution is described in Section 2.3.4.6.

2.3.4.6. Lookahead to Break Ambiguous Cases

Sometimes even with typed id tokens, it is still not possible to break an ambiguous situation between two rules. In these cases, we use the parser's lookahead capability and semantic predicates. Lookahead means we look at the tokens following the current token to make a decision. Example in simple_factor:

 |  { LA(2)==LPAREN }? entity_constructor
 |  { LA(1)==ENUMERATION_IDENT||LA(3)==ENUMERATION_IDENT }? enumeration_reference
 |  (  (  unary_op  )?  (  LPAREN!  expression  RPAREN!  |  primary )  )
 ;

entity_constructor develops as

entity_ref '(' ( expression ( expression )* )? ')'

primary develops as

qualifiable_factor ( qualifier )*

and then qualifiable_factor can develop as a population, which is an entity_ref.

So, reading an ENTITY_IDENT, how do we know whether it is a entity_constructor or a population ? For tree construction matters, we cannot factor these rules, so we just look if there is a ( character just after the ENTITY_IDENT. Thus, we are able to decide between entity_constructor and population.

It works the same for enumeration_reference, but the reason is a little bit more complicated: because of the exception 10.2.2 (see Section 2.3.4.5) we may read a TYPE_ATTR_IDENT token, so since two identifiers with the same name are visible, it may be a TYPE_IDENT or a ATTRIBUTE_IDENT but we don't know which one. So, in case it is a TYPE_IDENT it may be an enumeration_reference:

enumeration_reference -> ( type_ref '.' )? enumeration_ref

in case it is a ATTRIBUTE_IDENT, it may be a

primary -> qualifiable_factor ( qualifier )*

with

qualifiable_factor -> attribute_ref

and

qualifier -> attribute_qualifier -> attribute_ref

So, is it type_ref.enumeration_ref or attribute_ref.attribute_ref ? in this case, the only way to know is to check the token after the dot, i.e. the third, LA(3). If LA(3) is a ENUMERATION_IDENT then the answer is enumeration_reference, if it is not, the answer is primary.

If the semantic predicate is not true, then the rule is skipped. So, the obvious case where an ENUMERATION_IDENT leads to the enumeration_reference rule needs to be considered in the semantic predicate, that's why there is a LA(1) condition.

Conclusion:

TYPE_IDENT leads to enumeration_reference
ATTRIBUTE_IDENT leads to primary
TYPE_ATTR_IDENT '.' ENUMERATION_IDENT leads to enumeration_reference
TYPE_ATTR_IDENT '.' (not ENUMERATION_IDENT) leads to primary
ENUMERATION_IDENT leads to enumeration_reference

Such mechanisms are also used in domain_rule.

2.3.4.7. REFERENCE and USE Clauses

This parser is not able to read multiple schemas in multiple files at this time, but since the syntax allows to declare multiple schemas in a single file, it supports REFERENCE and USE clauses in this case.

Basically, when an id is referenced or used in a schema, it is added to the root scope of the referencing schema. When the id is renamed, then the new name is added but not the former one. If no id is specified in a REFERENCE clause, all constant, entity, function, procedure and type ids of the referenced schema(s) are added to the referencing schema's root scope. If no id is specified in a USE clause, then all entity and type ids of the used schema(s) are added to the using schema's root scope.

In order to achieve that, we once again record new information in the scope tree, and we use a new object, ExternalId, to store information. Information is stored in the Scope object representing a scope defined by a schema. What is stored is:

a Vector of ExternalId objects. These objects represent the single ids that are used or referenced by a schema. Each ExternalId object contains the name of the id, the name of the schema where it is originally declared, its type and the new name if it has been renamed in the REFERENCE/USE clause.
a Vector of Strings for the REFERENCE clause. Each string is the name of a schema. Theses schemas are those that are entirely referenced (no id is specified).
a Vector of Strings for the USE clause. Same thing.
a boolean (extdone) to tell whether the referenced or used ids have been resolved.

Between the first and the second pass, schema names are resolved, ids are added. See the processExternals() function in Scope.java.

Scopes defined by schemas also feature functions to get all referencable or usable ids in the current schema. See getReferencedExternals() and getUsedExternals() functions. The ids are returned as a Vector of ExternalId objects, these objects are just id/type pairs. These functions are used when an entire schema is referenced or used (without specifying the ids).

NOTE: if a stronger mechanism to handle REFERENCE or USE clauses is implemented (such as a mechanism capable of accessing a schema database or of parsing other files to get the ids) then the current mechanism should be removed.

2.4. Tree construction

2.4.1. General

The tree structure is simple, every rule of the grammar produces a node. The common syntax is:

 #rule_name = #([NODE_NAME, "NODE_NAME"], #rule_name);

Actually, we decided that the node would have the name of the rule, so NODE_NAME is the same as rule_name. In ANTLR's ASTs, a token is attached to a node to identify it. So NODE_NAME is a token, although it is not produced by the lexer. So we have to declare it in the parser's header. Basically, #rule_name contains a list of tokens or tree nodes. So we create a tree node, then we attach the current #rule_name to it, and then we affect #rule_name the newly created node.

We don't need typed ids anymore in the AST. Typed id and special id tokens (*_IDENT) are usefull during second pass only. That is why every id token has its type set to IDENT when building the tree.

For example, function_id:

|  nid:FUNCTION_IDENT
             { #nid.setType(IDENT); #function_id = #([FUNCTION_ID,
             "FUNCTION_ID"], #function_id); } 
             ;

The node is a FUNCTION_ID, and it contains an IDENT. So when walking the tree, only one kind of identifier exists (IDENT) as in the Express syntax (simple_id).

Since global_ident (see Section 2.3.4.4) appears only in inverse_attr and attribute_qualifier in place of an attribute_ref, then a ATTRIBUTE_REF node is created in global_ident.

2.4.2. Special Cases

Sometimes the grammar has been modified due to parsing issue, but the tree must fit the original Express grammar. So, instead of just attaching the set of items of the rule to the tree node, we need to customize the node. In order to do that we use the ! option after the rule_name. It means that the common node creation functions will not be used, instead we have to do it ourselves.

We use it with literals: the type of the node depends on the kind of literal it is. In addition, the FLOAT token in the parser does not represent a float number, but just its decimal part (after the '.'). So we customize it so that a FLOAT really becomes a floating point number in the tree.

We also use it in rename_id, just to fit the Express grammar.

2.5. Comments and Token Stream Splitting

So far, comments are discarded, but they are not discarded during lexing. We use a filter (TokenStreamHiddenTokenFilter) to discard two types of token, one is COMMENT and the other LINE_COMMENT.

COMMENT is the typical block comment (* [...] *) and LINE_COMMENT a one line comment, with lines starting with --.

We use a filter to discard these tokens because it might be possible just to hide them in the future, that is to say they are attached to other tokens, but not managed by the parser. Then when building the AST, the hidden tokens can be attached to the tree nodes, and then it is possible to get them during tree walking.

It means that it is possible to 'not parse' comments, but to be able to access them during the last phase of the translation process. For example, a code generator would be able to include comments at the same place they were in the parsed Express code.

This feature is not implemented yet.

2.6. Walking the Tree

See the Parser's general purpose documentation