The whole OPL-to-HAL++ compiler was implemented using Mercury. Mercury is a declarative logic programming language, which unlike most other logic programming language, requires the user to provide type, mode and determinism declarations similar to those required by HAL. The Mercury compiler then uses all these information to generate efficient and robust code. The reasons for choosing Mercury are its maturity, the availability of many useful libraries such as Lex and Moose, the robustness and efficiency of the generated code, and its closeness to the HAL language. (Mercury can be considered a subset of HAL, and HAL is compiled to Mercury).
As mentioned before, Lex(a lexical analyser generator) and Moose(a parser generator) are provided as part of the Mercury library. Lex creates the lexer that recognises reserved words, identifiers, punctuation and constants, which are collectively known as tokens. Lex uses regular expressions to identify tokens because certain tokens like identifier names are not known until compile time. For instance, in OPL all valid Identifiers start with a lower case or upper case letter followed by zero or more occurrences of letters, digits or underscores. Thus, the regular expression for an identifier is as follows
alpha ++ *(ident)
In order to build our lexer, we first obtained the OPL syntax from the Optimization Programming Language book[2], and then used it to create the regular expressions that recognised all OPL tokens. After creating the lexer, we begin to work on the parser. The first step in the process was to erradicate all grammar errors and ambiguities. Then the grammar rules were transformed to Moose format. The result was 180 tokens with 132 grammar rules. Once the lexer and the parser were created, we proceed to integrate them. Because both Lex and Moose uses Mercury code, the lexer can be invoked through a standard interface making the integration easier. During the actual parsing, the parser invokes the lexer continuously to obtain tokens until the lexer indicates the end of file or an error occurred during lexing. For each tokens the parser receives, it is matched against the grammar rules for the purpose of constructing a parse tree. A parse tree represents a subset of the possible syntax specified by the grammar rules. Failure to create a parse tree indicates syntax error on the input tokens.
To illustrate the parse tree idea, consider the following Figure 11, which is a simplified version of the grammar rules used in the actual parser. The interesting thing about the grammar rules is that all operators in the rule multiplicative has a higher precedence over the operators in the rule exp. The operator precedence is built into the grammar rules, where operators of higher level precedence appeared in the lower level of grammar rules. Also, all operators within the same grammar rules have the same level of precedence. Such construction ensures the operator precedence is maintained at all times. However, parentheses can overwrite the order of precedence.
Both the exp and multiplicative rules have two arguments of type relexpr and list(string), respectively. The first argument stores expressions meaningfully and without ambiguities. The second argument stores the actual list of variable names used in an expression. This given the expression "2 + 3 * 4", the equivalent parse tree can be created as
+
/ \
2 *
/ \
3 4
and therefore is a valid expression. However, given the expression "2 * 4
+", the parser will report a syntax error since the expression is incomplete
and the parse tree generated is also incomplete based on the above grammar
rules.
Another thing worth mentioning is the propagation of attribute values along the
grammar rules. When the second part of the expression "3 * 4" is matched in the
rule multiplicative, constants 3 and 4 are stored in the
attribute as [CVar
Avar]. This value is passed to the rule exp
where the rest of the expression is matched, and it is then used to form the
complete expression "2 + 3 * 4". Without attributes, the second part of the
expression "3 * 4" is lost without a trace. The use of an attribute grammar
enables code generation to be performed since we can then ensure that the whole
expression is seen.
As demonstrated earlier, during actual parsing it is essential to collect certain information needed later for code generation. Part of these information is available from attributes associated with each grammar rules. However, such information has limited capability as it is not available to all grammar rules. To remedy this problem, we would like to use data structures such as global variables where the information is accessible anywhere. In Moose, the closest thing to a global variable is the data structures declared in the parser_state: a collection of data structures recording the state of the parser during the actual parsing, which is accessible to any grammar rules. We have therefore created two data structures, in the parser_state, that assist in code generation: the typetable and the identifier.
A typetable stores all the information regarding user-defined types such as the type name, functor, type (i.e. enum, set, range or struct) and the type value. Figure 2 shows how different data type declarations are stored in the typetable. With the availability of typetable, the compiler can implement limited type checking to ensure the consistent use of variables. For example, given the OPL declarations:
range Boolean 0..1; [1]
var Boolean success; [2]
Line 1 declares Boolean as type range with a lower bound of 0 and an
upper bound of 1. Line 2 declares a variable name success of type
Boolean. When the first statement is read in, an entry will be created
in the typetable storing typename Boolean together with the type
(range) and the values 0 and 1. Later on, when the second statement appears, the
compiler will try to look up type Boolean in the typetable. Since the
entry already exists in the typetable, the statement passes type checking and
all relevant code will be generated. However, if the type Boolean is
not created before its use the compiler(as requested by OPL) will raise an error
and stop parsing. An identifier table stores all the variables declared in a model. An identifier is either a data variable or a decision variable with a type supported by OPL. In essence, each variable declaration has a corresponding entry in the identifier table. Each entries are composed of the following information, namely identifier name, type and values. All variable names in HAL start off with 'V_' followed by the actual variable name and a number. For instance, the variable name 'colour' in OPL has the corresponding variable name 'V_color_0' in HAL.
There are several advantages when using an identifier table. First, it ensures all variable names are declared before they are used. And second, it allows the compiler to replace variable names with the actual values. Figure 3 shows how different types of variables are stored in the identifier table.
|