4.2 Natural language translation

Rather than writing my own parser for SQ-HAL, I have decided to use Parse::RecDescent module to implement the natural language parser. Parse::RecDescent is a top-down parser, which gives all the advantages of top-down parsing. There are other bottom-up parsers such as Parse::Yapp and perl-byacc [10 pg 29-30], which I could have used. But the Parse::RecDescent has some special features that are more suited for this application. Most importantly, the RecDescent module can,

  1. generate run-time parsers - so we can embed the database structure during the run-time. This is important as we do not know the database structure until we connect to the database during run-time.

  2. modify or extend the parser during run-time - useful for SQ-HALs learning functionality as we need to extend the parser to lean new grammar.

  3. save the parser object to a file and reload it in future - this is another important feature as this enables the parser object to be created only once and new extensions to the grammar can be saved in the same file as the parser rather than a separate file for the new grammar learned. Also loading the parser from a file is significantly faster than creating the parser.

There are many other useful features of this module [9 pg 49], but is not much of an importance for SQ-HAL purposes. The only significant drawback of the RecDescent module is the slow speed in creating the parser object. This is expected from a top-down parser and for the RecDescent module slow speed is more evident due to run-time creation of the parser. Given the advantages of the Parse::RecDescent module, we can afford to trade of the slow speed to other useful features..

The use of Parse::RecDescent module in SQ-HAL can be explained using examples. I will use the same sample database used in design section (see Figure 7, pg 26) throughout this section.

Lets start with some simple grammar as explained in the design section.

Grammar    : [show|list|display} (me) (all) (our) <table_name>

SQL           : SELECT * FROM <table_name>

The above grammar can be translated to perl code that the parser can understand.

my $grammar = q{ 
    translate : select

    select    : ask /(me)?/ /(all)?/ /(our)?/ table_name 
                { "SELECT * FROM $item[5]" }

    ask       : /show|list|display/ 

    table_name:    TABLES
};

Since we do not know database structure at compile time, we cannot include the table names. Therefore during run-time the word TABLES needs to be replaced with actual table names. This can easily be done with the help of regular expressions.

my $table_names = get_table_names();

$grammar =~ s/TABLES/$table_names/;

The words within the curly brackets ("{}") are the actions to execute if the rule is successfully matched. For this case, words within the brackets are returned on a successful match. If the rule does not contain any action, then the matched token is returned. For example, the subrule 'ask may return the word 'show' or 'list' or 'display' if it matches it. Otherwise the subrule will return undef and this will cause that branch of parsing to be invalid.

The variable $item[5] is the 5th token matched for that rule. For the 'select' rule it is the table name.

Once we defined the grammar, parser object needs to be created using this grammar. Following is the perl code used to create the parser.

my $parser = new Parse::RecDescent($grammar);

Once the parser object is created, we can call any of the rules defined in the grammar as a function call. Since this is a top-down parser, only the most general rule is called. Therefore we can translate some natural language query to SQL as below.

my $input = "show me our suppliers";

my $sql = $parser->translate("\L$input");

print $sql;

This block of code will translate the input to SQL and output the results. Since the parsing is case sensitive, the input has be converted to lower case ("\L") before calling the translate function.

Now we can add more grammar rules to the $grammar variable, but do not need to do changes to any other code.

Grammar    : show (the) <column_name> of (our) <table_name>

SQL           : SELECT <column_name> FROM <table_name>

The parser grammar is,

my $grammar = q{
    translate : select

    select :  ask /(me)?/ /(all)?/ /(our)?/ table_name
                   { "SELECT * FROM $item[5]" }

                | ask /(he)?/ column_name /of/ /(our)?/ table_name check[$item[3],$item[5]]

                  { "SELECT $item[3[ FROM $item[6]" }

    ask : /show|list|display/

 table_name : TABLES

 column_name : COLUMNS

    check : <reject !check_column($arg[0], $arg[1]) >
};

The check rule checks whether the given column (argument 0) belongs to a particular table (argument 1) and if no reject the current production.

Example -

show the names of our suppliers - accepted
show the costs of our suppliers
- rejected

Since the grammar itself does not know which column belongs to which table, the check_column function has to know this information. Therefore SQ_HAL has to store the database structure in a variable so that it does have to read the database structure every time.

To add new grammar we only need to modify the $grammar variable. So even at run-time, we can modify the grammar. This form the basis of the learning functionality in SQ-HAL.

Example - lets say SQ-HAL encountered following new grammar during run-time.

Grammar    : how may <table_name> are there

SQL           : SELECT COUNT(*) FROM <table_name>

Following code is use to extend the select rule in the grammar.

$rule = "select";
$production = "/how many/ table_name /are there/";
$action = "{ qq(SELCT COUNT(*) FROM $item[2]) }";

### define the new grammar to be learn ###
my $new_grammar = qq{ $rule : $production $action };

### extend the parser with new grammar ##
$parser->Extend($new_grammar);

This code will append the new grammar to the end of the select rule as a new production. (I.e. joined by "|")

Grammar rules for the conditions are slightly complicated than other grammar. Complication rises where we need to determine the type the value. These conditions can be explained used the following sample code for the 'greater than' condition.

my $grammar = q{
    .
    .
    .

    condition : gt | lt | like | between | equal

    gt        : column_name gt_word value
                { "WHERE $item[1] > " . format_val($item[3]) }

    gt_word   : /(greater|more|expensive) than/

    value     : number | word | date

    number    : /(\$?)(-?)\d+(\.?)\d*/ ### any valid number including dollar values ###

    word      : /\w+|\"[\S\s]*\"/ { qq{ $item[1] } } ### one word or words within quotes ###

    date      : DD date_sep MM date_sep (YYYY|YY)
                { parse_date("$item[3]/$item[1]/$item[5]" }

    DD        : /\d{1,2} ### date - either one or two digits

    MM        : /\d{1,2} ### month - either one or two digits

    YYYY      : /\d{4} ### 4 digit year

    YY        : /\d{2} ### 2 digit year }

Notes:

1. format_val function will format the value into appropriate SQL format. That is, it will put quotation marks around sting values and hashes (#) around date values.

Example -

format_val("20.92")       -> 20.92
format_val("ABC")         -> "ABC"
format_val("22-Oct-1976") -> #22-Oct-1976#

2. The date parsing shown above grammar is only one way of representing the date. There are may other ways of expressing date including dates strings such as "today", "yesterday", etc. SQL does not interpret these date stings and therefore need to convert it into US-standard date string (MM/DD/YYYY). This is done within the parse_date function with the help of Date::Manip module. Even though 20-20-2000 will parse as a date string, the parse_date function will reject it and therefore the parser will not accept it as a valid date. (it even rejects 2/29/1900 but accepts 2/29/2000 - well, SQ-HAL is Y2K compliant)

The complete set of grammar can be found in the appendix 2. This grammar still needs more expansion to translate more general cases, which I leave as further work. Some of the test results of the SQ-HAL parsing can be found in the appendix

Implementation - User Interface