In order to translate English statements we need to use grammar to define various rules. To explain these concepts, I will use a sample database with the following structure.
Figure 7: Sample database structure
![[Sample database structure]](images/database2.jpg)
Now if we want to see all the data in the 'suppliers' table or 'product' table, we can write natural language queries as,
show me all our suppliers
show me all our products
and these English statements are translated into SQL as,
SELECT * FROM suppliers
and
SELECT * FROM products
respectively. When we analyse the above two English sentences, we can see that these are very much similar in there structure. The only difference between them is the table name. This can the be generalise to form the grammar,
show me all our <table_name>
and the translated SQL statement is,
SELECT * FROM <table_name>
The words within angle brackets can be any table name, and for the above examples they are suppliers and products.
The same statements can also be expressed in slightly different ways. Following are eight different variations of the first English query.
show me all our suppliers show me our suppliers show me all suppliers show me suppliers show all our suppliers show our suppliers show all suppliers show suppliers
These eight queries can be generalised to one single grammar as below.
show (me) (all) (our) <table_names>
The words in brackets are optional words, which can be omitted in the queries.
Furthermore the first word of the query may also be changed to create more queries.
show me all our suppliers
list me all our suppliers
display me all our suppliers
For the above three queries, the first word is substituted with three similar words. This can also be included in our grammar to extend it even further, which leads to the new grammar:
[show|list|display] (me) (all) (our) <table_names>
The one word from the above list (with the square brackets) must be appeared in the query. The above grammar can produce 24 different English statements for one table name, but will translate to only one SQL statement.
SELECT * FROM <table_name>
To increase more translating possibilities, we can define more grammar. However there is a slight problem with this kind of grammar definitions. For example if we define grammar similar to
[what|who] are our <table_name>
This will not only accept queries like,
who are our suppliers
what are our products
, the grammar will also accept queries like,
what are our suppliers
who are our products
which are grammatically incorrect. However correcting this problem is quite hard as we need to compare the table name against the first word. Since this grammar accepts the correct grammatical queries, we can ignore this problem as we can assume that the users will always write the correct English.
More grammar rules can be defined to retrieve information from only one column of a table. For example;
show the names of our suppliers
show the descriptions of our products
statements can be generalise to create the grammar,
show (the) <column_name> of (our) <table_name>
and the translated SQL will look like,
SELECT DISTINCT <column_name> FROM <table_name>
Note that the words names and descriptions are not the exact
column names, but are the plurals of columns name and description.
It may also be possible to have similar words to column names or table names in
the queries instead of the exact words. Therefore we not only need to keep
track of table and column names, but also need to keep track of similar
words. Except for plurals, users need to manually enter other
synonyms. If I was to overcome this problem, I need to include a thesaurus
with SQ-HAL. But due to time limitations, this is not possible. The
word DISTINCT is used in the SQL statement to get rid of any
repeated values as we are looking at only one column.
To retrieve two columns from the database, queries can be written as,
show me names and addresses of our suppliers
show me descriptions and cost of our products
and the grammar for this is,
show (me) <column_name1_> and <column_name_2> of (our)
<table_name>
and the resulting SQL statement will be,
SELECT <column_name_1>,<column_name_2>
FROM <table_name>
More grammar rules can be defined to create SQL statements that contains COUNT and SUM functions. SQ-HAL will only implement these two functions, as they are the most commonly used SQL functions.
The queries with conditions can also be recognised and generalised to form the grammar. For example,
show me the names of our products supplied by ABC
show me the address of our supplier whose name is XYZ
queries will have the grammar,
show (me) (the) <column_name_1> of (our) <table_name> (whose)
<column_name_2> [by|is] <value>
which can translate to the SQL statement,
SELECT <column_name_1> FROM <table_name>
WHERE <column_name_2> = <value>
Many of these conditions can be generalised by separate grammar. Some of these grammar rules and corresponding SQL statements are summarised in the following table.
Table 4: Grammars and SQL for various conditions
|
Sample query |
Grammar |
SQL |
|
… supplied by ABC |
… <column_name> [by|with|is|are|…] <value> |
|
|
… cost is between 10 and 20 |
… <column_name> ([is|are]) between <value_1> and <value_2> |
|
|
… cost is higher than 10 |
… <column_name> ([is|are]) [greater|larger|higher|…] than <value> |
|
|
… cost is less than 10 |
… <column_name> ([is|are]) [less|smaller|lower|…] than <value> |
|
|
… name starts with abc |
… <column_name> [starts|start] with <value> |
|
|
… name ends in abc |
… <column_name> [ends|end|ended] in <value> |
|
|
… name contains abc |
… <column_name> [contains|contains] <value> |
|
If the value consists of more than single word, it must be quoted to recognise as a single value.
Two table queries are another complicated queries that the SQ-HAL should be able to translate. If the query looks like,
show me our suppliers and their products
The grammar for this is,
show (me) (our) <table_name_1> and (their) <table_name_2>
and the translated SQL statement is,
SELECT * FROM <table_name_1>, <table_name_2>
WHERE <relationship_between_tables_1_and_2>
To create this SQL statement, SQ-HAL needs to know the relationships between tables. It is quite hard for the program to determine these relationships automatically. Therefore user assistance is required to determine these. The user has to manually enter these relationships, which can then be stored for future use. However if the user writes a natural language query similar to,
show me our suppliers and orders
but there is no direct relationship between supplier and orders table. Even though this can be translated into a valid SQL as,
SELECT * FROM suppliers, order
It will not produce the expected results and the results may contain thousands of unrelated records. Therefore if the relationship between the two tables is not defined, SQ-HAL should not translate the English query to SQL.
To translate natural language queries, SQ-HAL needs to have a parser, which contains all these grammar rules. The parser also has to know the table names, column names, any relationships between tables and related words for the table and column names
It is not possible to have all possible combinations of grammar included in SQ-HAL parser. Only the most common grammar is included in the parser. For all the other possibilities, a learning facility can be design and implemented to the parser. For example, if the SQ-HAL does not understand the query,
show me our suppliers names
but when rephrased it can understand the query,
show me names of our suppliers
Then the program can analyse the first statement to find table names, column names or conditions and then create a new grammar. By analysing the first English statement, SQ-HAL can create the grammar as,
show me our <table_name> <column_name>
The corresponding SQL statement is the same for the second English query. That is,
SELECT <column_name> FROM <table_name>
Now this grammar and the corresponding SQL statement can be append to the parser so that it can learn the new grammar for other similar statements. Once this grammar is added to the parser, SQ-HAL can translate queries such as,
show me our suppliers address
show me our product descriptions
show me our order dates
Next challenge is storing this new grammar and database structure for future use. There are two options available. These information could either be stored in a file and load every time the user run the program, or save the complete parser every time a change occurs. The first option could be a very slow process if there are lots of new grammars to be learnt. Therefore second option of saving the parser to a file is implemented in SQ-HAL.