Contents
1.1 Monolingual
and Multilingual Dictionaries 4
1.2 Aims 5
1.3 Thesis
overview 5
Chapter 2 Past work
2.1
Dictionaries 6
2.2 Electronic
dictionaries 8
3.2 Limitations
with multilingual dictionaries 14
3.3 Operations, requirements and tradeoffs 16
Chapter 4 Internal representation of the XML
document
4.1 XML 21
4.2 XML
information access 24
4.3 DOM
(Document Object Model) 24
4.4 SAX (Simple
API for XML) 26
4.5
Object orienting the XML file 28
4.6
Other XML tools 29
4.7
Future XML tools 30
5.5 Additional considerations 38
Chapter 6
Interface to an electronic dictionary
6.1 Query interface options 40
6.2 Viewing search results 41
6.3 XSL 42
Chapter 7
Conclusion 44
Appendices 45
Bibliography 53
The
purpose of the research was to investigate and analyse the creation of a
multilingual dictionary application using JMdict as a study target. The JMdict
file is an XML document containing multilingual dictionary entries of
Japanese-English-German. A dictionary application is made up several major
programming components: user input of query information, query search and
retrieval and display of information. These aspects of multilingual dictionary
application creation were investigated with a major focus on internal
representation of dictionary entries in the application. Current XML parsers
have been found to be inappropriate as a form of data representation and
retrieval on their own. The SAX parser could possibly be used along with the most
favorable method, the indexing technique. Investigations on multilingual
dictionary files, the XML format, handling of foreign characters using Unicode
and input of foreign characters has been carried out. Furthermore, processing
of information retrieval such as stemming techniques and spelling suggestions
have also been investigated. This research can be a guide for programmers
wishing to find out about creating multilingual dictionary software.
Acknowledgements
First and foremost, my
biggest thanks goes to my supervisor, Associate Professor Jim Breen. His advice
and inspiration has been invaluable throughout the thesis. In addition, without
his hard work in the JMdict, this project would not be possible. Thanks for all
your help.
Thanks to my friends, they
know who they are, for their encouragement, criticisms and also for being there
almost every step of the way with me on this journey.
Last but not least, I
would like to give a sincere thanks to my family and relatives, especially my
Mother Maria for helping with proofreading and providing moral support.
Chapter
1
Introduction
1.1
Monolingual and Multilingual Dictionaries
Dictionaries are an essential part of learning and education and are a centuries old teaching tool used all over the world in a multitude of different languages. The Oxford English Dictionary defines it as “A book dealing with the individual words of a language (or certain specified classes of them), so as to set forth their orthography, synonyms, derivation, and history…”[1] Dictionaries initially began life as topic specific dictionaries and also as bilingual dictionaries (Roman-Latin and also Chinese dictionaries) as early as the 16th century. So the use of bilingual dictionaries is not a new concept. Over the years, dictionaries also broadened their scope becoming more field-specific, for example, music, medical and scientific dictionaries. Soon, translations from several different languages to another language emerged, resulting in the creation of multilingual or interlingual dictionaries. Indeed, multilingual dictionaries provide a powerful tool for students to enhance their learning of another language.
Despite the range of dictionary functionalities, the structure of the print dictionary has always been constant, closely following the definition of a dictionary. Being able to search through one dictionary would enable readers to search through all dictionaries in the same way. English dictionaries are all laid out from A-Z (or in some cases A-K and L-Z). To find a word in these dictionaries simply requires the user to search alphabetically to find the entry. Asian language dictionaries, indexed by ideographic characters, may have a slightly different format in that they may order the dictionary via stroke or radical order. In any case, the reader has to manually sift through pages in a set pattern to find their definition or translation. There was no other known way to search for information within a dictionary [Atkins, 1985].
As computers have grown more powerful and widespread, the opportunity
for automated processes and digital storage drew closer. Programmers,
developers and lexicographers are now no longer constrained by the conventional
format of the dictionary. Students and teachers are also no longer constrained
in dictionary searching techniques. All applications that required paper or
tape storage could now be located on digital storage medium only a fraction of
the size required before. Of course, dictionaries, requiring relatively large
amounts of physical storage area and mass, could be stored efficiently
digitally. Ideas previously thought unfeasible, such as an electronic
dictionary was suddenly a possibility. However, this opened up a whole new set
of questions to be answered in terms of storage and information. Problems such
as dictionary data storage, multimedia and pictures are all fields that need
research in the context dictionary file construction. Different file structures
have been used create electronic dictionary files. As new storage formats are used,
they open up a whole host of different, new and sometimes unknown methods of
information retrieval and representation. Today, there are many incarnations of
electronic dictionaries, from portable handheld devices to stand-alone PC-based
programs and even World-Wide-Web on-line dictionary systems.
1.2 Aims
Despite the rapid improvement of electronic dictionaries and the
emergence of palm-top bilingual dictionaries, past research into electronic
dictionaries and information retrieval have not focused on the study of the
multilingual dictionary in the electronic medium. This area of study
incorporates research into conventional dictionaries, but there are many issues
that are left unanswered. Little work has been done investigating ways of
representing the dictionary, storing, displaying and understanding user
requirements of the multilingual dictionary. The aim of the research undertaken
in this thesis was to explore the applicability and suitability for using the
JMdict XML-based Japanese-English-German dictionary file as a study target.
The problem to be investigated does not come under one distinct
category. Instead, it is made up of several major aspects, all of which, when
combined, results in an analysis of a complete system, making this research
original. The thesis attempts to be a guide to programmers interested in
creating multilingual-based applications. The problem statement can be
described as follows:
“To investigate and compare old and new
storage techniques, data formats, representation of information and methods of
information retrieval. In addition to these aspects, to investigate the
applicability of these techniques in a multilingual dictionary environment.
Finally, to explore the ways in which a system can be presented to a user and
understand user needs.”
1.3 Thesis overview
The major aspects to be
investigated are as follows: data storage format, internal representations of
the document and the interface with the electronic dictionary. The following
chapter headings mark the discussion of these topics in further detail:
Chapter 2 reviews the past work carried out in the large field
of lexicography, electronic dictionaries and in particular, the area of
information retrieval. With information retrieval, various fields of research
into cross language information retrieval and text retrieval have been
reviewed, in particular, research into query translation and machine
translation.
Chapter 3 discusses the multilingual data available at hand.
It also describes the types of encoding methods available for different
languages. More importantly, Chapter 3 identifies the major (and minor)
operations required from any dictionary and from a multilingual dictionary in
particular. The limitations of multilingual dictionaries and in particular, the
JMdict file, will be examined.
Chapter 4 introduces the concept of XML as a data storage
format. The internal representation of dictionary data is extremely important
as it determines the ease with which data can be retrieved for the user. The merits
and shortcomings of various XML based tools are discussed along with the issue
of object orientation and its suitability in the dictionary setting.
Chapter 5 provides a more in depth look at other techniques
for efficient organization of data for retrieval. The format of the data
storage, ease of data extraction and flexibility in the types of data being
extracted will be investigated.
Chapter 6 provides insight into what the user seeks from an
electronic dictionary. It is important in all software to make sure the user
can operate it successfully and easily. In the same vein, the program must also
be able to provide the information that the user wants. Chapter 6 will identify
what kinds of information different types of user wants and provide possible
solutions for displaying information.
Chapter 7 will complete the discussion on multilingual
dictionaries and provide some suggestions for further work and the direction in
which research is heading in this topic.
Chapter
2
Past
work
Because of the nature of this thesis report, extensive research into both the technological component and language aspects of dictionaries was carried out. Without a full understanding of the whole picture, further investigations cannot occur.
2.1
Dictionaries
Dictionaries and lexicon are terms that are
synonymous with each other; the term ‘lexicography’ is defined as the process
of compiling dictionaries. A bilingual dictionary is usually produced for users
in one source language (L1) to a target language (L2) (for monolingual
dictionaries, L1 and L2 are the same language). There is sometimes also
confusion over the differences between dictionaries and encyclopaedias. Some
believe they are closely linked and are considered interchangeable, but they
are different kinds of reference books with different purposes.2 Dictionaries produced can be either:
Multilingual and bilingual dictionaries
Before the construction of a multilingual
dictionary, the creator must consider these issues:
-
The
users of the dictionary
-
The
purpose of the dictionary. There are four ways in which a dictionary can be
aimed towards:
o
Technical
dictionary
o
Learning
dictionary
o
Reference
dictionary
o
Comprehension
dictionary
-
If
the dictionary is used to understand a foreign language or to produce text in a
foreign language
-
Native
language of the user; either the user is a native speaker of the source or the
target.
There are 4 types of multilingual dictionaries
Many experts agree that the
trouble with most dictionaries is that they try to cater for the needs of both
the source and target language together. Some believe that it is impossible to
pay equal attention to both in the same volume. Because of the limitations of
the size of normal paper back dictionaries, the editor usually has to select
the entries to go into the dictionary on the basis of the purpose of the
dictionary [Ali-Kasimi, 1977] [Hartmann, 1983].
There are several criteria to distinguish multilingual dictionary
types, whether they include encyclopaedic information and whether or not they
take account of changes in the language. There are three major types of
multilingual dictionaries:
Machine translation requires
detailed grammars of the source and target languages, an interlingual grammar,
a comprehensive bilingual dictionary, and complex computer programming systems
to store, process and retrieve the data. An ordinary dictionary is expected to
provide only the information which the user needs, but the dictionary for
machine translation contains all the grammatical information about both
languages [Al-Kasimi, 1977]. The major focus of this thesis is to investigate
the criterion for dictionaries for human users.
Comparison between bilingual and monolingual
dictionaries
Monolingual and multilingual
dictionaries differ in several respects, according to the users they intend to
serve, the needs they cater for, and the methods of their compilation.
Bilingual and monolingual dictionaries show systematic variations in approach
to wordlist (the stock of vocabulary to be listed). The bilingual dictionary
usually contains a few thousand of the most frequent items to an extensive list
as large as a monolingual dictionary. Furthermore, bilingual dictionaries
contain not one, but 2 discrete wordlists (L1 and L2). Another fundamental
difference between the mono and bilingual dictionaries is that the monolingual
dictionary takes the form of a definition, whilst the bilingual dictionary
attempts to provide an equivalent or a series of equivalents in the target
language [Beryl, 1985]. The layout of the entries and search techniques
may also be different. Some foreign languages can be searched differently
according to phonetics or character construction, or even character codes.
Dictionary
users
There are many different
types of users who use dictionaries. It is very important to understand what
kind of people use the dictionary otherwise there will not have be defined
audience and this will result in a dictionary lacking focus. Users have
different requirements and this will be reflected in the purpose of the
dictionary via the type of entries, number of entries and structure of the
entries. Hartmann [1983] categorises the uses and the users of dictionaries:
Factors
in dictionary use:
Information
Meanings/synonyms Pronunciation/syntax Spelling/etymology Names/facts |
Operations
Finding meanings Finding words Translating |
Situations in dictionary use
Users
Child Pupil/trainee Teacher/critic Scientist/secretary |
Purposes
Extending knowledge of the mother tongue Learning foreign language Playing word games Composing a report Reading/decoding foreign language texts |
Table 1 Uses and users of dictionaries
2.2
Electronic dictionaries
There are many potential
advantages of an electronic dictionary over a conventional dictionary. The
information accessible to a user is now huge. The cost of dictionaries is an
advantage. The cost of storing a dictionary on an electronic medium such as a
disk or CD is far cheaper than the production costs incurred for the printing
and distribution of dictionaries. Furthermore, updates to the dictionary will
no longer mean a complete republishing of new edition; an update patch with the
new data can be distributed very easily. Most conventional dictionaries cannot
contain all the information for all the entries, but instead are selective in
coverage of different aspects such as pronunciation, spelling, etymology and
idioms. There are no longer any problems like this. The amount of data in a dictionary
will now depend on how much a lexicographer is willing to put into the file
[Hartmann, 1983].
Depending
on the type of data format the dictionary is stored in, it is much easier to
insert new entries into the dictionary. For example, introducing additional
attributes to a file such as JMdict is very easy. This leads to the size of a
dictionary. A 20 volume dictionary no longer needs to belong in a state
library; it is possible to have it on your lap, in your laptop. Near unlimited
amounts of free space can mean incorporation of features that were impossible
in print dictionaries. Some larger print dictionaries were able to incorporate
several pictures per page. Pictures
enhance the presentation of the dictionary, but it also serves as a valuable resource
in terms of word-picture recognition. Multimedia is also another feature that
can be incorporated into electronic dictionaries. A user is able to understand
a word and read the phonetics of a word, but now is also able to hear it. The
best way to learn to say a word is by hearing it first. Portability of an
electronic file is another reason for electronic dictionary superiority. A set
of dictionary files can easily fit on portable storage media such as CD. This
means they can be taken anywhere and used whenever needed. No need to go to a
library to find the most comprehensive dictionary. Accessibility is a great
issue. Large dictionaries are not produced in large capacities because of lack
of demand and high costs of production. However, dictionary files, even large
files are easily transmittable over a medium like the internet which means that
people all around the world will be able to take benefit from a most valuable
resource.
One of the most important
advantages that electronic dictionaries are able to offer is its
transparency; it allows access to the
whole of its contents via formalized categories so interesting features of the
vocabulary can be investigated directly. Some dictionaries used to be created
from scratch specifically for beginners and infants. One dictionary is able to
satisfy the needs of all users. How can this be achieved? Electronic
dictionaries provide different ways of presenting data to the user. The format
of entries no longer has to be set like conventional dictionaries. Computerised
dictionaries have the potential to be customized according to the needs of the
user [Hartmann, 1983].
The speed of information
retrieval that electronic dictionaries deliver can be said to lose the memory
retention benefits that manually searching through paper dictionaries provide.
Using a dictionary usually provides the use with the chance to look at other
entries surrounding the word on the same page. This is not to say that an
electronic dictionary cannot do this, provided the correct design principles
are administered.
There are many different types of electronic dictionaries available. To complete the analysis of electronic dictionaries, some of these files should be studied and acknowledged. Breen’s [2000a] EDICT file, KEBI and EDR are studied. The structure and contents of these files are quite different. They are discussed in detail in Appendix A.
2.3
Information retrieval
There
has been limited documented research into dictionary files or electronic
dictionary representation. However, the majority of research related to
electronic dictionaries has been in methods of cross language information
retrieval. The following are the basic information retrieval processes that are
carried out:
Representing
the information needed – Query formulation
Representing
documents – Indexing the text
Comparing
these representations – Retrieval
Evaluating
the retrieved documents , and if necessary, return the query and entry [Croft
et. al]
Various
aspects of these processes may be relevant and useful when investigating
electronic dictionaries. Information retrieval has become an important issue
because there is now emphasis on the efficiency of the retrieval and also the
accuracy and amount of information that can be returned from a search. There have
been several articles published by various researchers who have attempted to
carry out cross-language information retrieval (CLIR).
Of
the issues raised by information retrieval, some of the problems encountered
are relevant for investigations of electronic dictionaries and are listed as
follows:
-
Character encoding -
Documents that are stored in different languages have to be stored, but
subsequently, need to be displayed in a non-ASCII environment. This problem has
been tackled and one solution has been provided by JMdict.
-
Segmentation
of Chinese and Japanese- Asian languages are usually written with separated
lexical elements, so is the case with English languages. Tokens need to be
identified as strings of characters so that searches can be carried out on
them.
Text
retrieval systems, such as the INQUERY system (based on probabilistic retrieval
via Bayesian net framework) [Croft et. al] [Callan], appear to be too complex
and large for a task such as searching through a database of entries. Other attempts
have also been made for multilingual text retrieval on a larger medium such as
the world wide web. Projects such as MULINEX have been set up to develop tools
for cross language retrieval, using retrieval systems such as Fulcrum
SearchServer and SurfBoard [Erbach 1997].
Machine
translation -
Machine translation (MT) is a application to electronic dictionaries but
consistently more complex. Electronic dictionaries set out to translate a
single word, phrase or character into another language, whereas MT really
attempts to translate whole sentences or documents. However, MT uses electronic
dictionary files to carry out the translations. This takes a lot more effort to
achieve than simply looking up a dictionary file. For example, MT performs a
linguistic analysis so that the most suitable Japanese word can be matched to
the English query request [Jones, 1999]. In further studies, Chinese queries
were used also to evaluate the effectiveness of different MT techniques [Kwok,
1997] At the moment, the state of development of MT is such that translation
requires human aid to complete. This is referred to as machine-aided
translation. Some current commercial MT software (such as Logovista E to J)
enables some form of translation but still requires pre and post editing of the
translation for acceptable conversion from English to Japanese [Eichmann et
al., 1998].
Dictionary-based retrieval
can occur by breaking up the query into their root forms, searching the
dictionary file for equivalents for each word and substituting them with the
translated words with the highest precision [Ballesteros, 1996]. Problems with
this type of translation are that an incomplete dictionary will result in inconsistent
results. Furthermore, there may be ambiguities in translation that can also
introduce substantial error [Eichmann et al., 1998]. In other research fronts
in the same area, Fujii [1993] investigated the effects of retrieval using
characters versus word based indexing techniques for text retrieval. It was
found that character based indexing and retrieval was the most efficient out of
these two techniques.
A different method of cross-language
information retrieval can be carried out by using a multilingual thesauri
[Eichmann et al., 1998]. This bears a resemblance to specialised
Japanese-English dictionary file that Breen has created (such as COMPDIC) in
that controlled vocabularies are stored in the thesauri. The medical thesaurus,
called a metathesaurus, is multilingual supporting a range of European
languages. Conventional thesaurus-based retrieval required the queries to be
matched in the thesaurus by their representation. Thus CLIR would follow the
same method of retrieval. From their tests, they discovered that the best
results came from choosing words that contained only query words. Performance
of the metathesaurus based retrieval did not exceed dictionary-based retrieval
[Eichmann et al., 1998]. This evidence indicates that using a thesaurus to
translate can be effective, but a thesaurus is not as flexible in comparison to
a dictionary.
Jones [1999] tested several
different translation methods in one study. Of the translation methods, the
interesting ones were: ‘using a bilingual dictionary to return all the
definitions of the corresponding English query’ and ‘using the bilingual
dictionary to return a single default translation for the matching English
word’. Jones claimed that there was literature that argued that full machine
translation was unsuitable, however he does not back up their claims with any
evidence. Despite these unfounded claims, the results that they produced
indicated that full machine translation produced favourable results compared to
dictionary term lookup. Again Jones failed to disclose any references
describing the method of full machine translation. Hull and Grefensette [1996] mentioned in his paper that the
‘performance of MT systems in the setting of general language translation is
dismal enough to make this option less than entirely satisfactory’. This claim
had also been backed up by Pirkola [1998], Oard and Dorr [1996] and Yamabana et
al. [1996].
Hull
and Grefensette [1996] discussed five different definitions for multilingual
information retrieval (MLIR). Of the definitions that were displayed in the
paper, none of them took into account the use of a pure multilingual dictionary
to carry out MLIR. A multilingual dictionary can play a powerful role in the
development in information retrieval of multiple languages. The definitions
mentioned seemed to revolve around a collection of different dictionaries
either working in parallel or in combination to produce multilingual
translations. Hull’s narrow mindedness could have been a result of his
objectives which were centred around the development of a ‘query translation
module’ that could be easily built on top of an already existing information
retrieval system.
To
further demonstrate the diversity in methods and ideas being exercised, Pirkola
[1998] used a special dictionary and a general dictionary in query translation.
It was found that this technique was highly efficient [Pirkola, 1998].
Several CLIR techniques have been
observed to be a result of avoiding the construction of a specific dictionary
file for research. Indeed, creating a dictionary would be time-consuming,
however, it is believed once a ‘standard’ multilingual dictionary was produced,
it would relieve some of the problems faced by some of the researchers and open
new doors in CLIR.
From the review of the cross
language retrieval papers, it seems that the scope and depth that CLIR presents
is much to broad for most of the aspects to be relevant to multilingual
dictionary research. CLIR attempts to tackle complete indexes of documents in
many different languages. The research that has been uncovered tends to be
focused on either efficient retrieval of documents with the greatest relevance
to a search string query, or the complete translation of particular documents
from one language to another. An example of this would be a Greek speaker who
needs documents on a certain topic. The user would enter the search query for
keywords in the documents. If a Finnish document containing the keywords was
located, the document would then be machine translated into Greek for display
[Hlava et al, 1997]. Techniques and ideas raised by these studies may be
applicable to the current research.
Stemming
The
morphology of a language may mean that words in their plural or past tense may
not be queried effectively by the search engine because they are sometimes
structurally different in construction. The words ‘run, ‘ran’ and ‘running’ are
all related words and it should be reflected in the search by retrieving the
root word ‘run’. This issue is not a usual problem when it comes to searching
through a paper dictionary because whilst the user is flicking through the
pages, they may come across the root word of the query they were looking for
because most of the related words are found quite close together. The stemmer
widely believed to be relatively reliable is the suffix-stripping algorithm
created by Porter [1980]. The algorithm contains a set of rules, or steps where
the query word is compared each of the rules and is subsequently modified if
the suffixes match the comparison. Code to create the Porter suffix stripper is
relatively easy to construct using the guidelines outlined in Porter’s [1980]
paper. By using a stemmer for query modification, it can help to increase the
chance that a matching query can be found.
Spelling
Users
may appreciate options in the application that provide suggestions of possible
words if they have entered a word incorrectly, or if they do not know how to
spell it. Words can be similar in two ways: words that sound alike or
words that are structurally similar to the word. Two techniques could be
implemented to handle misspellings, Soundex and the Levenstein Distance
Algorithm.
The
Levenstein Distance Algorithm (LDA) provides a simple metric for testing the
similarity of words. It therefore can be used as a form of spelling checker
that can be used when entries cannot be found in the dictionary. If two words
which are similar are compared, they can be ‘aligned’ so that sequences of
words in each of them are able to match each other. For example, the words
‘word’ and ‘bird’ are quite similar. They can be aligned in such a way:
B
I R D
| |
W
O R D
This example is rather
simple, because there is no other way to really align the words. However, if
the two words ‘WORD’ and ‘BEARD’ are aligned. It is more difficult because
although there are similar sequences in each of them, there are several other
possibilities to align the rest of the letters. Some of the solutions are:
B E A R D B E A R D B E A R D
| | | | | |
W O R D W . O R D W O . R D
To convert ‘word’ to
‘beard’, insertions, deletions or letter replacements can be carried out. For
example:
B E A R D B E A R D B E A R D B E A
R D
| | | | | | | |
W O R D W O A R D W E A R D B E A
R D
[Word insertion] [Word replacement]
[Word replacement]
The
Levenstein distance of the 2 words is defined as the minimum number of
operations required to transform word 1 to word 2. There are phonetic
comparisons that compare the pronunciation of words to determine similarity.
However, these techniques are very costly and complex. The Levenstein Distance
algorithm assigns a score to each change, deletion or addition required to make
the strings equal. The final distance is the sum of these scores. A threshold
is set up which determines if a string is considered similar or dissimilar.
This algorithm can provide suggestions of words to the user with low Levenstein
distances in the event that the user entered an incorrect query string.
The Soundex code is an
indexing system that translates names into a 4 digit code consisting of 1
letter and 3 numbers. Its most familiar application has been by the US Bureau
of the Census to create an index for individuals using their surnames.
The advantage of Soundex is
its ability to group names by sound rather than the exact spelling. All the
words that have a similar sound are grouped together by having the same Soundex
number.
There several rules when creating Soundex codes:
-
All
Soundex codes have 4 alphanumeric characters
o
1
Letter
o
3
Digits
-
The
letter of the name is the first character of the Soundex code.
-
The
3 digits are defined sequentially from the name using the Soundex Key chart
-
Adjacent
letters in the name belonging to the same Soundex Key code number (such as a
double ‘r’) are assigned a single digit.
|
1 |
b p f w |
|
2 |
c s k g j q x z |
|
3 |
d t |
|
4 |
l |
|
5 |
m n |
|
6 |
r |
|
No code |
a e h i o u y w |
Table 2. Soundex Key table
If users misspell the first
letter of the word, the Soundex system is unable to retrieve the correct
Soundex code. Instead a list of words starting with the same letter will be
retrieved.
Chapter 3
Data, concepts and tasks
The encoding of characters, especially an international character set is important when considering a multilingual dictionary. In the past, different encoding schemes have been used; standardization of codes was non-existent. Studying the options available for encoding a multilingual file is therefore a required point of discussion. Japanese, Chinese and Korean text representation is discussed in detail in Appendix B. The current method of international standardization of encoding called Unicode.
3.1 Unicode
The Unicode Standard is a
superset of all characters in widespread use today. It unifies character sets
from around the world, making multilingual software easier to write,
information systems easier to manage and information exchange around the globe
more accessible. It contains the characters from major international and
national standards as well as prominent industry character sets. For example,
Unicode incorporates the ISO/IEC 6937 and ISO/IEC 8859 families of standards,
the SGML standard ISO/IEC 8879, and bibliographic standards such as ISO 5426.
Important national standards are included within Unicode: ANSI Z39.64, KS C
5601, JIS X 0208, JIS X 0212, GB 2312, and CNS 11643. The primary goal of the
development effort for the Unicode Standard was to remedy serious problems
common to most multilingual computer programs, overloading of character
encoding and also multiple, inconsistent character codes caused by conflicting
national and industry character standards and finally the inadequacy of using 7
and 8 bits (or a maximum of 256 characters) to represent the global character
set.. In Western European software environments, there is often confusion
between the Windows Latin 1 code page 1252 and ISO/IEC 8859-1 [The Unicode
Consortium, 1996].
The Unicode project began in
1988, the inconsistent groups of international character sets affected
publishers of scientific and mathematical software, newspapers, book
publishers, bibliographic information services, and academic researchers. In 1991, the ISO Working Group responsible
for ISO/IEC 10646 (JTC 1/SC 2/WG 2) and the Unicode Consortium decided to
create one universal standard for coding multilingual text. Since then, the ISO
10646 Working Group (SC 2/WG 2) and the Unicode Consortium have worked together
very closely to extend the standard and to keep their respective versions
synchronized.
Although the character codes
are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard
imposes additional constraints on implementations. Unicode 2.1 has the same
character repertoire as ISO/IEC 10646-1:1993 and Unicode 3.0 has the same
character repertoire as ISO/IEC 10646-1:2000. Unicode uses a variable length 16
bit representation called UCS-2. The full 16 bit code space (that is 65000 code
positions) are available to represent characters. For compatibility with other
environments, there are two transformations of Unicode to convert them to 8 or
7 bit environments: UTF 8 (Universal Character Set Transformation
Format) and UTF 7. Furthermore, Unicode stands out from other standards
as it only deals with character codes, leaving the glyph shape and construction
to font vendors [The Unicode Consortium, 1996].
When considering Japanese character representation, there is no longer
any need to inter-convert between the different character set encoding
standards; all Kanji, Katakana and Hiragana characters are supported in
Unicode. Since a standard for Japanese representation is available, it was
logical that the JMdict file be encoded in UTF 8. [The Unicode Consortium,
1996] As for Kanji, the character code for characters that occur in both
Chinese and Korean character sets are the same, further decreasing the number
of character codes required to represent Chinese characters.
Because each Unicode
character is a 16 bit value, it cannot be handled like an ordinary ASCII
character value. Programming in Unicode may be more troublesome if the
programming language cannot handle 16 bit characters easily. C++ is a very
popular programming language however, support for international languages has
not been taken into account when designing this language, so representing Asian
characters using C++ is rather weak. Java has several advantages over C++ when
it comes to international language processing because it has been designed to
handle Unicode as standard input/output. It is the first programming language
to have built in support for Unicode. Clearly, Java is a useful language when
it comes to internationalization and is the recommended programming language to
use when creating a dictionary application. When using the UCS-2 encoding to
map from one locale to another, mapping tables will be required because no code
conversion algorithm exists. This conversion between different codes is carried
out via table-driven conversion.
3.2 Limitations with multilingual dictionaries
A multilingual dictionary is an excellent idea on first thought. It
would be ideal to have a dictionary that could translate English into many
different languages. However, depending on the data format of the dictionary,
there can be limitations when it comes to building the dictionary file. These
limitations can be due to structure of the data file to the clash of various
languages themselves.
Although the use of XML is a
very important step in solving the problem about flexibility and extensibility
of the dictionary file, there are other considerations into the design, content
and lexicographical aspects of the dictionary file. Although not such an
important issue, the actual size of one single dictionary file may become a
problem as more entries are added and multiple languages are included into the
file. Being able to physically store a large file with millions of records,
processing a file tens or even hundreds of megabytes in size may become a
problem if CPU intensive tasks are required to be carried out on the data file.
The issue of handling a multilingual dictionary file by the computer is
discussed further in this report.
One of the most important
limitations when working with multilingual dictionaries is attempting to match
senses and glosses for different languages together. There are words in some
languages that are not defined in another language. There are also some
languages where words do not exist for specific meanings, instead a generalized
term is used. Furthermore, one word in one language may be used to describe
more than one thing in another language. This non-parallelism between languages
is called ‘anisomorphism’ [Landau, 1984]. For bilingual dictionaries such as
the EDR (see Appendix A), there are 190,000 headwords for the English to Japanese
part compared with 230,000 words in the Japanese to English part. The smaller
amount of headwords reflects the fewer headwords required in the generation of
the language. Different languages are bound to have differing types of grammar
so there are simply words in one language that don’t exist in another language.
For example the Aussie word ‘bloke’ may only be able to be defined as ‘guy’ or
‘man’ in another language, and the real meaning cannot be conveyed to the user.
In some languages, there may not be a word to describe a word in another
language. These differences and problems are obstacles that need to be overcome
when combining several dictionaries. [Al-Kasimi, 1977] There is no real
solution to this problem. How this situation is handled can depend on the type
of implementation used to hold the information.
A file like JMdict has
decided to make headwords Japanese with English definitions. The English
definitions could also be used to possibly be headwords too. Usually the
headword defines which language the user is confident with and the target
language is the second language for the user. The language of the headword may
be important in the structure and direction of the multilingual dictionary
file. Furthermore, the additional information included within an entry such as
part of speech and examples needs to be stored in a particular language. A
smaller limitation exists that asks what language should these additional
details be stored in. Entry information such as part of speech and synonym may
be stored in one language. So a possible problem may arise if the information
is in English and a user is wishing to search from German to Japanese. The
information stored in English may require translation to Japanese or German in
order for a person to understand the entry.
This brings the issue to the
next point: the target audience for the multilingual dictionary. The dictionary
may be for English speakers who require translation from English to multiple
languages, or a student who wishes to translate an article. It was mentioned
earlier that bilingual dictionaries could not cater for both L1 and L2 users.
Doing so would mean that effort on creating one dictionary would be cut in half
by trying to create two distinct dictionaries, each for a different target
audience. A possible outcome from this limitation could be to create a
multilingual dictionary file that supports native speakers from only Japan,
Australia and Germany. The dictionary file could be created for users who
specifically wanted to translate between these three languages.
One of the problems with
creating interlingual dictionaries is that of pronunciation. If a user wants to
produce the foreign language from a bilingual dictionary, then they would want
to know the appropriate word and how to pronounce it. For example,
pronunciation of a Chinese character is varied. If the user does not know the
phonetics of pronunciation for PinYin, it is hard for a person to say the
character. Furthermore, there are many dialects of Chinese all around China and
in Asian countries, so inclusion of pronunciation in an entry can be a
difficult task [Ali-Kasimi, 1977].
Although not a major issue concerning the actual multilingual
dictionary file, the difficulty and complexity of linking every definition when
coding the dictionary file, can be a headache to lexicographers. Definitions
not only have to be linked to each entry, but also to a set of identifiers
(such as usage information, parts of speech, and examples) [Landau, 1984].
Multilingual dictionaries increase the complexity of entry input because of the
languages to update. In addition, adding synonyms and other cross-references
can become tedious, especially if the references are in multiple languages that
the dictionary file represents. The process of entry insertion does not
necessarily have to be a problem and could be solved. Depending on the format
of the file, software could be produced, similar to a dictionary search engine,
allowing insertion, deletion and modification of the dictionary file with
little problem.
3.3 Operations, requirements
and tradeoffs
Creating an electronic
dictionary can be considered a difficult task. A dictionary application
consists of many different operations; each operation often constituting an
area of study in computer science. It is therefore important to set out the
operations and basic processes that a dictionary application carries out.
Figure 1 describes the operations executed by typical users of a dictionary
program.

Figure
1 Diagram of user
actions and corresponding programming sections
The features of user options
and possible variants in a dictionary search is discussed later in the paper.
From a software developer’s perspective, there are the three major sections to
the software:
1. Input
of entries for users: program has to be able to read in the
query of the user, in any of the languages specified. The system also needs to
be able to cater for the input of non-English characters into the query space.
2. Retrieval
of entry information: an important aspect, the retrieval
system of the dictionary is the heart of the application. A weak system will
result in fewer users due to lack of efficiency. The retrieval system is
required to search through the dictionary file, find matches in the entries,
compile relevant information and send the data to the application.
3. Output
of entries for users: data is received from the retrieval and
the next major operation is to format the information in for the user to view
the information. This task is a user interface and data representation problem.
The
following is a diagram describing an overview of the possible data paths for
the application:

Figure 2 Diagram describing the
possible data paths dictionary data
The operation of information
retrieval presents various issues, or ‘trade offs’ when choosing a technique to
implement. Usually, when implementing a technique, there are many advantages to
why the technique is used over another technique, however there are also
tradeoffs for using the certain method. Deciding which implementation is used
is generally a weighing up of priorities; important features that must be
optimal and other features which are not as important. The factors and costs
that need to be weighed when choosing implementations are:
-
Efficiency
of the searching system
-
Storage
space required to store the file
-
Storage
space required when processing the file
-
Frequency
of access of entries
-
Frequency
of access of certain entries
-
Amount
of data to be extracted per search
-
Accessibility
to multiple users
These costs will be analysed as different
information retrieval techniques are discussed.
An information retrieval system may typically consist of phases for
input, storage, processing, editing, output and transfer. Input is a serious
problem, resulting in a bottleneck. In a non-ASCII context, there is the
problem of text input. More than 500 encoding schemes have been devised alone
for Chinese character input. In many cases, a traditional western keyboard,
which will undoubtedly be the main input source for the multilingual
dictionary, is insufficient for text input. [Lunde, 1999]
Japanese is a language where each kana or compound character can be
represented with a unique phonetic. By understanding this, and creating a
conversion file or table listing the compound characters and Romaji equivalent,
it is possible to type a Romaji query and have it converted immediately into
Kana for the user to see before sending the query to the search engine. This is
a well-established technique and can be applied quite effectively for Japanese
character input.

Figure
3 Excerpts from a
Romaji to Kana conversion file. Pairing together a combination of phonetics
will result in a character, and combining these together will result in a
compound word.
There are two general types of ‘fast’, or configurative input for
characters: those which require memorization of character code numbers and
those which depend on the internalization of key positions directly related to
character codes. An example of this is the input of Chinese characters input
via the decomposition of each Chinese character into radicals or strokes that
are subsequently transformed into letters of numerals on the keyboard. ‘Easy’,
or phonetic methods, seek to exploit the users existing knowledge of Chinese
language and script to minimize the need to acquire new skills. On the
keyboard, we need to specify, not produce the character we want. One form of
specification is by pronunciation, the most widely used in ‘easy’ input [Mair
and Liu, 1991].
There is a general class of input of foreign characters called IM or
IME (Input Method Editors). Microsoft has developed a free IME they have called
Microsoft Global IME 5.02. It has been developed to enhance East Asian
character input. Global IME allows users to input Asian characters without any
special keyboard or equipment. Because Microsoft operating systems are used so
widely, this free input system is fast becoming a ‘de facto’ standard for
character input [IME, 2000].

Figure 4 Microsoft Global IME in
action
A common and well-known example of an IM is an IM for word processors.
The following is a diagram of the Chinese Star IM that can be used to input
Chinese characters into Microsoft Applications.

Figure
5 Chinese Star word
processor Chinese character input system. Involves typing PinYin into a translation
box. A list of characters that correspond to the PinYin are displayed.
Outputting the correct characters is a process of choosing the correct
character
However, some of these techniques are much too complicated and only
simple versions of them may be needed for short text input queries. The
emphasis for inputting foreign characters into a multilingual dictionary is not
speed, because so few letters will be put in at one time, but more of usability
and ease of use for the user. The investigation and implementation of various
input methods for different languages is out of the scope of this project. As
more languages are added to the multilingual dictionary, more and more input
techniques may be required to support the languages. Users could find it
difficult to adapt to all the different kinds of input techniques available.
In the past, there a major obstacle in foreign text processing was the output of the characters. There are several ways of representing the same font information. The different representations came from the different font houses (such as Adobe) creating their own standards. Fonts are either bitmapped or outlined (scalable). Bitmapped fonts represent each character as a rectangular grid of pixels. There are a number of disadvantages to this approach, but the most important one is the difficulty to change the size, shape, and resolution of a bitmapped character without loss of quality because the bitmap is defined at a certain size and resolution. Outline fonts represent each character mathematically as a series of lines and curves. The font must be 'rasterized' into a bitmap. LaserJet .SFP and .SFL files, TeX PK, PXL, and GF files, Macintosh Screen Fonts, and GEM .GFX files are all examples of bitmapped font formats. PostScript Type 1, Type 3, and Type 5 fonts, Nimbus Q fonts and TrueType fonts are all examples of outline font formats.
In
addition to these two types of font archive formats, certain font standards,
there are further issues. Identical formats on different platforms are not
necessarily the same. For example Type 1 fonts on the Macintosh are not
directly usable under MS-DOS or Unix, and
vice-versa. There are just as many different font formats. Two major
font formats are discussed:
PostScript Type 1 Fonts: Postscript Type 1 fonts
(Also called ATM (Adobe Type Manager)
fonts, Type 1, and outline fonts) contains information, in outline form,
that allows a postscript printer, or ATM, to generate fonts of any size.
TrueType Fonts: Truetype fonts are a new font format
developed by Microsoft with Apple. The rendering engine for this font is built
into MS Windows v3.1 and subsequent versions. Like PostScript Type 1 and Type 3
fonts, it is also an outline font format that allows both the screen, and
printers, to scale fonts to display them in any size.
The following is a table
that describes some of the font extensions and their platform usage. Despite
all these difficulties, there are now standard libraries (such as those listed
above) that define the font types and font information, so the problem of
displaying foreign characters is not such a major problem today.
|
Extension |
Usage |
|
* .fon |
An MS-Windows bitmapped font |
|
* .pfa |
Adobe Type 1 Postscript font in ASCII format
(PC/Unix) |
|
* .pfb |
Adobe Type 1 PostScript font in "binary`'
format (PC/Unix) |
|
* .ps |
Any PostScript file (Type 3 font) |
|
* .pxl |
TeX pixel bitmap font file |
|
* .ttf |
MS-Windows True Type font |
Table
3. Font extensions and their platform usage
The ‘heart’ of the application lies in the internal representation of the multilingual dictionary. Failure to create an efficient system will result in a disappointing application. The concepts of XML are explored in this chapter and its role in the JMdict file. In addition, XML specific tools are examined, outlining their properties and also their applicability for use in a multilingual dictionary application.
XML stands for ‘eXtensible
Markup Language’. It is the standard system for defining the content and format
of an electronic document. HTML tells how the data should look, but XML tells you what it means[Goldfarb, 1998]. The
differences are more distinct than just that. HTML has permanent markup tags,
for example the <bold> or <href> tags. Additional tags cannot be
defined therefore restricting the applications of HTML. From the computer’s
perspective, there is no structure of the information supplied in an HTML file.
It differs from a similar markup language in that XML is designed keeping in
mind that document format should be specific to the type of document that is
being created. This allows XML to be used as a genuine storage method, as
JMdict has shown. It allows the programmer to separate data from display. The
following are some of the strengths of XML:
Extensibility – allows
users to define their own tags (or attributes) to suit the data being
represented.
Structure – allows
nested structures of any depth, hence being suited to dictionary-type entries.
Validation – allows
the document to be validated before use by applications
Another important feature of
XML is that it enables the definition of tags for each individual document. The
formal definition that describes each type of tag is called a document type
definition (DTD). The following is a small sample of what a tag definition
looks like:
<!DOCTYPE
label[
<!ELEMENT label (name, street, city,
state)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT street(#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state(#PCDATA)>
]><label>
This DTD defines a
label. Each label will contain an element name, street, city and state. By allowing user-defined tags, XML
improves functionality and increases the appeal to developers, enabling them to
create any type of data document [XML.com, 2000].
Elements
have their own attributes that are properties for elements. This is simply the
value for the element that is being presented. An example of this in the JMdict
file is the <gloss g_lang="de"> tag. It is an element of type
gloss g_lang. To define the element, it is given the attribute “de” which
stands for German. By using this unique feature of XML, different languages can
easily be entered into the dictionary file without the need to design
additional tags.
XML has a hierarchical view
of XML documents which is referred to the tree structure of the document. The
structure of the entry: ‘New Year’s sake’ is displayed in the following
diagram.

Figure
6 Tree structure
view of an entry in JMdict
Structuring of a document or storage file is an
extremely important factor, especially with one as large as the EDICT file (See
Appendix) containing over 100,000 entries. A plain, unstructured dictionary
file format such as EDICT allows the linguist a great deal of freedom with the
structure of the dictionary. The disadvantage is that the file must be accessed
from top to bottom, from beginning to end. It is also application independent.
This is a good method of storage if the amount of data is relatively small and the
combined delays of searching are not an issue. Even word processor searches for
particular words in a relatively long file can output a result in good time.
The dictionary file has to be more structured to allow for more efficient
access as the data set gradually increases in size, and when extra data
attributes are added. In addition to these problems, the limited structure of
flat file dictionaries like EDICT means that for each entry, the amount of
includable information is restricted as information extraction from the file
would become extremely difficult. XML was chosen as the data format for the new
dictionary file because it allowed for additional attributes and most
importantly extendibility. Grasping the opportunity to utilize the
possibilities of extendibility of the file, German glosses were added to
dictionary entries. Breen called this new file the ‘Japanese Multilingual
Dictionary’ or JMdict. Its superiority over other electronic dictionary files
of this type meant that the JMdict file would be the primary data source, and
the dictionary application would be built around this file.
According to Breen[2000 (b)], the aims of the JMdict
project were as follows:
The following is figure of a sample of JMdict as
displayed on Internet Explorer 5 that is capable of displaying Unicode fonts:

Figure 7 Sample of the JMdict
file displayed on a Unicode capable browser
XML is a simplified subset of SGML and is not
really optimized for the Web environment. However, it does mean that it is data
processing-oriented (compared to browser-oriented HTML) and should not be seen
by the end-user. This is not to say that XML cannot be presentable. Data stored
in XML documents still need to be presented to users in an attractive format.
Rules for formatting an XML document is called a stylesheet and can be used to
transform the raw XML data into a HTML document complete with hyperlink markups.
The language used to create the stylesheets is called XSL, or extensible style
language. [Goldfarb, 1998]
4.2 XML information access
Because such a structured storage
structure is being used to hold the dictionary entries, more sophisticated
methods of searching and entry handling can be carried out. By using the XML
storage structure, the entries need to be parsed by an XML parser. There are
two major types of implementation of XML parsers available for processing XML
documents:
-
Tree-based APIs
-
Event-based APIs
Firstly,
there is a SAX (Simple API for XML) implementation of XML parsing. This method
is an event-based API that uses a simple technique in which the parser searches
through the XML document in a logical order to find matching entries. On the
other hand, there is a tree-based API called DOM (document object model)
implementation. This is slightly more sophisticated in that it treats the
entries individually as objects. [Laddad, 2000] Both of these APIs were created
to serve the same purpose: to provide access to the information stored in an
XML file. The advantages and disadvantages of using both these methods will be
studied consequently to determine if they can process the JMdict efficiently.
Of particular interest is how the XML document is broken down by the parser.
Additional features that can be implemented in electronic dictionaries such as
cross-referencing will be compared when using these two different XML parsing
techniques
When deciding what
programming language to use to implement the multilingual dictionary, several
requirements need to be fulfilled.
-
Ability
to handle Unicode
-
Compatibility
with XML tools, in particular SAX and DOM
-
Ability
to integrate with the internet
-
Efficient
and re-usable code
Java appears to be the best
language to code the application. One of its greatest strengths is that its
internal character is in fact Unicode, immediately reducing the complexity of
attempting to convert between foreign character standards and internal
character representations. In comparison with C or C++, string handling in Java
is far superior. Many XML tools are being developed for Java and XML is
beginning to be synonymous with Java. For example, IBM’s implementation of the
XML parser has been released as XML4J (XML for Java). Standards such as DOM
(and JDOM, see section XX) are being put forward to be included in the next
Java release.
DOM not only addresses the object model in XML, but also in HTML, as it is also a structured document. Currently, DOM is specified up to Level 1 core. The DOM level 2 specification is still in a working draft form and it is guaranteed to change before it is officially a standard.
DOM represents a document
tree fully held in memory. It is a large API designed to perform almost every
conceivable XML task. It also must have the same API across multiple languages.
Because of those constraints, DOM does not always come naturally to Java
developers who expect typical Java capabilities such as method overloading, the
use of standard Java object types, and simple set and get methods. DOM also
requires lots of processing power and memory, making it untractable for many
lightweight Web applications and programs. However, we will still investigate
the features of DOM and its suitability.
The DOM class heirachy is divided into several
layers.
-
Document – the Document node is the master node – only one of
these can exist for an XML document. It represents the XML document as a whole.
-
NodeList – This type of node is used
to hold a collection of child nodes. It basically allows access to the children
-
Named NodeMap
– Contains
additional functionality in relation to NodeList in that it is able to access
the children by their names
-
Element – Contains an element from
an XML document. This can be thought of as the name of the tag used in an XML
document <gloss>
-
Text – These are used to
represent text contained within the element tags. <gloss>to
conspire</gloss>
-
Attr (attribute) – This node
represents the attributes declared within the scope of an element. An example
of this in XML format is <gloss g_lang=”de”>
-
CDATASection – Similar to the text node,
however, it can contain markup. This allows the user to specify text with XML
control characters such as ‘/’ and ‘>’
-
DocumentType – is a node that represents
the tags used in the Document Type Definition.
-
Entity,
EntityReference and
Notation – Are used to describe nodes
used in the DTD.
The relationship and hierarchy between the different
node types can be visualized on the following page. Various interface methods
can be carried out on these nodes to manipulate and access them. Examples of
various functions are:
-
Child
modifiers
-
Node
creation, grabbing, moving and deletion
-
Element
methods
-
Element
usage such as child iteration
Thus, DOM provides
mechanisms needed to dynamically interact with the elements and content in an
XML document. With DOM, handlers, or hooks, are created to encapsulate
behaviors to associate with elements in the DOM tree. DOM sets out to be able
to model every possible well-formed XML document. Therefore, DOM classes
contain features that many XML applications never use. An electronic dictionary
requires few of these classes. In almost all procedures, operation to be
carried out would be the grabbing of nodes and attributes. The file is not
required to be, or should be, modified by different user session, so interfaces
provided for dynamic modification of XML document structure is not required.
In summary, DOM is really
useful for business applications which require the dynamic manipulation of
elements and content in XML files. These features that are important in other
applications are not a major factor in the context of creating a dictionary
application. The only concern in the usage of JMdict is the efficient access
and searching of the XML file. DOM can provide this, but at a huge cost, which,
unfortunately is not suitable for this application. An initial experiment
carried out using the DOM parser to create the DOM tree on JMdict resulted in a
program crash, the problem being the system was out of memory. With ever
increasing dictionary file, DOM alone does not provide a feasible option.

Figure 8 Hierarchical
relationship between different node types
SAX is a public domain API
developed cooperatively by the members of the XML-DEV mailing list. It has now
become a ‘de facto’ standard for event-based parsers and is one of the most
popular XML APIs available. It provides an event-driven (sometimes referred to
as a callback-style) interface to the process of parsing an XML document. As
mentioned previously, XML is a hierarchical language that means entries can be
nested and have parents and children. Although XML provides this kind of
functionality, documents do not need to have a tree structure. SAX presents a
view of the document as a sequence of events. For example, it reports every
time it encounters a begin tag and an end tag. That approach makes it a
lightweight API that is good for fast reading. SAX also does not support
modifying the document, nor does it allow random access to the document.
In some cases, the event
based API can be more efficient than a tree-based API. It generally provides a
lower-level access to an XML document. An advantage of using SAX is its
portability between other SAX parsers. Code created using one parser can be
ported very easily to another parser package. Because SAX does not involve the
generation of internal structures, it is able to handle large documents much
better than DOM; there are no memory overheads associated with storing the XML
data.
SAX is called the simple API
because it is just that. All SAX does is carry out actions depending on events
occurring in the XML file. SAX, knowing nothing of the rules that govern an XML
document's structure, must be prepared for anything. It must watch for and, if
directed, generate events for every possible XML feature that an XML document
may provide. These events are programmed into the code to respond according to
the type of data being read. This allows for great flexibility. For example,
the SAX parser will be able to pick up all the <entry> tags so, if time
and efficiency were not a factor, a query could be searched for using SAX
simply by comparing each entry tag to see if it matches the query string.
To make SAX handle events
properly, a SAX document handler needs to be created to interpret all the SAX
events. In addition to this, the behaviour of the handler which will respond
according to the data received by parser needs to be coded in as well (which
can be a lot of work). The documentHandler interface is the most important part
of the SAX interface. It is responsible for capturing specific document events.
startDocument
and endDocument – Indicates the start and
the end of the XML document
startElement and endElement – Indicates the start and
the end of a new element. This can be <entry> or even <gloss>
characters – Indicates that there is
character data. This interface can be used to retrieve data such as the
definition of an entry.
The following is a diagram outlining the basic
functions of the SAX parser and how it may be used in dictionary application:

Figure
9 SAX parser
operation and a possible application in a multilingual dictionary
There are several XML parsers that have built-in SAX support:
Microstar’s Aelfred
James Clark’s XP
IBM’s XML for Java
Implementation
using the SAX parser alone as the search engine is going to be impractical.
Although SAX is the better parser to use for large files, it suffers from the
fact that it is event-driven. The parser has to start at the beginning of the
document and finish its parse once it reaches the end of the document. It is
not efficient enough. However, the event-based nature of SAX could have its
advantages. One suggestion for use of SAX could be to use its start-to-end
structure to create an index file. It will run the startElement function whenever it
encounters a tag whether you are interested in the tag or not. But, by
selectively processing elements that are, for example, <ent_seq>
(the sequence
number of the entry) and <gloss> (the English definition) an
index file could be created.
Another implementation may be possible using the SAX parser. If the application can use an alternate technique (such as indexing techniques or hashing) to find the exact location of the query entry in JMdict, the SAX parser could be used to parse the subsequent entries. As mentioned before, the SAX parser documentHandler interfaces can be overloaded so that they can exhibit certain behaviour when a specific tag is found. This feature can be used to allow the application to exhibit special behaviour when for example a bibliographic attribute is found in a particular entry. SAX parsers are built to parse a complete file at one time, but it should be possible to make minor modifications so that the parser is able to start at a certain point in the file without having to validate the file. The following table compares SAX and DOM
|
|
SAX |
DOM |
Information
access
|
Sequential |
Random |
|
Setup cost |
None |
High (document parsed into memory) |
|
Memory cost |
Low |
Very high |
|
Applicability to JMdict |
Possible |
Impractical |
Table 4. SAX and DOM comparison
4.5 Object orienting the XML file
A paper dictionary can at
times be considered a file. The entries are entered in alphabetical order.
Another feature of this file is that each entry can be considered as an object.
Electronic dictionaries face the dilemma about how to best internally represent
the data of a dictionary. The consideration of each entry is a fair one, it is
a system that has worked before and it is possible it will work with electronic
dictionaries. However, it is not a situation where ‘if it ain’t broke, don’t fix
it’. By using objects to represent entries, many of the advantages of using an
electronic medium to store, search and display information will be lost. If the
concept of objects were created, it makes it quite difficult to link entries
together. Headwords in a dictionary are invariably tied to many other words in
a dictionary; no word is unrelated. Some relationships include parts of speech,
usage, synonym, antonym and word derivation to name a few. It would be
advantageous to base the representation of the dictionary entries on the
relationships between various attributes of an entry. It would definitely make
for a more diverse and functional dictionary system. An object oriented
approach to entry storage simply cannot cope with so many links. It would be
expensive to construct and it would be quite complex to traverse.

Figure 10
It can become quite complex and difficult to traverse with each entry
containing so many different links, incoming and outgoing
It from the analysis on the
object-based DOM API, it seemed that the implementation of an object oriented
tree structure was not feasible, memory wise. The concept of object orienting a
dictionary also seems like a far from adequate solution for visualizing a
dictionary entry. The following chapter will discuss other methods with which
to represent JMdict.
4.6 Other XML tools
There are other tools for XML that are seen to be irrelevant creating the dictionary application. They are briefly summarized:
XML Base (XBase) allows a document to specify a document’s
base URI (Universal Resource Identifier) against which all relative URI
references in the document can be resolved against. This includes references to
images, stylesheets, applets, etc. It is anticipated that this specification
will endorse the 1.0 version of the XBase specification.
XML Pointer Language (XPointer) is a language that can be
used as a fragment identifier for any URI that locates an XML resources. It is
based on the XML Path Language (XPath). It supports addressing internal
structures of XML documents, traversals of a document tree, and the selection
of internal parts of an XML document based on various properties. It is
anticipated that this specification will endorse the 1.0 version of the
XPointer specification.
XPath is a language for addressing parts of an XML
document, designed to be used by both XSLT and XPointer. It is anticipated that
this specification will endorse the 1.0 version of the XPath Recommendation.
4.7 Future XML tools
Both DOM and SAX general-purpose solutions to XML document processing. Each of these currently seems unsuitable for processing the XML file individually. Developers have also picked up on some of the points where both SAX and DOM fall short and have begun creating some new APIs for use with Java and other programming languages to fill the gap. Although some of these APIs are very new, they offer hope for using XML-specific tools to create XML-based applications more efficiently.
JDOM: a Java representation of an XML
document. It is a new API for reading,
writing, and manipulating XML from within Java code. JDOM attempts to
incorporate the best of DOM and SAX and create a new set of classes and
interfaces. It can read from existing DOM and SAX sources, and output to DOM
and SAX receiving components. That ability enables JDOM to interoperate
seamlessly with existing program components built against SAX or DOM. It
provides a way to:
-
Represent
the document
-
Read
the file easily and efficiently
-
Manipulate
the data
-
Write
the data back to file
It is an alternative to DOM and SAX, although it
integrates well with both DOM and SAX.
JDOM documents can be built
from XML files, DOM trees, SAX events, or any other source. JDOM documents can
be converted to XML files, DOM trees, SAX events, or any other destination.
This ability proves useful, for example, when integrating with a program that
expects SAX events. JDOM can parse an XML file, let the programmer easily and
efficiently manipulate the document, then fire SAX events to the second program
directly - no conversion to a serialized format is necessary.
The developers of the JDOM standard have described
the development of this API as:
-
“Straightforward
for Java programmers.
-
Supporting
easy and efficient document modification.
-
Ability
to hide the complexities of XML wherever possible
-
Able
to integrate with DOM and SAX.
-
Being
lightweight and fast.
-
Able
to solve 80% (or more) of Java/XML problems with 20% (or less) of the effort”
[JDOM, 2000]
Loading and manipulating documents should be quick,
and memory requirements should be low. It provides a full document view with
random access but it does not require the entire document to be in memory.
One of the drawbacks of
attempting to implement JDOM at this point in time is its early stage of
development. JDOM API has not yet been released as a beta. This means there
will be changes to interfaces and classes meaning implementation of the access
to information in the dictionary file may change as the standard develops.
Coupled with the constant growth of the multilingual dictionary, there may be
too many changes happening around the application. However, because of its
ability to output data to existing DOM and SAX components, JDOM could be used to
replace older implementations of a system that used SAX or DOM. If the
developer’s claims are correct about improvements in functionality and
efficiency over current standards, then the application may run faster.
Adelard is an alternative to
existing technologies like SAX and DOM. These technologies, while useful,
operate at the level of individual elements and attributes. The application
code in effect implements the a ‘bridge’ from these entity-level components to
user-level dictionary entries. Adelard aims to provide mapping of XML document
structures directly to high-level objects. This API is developed with mapping
XML documents to business-level objects, which can also be seen as
XML-to-dictionary entry objects.
Adelard comprises two integrated
parts: a binding framework and a schema compiler. The data binding framework
supports the transformation of XML documents to and from Java objects. Both the
source schema (otherwise known as the DTD) and the binding schema became inputs
into the schema compiler. The source schema describes the structure of the XML
file. And the binding schema describes the program-specific information that
drives the generation of the Java classes from the source schema. Because
Adelard knows about the schema, it can optimize the generated classes to
support only those features necessary for the schema in question. It can get
rid of support for unused XML document features.

Figure 11 Diagram of the schema
used to create objects
Projections of performance
is one of the most important pieces of information. Current benchmarking
indicates that Adelard is both faster than SAX and easier on resources than
DOM. This is clearly a great advantage for improving efficiency of the
application. The public release of the specification and an early-access
implementation is projected to be released at the end of 2000.
The JAXP specification
focuses on this aspect of XML programming, providing an API for creating,
configuring, and manipulating XML parsers. JAXP supports the SAX and DOM which
are the most common interfaces to XML. JAXP's main goal is to provide an
interface that lets programmers create, manipulate, and use standard XML
parsers. In addition, JAXP sets out to allow programmers to create
parser-neutral code, and deferring parser selection further down the process.
Sun are the creators of JAXP and they are currently starting the 1.1 version
for it. They are hoping to make JAXP a core Java extension.
XML Query Engine is a
JavaBean component that lets you search your XML documents for element,
attribute, and full-text content. It can index multiple documents using a SAX
parser of your choice. The index, once built, can be queried using XQL, a ‘de
facto’ standard for searching XML like a database.
XML Query Engine uses an index system to track every element,
attribute, and the words contained in each for every document. Any document to
be queried needs to be indexed first. Before you can index though, you have to
tell the query engine what sorts of things to index or ignore. XML Query Engine
defaults by indexing everything it encounters. That might not be what users
want so restrictions can be put in place to customize the index file.
In its incomplete, unpolished
form, it is not yet a beta, and it does provide an attractive alternative to
techniques described earlier. However, it seems to be focused towards business
applications. This means that some of the functions included in the API would
again be irrelevant and thus untouched in the implementation of the
multilingual dictionary program. Currently, there is no implementation of a
persistent store for the index. This means that the API creates the index each
time the application is starts up. It would seem unlikely that indexing the 13
Mb file would be sufficiently fast enough to satisfy users. The inconvenience
of the resultant delay may be a price users are unwilling to accept on a
regular basis.
When
the JMdict is parsed using data structures into memory using basic data
structures, the number of entries in the file (13Mb) can result in excessive
memory usage by the system. For a dictionary application to use up the majority
of a computer’s memory is not very practical. This means that other methods of
storage needed to be sought out in order to speed up the application and memory
efficiency. Object orientation of the entries has been explored but
unfortunately does not provide the flexibility that will allow different types
of access to the entries. Some of these difficulties can be attributed to the
design of the JMdict file. Despite this, there are still several possibilities
for internally representing the dictionary file.
5.1 Limitations with JMdict
JMdict is a revolutionary dictionary
file in comparison with its predecessor, the EDICT. It provides many structural
changes and introduces features and a format that will allow the file to grow
further in the future. However, there are a few aspects of its design that can
potential affect the way a dictionary file is represented. There are many
issues associated with multilingual dictionary files in general which have been
discussed previously. Because the headwords in the JMdict file are in Japanese,
it will restrict the type of information available. English definitions not
covered by Japanese headwords may be left out of the dictionary. In addition,
this multilingual dictionary provides translations from one language to
another, not to the same languages. This could be added in further updates.
Furthermore, the English glosses are sometimes quite lengthy, or provide a deep
definition. When it comes to translating from English to Japanese, the only
information that a user can extract in Japanese is the headword; there is no
indication of its ‘usage’ for example “to run” or in one case "in the
blink of an eye" (lit: in the time it takes to say "Ah!")” as
one English definition puts it. The XML file is already quite large in terms of
file size, and will undoubtedly grow bigger. Many of the tools available for
XML are business-based tools which rely on smaller files, which are sometimes
dynamically created.
In the end, there are some design features that are
more desirable than others. As with most things, they come at a cost of
something else. Breen has set up a mailing list for developers or computer
scientists who are interested in JMdict to participate in discussions that
could lead to changes in the makeup of JMdict. JMdict is a good design, but
some changes could possibly made to improve the file.
5.2 Indexing techniques
Indexing is the process of
pre-building the internal data structures needed to enable subsequent fast
retrieval from the indexed documents. The use of an index file is well suited
to dictionary applications because depending on how the index file is set out,
advanced searches on alphabetical or logical order data are possible.
The
index file is an efficient method of entry retrieval. It can be used to create
an fast dictionary system. The idea of setting up an index file is not new and
has been implemented numerous times. The reason for investigating this
technique is to identify if it can be used just as well in the XML document
context. JMdict is a data store like any other file; it differs by containing
additional structure. This added structure may open new methods of data
representation, but older techniques should also be investigated.
One
technique used by Breen (Jdic, Xjdic) to facilitate rapid searching of the
EDICT file. An indexing utility is used which identifies byte offsets of
English, Kanji and kana tokens. These entries are sorted into word value order
to produce an index file. In order to find the correct entries, the search
engine would search through the index file, checking the EDICT to find the
correct entry. Once it is found, it is passed back to the application for
display. Once a mapping has been found, it is a relatively easy task of opening
up the dictionary file and traveling to the specific byte number in the file to
access the information required.
Breen’s index only had one index file for
English, Kanji and kana and allowed searching for English and kana tokens. One
of the restrictions in index files is that a single index file really only
allows for searching of the file in one dimension. If the English to Japanese
dictionary were implemented, having an English index file containing the
English headwords as index entries, it would mean that the user’s search would
have to be based around an English headword search. A solution to this problem
is not only to have one index for the multilingual file, but multiple indexes
harboring different headwords and index entries. This is a very viable
solution. Not only does this allow the flexibility of allowing the user to search
different languages, but the different indexes for, say, verbs, nouns and
synonyms will improve functionality of the dictionary system by providing
superior and interesting searches for users. This increased functionality for
the user can lead to many improvements such as useful display methods. The
following is a possible arrangement of a dictionary application using multiple
indexes.
The
advantage of using an index file is that it is very suitable for running the
application remotely. The user in a remote location only needs to parse through
certain sections of the XML file to find the information they need, instead of
reading in a complete dictionary file which would be impossible with low
bandwidth connections. In addition to this, an index file is usually much
smaller and simpler than the dictionary file. Because it is much smaller than
the dictionary itself, it should be able to fit into memory very easily. This
makes it quick to reference. In addition to this, it is relatively easy to
compile using an indexing program that can come with the dictionary
application, so when new updates to files arrive, the user can re-create the
index.

Figure 12
Outline of possible index filing of the multilingual dictionary coupled with
the SAX parser
The index files are stored
in alphabetical order. In several cases they are ordered for example as part of
speech or bibliographic entry. A very efficient search of ordered data is a
simple binary search. Even on large files, binary searching is fast and
effective way of locating an entry.
5.3
Databases
There has been debate as to whether XML is in fact a database. The answer is no. Although an XML document contains data, without any additional software to help process that data, it is no more a database than any other text file. However, if the XML document along with all the surrounding XML tools and technologies, then the answer may be yes because it can provide some of the features found in databases.
A data base management system is a software
system for defining the structure, entering, retrieving, storing, maintaining
and validating of a large collection of formatted data. Of course, there are
many advantages of using a DBMS with a large amount of data and it is a natural
progression to consider such a large and expanding information source such as a
multilingual dictionary file to be stored in a commercial DBMS. Sharing between
users, data managed as resource, data independence from applications and portability
are all reminders of why DBMS is a popular form of data storage.
In order to transfer data
between an XML document and a database, it is necessary to map document
structure to database structure and vice versa. Such mappings fall into two
general categories: template-driven and model-driven. In a template-driven mapping,
there is no predefined mapping between document structure and database
structure. Instead, you embed commands in a template that is processed by the
data transfer middleware. Model-driven mapping involves the data model
being imposed on the structure of the XML document which maps to the structures in the database. What is lost in
flexibility is gained in simplicity in the design model.
Object oriented mappings are
already established techniques in which object oriented data stored as an XML
document can be transferred into relational databases. However, there are
issues when concerned with dictionary entries being viewed as objects. As
discussed in the previous chapter, the concept of seeing an entry ad an object
does not necessarily mean that the system can get the maximum functionality out
of the dictionary file.

Figure 13 Database approach to
data retrieval
A DBMS is great for storing data that can fit into a
set category. Despite the fact that a dictionary is completely made up of
entries, the type of data stored within them is very very different. Consider
the problem of a bilingual dictionary, a Japanese to English dictionary. The
bilingual dictionary can be defined using data modeling via an entity
relationship diagram:

Figure
14 Possible entity
relationship diagram for a Japanese to English bilingual dictionary
Choosing
the key for the database is not very difficult. The only attributes that
uniquely identifies an entry are the Kanji headword and the Kana headword. It
is a harder task deciding which attributes are grouped together into different
groups. The relationships between attributes are interconnected and trying to
find relations between them will result in confusion. Establishing additional
relational tables for further languages is likely to be an arduous task. The
reason why it is so difficult to assign the different attributes into tables is
because a database management system is not suitable for a dictionary file.
-
Not
all potential users can have access to such systems.
-
Not
all entries contain the same types of entities, such as one may be a
cross-reference and another a complete analysis. Some entries may have multiple
definitions and also multiple types of word (that is, noun and also an
adjective if used in a different context)
-
The
problem with what constitutes keys and which attributes mapping to which
tables. If someone wants to grab a list of nouns from the dictionary, it may be
impossible to do so under some configurations of the tables in the DBMS
-
A
DBMS stifles the amount of variety of relationships that can be represented between
lexical entries. If these factors are limited, it is providing similar barriers
that object orienting the dictionary does.
-
The
ER diagram finds it difficult at representing hierarchical information and
cannot deal with data types that have variable structures.
-
Space
can be wasted if entries with variable structure are created into tables
because even if an attribute is left blank in the table, it is invariably
taking up space in the database. Add up the different combinations of
attributes in one entry and there is the potential to lose a lot of space just
to keep empty entries
-
Entries
in dictionaries can be very complex – requiring storage of different attributes
into different tables if a DBMS were to be implemented. When the components of
an entry are scattered through different tables, the number of transformations
(such as restrictions, projections and joins) increases to a point where
retrieval of a single entry can become time consuming.
-
In
the above example of the ER diagram, not only are these attributes the only
data types, but they may contain subentries that in turn contain their own
subentries. This kind of nesting structure can result in an unusable database
-
Once
the relational database is set up for a dictionary system, its structure is
almost permanent. Addition or rearrangement of attribute groups is extremely
difficult. For example, if it is decided that another language is going to be
added to the database, it is extremely difficult to map it over the current set
up. A major structural overhaul from a bilingual dictionary database would be
need in order to incorporate the extra language.
Despite all these
negative comments about DBMS, there are several positive aspects from using
DBMS in the context of lexicographical information representation.
-
The query handling is handled by the system.
Relational databases can also connect to systems that allow editing of the
tables and also be linked to applications that create interfaces to query the
interface.
-
Headwords that are synonyms can share the same
definition – they no longer need to be stored differently. One key can have a
one to many relationship between different entities in different tables, such
as multiple definitions for an entry.
-
Access is carried out through a query language
such as SQL, making access relatively easy
-
The most important advantage of a DBMS is that
it is a system that does not require the user to search the file from head to
toe to find an entry. Compared to a technique such as SAX, the retrieval times
are far superior and more efficient.
-
Given the low price and ease of use of databases
like dBASE and Access, it may seem like a good option
There are still different types of implementations of DBMS systems that utilize the DOM form of document object. The biggest problem with the DOM structure is that it is all resident in memory. Obviously, others have realized this problem and in an effort to by pass this problem, a group of developers, GMD-IPSI, have created software called PDOM that avoids the need to store the complete document structure in memory.
PDOM provides an
implementation of the DOM over indexed, binary files. These are created by
converting existing XML documents (a one-time operation), as well as when PDOM
is used to create a new DOM Document. PDOM includes a cache, which swaps DOM
nodes to disk when handling large DOM trees, defragmentation and garbage
collection facilities, commit points (for writing the in-memory tree to disk),
file compression with gzip, and thread-safe operation. This application has not
been tried or tested, but it is an alternative worth considering when deciding
to implement a dictionary system.
5.4
Hash tables
A hash table is form of data
structure that can offer very fast insertion and searching. No matter how many
data items there are, insertion and searching can take close to constant time.
Hash tables rely on the concept of having a range of keys that are transformed
into a range of array index values. In some cases, the index number can map
directly onto an array, removing the need to transform index to array number.
Despite these strong positive points, hash tables fall short of becoming a good
internal representation because it cannot cope with sequential accesses. A
detailed description of hash tables and reasons why it is not a good technique
for representing a multilingual file is described in Appendix C.
5.5
Additional considerations
The
preceding discussion describes the methods in which the dictionary file can be
stored and structured in a way that allows fast and flexible access. This is
but one link in the chain to a successful application. The search string
queries sent to the file structure to find entries must be studied carefully.
The type of information to be stored in a file such as an index file needs
examination.
Indexing issues when
creating index file
There
are some minor problems when indexing the JMdict file. The issue arises because
of the nature of the English entries. An entry may consist of a word, two
words, or a phrase. A headword in an English index file cannot start with ‘to
run’. If a user typed ‘to’ they would get an avalanche of definitions. On the
other hand, if the user typed ‘run’, they may not find what they are looking
for, since the index searches alphabetically for matching words. What the index
would really like out of this entry is ‘run’. This may not only be a difficulty
for English definitions, but it may also be a factor for the German definitions
too. One solution to fix this could be to have an exclusion list of words. This
exclusion list would contain a list of words that are to be omitted when
indexing. Examples of these could be ‘to’, ‘the’, ‘or’ and ‘and’. The exclusion
list would be put into action when creating the index file. This alone would
reduce the size of the index file, because there are a lot of these
words in the dictionary. More importantly, it will boost the effectiveness of
the searches. Another additional technique to ensure that the user can retrieve
the correct headword would be to store all the words contained in the English
definition in the index file. For example, an entry ‘hot water’ would be stored
as 2 separate head words ‘hot’ and ‘water’. This will allow the user, if
searching for ‘water’ to grab several derivatives or compound words containing
the word ‘water’.
Dictionaries are constantly
growing in size. Take the Oxford dictionary for example. This English
dictionary undergoes updates and additions every year as new words are created.
A multilingual dictionary will always continue to grow in size. JMdict
currently only supports 3 languages at the current point in time, and even this
support is limited. So when choosing the internal representation of the
dictionary file, the issue of scalability is rather large. The system must be
able to:
-
Accommodate
large additions in data to all aspects of the file
-
Be
flexible enough to allow addition of further fields and entry types
-
Still
maintain efficiency with much larger files
-
Cope
with the addition of another language
As discussed in a previous section, a DBMS is able
to grow along with the dictionary file. Its Achilles heel is its inability to
cope with inclusions of additional fields and entry types. Hash tables are poor
in providing scalability and sequential access. The indexing system should be
able to cope reasonably well to most of these factors.
The
problem of visualization is a challenging one. Providing a translation of a
word from the user’s native language to a second language is of little help
without giving advice on the usage and relationship with regards to other
words. This limitation not only comes from the format of the dictionary;
failure to deliver the information to the user can also be the fault of the interface
to the dictionary. Furthermore, the interface can deliver much more than the
interface for a print dictionary, and it should reflect this fact. Success of
an application also depends on whether a user find the interface attractive,
but more important, intuitive and easy to navigate. Unless the software is
outstanding, some users will not appreciate steep learning curves for
applications.
The primary focus of the
interface of the electronic dictionary is towards the user. The amount and
format of information presented to the user can be varied greatly compared to
the print ones. A user’s most basic requirement of a dictionary are:
-
To
enter a query
-
View
the entry in the dictionary that corresponds to the search
Therefore,
the basis of the dictionary interface should be focused around these two
aspects of a user’s needs.
Print dictionaries used to only provide a search for a specific entry.
As mentioned previously, the electronic storage of dictionaries allows
increased functionality of searches provided adequate data representation
techniques are used. The following are some of the different input and
searching methods that can be employed:
-
Complex
searching options. This is one of the major features that electronic dictionaries
have over print dictionaries – being able to search not only in alphabetical
order, but in other interesting ways:
o
Searching
for verbs, nouns, adjectives
o
Searching
for words beginning or ending with a certain letter combinations
o
Search
for synonyms and antonyms
o
Search
for ‘n’ letter words
o
List
the dictionary in alphabetical (traditional print) format
o
Combination
searches (search for nouns that begin with “trace”)
o
Restriction
of a search; re-search the list of entries with a different search
o
Return
words that sound like “grass”
-
These
complex searching techniques can be made possible by a data representation
scheme such as indexing. Multiple index files would be created, each providing
a different index allowing unique searches to be carried and combination
searches that utilize several of the index files at one time.
-
Follow
up cross references in an instant. Cross references displayed after a search
should have the ability to be followed-up. The cross-references should be able
to be inputted as a search query by either clicking on the text or some
other form of requesting the cross-reference entry.
-
Choice
of different input schemes for users of different languages. The provision of
input schemes for languages that cannot be entered through a western keyboard
needs to be implemented. Research into this area is extremely broad and diverse
depending on the language and for this reason was not investigated in this
project.
Because the JMdict file is stored in Unicode and the programming language is capable of handling Unicode, the next issue with regards to character encoding is being able to display the Unicode information onto the screen. MS Window 9x operating systems do not use Unicode as their primary fonts. If Unicode is required on these systems, specialized Unicode fonts need to be installed or downloaded via the internet. Being a stand alone application, Unicode support for display of foreign characters must be incorporated into the software. The investigation of ways to display Unicode was outside the scope of the project.
The display of entries to
the user can be creative, yet practical for the user. There are many things
that can be displayed to the user at one time. The following is a list of
features and displays that can be provided for the user:
-
Pictures and sounds for the user. Sounds can be used to give the user an audio
experience that can aid in the learning of the language from the speech
perspective. Dictionaries have usually been very useful for writing and reading
languages, but inclusion of audio will add an extra dimension of language
learning to the dictionary experience. Illustrative examples for bilingual
dictionaries can be used to contribute to a user’s interest by showing the word
in a live context and the enhance understanding of the grammatical and semantic
rules governing the usage of the word by showing these rules in action.
Pictorial illustrations can serve 2 purposes in a bilingual dictionary – They
can cue and reinforce verbal equivalents, especially when the user can identify
with the picture. Secondly, they serve as generalizing examples when several
different but relevant pictures are given in order to establish concepts.
-
Customization of user needs– the ability to choose what type of information is
presented. Various users may use this dictionary. Electronic dictionaries
should be able to cater for young students and also for specialists. To allow
such compatibility in one program, the display should have the option of
customization. For example, hiding or
displaying the pronunciation, etymologies, variant spellings, part of speech,
bibliography, cross reference, examples and pictures.
-
Tree or graph displays. To demonstrate the relationship between words in
the dictionary, the user can be provided with the option for displaying
cross-references or special word such as synonyms in the form of a graph. The
following is an example of a possible synonym graph.

Figure 15 Example graph layout
of the synonyms for ‘gruff’
Use of colours: Various colours can be used to denote different attributes of an entry and can
also provide links for cross-references.
Switching between languages: Because the application is
a multilingual environment, the user should have the right to switch between
different language displays at any time. An entry that has been displayed in
Japanese should allow switching to a German display. This can be useful in
finding language equivalents for a particular word or phrase.
A sample mock-up of a
possible interface for users has been created. The interface demonstrates some
of the aspects highlighted earlier. A sidebar has been included to allow users
to view other entries located around the target entry, in addition to providing
multimedia in the form of pictures. Buttons or hyperlinks can be used to filter
different displays for the user.

Figure 16 Sample interface for a
multilingual dictionary
6.3 XSL
XML is a customizable markup language. In the case of JMdict, it is optimized to handle pure data. This data needs to be processed so that articles that are formatted can be displayed to the user. A style sheet is required to automatically convert the document from the abstraction (XML) to a formatted rendition. This stylesheet language is called XSL.
XSL (eXstensible Stylesheet
Language) is a stylesheet, which is a template that describes how to present
documents. It enables a document to be presented in a variety of ways such as
onto a monitor, onto paper or even speech. It is used to apply style to XML
documents. XSL is a declarative language. Each ‘rule’ element must have a
target-element and an action. Each element in a document matches a single rule.
The XSL stylesheet will look at each element and apply the correct rule to it.
It consists of two parts: 1) a language for transforming XML documents to other
XML documents (XSLT), and 2) an XML vocabulary for expressing formatting
semantics.
The Extensible Stylesheet
Language Transformations (XSLT) W3C recommendation describes a transformation
vocabulary used to specify how to create new structured information from
existing XML documents. XSLT implements transformation by example, not by
program logic. Templates are created that tell the XSLT engine how to transform
the XML document. There are instructions included in the template file for the
XSLT engine to find the information in the XML file. XSLT may be useful if
temporary XML files need to be created containing a certain list of entries for
display to the user. This new XML file may be passed to the XSL processor to
produce the desired output.
An XSL stylesheet consists
of a set of construction rules that are defined for the conversion of an XML
source tree into a new XML document that is expressed in a hierarchy of
formatting objects called flow objects. This means that XSL stylesheets are XML
documents. These flow objects describe exactly how the document should be
printed. When using the XSL formatting objects to present information, the
objective of the stylesheet is to transform the XML information into a
hierarchy exclusively comprised of the XSL formatting object vocabulary. A
rendering engine then takes this resultant hierarchy and interprets the
semantics of the XSL vocabulary to produce the desired format.
XSL
increases the usefulness of XML. It provides the ability to reuse data, provide
a standard style of presentation but on the other hand can provide customized
presentations for different users. For example, in a product database, the user
may like to see the products that are in stock for a company. Instead of
drawing up a whole new database for them, can use the existing factory
database, except use a fancy stylesheet which will only allow viewing of the
database.

Figure 17
Diagram outlining the processes that take place to process an XML document for
viewing.
The motivation behind the research was to combine
all the different research and information associated with creating a
multilingual dictionary into one report. It has focused primarily on the
handling of the XML-based JMdict file and its role in the multilingual
dictionary context.
Review
This thesis has provided a platform for the analysis
of a broad range of issues relating to electronic dictionaries, with the major
focus on multilingual dictionaries. The operations required from a dictionary,
the input of a query, searching for the entry and display of results. The
analysis of the multilingual problems and possible solutions have been combined
into one thesis report is quite rare.
Many XML tools have been constructed
without the application of a dictionary file taken in to consideration, so the
investigation of their applicability will allow others to benefit from the
analysis. This thesis has been undertaken with the hope that future developers
of a multilingual dictionary application will have a better understanding of
the difficulties and possible solutions to some of the questions raised.
Based on the investigations carried out on the
different storage and retrieval techniques, it appears that indexing the JMdict
file currently seems the most effective option. Indexing allows multiple keys
to be stored and also provides for quick access that can be carried out in a
sequential order. The two XML APIs, DOM and SAX are alone unable to support
such large and complex XML structures such as the JMdict. DOM simply cannot
provide memory effective ways of representing entries as objects. SAX is
event-driven, restricting the range of searching and retrieval methods.
Restrictions
and Future work
While the thesis has been a thorough review, there
are many aspects that can be investigated further and implemented. Because the
topic is so huge, a restriction on the research has been the time frame. It
would have been advantageous to have begun an implementation of a multilingual
dictionary. However, lack of information and tools specifically for
multilingual dictionaries resulted in the investing of time in combining all
the information into a single document. Despite this, there are still several
gaps in the research and subsequently gaps in the thesis report.
A proper design and subsequent prototype can be
created because the major issues have been identified and in some situations,
suggestions are given. An important extension to this thesis would be to
produce a proof-of-concept application. Such an application would combine the
techniques outlined for each of the programming concepts required for each
stage. A demonstration of the application is often a good way of proving that
the suggested techniques work well in the environment.
Input techniques such as those described in Chapter
3 are so diverse amongst different languages that investigation into the most
appropriate techniques to use could result in a completely different thesis
report. Japanese Kanji and Kana input would need analysis along with the
traditional input methods. As more different languages are included into the
XML JMdict, further research would be required to establish input techniques.
Studying the JMdict file itself was touched upon
only briefly in this paper. However, a further in depth look could be taken. If
the JMdict were to be altered structurally, the impact on internal
representation of the file could be examined, identifying the advantages and
disadvantages of different changes. An example of a change could be the impact
of converting some of the data into attributes. In addition, further discussion
on the drawbacks of XML and the JMdict could be carried out along with possible
solutions.
Investigations on user learning ability when given
different a user interfaces would be an interesting future research topic. From
this form of investigation, it may be possible to determine the best ways to
present information to the user and identify which types of interfaces are best
for users to navigate through.
Appendices
There have been a variety of different types of
electronic dictionaries developed around the world. A quick search from a
search engine will bring up a host of electronic dictionary web sites. Of
notable interest is the Oxford English Dictionary [OED, 2000]. This dictionary
is the largest English dictionary in the world, containing 20 volumes in its
paperback version. The OED is stored in flat files with SGML markup (Standard
Generalised Markup Language). The following is a table describing the type of
data available per entry that a user can access. The OED contains the most
information per entry pertaining to a single language.
|
Headword
section |
Sense
section |
Special
types of main entries |
Cross-reference
entries |
|
status
symbol headword
pronunciation
part
of speech homonym
number label
variant
forms etymology
and etymological note |
status
symbol sense
number label
definition
and definition note quotation
paragraph date
of publication author
title
text
of quotation compounds
derivatives
|
letters
of the alphabet initialisms,
acronyms, abbreviations affixes
and combining forms proper
names erroneous,
spurious, or ghost words lengthy
entries |
|
Table
A1. Illustration of the structure and data contained in each entry [OED, 2000]
EDICT
EDICT is an electronic dictionary
file containing Japanese entries with translations to English created by Breen
[2000 (a)]. Initially created as a file for a piece of software called Moke
[Breen, 2000 (1)], the dictionary size has grown to over 100,000 entries. The
entries are very simple (lacking structure) and as the following sample of
EDICT entries demonstrates, only a headword and definition are included per
entry:

Figure A18 Sample of the EDICT
dictionary file
The format of the
entries is as follows:
Kanji
entry (if any) [kana entry] / English definition/
KEBI
KEBI stands for Kamus Elektronik
Bahasa Indonesia which means the Indonesian Electronic Dictionary. KEBI is
developed as part of the Multilingual Machine Translation System Project.
22,500 root word entries exist, which consist of a total of 43,500 derivation
words. The dictionary structure consists of the following information:
-
Morphological
information: consisting of suffixes and prefixes that are tagged onto root words
to form derivation words
-
Syntactic
information: consists of parts of speech
-
Semantic
information: consists of semantic category of a word.
-
Concept
description: contains word meaning which is described in English.
The
following figure illustrates the results of a search in Indonesian

Figure A19 Output display of KEBI
system.
Format
of entries:
Head Word Information
Morphological information
Syntactic Information
Semantic Information
Concept Information
EDR
The
Japanese EDR (Electronic Dictionary Research) dictionary was a project funded
by 8 Japanese electronic companies Fujitsu, NEC, Hitachi, Sharp, Toshiba, Oki,
Mitsubishi and Matsushita. It was developed for advanced processing of natural
language by computers. It is made up of eleven sub-dictionaries. The
sub-dictionaries include a concept dictionary, word dictionaries and bilingual
dictionaries. Altogether there are tens of thousands of entries. Of great
interest is the bilingual dictionary. This dictionary consists of an English to
Japanese and a Japanese to English file. The files list the correspondences
between headwords in the different languages. The Japanese-English bilingual
dictionary contains 230,000 words, and the English-Japanese bilingual dictionary
contains 190,000 words. The following figure provides the structure of an entry
in the bilingual dictionary
Each
record consists of
-
Entry
information
-
Grammatical
information
-
Semantic
information
-
Bilingual
information: consists of correspondence word information (equivalent,
paraphrase, direct translation, romanization/katakana, and explanation)
-
Part
of speech

Figure
A20 Diagram
displaying the structure of Japanese/English entry structure in EDR [EDR, 2000]
Appendix B
Text representation
Asian character
representation, in particular Japanese representation on computers, is rather
unique in that they do not resemble or fit into the traditional European displays.
In addition to the great differences in representation, they require special
handling when being displayed on Western computers. There used to be no
character set for Japanese that was universally recognized. In addition to the
representation standards, there are many differences between Western and Asian
languages relating to the actual display and use of the language.
The Japanese language is not just composed of a single character set;
it is made up of three different types of characters Hiragana (native Japanese
words), Katakana (Japanese representation of non-Japanese words) and Kanji
(Chinese characters). The need for Japanese representation resulted in the
creation of a number of different character representation standards.
Half-width katakana was the very first attempt at Japanese
character representation on computers. It could be displayed relatively easily
on Western computers because they could be displayed in the same space as ASCII
characters.
JIS X
0208:1997 has been the most widely used of the Japanese electronic character
sets. It was created in 1978 and was to become the very first Japanese coded
character set to include Chinese characters.
After
the release of JIS X 0208-1990 (an extension to JIS X 0208:1997 character set),
Ken Lunde released a document called ‘Japan.inf, Electronic handling of
Japanese Text’. This article informed users on how to handle Japanese on a
variety of platforms.
Encoding is the mapping of a
numeric value to a character. Throughout the different Asian character sets,
the main encoding method is EUC (Extended Unix Code) and ISO-2022-JP. There is
also a Japan-specific encoding used which is called Shift-JIS. These encoding
standards supported one or several of the different types of Japanese text representations.
Chinese and Korean text
representation
Chinese characters,
otherwise known as Hanzi, are the most complex type of Asian character set. In
comparison with the English language that has 26 unique letters, the Chinese
character spectrum is huge, containing over ten thousand characters from
different regions. Therefore, the encoding and representation standards for
Chinese can become rather difficult. To complicate matters even further, there
are traditional and simplified representations of the same characters that must
be handled.
In China, the Chinese
character set standard is GB or ‘guo biao’ which stands for national
standard. This standard enumerates several thousand Chinese characters.
Numerous extensions and corrections to this standard have been carried out
since its introduction in 1981. In Taiwan, a country that does not use
simplified characters, the character set standard used is called Big Five. It
is an unofficial standard, but has been adopted and widely used by the
Taiwanese people. Established in 1984, it has the capacity to store fourteen
thousand characters. CNS, another character set standard, is seen to be a
corrected and updated version of Big Five. It has the largest capacity,
enumerating 48,000 Hanzi. [Lunde, 1999]
Korean is represented using
Hangul characters. These are a totally different set of characters in
comparison with Kanji and Han characters. The character enumeration standard KS
X was established in 1992 by South Korea. This standard contains almost five thousand
entries. North Korea also developed their own character standard in 1997 called
KPS that enumerates over 8,000 characters. [Lunde, 1999]
Appendix C
A hash table is form of data
structure that can offer very fast insertion and searching. No matter how many
data items there are, insertion and searching can take close to constant time.
Hash tables rely on the concept of having a range of keys that are transformed
into a range of array index values. In some cases, the index number can map directly
onto an array, removing the need to transform index to array number.
It may be a straightforward
idea to simply map each character so a unique number and create an index file
in this manner. However, after a short consideration, this plan has many flaws.
Many of the words will have the same index. Combinations of numbers to amount
of words in a dictionary cannot be fulfilled. Another possible way of indexing
each entry into a unique could be to possibly represent each character of a
word as a value 10 times as big as the position to its right. For example:
If ‘a’ = 1, ‘b’ = 2, ‘c’ = 3, ‘d’ = 4, ‘e’ = 5
Then the word ‘ace’ would be indexed as the number:
1*102 + 3*101 + 5*100 = 135
This poses another set of
problems in that a large word, such as a word like ‘encyclopaedia’, a 13 letter
word, would result in an extremely large index number, of size 1012
– far more space in an array than required. Therefore a method is required to
store this range of numbers into a reasonably sized array. A hsah function
is required to convert numbers in a large range into a number in a smaller
range. A simple hash function is to find the remainder by dividing the large
number by the array size to reduce the range of numbers to fit into the size of
the array.
When this form of hashing is
used, there is the problem where words like ‘bat’ and ‘tab’ hashes to the same
index. This means that there will definitely be hashing of several different
words into the same array location. This is a collision. There are solutions to
solve this kind of problem. Open addressing is a term used to describe the
action of searching the array in a sequential manner to find the next empty
cell slot and placing the collision item in that slot. A problem with open
addressing is that when there are many entries that share the same index number
(which there undoubtedly will be for an English dictionary), clustering will
occur (increasing size of a filled sequence of array entries). Another
alternative is to create a linked list at each index entry. This way, the
matching index numbers are linked together, anchored at the front to the
correct index slot in the array. This technique is called separate chaining. An
issue involved with separate chaining is that when there are many entries that
contain the same index number, searching will take longer. The following are
graphs comparing successful and unsuccessful searches using both open
addressing and separate chaining techniques.

Figure C1 Graph of the performance of searching using open
addressing

Figure C2 Graph of the performance of separate chaining.
In the situation of the
JMdict, the dictionary keys are not well behaved; they are words of different
length. The dictionary, if able to fit into computer memory, is a good choice
solely because of the fact that it can be accessed quickly. There are currently
approximately 500,000 entries in JMdict. The words would like to be accessed
from the hash table. This requires the conversion of the word into an index
number using the hash function. The array would not only have to be 500,000 in
length, but at least twice the size because having a hash table completely full
decreases the efficiency of searches.
There are several disadvantages. They are based on arrays, and arrays are difficult to expand once they have been created. For some hash tables, the performance of a search can decrease rapidly when the table becomes too full. If a search was carried out on a full hash table, an unsuccessful search would result in a complete linear search of the table. In addition, hash tables are not very suitable for growing data sets. The JMdict is sure to grow in the future so this would mean reworking the hash function and increasing the size of the hash table. The hash table would be indexed with one value only. This is restricts the flexibility of the system because it only allows one specific type of search to be carried out. Multiple hash tables would be required for different languages. All of them would not be able to fit into memory. There is no convenient way to visit the items in a hash table in any kind of order (such as in alphabetical order, or words ending with ‘less’). This is quite an important drawback. Many of the searches will come in the form of finding certain entry. A user may want to view a list of words that are in alphabetical order (as in the traditional format of viewing a dictionary). When this case arises, it will be very inconvenient to go through the hash table to retrieve the entries in alphabetical order. As suggested by Lafore, if the capability of ordered searching is required, a different data structure should be sought out. [Lafore, 1998] Therefore the optimum requirements for a hash table implementation would be: no requirement of ordered item visitation, accurate prediction of the size of the hash table, attributes that the multilingual dictionary does not possess.
Bibliography
[Al-Kasimi, 1977] Al-Kasimi, A. (1977) Linguistics
and Bilingual Dictionaries. E.J. Brill, Leiden.
[Atkins, 1985] Atkins, B. ‘Monolingual and Bilingual
Learners’ Dictionaries: A Comparison’ Dictionaries, Lexicography and Language
Learning, Pergamon Press, 1985. Pages 15-24
[Ballesteros, 1996] Ballesteros, L and Croft, W.B.
(1996) Dictionary methods for cross-lingual information retrieval. Proceedings
of the 7th International DEXA Conference on Database and Expert
Systems. Pages 791-801
[Brajnik et al., 1996] Brajnik, G. and Mizzaro, S.
and Tasso, C. (1996) Evaluating User Interfaces to Information Retrieval
Systems: A Case Study on User Support. SIGIR ’96 Zurich, Switzerland.
Pages 128-136
[Breen, 2000 (b)]
Breen, J.W. (2000) A WWW Japanese Dictionary. Japanese Studies Centre
Symposium, Melbourne, Australia.
[Breen, 1995] Breen, J.W.
(1995) Building an Electronic Japanese-English Dictionary. Japanese
Studies Association of Australia Conference, July 1995, Brisbane, Australia.
[Callan] Callan, J., Croft, B. and Harding, S. The INQUERY Retrieval System. Proceedings of the 3rd International Conference on Database and Expert Systems
[Ceponkus & Hoodbhoy, 1999] Ceponkus, A.,
Hoodbhoy, F. Applied XML. A toolkit for programmers. Wiley
Publishing, 1999
[Croft et. al] Croft, B., Broglio, J. and Fujii, H. Applications
of Multilingual Text Retrieval, Proceedings of the 29th Annual Hawaii
International Conference on System Sciences, Pages 98 - 107
[Croft, 1995] Croft, W. and Xu, J. (1995) Corpus-Specific
Stemming using Word Form Co-occurrence. Proceedings for the Fourth
Annual Symposium on Document Analysis and Information Retrieval, Las
Vegas, Nevada Pages 147-159
[Davis, 1998] Davis, M. (1998) Free resources and
advanced alignment for cross-language text retrieval. Proceedings of the
sixth text retrieval conference (TREC-6), Gaithersburg, MD: National Institute
of Standards Technology (NIST).
[EDR,
2000] Japanese Electronic Dictionary Research Institute: http://www.iijnet.or.jp/edr/ (2000)
[Eichmann et al., 1998] Eichmann, D. and Ruiz, M.E.
and Srinivasan P. (1998) Cross-Language Information Retrieval with the UMLS
Metathesaurus. SIGIR ’98 Melbourne Australia. Pages 72-80.
[Erbach,
1997] Erbach, G., Neumann, G., Uszkoriet, H. MULINEX, Multilingual Indexing,
Navigation and Editing Extensions for the World Wide Web. Project Note
at AAAI Symposium on Cross-Language Text and Speech Retrieval, Stanford, 1997
[Fujii,
1993] Fujii, H., Croft, W. (1993) A Comparison of Indexing Techniques for
Japanese Text Retrieval. Annual International ACM/SIGIR Conference on
Research and Development in Information Retrieval. Pittsburgh, USA
[Goldfarb,
1998] Goldfarb, C. F. and Prescod, P (1998) The XML Handbook. Prentice
Hall
[Han
et. al, 1994] Han, C., Fujii, H. and Croft, W. (1994) Automatic Query
Expansion for Japanese Text Retrieval. Technical report, Departement of
Computer Science, University of Massachusetts, Amherst
[Hartmann,
1983] Hartmann, R. (1983) Lexicography: Principles and Practise. Academic
Press
[Hlava
et al., 1997] Hlava, M., Belonogov, G., Kuznetsov, B., Hainebach, R. (1997) Cross
Language Retrieval – English/Russian/French. American Association for
Artificial Intelligence, Spring Symposium Series, 1997
[Hull and Grefensette, 1996] Hull, D.A. and
Grefenstette, G. (1996) Querying Across Languages: A Dictionary-Based
Approach to Multilingual Information Retrieval. Annual International
ACM/SIGIR Conference on Research and Development in Information Retrieval.
1996, Zurich, Switzerland.
[Hunter
& McLaughlin, 2000] Hunter, J and McLaughlin, B, JDOM Introduction:
http://javaworld.com/javaworld/jw-05-2000/jw-0518-jdom.html
(2000)
[IME, 2000] Microsoft Global IME: http://www.microsoft.com/Windows/ie/Features/ime.asp
(2000)
[JDOM, 2000] JDOM: www.jdom.org(2000)
[Jones, 1999] Jones, G et al. (1999) A Comparison
of Query Translation Methods for English-Japanese Cross-Language Information
Retrieval. SIGIR ’99 Berkley, CA, USA. Pages 269 – 270
[KEBI,
2000] KEBI Online - The Indonesian Electronic Dictionary Online: http://nlp.aia.bppt.go.id/kebi /(2000)
[Knuth,
1993] Knuth, D. The Art of Computer Programming – Volume/Sorting and
Searching, Addison-Wesley, 1993
[Kwok,
1997] Kwok, K.L (1997) Evaluation of an English-Chinese Cross-Lingual
Retrieval Experiment. AAAI Spring Symposium on Cross-Language Text and
Speech Retrieval, 1997
[Laddad, 2000] Laddad, R. (2000). XML APIs for
databases.
At: http://www-4.ibm.com/software/developer/library/jw-xmlapis
(2000)
[Lafore,
1998] Lafore, R. Data structures and algorithms in Java, Waite Group
Press, 1998
[Levenstein,
1966], Levenstein V.I. Binary codes capable of correcting deletions,
insertions and reversals. Cybernet. Control Theor. 1996 Pages:
707-710
[Landau,
1984] Landau, S. (1984) Dictionaries: The Art and Craft of Lexicography.
The Scribner Press, Charles Scribner’s Sons, New York
[Leventhal,
1998] Leventhal, M. Lewis, D. Fuchs, M. Designing XML Internet Applications,
Prentice Hall PTR
[Lunde,
1999] Lunde, K (1999). CJKV
Information Processing. O’Reilly Publishing
[Mair and Liu, 1991] Mair, V.H. and Liu, Y. (1991)
Characters and Computers. IOS Press
[Maruyama et. al, 1999] Maruyama, H., Tamura, K.,
Uramoto, N. (1999) XML and Java. Developing Web Applications. Addison-Wesley
[OED, 2000] The Oxford English Dictionary On-line:
http://www.oed.com (2000)
[Oard and Dorr, 1996] Oard, D and Dorr, B. (1996) A
Survey of Multilingual Text Retrieval, Technical Report. UMIACS-TR-96-19,
University of Maryland, Institute for Advanced Computer Studies.
[Pirkola, 1998] Pirkola, A. (1998) The Effects of
Query Structure and Dictionary Setups in Dictionary-Based Cross-language
Information Retrieval. Annual International ACM/SIGIR Conference on
Research and Development in Information Retrieval. 1998, Melbourne, Australia.
Pages 55-63
[PlumbDesign, 2000] PlumbDesign, ThinkMap Visual
Thesaurus.
At: http://www.plumbdesign.com/thesaurus
(2000)
[Porter, 1980] Porter, M.F. (1980). An
algorithm for suffix stripping. Program, Vol. 14, no. 3, July 1980
Pages 130-137
[Sebrechts et. al, 1999] Sebrechts, M., Vasilakis,
J., Miller, M., Cugini, J., Laskowski, S. (1999) Visualisation of Search
Results: A Comparative Evaluation of Text, 2D and 3D Interfaces. Annual
International ACM/SIGIR Conference on Research and Development in Information
Retrieval. 1999, Berkley, CA, USA
[St. Laurent & Cerami, 1999] St. Lauren, S. and
Cerami, E. Building XML Applications, McGraw Hill
[Sundsted,
2000] Sundsted, T. Adelard, one year later : http://www.javaworld.com (2000)
[The Unicode Consortium, 1996] The Unicode
Consortium. (1996) The Unicode
Standard Version 2.0. Addison-Wesley Developers Press
[Veerasamy and Belkin 1996] Veerasamy, A., Belkin,
N. Evaluation of a Tool for Visualisation of Information Retrieval Results. Annual
International ACM/SIGIR Conference on Research and Development in Information
Retrieval. 1996, Zurich Switzerland. Pages 85-92
[XSLT,
2000] http://www.xslt.com/what_is.htm
(2000)
[XML.com, 2000] Technical Introduction to XML,
At: http://www.XML.com (2000)
[Yamabana et al. 1996] Yamabana, K. and Muraki, K. and Doi, S. and Kamei, S. (1996).
A Language Conversion Front-end for Cross-Linguistic Information Retrieval.
Working notes of the Workshop on Cross-Linguistic Information Retrieval, ACM
SIGIR, Zurich, Switzerland.
[1] The Oxford English Dictionary, Second Edition, Volume IV
2 A dictionary is a book that lists words in alphabetical order and describes their meanings. They include information such as spelling, syllabication, pronunciation and etymology (word derivation). An encyclopedia is a collection of articles about every branch of knowledge. The often include definitions, and go far beyond the information given by a dictionary [Landau, 1984]. This division probably arose because of the inability to store the diverse information about a topic in a dictionary, where only definitions were required. But with the electronic medium, this border is likely to be blurred.