Contents

                                                                             Page

Abstract                                                                                                                          2

 

Acknowledgements                                                                                                      3

 

Chapter 1 Introduction                                                                                               

1.1   Monolingual and Multilingual Dictionaries                                                 4

1.2   Aims                                                                                                               5

1.3   Thesis overview                                                                                             5

 

Chapter 2 Past work                                                                                                    

2.1 Dictionaries                                                                                                    6

2.2 Electronic dictionaries                                                                                   8

2.3 Information retrieval                                                                                     9

 

Chapter 3 Data, concepts and tasks                                                                          

3.1 Unicode                                                                                                           13

3.2 Limitations with multilingual dictionaries                                                     14

3.3 Operations, requirements and tradeoffs                                                       16

                                                                                                                                                                                       

Chapter 4 Internal representation of the XML document                                  

4.1 XML                                                                                                               21

4.2 XML information access                                                                               24

4.3 DOM (Document Object Model)                                                                  24

4.4 SAX (Simple API for XML)                                                                          26

4.5 Object orienting the XML file                                                                       28

4.6 Other XML tools                                                                                           29

4.7 Future XML tools                                                                                          30

                                                                                                                                                                                       

Chapter 5 Indexing and entry retrieval                                                                  

5.1 Limitations with JMdict                                                                                 33

5.2 Indexing techniques                                                                                       33

5.3 Databases                                                                                                      35

5.4 Hashing                                                                                                          38

5.5 Additional considerations                                                                              38

                                                                                                                                                                                       

Chapter 6 Interface to an electronic dictionary                                                   

6.1 Query interface options                                                                                 40

6.2 Viewing search results                                                                                   41

6.3 XSL                                                                                                                 42

                                                                                                                                                                                       

Chapter 7 Conclusion                                                                                                  44

 

Appendices                                                                                                                     45

 

Bibliography                                                                                                                 53


Abstract

Multilingual Electronic Dictionary Project

 

 

The purpose of the research was to investigate and analyse the creation of a multilingual dictionary application using JMdict as a study target. The JMdict file is an XML document containing multilingual dictionary entries of Japanese-English-German. A dictionary application is made up several major programming components: user input of query information, query search and retrieval and display of information. These aspects of multilingual dictionary application creation were investigated with a major focus on internal representation of dictionary entries in the application. Current XML parsers have been found to be inappropriate as a form of data representation and retrieval on their own. The SAX parser could possibly be used along with the most favorable method, the indexing technique. Investigations on multilingual dictionary files, the XML format, handling of foreign characters using Unicode and input of foreign characters has been carried out. Furthermore, processing of information retrieval such as stemming techniques and spelling suggestions have also been investigated. This research can be a guide for programmers wishing to find out about creating multilingual dictionary software.

 

 


Acknowledgements

 

First and foremost, my biggest thanks goes to my supervisor, Associate Professor Jim Breen. His advice and inspiration has been invaluable throughout the thesis. In addition, without his hard work in the JMdict, this project would not be possible. Thanks for all your help.

 

Thanks to my friends, they know who they are, for their encouragement, criticisms and also for being there almost every step of the way with me on this journey.

 

Last but not least, I would like to give a sincere thanks to my family and relatives, especially my Mother Maria for helping with proofreading and providing moral support.


Chapter 1

Introduction

 

 

 

1.1 Monolingual and Multilingual Dictionaries

 

Dictionaries are an essential part of learning and education and are a centuries old teaching tool used all over the world in a multitude of different languages. The Oxford English Dictionary defines it as “A book dealing with the individual words of a language (or certain specified classes of them), so as to set forth their orthography, synonyms, derivation, and history…”[1] Dictionaries initially began life as topic specific dictionaries and also as bilingual dictionaries (Roman-Latin and also Chinese dictionaries) as early as the 16th century. So the use of bilingual dictionaries is not a new concept. Over the years, dictionaries also broadened their scope becoming more field-specific, for example, music, medical and scientific dictionaries. Soon, translations from several different languages to another language emerged, resulting in the creation of multilingual or interlingual dictionaries. Indeed, multilingual dictionaries provide a powerful tool for students to enhance their learning of another language.

 

Despite the range of dictionary functionalities, the structure of the print dictionary has always been constant, closely following the definition of a dictionary. Being able to search through one dictionary would enable readers to search through all dictionaries in the same way. English dictionaries are all laid out from A-Z (or in some cases A-K and L-Z). To find a word in these dictionaries simply requires the user to search alphabetically to find the entry. Asian language dictionaries, indexed by ideographic characters, may have a slightly different format in that they may order the dictionary via stroke or radical order. In any case, the reader has to manually sift through pages in a set pattern to find their definition or translation. There was no other known way to search for information within a dictionary [Atkins, 1985].

 

As computers have grown more powerful and widespread, the opportunity for automated processes and digital storage drew closer. Programmers, developers and lexicographers are now no longer constrained by the conventional format of the dictionary. Students and teachers are also no longer constrained in dictionary searching techniques. All applications that required paper or tape storage could now be located on digital storage medium only a fraction of the size required before. Of course, dictionaries, requiring relatively large amounts of physical storage area and mass, could be stored efficiently digitally. Ideas previously thought unfeasible, such as an electronic dictionary was suddenly a possibility. However, this opened up a whole new set of questions to be answered in terms of storage and information. Problems such as dictionary data storage, multimedia and pictures are all fields that need research in the context dictionary file construction. Different file structures have been used create electronic dictionary files. As new storage formats are used, they open up a whole host of different, new and sometimes unknown methods of information retrieval and representation. Today, there are many incarnations of electronic dictionaries, from portable handheld devices to stand-alone PC-based programs and even World-Wide-Web on-line dictionary systems.

 

 

 

 

 

 

1.2 Aims

 

Despite the rapid improvement of electronic dictionaries and the emergence of palm-top bilingual dictionaries, past research into electronic dictionaries and information retrieval have not focused on the study of the multilingual dictionary in the electronic medium. This area of study incorporates research into conventional dictionaries, but there are many issues that are left unanswered. Little work has been done investigating ways of representing the dictionary, storing, displaying and understanding user requirements of the multilingual dictionary. The aim of the research undertaken in this thesis was to explore the applicability and suitability for using the JMdict XML-based Japanese-English-German dictionary file as a study target.

The problem to be investigated does not come under one distinct category. Instead, it is made up of several major aspects, all of which, when combined, results in an analysis of a complete system, making this research original. The thesis attempts to be a guide to programmers interested in creating multilingual-based applications. The problem statement can be described as follows:

 

“To investigate and compare old and new storage techniques, data formats, representation of information and methods of information retrieval. In addition to these aspects, to investigate the applicability of these techniques in a multilingual dictionary environment. Finally, to explore the ways in which a system can be presented to a user and understand user needs.”

 

1.3 Thesis overview

 

The major aspects to be investigated are as follows: data storage format, internal representations of the document and the interface with the electronic dictionary. The following chapter headings mark the discussion of these topics in further detail:

 

Chapter 2 reviews the past work carried out in the large field of lexicography, electronic dictionaries and in particular, the area of information retrieval. With information retrieval, various fields of research into cross language information retrieval and text retrieval have been reviewed, in particular, research into query translation and machine translation.

 

Chapter 3 discusses the multilingual data available at hand. It also describes the types of encoding methods available for different languages. More importantly, Chapter 3 identifies the major (and minor) operations required from any dictionary and from a multilingual dictionary in particular. The limitations of multilingual dictionaries and in particular, the JMdict file, will be examined.

 

Chapter 4 introduces the concept of XML as a data storage format. The internal representation of dictionary data is extremely important as it determines the ease with which data can be retrieved for the user. The merits and shortcomings of various XML based tools are discussed along with the issue of object orientation and its suitability in the dictionary setting.

 

Chapter 5 provides a more in depth look at other techniques for efficient organization of data for retrieval. The format of the data storage, ease of data extraction and flexibility in the types of data being extracted will be investigated.

 

Chapter 6 provides insight into what the user seeks from an electronic dictionary. It is important in all software to make sure the user can operate it successfully and easily. In the same vein, the program must also be able to provide the information that the user wants. Chapter 6 will identify what kinds of information different types of user wants and provide possible solutions for displaying information.

 

Chapter 7 will complete the discussion on multilingual dictionaries and provide some suggestions for further work and the direction in which research is heading in this topic.


Chapter 2

Past work

 

 

 

Because of the nature of this thesis report, extensive research into both the technological component and language aspects of dictionaries was carried out. Without a full understanding of the whole picture, further investigations cannot occur.

 

2.1 Dictionaries

 

Dictionaries and lexicon are terms that are synonymous with each other; the term ‘lexicography’ is defined as the process of compiling dictionaries. A bilingual dictionary is usually produced for users in one source language (L1) to a target language (L2) (for monolingual dictionaries, L1 and L2 are the same language). There is sometimes also confusion over the differences between dictionaries and encyclopaedias. Some believe they are closely linked and are considered interchangeable, but they are different kinds of reference books with different purposes.2 Dictionaries produced can be either:

  1. A dictionary of comprehension (allowing the student to understand L2)
  2. A dictionary of communication (comprehension and production of L2) and can consist of some of these components (as outlined by the extensive entry record of the Online Oxford English Dictionary) [Beryl, 1985]

 

Multilingual and bilingual dictionaries

 

Before the construction of a multilingual dictionary, the creator must consider these issues:

-          The users of the dictionary

-          The purpose of the dictionary. There are four ways in which a dictionary can be aimed towards:

o        Technical dictionary

o        Learning dictionary

o        Reference dictionary

o        Comprehension dictionary

-          If the dictionary is used to understand a foreign language or to produce text in a foreign language

-          Native language of the user; either the user is a native speaker of the source or the target.

 

There are 4 types of multilingual dictionaries

  1. For speakers of L1 to understand target language text
  2. For speakers of L1 to produce target language text
  3. For speakers of L2 to comprehend source language text
  4. For speakers of L2 to produce source language text

 

Many experts agree that the trouble with most dictionaries is that they try to cater for the needs of both the source and target language together. Some believe that it is impossible to pay equal attention to both in the same volume. Because of the limitations of the size of normal paper back dictionaries, the editor usually has to select the entries to go into the dictionary on the basis of the purpose of the dictionary [Ali-Kasimi, 1977] [Hartmann, 1983].

 

There are several criteria to distinguish multilingual dictionary types, whether they include encyclopaedic information and whether or not they take account of changes in the language. There are three major types of multilingual dictionaries:

  1. Dictionaries for machine translation
  2. Dictionaries for terminological data banks
  3. Dictionaries for the human user.

 

Machine translation requires detailed grammars of the source and target languages, an interlingual grammar, a comprehensive bilingual dictionary, and complex computer programming systems to store, process and retrieve the data. An ordinary dictionary is expected to provide only the information which the user needs, but the dictionary for machine translation contains all the grammatical information about both languages [Al-Kasimi, 1977]. The major focus of this thesis is to investigate the criterion for dictionaries for human users.

 

Comparison between bilingual and monolingual dictionaries

 

Monolingual and multilingual dictionaries differ in several respects, according to the users they intend to serve, the needs they cater for, and the methods of their compilation. Bilingual and monolingual dictionaries show systematic variations in approach to wordlist (the stock of vocabulary to be listed). The bilingual dictionary usually contains a few thousand of the most frequent items to an extensive list as large as a monolingual dictionary. Furthermore, bilingual dictionaries contain not one, but 2 discrete wordlists (L1 and L2). Another fundamental difference between the mono and bilingual dictionaries is that the monolingual dictionary takes the form of a definition, whilst the bilingual dictionary attempts to provide an equivalent or a series of equivalents in the target language [Beryl, 1985]. The layout of the entries and search techniques may also be different. Some foreign languages can be searched differently according to phonetics or character construction, or even character codes.

 

Dictionary users

 

There are many different types of users who use dictionaries. It is very important to understand what kind of people use the dictionary otherwise there will not have be defined audience and this will result in a dictionary lacking focus. Users have different requirements and this will be reflected in the purpose of the dictionary via the type of entries, number of entries and structure of the entries. Hartmann [1983] categorises the uses and the users of dictionaries:

 

Factors in dictionary use:

Information

Meanings/synonyms

Pronunciation/syntax

Spelling/etymology

Names/facts

Operations

Finding meanings

Finding words

Translating

Situations in dictionary use

Users

Child

Pupil/trainee

Teacher/critic

Scientist/secretary

Purposes

Extending knowledge of the mother tongue

Learning foreign language

Playing word games

Composing a report

Reading/decoding foreign language texts

Table 1 Uses and users of dictionaries

 

2.2 Electronic dictionaries

 

There are many potential advantages of an electronic dictionary over a conventional dictionary. The information accessible to a user is now huge. The cost of dictionaries is an advantage. The cost of storing a dictionary on an electronic medium such as a disk or CD is far cheaper than the production costs incurred for the printing and distribution of dictionaries. Furthermore, updates to the dictionary will no longer mean a complete republishing of new edition; an update patch with the new data can be distributed very easily. Most conventional dictionaries cannot contain all the information for all the entries, but instead are selective in coverage of different aspects such as pronunciation, spelling, etymology and idioms. There are no longer any problems like this. The amount of data in a dictionary will now depend on how much a lexicographer is willing to put into the file [Hartmann, 1983].

 

Depending on the type of data format the dictionary is stored in, it is much easier to insert new entries into the dictionary. For example, introducing additional attributes to a file such as JMdict is very easy. This leads to the size of a dictionary. A 20 volume dictionary no longer needs to belong in a state library; it is possible to have it on your lap, in your laptop. Near unlimited amounts of free space can mean incorporation of features that were impossible in print dictionaries. Some larger print dictionaries were able to incorporate several pictures per page.  Pictures enhance the presentation of the dictionary, but it also serves as a valuable resource in terms of word-picture recognition. Multimedia is also another feature that can be incorporated into electronic dictionaries. A user is able to understand a word and read the phonetics of a word, but now is also able to hear it. The best way to learn to say a word is by hearing it first. Portability of an electronic file is another reason for electronic dictionary superiority. A set of dictionary files can easily fit on portable storage media such as CD. This means they can be taken anywhere and used whenever needed. No need to go to a library to find the most comprehensive dictionary. Accessibility is a great issue. Large dictionaries are not produced in large capacities because of lack of demand and high costs of production. However, dictionary files, even large files are easily transmittable over a medium like the internet which means that people all around the world will be able to take benefit from a most valuable resource.

 

One of the most important advantages that electronic dictionaries are able to offer is its transparency;  it allows access to the whole of its contents via formalized categories so interesting features of the vocabulary can be investigated directly. Some dictionaries used to be created from scratch specifically for beginners and infants. One dictionary is able to satisfy the needs of all users. How can this be achieved? Electronic dictionaries provide different ways of presenting data to the user. The format of entries no longer has to be set like conventional dictionaries. Computerised dictionaries have the potential to be customized according to the needs of the user [Hartmann, 1983].

 

The speed of information retrieval that electronic dictionaries deliver can be said to lose the memory retention benefits that manually searching through paper dictionaries provide. Using a dictionary usually provides the use with the chance to look at other entries surrounding the word on the same page. This is not to say that an electronic dictionary cannot do this, provided the correct design principles are administered.

 

            There are many different types of electronic dictionaries available. To complete the analysis of electronic dictionaries, some of these files should be studied and acknowledged. Breen’s [2000a] EDICT file, KEBI and EDR are studied. The structure and contents of these files are quite different. They are discussed in detail in Appendix A.

 

 

 

 

2.3 Information retrieval

 

There has been limited documented research into dictionary files or electronic dictionary representation. However, the majority of research related to electronic dictionaries has been in methods of cross language information retrieval. The following are the basic information retrieval processes that are carried out:

Representing the information needed – Query formulation

Representing documents – Indexing the text

Comparing these representations – Retrieval

Evaluating the retrieved documents , and if necessary, return the query and entry [Croft et. al]

 

Various aspects of these processes may be relevant and useful when investigating electronic dictionaries. Information retrieval has become an important issue because there is now emphasis on the efficiency of the retrieval and also the accuracy and amount of information that can be returned from a search. There have been several articles published by various researchers who have attempted to carry out cross-language information retrieval (CLIR).

 

Of the issues raised by information retrieval, some of the problems encountered are relevant for investigations of electronic dictionaries and are listed as follows:

-          Character encoding - Documents that are stored in different languages have to be stored, but subsequently, need to be displayed in a non-ASCII environment. This problem has been tackled and one solution has been provided by JMdict.

-           Segmentation of Chinese and Japanese- Asian languages are usually written with separated lexical elements, so is the case with English languages. Tokens need to be identified as strings of characters so that searches can be carried out on them.

 

Text retrieval systems, such as the INQUERY system (based on probabilistic retrieval via Bayesian net framework) [Croft et. al] [Callan], appear to be too complex and large for a task such as searching through a database of entries. Other attempts have also been made for multilingual text retrieval on a larger medium such as the world wide web. Projects such as MULINEX have been set up to develop tools for cross language retrieval, using retrieval systems such as Fulcrum SearchServer and SurfBoard [Erbach 1997].

 

Machine translation - Machine translation (MT) is a application to electronic dictionaries but consistently more complex. Electronic dictionaries set out to translate a single word, phrase or character into another language, whereas MT really attempts to translate whole sentences or documents. However, MT uses electronic dictionary files to carry out the translations. This takes a lot more effort to achieve than simply looking up a dictionary file. For example, MT performs a linguistic analysis so that the most suitable Japanese word can be matched to the English query request [Jones, 1999]. In further studies, Chinese queries were used also to evaluate the effectiveness of different MT techniques [Kwok, 1997] At the moment, the state of development of MT is such that translation requires human aid to complete. This is referred to as machine-aided translation. Some current commercial MT software (such as Logovista E to J) enables some form of translation but still requires pre and post editing of the translation for acceptable conversion from English to Japanese [Eichmann et al., 1998].

 

Query Translation - As highlighted by Eichmann et al. [1998], there have been several methods of handling CLIR. The first method is translating the query, therefore translating the problem into a monolingual dictionary problem [Ballesteros, 1996] [Davis, 1998]. Another method would be to translate the document. The final approach would be to automatically establish associations between queries independent of language differences [Eichmann et al., 1998]. There are also several other techniques that are being involved such as machine translation systems and bilingual dictionaries.

 

Dictionary-based retrieval can occur by breaking up the query into their root forms, searching the dictionary file for equivalents for each word and substituting them with the translated words with the highest precision [Ballesteros, 1996]. Problems with this type of translation are that an incomplete dictionary will result in inconsistent results. Furthermore, there may be ambiguities in translation that can also introduce substantial error [Eichmann et al., 1998]. In other research fronts in the same area, Fujii [1993] investigated the effects of retrieval using characters versus word based indexing techniques for text retrieval. It was found that character based indexing and retrieval was the most efficient out of these two techniques.

 

A different method of cross-language information retrieval can be carried out by using a multilingual thesauri [Eichmann et al., 1998]. This bears a resemblance to specialised Japanese-English dictionary file that Breen has created (such as COMPDIC) in that controlled vocabularies are stored in the thesauri. The medical thesaurus, called a metathesaurus, is multilingual supporting a range of European languages. Conventional thesaurus-based retrieval required the queries to be matched in the thesaurus by their representation. Thus CLIR would follow the same method of retrieval. From their tests, they discovered that the best results came from choosing words that contained only query words. Performance of the metathesaurus based retrieval did not exceed dictionary-based retrieval [Eichmann et al., 1998]. This evidence indicates that using a thesaurus to translate can be effective, but a thesaurus is not as flexible in comparison to a dictionary.

 

Jones [1999] tested several different translation methods in one study. Of the translation methods, the interesting ones were: ‘using a bilingual dictionary to return all the definitions of the corresponding English query’ and ‘using the bilingual dictionary to return a single default translation for the matching English word’. Jones claimed that there was literature that argued that full machine translation was unsuitable, however he does not back up their claims with any evidence. Despite these unfounded claims, the results that they produced indicated that full machine translation produced favourable results compared to dictionary term lookup. Again Jones failed to disclose any references describing the method of full machine translation.  Hull and Grefensette [1996] mentioned in his paper that the ‘performance of MT systems in the setting of general language translation is dismal enough to make this option less than entirely satisfactory’. This claim had also been backed up by Pirkola [1998], Oard and Dorr [1996] and Yamabana et al. [1996].

 

            Hull and Grefensette [1996] discussed five different definitions for multilingual information retrieval (MLIR). Of the definitions that were displayed in the paper, none of them took into account the use of a pure multilingual dictionary to carry out MLIR. A multilingual dictionary can play a powerful role in the development in information retrieval of multiple languages. The definitions mentioned seemed to revolve around a collection of different dictionaries either working in parallel or in combination to produce multilingual translations. Hull’s narrow mindedness could have been a result of his objectives which were centred around the development of a ‘query translation module’ that could be easily built on top of an already existing information retrieval system.

            To further demonstrate the diversity in methods and ideas being exercised, Pirkola [1998] used a special dictionary and a general dictionary in query translation. It was found that this technique was highly efficient [Pirkola, 1998].

            Several CLIR techniques have been observed to be a result of avoiding the construction of a specific dictionary file for research. Indeed, creating a dictionary would be time-consuming, however, it is believed once a ‘standard’ multilingual dictionary was produced, it would relieve some of the problems faced by some of the researchers and open new doors in CLIR.

 

            From the review of the cross language retrieval papers, it seems that the scope and depth that CLIR presents is much to broad for most of the aspects to be relevant to multilingual dictionary research. CLIR attempts to tackle complete indexes of documents in many different languages. The research that has been uncovered tends to be focused on either efficient retrieval of documents with the greatest relevance to a search string query, or the complete translation of particular documents from one language to another. An example of this would be a Greek speaker who needs documents on a certain topic. The user would enter the search query for keywords in the documents. If a Finnish document containing the keywords was located, the document would then be machine translated into Greek for display [Hlava et al, 1997]. Techniques and ideas raised by these studies may be applicable to the current research.

 

Stemming

 

The morphology of a language may mean that words in their plural or past tense may not be queried effectively by the search engine because they are sometimes structurally different in construction. The words ‘run, ‘ran’ and ‘running’ are all related words and it should be reflected in the search by retrieving the root word ‘run’. This issue is not a usual problem when it comes to searching through a paper dictionary because whilst the user is flicking through the pages, they may come across the root word of the query they were looking for because most of the related words are found quite close together. The stemmer widely believed to be relatively reliable is the suffix-stripping algorithm created by Porter [1980]. The algorithm contains a set of rules, or steps where the query word is compared each of the rules and is subsequently modified if the suffixes match the comparison. Code to create the Porter suffix stripper is relatively easy to construct using the guidelines outlined in Porter’s [1980] paper. By using a stemmer for query modification, it can help to increase the chance that a matching query can be found.

 

Spelling

 

Users may appreciate options in the application that provide suggestions of possible words if they have entered a word incorrectly, or if they do not know how to spell it. Words can be similar in two ways: words that sound alike or words that are structurally similar to the word. Two techniques could be implemented to handle misspellings, Soundex and the Levenstein Distance Algorithm.

 

The Levenstein Distance Algorithm (LDA) provides a simple metric for testing the similarity of words. It therefore can be used as a form of spelling checker that can be used when entries cannot be found in the dictionary. If two words which are similar are compared, they can be ‘aligned’ so that sequences of words in each of them are able to match each other. For example, the words ‘word’ and ‘bird’ are quite similar. They can be aligned in such a way:

 

B I R D

    | |

W O R D

 

This example is rather simple, because there is no other way to really align the words. However, if the two words ‘WORD’ and ‘BEARD’ are aligned. It is more difficult because although there are similar sequences in each of them, there are several other possibilities to align the rest of the letters. Some of the solutions are:

 

B E A R D    B E A R D    B E A R D

      | |          | |          | |

  W O R D    W . O R D    W O . R D

 

To convert ‘word’ to ‘beard’, insertions, deletions or letter replacements can be carried out. For example:

 

B E A R D     B E A R D      B E A R D       B E A R D

      | |           | |            | |             | |

  W O R D     W O A R D      W E A R D       B E A R D

                           [Word insertion]         [Word replacement]       [Word replacement]

 

The Levenstein distance of the 2 words is defined as the minimum number of operations required to transform word 1 to word 2. There are phonetic comparisons that compare the pronunciation of words to determine similarity. However, these techniques are very costly and complex. The Levenstein Distance algorithm assigns a score to each change, deletion or addition required to make the strings equal. The final distance is the sum of these scores. A threshold is set up which determines if a string is considered similar or dissimilar. This algorithm can provide suggestions of words to the user with low Levenstein distances in the event that the user entered an incorrect query string.

 

The Soundex code is an indexing system that translates names into a 4 digit code consisting of 1 letter and 3 numbers. Its most familiar application has been by the US Bureau of the Census to create an index for individuals using their surnames.

The advantage of Soundex is its ability to group names by sound rather than the exact spelling. All the words that have a similar sound are grouped together by having the same Soundex number.

There several rules when creating Soundex codes:

-          All Soundex codes have 4 alphanumeric characters

o        1 Letter

o        3 Digits

-          The letter of the name is the first character of the Soundex code.

-          The 3 digits are defined sequentially from the name using the Soundex Key chart

-          Adjacent letters in the name belonging to the same Soundex Key code number (such as a double ‘r’) are assigned a single digit.

 

1

b p f w

2

c s k g j q x z

3

d t

4

l

5

m n

6

r

No code

a e h i o u y w

Table 2.  Soundex Key table

 

If users misspell the first letter of the word, the Soundex system is unable to retrieve the correct Soundex code. Instead a list of words starting with the same letter will be retrieved.

 

 


Chapter 3

Data, concepts and tasks

 

 

 

            The encoding of characters, especially an international character set is important when considering a multilingual dictionary. In the past, different encoding schemes have been used; standardization of codes was non-existent. Studying the options available for encoding a multilingual file is therefore a required point of discussion. Japanese, Chinese and Korean text representation is discussed in detail in Appendix B. The current method of international standardization of encoding called Unicode.

 

3.1 Unicode

 

The Unicode Standard is a superset of all characters in widespread use today. It unifies character sets from around the world, making multilingual software easier to write, information systems easier to manage and information exchange around the globe more accessible. It contains the characters from major international and national standards as well as prominent industry character sets. For example, Unicode incorporates the ISO/IEC 6937 and ISO/IEC 8859 families of standards, the SGML standard ISO/IEC 8879, and bibliographic standards such as ISO 5426. Important national standards are included within Unicode: ANSI Z39.64, KS C 5601, JIS X 0208, JIS X 0212, GB 2312, and CNS 11643. The primary goal of the development effort for the Unicode Standard was to remedy serious problems common to most multilingual computer programs, overloading of character encoding and also multiple, inconsistent character codes caused by conflicting national and industry character standards and finally the inadequacy of using 7 and 8 bits (or a maximum of 256 characters) to represent the global character set.. In Western European software environments, there is often confusion between the Windows Latin 1 code page 1252 and ISO/IEC 8859-1 [The Unicode Consortium, 1996].

 

The Unicode project began in 1988, the inconsistent groups of international character sets affected publishers of scientific and mathematical software, newspapers, book publishers, bibliographic information services, and academic researchers.  In 1991, the ISO Working Group responsible for ISO/IEC 10646 (JTC 1/SC 2/WG 2) and the Unicode Consortium decided to create one universal standard for coding multilingual text. Since then, the ISO 10646 Working Group (SC 2/WG 2) and the Unicode Consortium have worked together very closely to extend the standard and to keep their respective versions synchronized.

 

Although the character codes are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations. Unicode 2.1 has the same character repertoire as ISO/IEC 10646-1:1993 and Unicode 3.0 has the same character repertoire as ISO/IEC 10646-1:2000. Unicode uses a variable length 16 bit representation called UCS-2. The full 16 bit code space (that is 65000 code positions) are available to represent characters. For compatibility with other environments, there are two transformations of Unicode to convert them to 8 or 7 bit environments: UTF 8 (Universal Character Set Transformation Format) and UTF 7. Furthermore, Unicode stands out from other standards as it only deals with character codes, leaving the glyph shape and construction to font vendors [The Unicode Consortium, 1996].

 

When considering Japanese character representation, there is no longer any need to inter-convert between the different character set encoding standards; all Kanji, Katakana and Hiragana characters are supported in Unicode. Since a standard for Japanese representation is available, it was logical that the JMdict file be encoded in UTF 8. [The Unicode Consortium, 1996] As for Kanji, the character code for characters that occur in both Chinese and Korean character sets are the same, further decreasing the number of character codes required to represent Chinese characters.

 

Because each Unicode character is a 16 bit value, it cannot be handled like an ordinary ASCII character value. Programming in Unicode may be more troublesome if the programming language cannot handle 16 bit characters easily. C++ is a very popular programming language however, support for international languages has not been taken into account when designing this language, so representing Asian characters using C++ is rather weak. Java has several advantages over C++ when it comes to international language processing because it has been designed to handle Unicode as standard input/output. It is the first programming language to have built in support for Unicode. Clearly, Java is a useful language when it comes to internationalization and is the recommended programming language to use when creating a dictionary application. When using the UCS-2 encoding to map from one locale to another, mapping tables will be required because no code conversion algorithm exists. This conversion between different codes is carried out via table-driven conversion.

 

3.2 Limitations with multilingual dictionaries

 

A multilingual dictionary is an excellent idea on first thought. It would be ideal to have a dictionary that could translate English into many different languages. However, depending on the data format of the dictionary, there can be limitations when it comes to building the dictionary file. These limitations can be due to structure of the data file to the clash of various languages themselves.

 

Although the use of XML is a very important step in solving the problem about flexibility and extensibility of the dictionary file, there are other considerations into the design, content and lexicographical aspects of the dictionary file. Although not such an important issue, the actual size of one single dictionary file may become a problem as more entries are added and multiple languages are included into the file. Being able to physically store a large file with millions of records, processing a file tens or even hundreds of megabytes in size may become a problem if CPU intensive tasks are required to be carried out on the data file. The issue of handling a multilingual dictionary file by the computer is discussed further in this report.

 

One of the most important limitations when working with multilingual dictionaries is attempting to match senses and glosses for different languages together. There are words in some languages that are not defined in another language. There are also some languages where words do not exist for specific meanings, instead a generalized term is used. Furthermore, one word in one language may be used to describe more than one thing in another language. This non-parallelism between languages is called ‘anisomorphism’ [Landau, 1984]. For bilingual dictionaries such as the EDR (see Appendix A), there are 190,000 headwords for the English to Japanese part compared with 230,000 words in the Japanese to English part. The smaller amount of headwords reflects the fewer headwords required in the generation of the language. Different languages are bound to have differing types of grammar so there are simply words in one language that don’t exist in another language. For example the Aussie word ‘bloke’ may only be able to be defined as ‘guy’ or ‘man’ in another language, and the real meaning cannot be conveyed to the user. In some languages, there may not be a word to describe a word in another language. These differences and problems are obstacles that need to be overcome when combining several dictionaries. [Al-Kasimi, 1977] There is no real solution to this problem. How this situation is handled can depend on the type of implementation used to hold the information.

 

A file like JMdict has decided to make headwords Japanese with English definitions. The English definitions could also be used to possibly be headwords too. Usually the headword defines which language the user is confident with and the target language is the second language for the user. The language of the headword may be important in the structure and direction of the multilingual dictionary file. Furthermore, the additional information included within an entry such as part of speech and examples needs to be stored in a particular language. A smaller limitation exists that asks what language should these additional details be stored in. Entry information such as part of speech and synonym may be stored in one language. So a possible problem may arise if the information is in English and a user is wishing to search from German to Japanese. The information stored in English may require translation to Japanese or German in order for a person to understand the entry.

 

This brings the issue to the next point: the target audience for the multilingual dictionary. The dictionary may be for English speakers who require translation from English to multiple languages, or a student who wishes to translate an article. It was mentioned earlier that bilingual dictionaries could not cater for both L1 and L2 users. Doing so would mean that effort on creating one dictionary would be cut in half by trying to create two distinct dictionaries, each for a different target audience. A possible outcome from this limitation could be to create a multilingual dictionary file that supports native speakers from only Japan, Australia and Germany. The dictionary file could be created for users who specifically wanted to translate between these three languages.

 

One of the problems with creating interlingual dictionaries is that of pronunciation. If a user wants to produce the foreign language from a bilingual dictionary, then they would want to know the appropriate word and how to pronounce it. For example, pronunciation of a Chinese character is varied. If the user does not know the phonetics of pronunciation for PinYin, it is hard for a person to say the character. Furthermore, there are many dialects of Chinese all around China and in Asian countries, so inclusion of pronunciation in an entry can be a difficult task [Ali-Kasimi, 1977].

 

Although not a major issue concerning the actual multilingual dictionary file, the difficulty and complexity of linking every definition when coding the dictionary file, can be a headache to lexicographers. Definitions not only have to be linked to each entry, but also to a set of identifiers (such as usage information, parts of speech, and examples) [Landau, 1984]. Multilingual dictionaries increase the complexity of entry input because of the languages to update. In addition, adding synonyms and other cross-references can become tedious, especially if the references are in multiple languages that the dictionary file represents. The process of entry insertion does not necessarily have to be a problem and could be solved. Depending on the format of the file, software could be produced, similar to a dictionary search engine, allowing insertion, deletion and modification of the dictionary file with little problem.

 

3.3 Operations, requirements and tradeoffs

 

Creating an electronic dictionary can be considered a difficult task. A dictionary application consists of many different operations; each operation often constituting an area of study in computer science. It is therefore important to set out the operations and basic processes that a dictionary application carries out. Figure 1 describes the operations executed by typical users of a dictionary program.

 

Figure 1 Diagram of user actions and corresponding programming sections

 

 

The features of user options and possible variants in a dictionary search is discussed later in the paper. From a software developer’s perspective, there are the three major sections to the software:

1.       Input of entries for users: program has to be able to read in the query of the user, in any of the languages specified. The system also needs to be able to cater for the input of non-English characters into the query space.

2.       Retrieval of entry information: an important aspect, the retrieval system of the dictionary is the heart of the application. A weak system will result in fewer users due to lack of efficiency. The retrieval system is required to search through the dictionary file, find matches in the entries, compile relevant information and send the data to the application.

3.       Output of entries for users: data is received from the retrieval and the next major operation is to format the information in for the user to view the information. This task is a user interface and data representation problem.

 

The following is a diagram describing an overview of the possible data paths for the application:

 

Figure 2 Diagram describing the possible data paths dictionary data

 

The operation of information retrieval presents various issues, or ‘trade offs’ when choosing a technique to implement. Usually, when implementing a technique, there are many advantages to why the technique is used over another technique, however there are also tradeoffs for using the certain method. Deciding which implementation is used is generally a weighing up of priorities; important features that must be optimal and other features which are not as important. The factors and costs that need to be weighed when choosing implementations are:

-          Efficiency of the searching system

-          Storage space required to store the file

-          Storage space required when processing the file

-          Frequency of access of entries

-          Frequency of access of certain entries

-          Amount of data to be extracted per search

-          Accessibility to multiple users

These costs will be analysed as different information retrieval techniques are discussed.

 

Multilingual text input

 

An information retrieval system may typically consist of phases for input, storage, processing, editing, output and transfer. Input is a serious problem, resulting in a bottleneck. In a non-ASCII context, there is the problem of text input. More than 500 encoding schemes have been devised alone for Chinese character input. In many cases, a traditional western keyboard, which will undoubtedly be the main input source for the multilingual dictionary, is insufficient for text input. [Lunde, 1999]

 

Japanese is a language where each kana or compound character can be represented with a unique phonetic. By understanding this, and creating a conversion file or table listing the compound characters and Romaji equivalent, it is possible to type a Romaji query and have it converted immediately into Kana for the user to see before sending the query to the search engine. This is a well-established technique and can be applied quite effectively for Japanese character input.

 

Figure 3 Excerpts from a Romaji to Kana conversion file. Pairing together a combination of phonetics will result in a character, and combining these together will result in a compound word.

 

There are two general types of ‘fast’, or configurative input for characters: those which require memorization of character code numbers and those which depend on the internalization of key positions directly related to character codes. An example of this is the input of Chinese characters input via the decomposition of each Chinese character into radicals or strokes that are subsequently transformed into letters of numerals on the keyboard. ‘Easy’, or phonetic methods, seek to exploit the users existing knowledge of Chinese language and script to minimize the need to acquire new skills. On the keyboard, we need to specify, not produce the character we want. One form of specification is by pronunciation, the most widely used in ‘easy’ input [Mair and Liu, 1991].

 

There is a general class of input of foreign characters called IM or IME (Input Method Editors). Microsoft has developed a free IME they have called Microsoft Global IME 5.02. It has been developed to enhance East Asian character input. Global IME allows users to input Asian characters without any special keyboard or equipment. Because Microsoft operating systems are used so widely, this free input system is fast becoming a ‘de facto’ standard for character input [IME, 2000].

 

Figure 4 Microsoft Global IME in action

 

A common and well-known example of an IM is an IM for word processors. The following is a diagram of the Chinese Star IM that can be used to input Chinese characters into Microsoft Applications.

 

Figure 5 Chinese Star word processor Chinese character input system. Involves typing PinYin into a translation box. A list of characters that correspond to the PinYin are displayed. Outputting the correct characters is a process of choosing the correct character

 

However, some of these techniques are much too complicated and only simple versions of them may be needed for short text input queries. The emphasis for inputting foreign characters into a multilingual dictionary is not speed, because so few letters will be put in at one time, but more of usability and ease of use for the user. The investigation and implementation of various input methods for different languages is out of the scope of this project. As more languages are added to the multilingual dictionary, more and more input techniques may be required to support the languages. Users could find it difficult to adapt to all the different kinds of input techniques available.

 

Multilingual text output

 

            In the past, there a major obstacle in foreign text processing was the output of the characters. There are several ways of representing the same font information. The different representations came from the different font houses (such as Adobe) creating their own standards. Fonts are either bitmapped or outlined (scalable). Bitmapped fonts represent each character as a rectangular grid of pixels. There are a number of disadvantages to this approach, but the most important one is the difficulty to change the size, shape, and resolution of a bitmapped character without loss of quality because the bitmap is defined at a certain size and resolution. Outline fonts represent each character mathematically as a series of lines and curves. The font must be 'rasterized' into a bitmap. LaserJet .SFP and .SFL files, TeX PK, PXL, and GF files, Macintosh Screen Fonts, and GEM .GFX files are all examples of bitmapped font formats. PostScript Type 1, Type 3, and Type 5 fonts, Nimbus Q fonts and TrueType fonts are all examples of outline font formats.

 

            In addition to these two types of font archive formats, certain font standards, there are further issues. Identical formats on different platforms are not necessarily the same. For example Type 1 fonts on the Macintosh are not directly usable under MS-DOS or Unix, and  vice-versa. There are just as many different font formats. Two major font formats are discussed:

PostScript Type 1 Fonts: Postscript Type 1 fonts (Also called ATM (Adobe Type Manager)  fonts, Type 1, and outline fonts) contains information, in outline form, that allows a postscript printer, or ATM, to generate fonts of any size. 

TrueType Fonts: Truetype fonts are a new font format developed by Microsoft with Apple. The rendering engine for this font is built into MS Windows v3.1 and subsequent versions. Like PostScript Type 1 and Type 3 fonts, it is also an outline font format that allows both the screen, and printers, to scale fonts to display them in any size.

 

The following is a table that describes some of the font extensions and their platform usage. Despite all these difficulties, there are now standard libraries (such as those listed above) that define the font types and font information, so the problem of displaying foreign characters is not such a major problem today.

 

Extension

Usage

* .fon

An MS-Windows bitmapped font

* .pfa

Adobe Type 1 Postscript font in ASCII format (PC/Unix)

* .pfb

Adobe Type 1 PostScript font in "binary`' format (PC/Unix)

* .ps

Any PostScript file (Type 3 font)

* .pxl

TeX pixel bitmap font file

* .ttf

MS-Windows True Type font

Table 3. Font extensions and their platform usage


Chapter 4

Internal representation of the XML document

 

 

 

            The ‘heart’ of the application lies in the internal representation of the multilingual dictionary. Failure to create an efficient system will result in a disappointing application. The concepts of XML are explored in this chapter and its role in the JMdict file. In addition, XML specific tools are examined, outlining their properties and also their applicability for use in a multilingual dictionary application.

 

4.1 XML

 

XML stands for ‘eXtensible Markup Language’. It is the standard system for defining the content and format of an electronic document. HTML tells how the data should look, but XML tells you what it means[Goldfarb, 1998]. The differences are more distinct than just that. HTML has permanent markup tags, for example the <bold> or <href> tags. Additional tags cannot be defined therefore restricting the applications of HTML. From the computer’s perspective, there is no structure of the information supplied in an HTML file. It differs from a similar markup language in that XML is designed keeping in mind that document format should be specific to the type of document that is being created. This allows XML to be used as a genuine storage method, as JMdict has shown. It allows the programmer to separate data from display. The following are some of the strengths of XML:

 

Extensibility – allows users to define their own tags (or attributes) to suit the data being represented.

Structure – allows nested structures of any depth, hence being suited to dictionary-type entries.

Validation – allows the document to be validated before use by applications

 

Another important feature of XML is that it enables the definition of tags for each individual document. The formal definition that describes each type of tag is called a document type definition (DTD). The following is a small sample of what a tag definition looks like:

 

<!DOCTYPE label[

     <!ELEMENT label (name, street, city, state)>

     <!ELEMENT name (#PCDATA)>

     <!ELEMENT street(#PCDATA)>

     <!ELEMENT city (#PCDATA)>

     <!ELEMENT state(#PCDATA)>

]><label>

 

This DTD defines a label. Each label will contain an element name, street, city and  state. By allowing user-defined tags, XML improves functionality and increases the appeal to developers, enabling them to create any type of data document [XML.com, 2000].

 

Elements have their own attributes that are properties for elements. This is simply the value for the element that is being presented. An example of this in the JMdict file is the <gloss g_lang="de"> tag. It is an element of type gloss g_lang. To define the element, it is given the attribute “de” which stands for German. By using this unique feature of XML, different languages can easily be entered into the dictionary file without the need to design additional tags.

 

 

 

 

XML has a hierarchical view of XML documents which is referred to the tree structure of the document. The structure of the entry: ‘New Year’s sake’ is displayed in the following diagram.

 

Figure 6 Tree structure view of an entry in JMdict

 

 

JMdict

 

Structuring of a document or storage file is an extremely important factor, especially with one as large as the EDICT file (See Appendix) containing over 100,000 entries. A plain, unstructured dictionary file format such as EDICT allows the linguist a great deal of freedom with the structure of the dictionary. The disadvantage is that the file must be accessed from top to bottom, from beginning to end. It is also application independent. This is a good method of storage if the amount of data is relatively small and the combined delays of searching are not an issue. Even word processor searches for particular words in a relatively long file can output a result in good time. The dictionary file has to be more structured to allow for more efficient access as the data set gradually increases in size, and when extra data attributes are added. In addition to these problems, the limited structure of flat file dictionaries like EDICT means that for each entry, the amount of includable information is restricted as information extraction from the file would become extremely difficult. XML was chosen as the data format for the new dictionary file because it allowed for additional attributes and most importantly extendibility. Grasping the opportunity to utilize the possibilities of extendibility of the file, German glosses were added to dictionary entries. Breen called this new file the ‘Japanese Multilingual Dictionary’ or JMdict. Its superiority over other electronic dictionary files of this type meant that the JMdict file would be the primary data source, and the dictionary application would be built around this file.

 

 

 

According to Breen[2000 (b)], the aims of the JMdict project were as follows:

 

  1. “To convert the EDICT file to a new dictionary structure which overcomes the deficiencies in the current structure. With regard to this goal, the particular structural and content aspects to be addressed include, but are not limited to:
    1. The handling of orthographical variation (for example, in Kanji usage, Okurigana usage, readings) within the single entry;
    2. Additional and more appropriately associated tagging of grammatical and other information;
    3. Provision for separation of different senses (polysemy) in the translations;
    4. Provision for the inclusion of translational equivalents from several languages;
    5. Provision for inclusion of examples of the usage of words;
    6. Provision for cross-references to related entries.
  2. To publish the dictionary in a standard format which is accessible by a wide range of software tools; it is proposed that this goal be addressed by developing the structure so that it can be released as an XML document, with an associated XML DTD.
  3. To retain backward compatibility with the original EDICT structure in order to enable legacy software systems to use later versions of the EDICT files”

 

The following is figure of a sample of JMdict as displayed on Internet Explorer 5 that is capable of displaying Unicode fonts:

Figure 7 Sample of the JMdict file displayed on a Unicode capable browser

 

 

XML is a simplified subset of SGML and is not really optimized for the Web environment. However, it does mean that it is data processing-oriented (compared to browser-oriented HTML) and should not be seen by the end-user. This is not to say that XML cannot be presentable. Data stored in XML documents still need to be presented to users in an attractive format. Rules for formatting an XML document is called a stylesheet and can be used to transform the raw XML data into a HTML document complete with hyperlink markups. The language used to create the stylesheets is called XSL, or extensible style language. [Goldfarb, 1998]

 

 

 

4.2 XML information access

 

XML parsing systems

 

            Because such a structured storage structure is being used to hold the dictionary entries, more sophisticated methods of searching and entry handling can be carried out. By using the XML storage structure, the entries need to be parsed by an XML parser. There are two major types of implementation of XML parsers available for processing XML documents:

-          Tree-based APIs

-          Event-based APIs

 

Firstly, there is a SAX (Simple API for XML) implementation of XML parsing. This method is an event-based API that uses a simple technique in which the parser searches through the XML document in a logical order to find matching entries. On the other hand, there is a tree-based API called DOM (document object model) implementation. This is slightly more sophisticated in that it treats the entries individually as objects. [Laddad, 2000] Both of these APIs were created to serve the same purpose: to provide access to the information stored in an XML file. The advantages and disadvantages of using both these methods will be studied consequently to determine if they can process the JMdict efficiently. Of particular interest is how the XML document is broken down by the parser. Additional features that can be implemented in electronic dictionaries such as cross-referencing will be compared when using these two different XML parsing techniques

 

Using Java

 

When deciding what programming language to use to implement the multilingual dictionary, several requirements need to be fulfilled.

-          Ability to handle Unicode

-          Compatibility with XML tools, in particular SAX and DOM

-          Ability to integrate with the internet

-          Efficient and re-usable code

 

Java appears to be the best language to code the application. One of its greatest strengths is that its internal character is in fact Unicode, immediately reducing the complexity of attempting to convert between foreign character standards and internal character representations. In comparison with C or C++, string handling in Java is far superior. Many XML tools are being developed for Java and XML is beginning to be synonymous with Java. For example, IBM’s implementation of the XML parser has been released as XML4J (XML for Java). Standards such as DOM (and JDOM, see section XX) are being put forward to be included in the next Java release.

 

4.3 DOM (Document Object Model)

 

DOM not only addresses the object model in XML, but also in HTML, as it is also a structured document. Currently, DOM is specified up to Level 1 core. The DOM level 2 specification is still in a working draft form and it is guaranteed to change before it is officially a standard.

 

DOM represents a document tree fully held in memory. It is a large API designed to perform almost every conceivable XML task. It also must have the same API across multiple languages. Because of those constraints, DOM does not always come naturally to Java developers who expect typical Java capabilities such as method overloading, the use of standard Java object types, and simple set and get methods. DOM also requires lots of processing power and memory, making it untractable for many lightweight Web applications and programs. However, we will still investigate the features of DOM and its suitability.

 

The DOM class heirachy is divided into several layers.

-          Document – the Document node is the master node – only one of these can exist for an XML document. It represents the XML document as a whole.

-          NodeList – This type of node is used to hold a collection of child nodes. It basically allows access to the children

-          Named NodeMap – Contains additional functionality in relation to NodeList in that it is able to access the children by their names

-          Element – Contains an element from an XML document. This can be thought of as the name of the tag used in an XML document <gloss>

-          Text – These are used to represent text contained within the element tags. <gloss>to conspire</gloss>

-          Attr (attribute) – This node represents the attributes declared within the scope of an element. An example of this in XML format is <gloss g_lang=”de”>

-          CDATASection – Similar to the text node, however, it can contain markup. This allows the user to specify text with XML control characters such as ‘/’ and ‘>’

-          DocumentType – is a node that represents the tags used in the Document Type Definition.

-          Entity, EntityReference and Notation – Are used to describe nodes used in the DTD.

 

The relationship and hierarchy between the different node types can be visualized on the following page. Various interface methods can be carried out on these nodes to manipulate and access them. Examples of various functions are:

-          Child modifiers

-          Node creation, grabbing, moving and deletion

-          Element methods

-          Element usage such as child iteration 

 

Thus, DOM provides mechanisms needed to dynamically interact with the elements and content in an XML document. With DOM, handlers, or hooks, are created to encapsulate behaviors to associate with elements in the DOM tree. DOM sets out to be able to model every possible well-formed XML document. Therefore, DOM classes contain features that many XML applications never use. An electronic dictionary requires few of these classes. In almost all procedures, operation to be carried out would be the grabbing of nodes and attributes. The file is not required to be, or should be, modified by different user session, so interfaces provided for dynamic modification of XML document structure is not required.

 

In summary, DOM is really useful for business applications which require the dynamic manipulation of elements and content in XML files. These features that are important in other applications are not a major factor in the context of creating a dictionary application. The only concern in the usage of JMdict is the efficient access and searching of the XML file. DOM can provide this, but at a huge cost, which, unfortunately is not suitable for this application. An initial experiment carried out using the DOM parser to create the DOM tree on JMdict resulted in a program crash, the problem being the system was out of memory. With ever increasing dictionary file, DOM alone does not provide a feasible option.


Figure 8 Hierarchical relationship between different node types

 

 

4.4 SAX (Simple API for XML)

 

SAX is a public domain API developed cooperatively by the members of the XML-DEV mailing list. It has now become a ‘de facto’ standard for event-based parsers and is one of the most popular XML APIs available. It provides an event-driven (sometimes referred to as a callback-style) interface to the process of parsing an XML document. As mentioned previously, XML is a hierarchical language that means entries can be nested and have parents and children. Although XML provides this kind of functionality, documents do not need to have a tree structure. SAX presents a view of the document as a sequence of events. For example, it reports every time it encounters a begin tag and an end tag. That approach makes it a lightweight API that is good for fast reading. SAX also does not support modifying the document, nor does it allow random access to the document.

 

In some cases, the event based API can be more efficient than a tree-based API. It generally provides a lower-level access to an XML document. An advantage of using SAX is its portability between other SAX parsers. Code created using one parser can be ported very easily to another parser package. Because SAX does not involve the generation of internal structures, it is able to handle large documents much better than DOM; there are no memory overheads associated with storing the XML data.

 

SAX is called the simple API because it is just that. All SAX does is carry out actions depending on events occurring in the XML file. SAX, knowing nothing of the rules that govern an XML document's structure, must be prepared for anything. It must watch for and, if directed, generate events for every possible XML feature that an XML document may provide. These events are programmed into the code to respond according to the type of data being read. This allows for great flexibility. For example, the SAX parser will be able to pick up all the <entry> tags so, if time and efficiency were not a factor, a query could be searched for using SAX simply by comparing each entry tag to see if it matches the query string.

 

To make SAX handle events properly, a SAX document handler needs to be created to interpret all the SAX events. In addition to this, the behaviour of the handler which will respond according to the data received by parser needs to be coded in as well (which can be a lot of work). The documentHandler interface is the most important part of the SAX interface. It is responsible for capturing specific document events.

 

startDocument and endDocument – Indicates the start and the end of the XML document

startElement and endElement – Indicates the start and the end of a new element. This can be <entry> or even <gloss>

characters – Indicates that there is character data. This interface can be used to retrieve data such as the definition of an entry.

 

The following is a diagram outlining the basic functions of the SAX parser and how it may be used in dictionary application:

 

Figure 9 SAX parser operation and a possible application in a multilingual dictionary

 

There are several XML parsers that have built-in SAX support:

Microstar’s Aelfred

James Clark’s XP

IBM’s XML for Java

 

            Implementation using the SAX parser alone as the search engine is going to be impractical. Although SAX is the better parser to use for large files, it suffers from the fact that it is event-driven. The parser has to start at the beginning of the document and finish its parse once it reaches the end of the document. It is not efficient enough. However, the event-based nature of SAX could have its advantages. One suggestion for use of SAX could be to use its start-to-end structure to create an index file. It will run the startElement function whenever it encounters a tag whether you are interested in the tag or not. But, by selectively processing elements that are, for example, <ent_seq> (the sequence number of the entry) and <gloss> (the English definition) an index file could be created.

 

Another implementation may be possible using the SAX parser. If the application can use an alternate technique (such as indexing techniques or hashing) to find the exact location of the query entry in JMdict, the SAX parser could be used to parse the subsequent entries. As mentioned before, the SAX parser documentHandler interfaces can be overloaded so that they can exhibit certain behaviour when a specific tag is found. This feature can be used to allow the application to exhibit special behaviour when for example a bibliographic attribute is found in a particular entry. SAX parsers are built to parse a complete file at one time, but it should be possible to make minor modifications so that the parser is able to start at a certain point in the file without having to validate the file. The following table compares SAX and DOM

 

 

SAX

DOM

Information access

Sequential

Random

Setup cost

None

High (document parsed into memory)

Memory cost

Low

Very high

Applicability to JMdict

Possible

Impractical

Table 4. SAX and DOM comparison

 

 

4.5 Object orienting the XML file

 

A paper dictionary can at times be considered a file. The entries are entered in alphabetical order. Another feature of this file is that each entry can be considered as an object. Electronic dictionaries face the dilemma about how to best internally represent the data of a dictionary. The consideration of each entry is a fair one, it is a system that has worked before and it is possible it will work with electronic dictionaries. However, it is not a situation where ‘if it ain’t broke, don’t fix it’. By using objects to represent entries, many of the advantages of using an electronic medium to store, search and display information will be lost. If the concept of objects were created, it makes it quite difficult to link entries together. Headwords in a dictionary are invariably tied to many other words in a dictionary; no word is unrelated. Some relationships include parts of speech, usage, synonym, antonym and word derivation to name a few. It would be advantageous to base the representation of the dictionary entries on the relationships between various attributes of an entry. It would definitely make for a more diverse and functional dictionary system. An object oriented approach to entry storage simply cannot cope with so many links. It would be expensive to construct and it would be quite complex to traverse.

 

Figure 10 It can become quite complex and difficult to traverse with each entry containing so many different links, incoming and outgoing

 

It from the analysis on the object-based DOM API, it seemed that the implementation of an object oriented tree structure was not feasible, memory wise. The concept of object orienting a dictionary also seems like a far from adequate solution for visualizing a dictionary entry. The following chapter will discuss other methods with which to represent JMdict.

 

4.6 Other XML tools

 

There are other tools for XML that are seen to be irrelevant creating the dictionary application. They are briefly summarized:

 

XML Base (XBase) allows a document to specify a document’s base URI (Universal Resource Identifier) against which all relative URI references in the document can be resolved against. This includes references to images, stylesheets, applets, etc. It is anticipated that this specification will endorse the 1.0 version of the XBase specification.

 

XML Pointer Language (XPointer) is a language that can be used as a fragment identifier for any URI that locates an XML resources. It is based on the XML Path Language (XPath). It supports addressing internal structures of XML documents, traversals of a document tree, and the selection of internal parts of an XML document based on various properties. It is anticipated that this specification will endorse the 1.0 version of the XPointer specification.

 

XPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer. It is anticipated that this specification will endorse the 1.0 version of the XPath Recommendation.

 

 

 

 

4.7 Future XML tools

 

Both DOM and SAX general-purpose solutions to XML document processing. Each of these currently seems unsuitable for processing the XML file individually. Developers have also picked up on some of the points where both SAX and DOM fall short and have begun creating some new APIs for use with Java and other programming languages to fill the gap. Although some of these APIs are very new, they offer hope for using XML-specific tools to create XML-based applications more efficiently.

 

JDOM

 

JDOM:  a Java representation of an XML document.  It is a new API for reading, writing, and manipulating XML from within Java code. JDOM attempts to incorporate the best of DOM and SAX and create a new set of classes and interfaces. It can read from existing DOM and SAX sources, and output to DOM and SAX receiving components. That ability enables JDOM to interoperate seamlessly with existing program components built against SAX or DOM. It provides a way to:

-          Represent the document

-          Read the file easily and efficiently

-          Manipulate the data

-          Write the data back to file

It is an alternative to DOM and SAX, although it integrates well with both DOM and SAX.

 

JDOM documents can be built from XML files, DOM trees, SAX events, or any other source. JDOM documents can be converted to XML files, DOM trees, SAX events, or any other destination. This ability proves useful, for example, when integrating with a program that expects SAX events. JDOM can parse an XML file, let the programmer easily and efficiently manipulate the document, then fire SAX events to the second program directly - no conversion to a serialized format is necessary.

 

The developers of the JDOM standard have described the development of this API as:

-          “Straightforward for Java programmers.

-          Supporting easy and efficient document modification.

-          Ability to hide the complexities of XML wherever possible

-          Able to integrate with DOM and SAX.

-          Being lightweight and fast.

-          Able to solve 80% (or more) of Java/XML problems with 20% (or less) of the effort” [JDOM, 2000]

Loading and manipulating documents should be quick, and memory requirements should be low. It provides a full document view with random access but it does not require the entire document to be in memory.

 

One of the drawbacks of attempting to implement JDOM at this point in time is its early stage of development. JDOM API has not yet been released as a beta. This means there will be changes to interfaces and classes meaning implementation of the access to information in the dictionary file may change as the standard develops. Coupled with the constant growth of the multilingual dictionary, there may be too many changes happening around the application. However, because of its ability to output data to existing DOM and SAX components, JDOM could be used to replace older implementations of a system that used SAX or DOM. If the developer’s claims are correct about improvements in functionality and efficiency over current standards, then the application may run faster.

 

 

 

Adelard

 

Adelard is an alternative to existing technologies like SAX and DOM. These technologies, while useful, operate at the level of individual elements and attributes. The application code in effect implements the a ‘bridge’ from these entity-level components to user-level dictionary entries. Adelard aims to provide mapping of XML document structures directly to high-level objects. This API is developed with mapping XML documents to business-level objects, which can also be seen as XML-to-dictionary entry objects.

 

Adelard comprises two integrated parts: a binding framework and a schema compiler. The data binding framework supports the transformation of XML documents to and from Java objects. Both the source schema (otherwise known as the DTD) and the binding schema became inputs into the schema compiler. The source schema describes the structure of the XML file. And the binding schema describes the program-specific information that drives the generation of the Java classes from the source schema. Because Adelard knows about the schema, it can optimize the generated classes to support only those features necessary for the schema in question. It can get rid of support for unused XML document features.

 

Figure 11 Diagram of the schema used to create objects

 

Projections of performance is one of the most important pieces of information. Current benchmarking indicates that Adelard is both faster than SAX and easier on resources than DOM. This is clearly a great advantage for improving efficiency of the application. The public release of the specification and an early-access implementation is projected to be released at the end of 2000.

 

 

JAXP

 

The JAXP specification focuses on this aspect of XML programming, providing an API for creating, configuring, and manipulating XML parsers. JAXP supports the SAX and DOM which are the most common interfaces to XML. JAXP's main goal is to provide an interface that lets programmers create, manipulate, and use standard XML parsers. In addition, JAXP sets out to allow programmers to create parser-neutral code, and deferring parser selection further down the process. Sun are the creators of JAXP and they are currently starting the 1.1 version for it. They are hoping to make JAXP a core Java extension.

 

 

 

 

 

XML QUERY ENGINE       

 

XML Query Engine is a JavaBean component that lets you search your XML documents for element, attribute, and full-text content. It can index multiple documents using a SAX parser of your choice. The index, once built, can be queried using XQL, a ‘de facto’ standard for searching XML like a database.

 

 XML Query Engine uses an index system to track every element, attribute, and the words contained in each for every document. Any document to be queried needs to be indexed first. Before you can index though, you have to tell the query engine what sorts of things to index or ignore. XML Query Engine defaults by indexing everything it encounters. That might not be what users want so restrictions can be put in place to customize the index file.

 

In its incomplete, unpolished form, it is not yet a beta, and it does provide an attractive alternative to techniques described earlier. However, it seems to be focused towards business applications. This means that some of the functions included in the API would again be irrelevant and thus untouched in the implementation of the multilingual dictionary program. Currently, there is no implementation of a persistent store for the index. This means that the API creates the index each time the application is starts up. It would seem unlikely that indexing the 13 Mb file would be sufficiently fast enough to satisfy users. The inconvenience of the resultant delay may be a price users are unwilling to accept on a regular basis.


Chapter 5

Indexing and entry retrieval

 

 

 

When the JMdict is parsed using data structures into memory using basic data structures, the number of entries in the file (13Mb) can result in excessive memory usage by the system. For a dictionary application to use up the majority of a computer’s memory is not very practical. This means that other methods of storage needed to be sought out in order to speed up the application and memory efficiency. Object orientation of the entries has been explored but unfortunately does not provide the flexibility that will allow different types of access to the entries. Some of these difficulties can be attributed to the design of the JMdict file. Despite this, there are still several possibilities for internally representing the dictionary file.

 

5.1 Limitations with JMdict

 

            JMdict is a revolutionary dictionary file in comparison with its predecessor, the EDICT. It provides many structural changes and introduces features and a format that will allow the file to grow further in the future. However, there are a few aspects of its design that can potential affect the way a dictionary file is represented. There are many issues associated with multilingual dictionary files in general which have been discussed previously. Because the headwords in the JMdict file are in Japanese, it will restrict the type of information available. English definitions not covered by Japanese headwords may be left out of the dictionary. In addition, this multilingual dictionary provides translations from one language to another, not to the same languages. This could be added in further updates. Furthermore, the English glosses are sometimes quite lengthy, or provide a deep definition. When it comes to translating from English to Japanese, the only information that a user can extract in Japanese is the headword; there is no indication of its ‘usage’ for example “to run” or in one case "in the blink of an eye" (lit: in the time it takes to say "Ah!")” as one English definition puts it. The XML file is already quite large in terms of file size, and will undoubtedly grow bigger. Many of the tools available for XML are business-based tools which rely on smaller files, which are sometimes dynamically created.

 

In the end, there are some design features that are more desirable than others. As with most things, they come at a cost of something else. Breen has set up a mailing list for developers or computer scientists who are interested in JMdict to participate in discussions that could lead to changes in the makeup of JMdict. JMdict is a good design, but some changes could possibly made to improve the file.

 

5.2 Indexing techniques

 

Indexing is the process of pre-building the internal data structures needed to enable subsequent fast retrieval from the indexed documents. The use of an index file is well suited to dictionary applications because depending on how the index file is set out, advanced searches on alphabetical or logical order data are possible.

 

The index file is an efficient method of entry retrieval. It can be used to create an fast dictionary system. The idea of setting up an index file is not new and has been implemented numerous times. The reason for investigating this technique is to identify if it can be used just as well in the XML document context. JMdict is a data store like any other file; it differs by containing additional structure. This added structure may open new methods of data representation, but older techniques should also be investigated.

 

One technique used by Breen (Jdic, Xjdic) to facilitate rapid searching of the EDICT file. An indexing utility is used which identifies byte offsets of English, Kanji and kana tokens. These entries are sorted into word value order to produce an index file. In order to find the correct entries, the search engine would search through the index file, checking the EDICT to find the correct entry. Once it is found, it is passed back to the application for display. Once a mapping has been found, it is a relatively easy task of opening up the dictionary file and traveling to the specific byte number in the file to access the information required.

 

Breen’s index only had one index file for English, Kanji and kana and allowed searching for English and kana tokens. One of the restrictions in index files is that a single index file really only allows for searching of the file in one dimension. If the English to Japanese dictionary were implemented, having an English index file containing the English headwords as index entries, it would mean that the user’s search would have to be based around an English headword search. A solution to this problem is not only to have one index for the multilingual file, but multiple indexes harboring different headwords and index entries. This is a very viable solution. Not only does this allow the flexibility of allowing the user to search different languages, but the different indexes for, say, verbs, nouns and synonyms will improve functionality of the dictionary system by providing superior and interesting searches for users. This increased functionality for the user can lead to many improvements such as useful display methods. The following is a possible arrangement of a dictionary application using multiple indexes.

 

The advantage of using an index file is that it is very suitable for running the application remotely. The user in a remote location only needs to parse through certain sections of the XML file to find the information they need, instead of reading in a complete dictionary file which would be impossible with low bandwidth connections. In addition to this, an index file is usually much smaller and simpler than the dictionary file. Because it is much smaller than the dictionary itself, it should be able to fit into memory very easily. This makes it quick to reference. In addition to this, it is relatively easy to compile using an indexing program that can come with the dictionary application, so when new updates to files arrive, the user can re-create the index.

 

Figure 12 Outline of possible index filing of the multilingual dictionary coupled with the SAX parser

 

The index files are stored in alphabetical order. In several cases they are ordered for example as part of speech or bibliographic entry. A very efficient search of ordered data is a simple binary search. Even on large files, binary searching is fast and effective way of locating an entry.

 

 

5.3 Databases

 

DBMS

 

There has been debate as to whether XML is in fact a database. The answer is no. Although an XML document contains data, without any additional software to help process that data, it is no more a database than any other text file. However, if the XML document along with all the surrounding XML tools and technologies, then the answer may be yes because it can provide some of the features found in databases.

 

A data base management system is a software system for defining the structure, entering, retrieving, storing, maintaining and validating of a large collection of formatted data. Of course, there are many advantages of using a DBMS with a large amount of data and it is a natural progression to consider such a large and expanding information source such as a multilingual dictionary file to be stored in a commercial DBMS. Sharing between users, data managed as resource, data independence from applications and portability are all reminders of why DBMS is a popular form of data storage.

 

In order to transfer data between an XML document and a database, it is necessary to map document structure to database structure and vice versa. Such mappings fall into two general categories: template-driven and model-driven. In a template-driven mapping, there is no predefined mapping between document structure and database structure. Instead, you embed commands in a template that is processed by the data transfer middleware. Model-driven mapping involves the data model being imposed on the structure of the XML document  which maps to the structures in the database. What is lost in flexibility is gained in simplicity in the design model.

Object oriented mappings are already established techniques in which object oriented data stored as an XML document can be transferred into relational databases. However, there are issues when concerned with dictionary entries being viewed as objects. As discussed in the previous chapter, the concept of seeing an entry ad an object does not necessarily mean that the system can get the maximum functionality out of the dictionary file.

 

 

Figure 13 Database approach to data retrieval

 

A DBMS is great for storing data that can fit into a set category. Despite the fact that a dictionary is completely made up of entries, the type of data stored within them is very very different. Consider the problem of a bilingual dictionary, a Japanese to English dictionary. The bilingual dictionary can be defined using data modeling via an entity relationship diagram:

 

Figure 14 Possible entity relationship diagram for a Japanese to English bilingual dictionary

 

Choosing the key for the database is not very difficult. The only attributes that uniquely identifies an entry are the Kanji headword and the Kana headword. It is a harder task deciding which attributes are grouped together into different groups. The relationships between attributes are interconnected and trying to find relations between them will result in confusion. Establishing additional relational tables for further languages is likely to be an arduous task. The reason why it is so difficult to assign the different attributes into tables is because a database management system is not suitable for a dictionary file.

 

Problems with DBMS

 

-          Not all potential users can have access to such systems.

-          Not all entries contain the same types of entities, such as one may be a cross-reference and another a complete analysis. Some entries may have multiple definitions and also multiple types of word (that is, noun and also an adjective if used in a different context)

-          The problem with what constitutes keys and which attributes mapping to which tables. If someone wants to grab a list of nouns from the dictionary, it may be impossible to do so under some configurations of the tables in the DBMS

-          A DBMS stifles the amount of variety of relationships that can be represented between lexical entries. If these factors are limited, it is providing similar barriers that object orienting the dictionary does.

-          The ER diagram finds it difficult at representing hierarchical information and cannot deal with data types that have variable structures.

-          Space can be wasted if entries with variable structure are created into tables because even if an attribute is left blank in the table, it is invariably taking up space in the database. Add up the different combinations of attributes in one entry and there is the potential to lose a lot of space just to keep empty entries

-          Entries in dictionaries can be very complex – requiring storage of different attributes into different tables if a DBMS were to be implemented. When the components of an entry are scattered through different tables, the number of transformations (such as restrictions, projections and joins) increases to a point where retrieval of a single entry can become time consuming.

-          In the above example of the ER diagram, not only are these attributes the only data types, but they may contain subentries that in turn contain their own subentries. This kind of nesting structure can result in an unusable database

-          Once the relational database is set up for a dictionary system, its structure is almost permanent. Addition or rearrangement of attribute groups is extremely difficult. For example, if it is decided that another language is going to be added to the database, it is extremely difficult to map it over the current set up. A major structural overhaul from a bilingual dictionary database would be need in order to incorporate the extra language.

 

Despite all these negative comments about DBMS, there are several positive aspects from using DBMS in the context of lexicographical information representation.

 

-          The query handling is handled by the system. Relational databases can also connect to systems that allow editing of the tables and also be linked to applications that create interfaces to query the interface.

-          Headwords that are synonyms can share the same definition – they no longer need to be stored differently. One key can have a one to many relationship between different entities in different tables, such as multiple definitions for an entry.

-          Access is carried out through a query language such as SQL, making access relatively easy

-          The most important advantage of a DBMS is that it is a system that does not require the user to search the file from head to toe to find an entry. Compared to a technique such as SAX, the retrieval times are far superior and more efficient.

-          Given the low price and ease of use of databases like dBASE and Access, it may seem like a good option

 

There are still different types of implementations of DBMS systems that utilize the DOM form of document object. The biggest problem with the DOM structure is that it is all resident in memory. Obviously, others have realized this problem and in an effort to by pass this problem, a group of developers, GMD-IPSI, have created software called PDOM that avoids the need to store the complete document structure in memory.

 

PDOM provides an implementation of the DOM over indexed, binary files. These are created by converting existing XML documents (a one-time operation), as well as when PDOM is used to create a new DOM Document. PDOM includes a cache, which swaps DOM nodes to disk when handling large DOM trees, defragmentation and garbage collection facilities, commit points (for writing the in-memory tree to disk), file compression with gzip, and thread-safe operation. This application has not been tried or tested, but it is an alternative worth considering when deciding to implement a dictionary system.

 

5.4 Hash tables

 

A hash table is form of data structure that can offer very fast insertion and searching. No matter how many data items there are, insertion and searching can take close to constant time. Hash tables rely on the concept of having a range of keys that are transformed into a range of array index values. In some cases, the index number can map directly onto an array, removing the need to transform index to array number. Despite these strong positive points, hash tables fall short of becoming a good internal representation because it cannot cope with sequential accesses. A detailed description of hash tables and reasons why it is not a good technique for representing a multilingual file is described in Appendix C.

 

 

5.5 Additional considerations

 

The preceding discussion describes the methods in which the dictionary file can be stored and structured in a way that allows fast and flexible access. This is but one link in the chain to a successful application. The search string queries sent to the file structure to find entries must be studied carefully. The type of information to be stored in a file such as an index file needs examination.

 

Indexing issues when creating index file

 

There are some minor problems when indexing the JMdict file. The issue arises because of the nature of the English entries. An entry may consist of a word, two words, or a phrase. A headword in an English index file cannot start with ‘to run’. If a user typed ‘to’ they would get an avalanche of definitions. On the other hand, if the user typed ‘run’, they may not find what they are looking for, since the index searches alphabetically for matching words. What the index would really like out of this entry is ‘run’. This may not only be a difficulty for English definitions, but it may also be a factor for the German definitions too. One solution to fix this could be to have an exclusion list of words. This exclusion list would contain a list of words that are to be omitted when indexing. Examples of these could be ‘to’, ‘the’, ‘or’ and ‘and’. The exclusion list would be put into action when creating the index file. This alone would reduce the size of the index file, because there are a lot of these words in the dictionary. More importantly, it will boost the effectiveness of the searches. Another additional technique to ensure that the user can retrieve the correct headword would be to store all the words contained in the English definition in the index file. For example, an entry ‘hot water’ would be stored as 2 separate head words ‘hot’ and ‘water’. This will allow the user, if searching for ‘water’ to grab several derivatives or compound words containing the word ‘water’.

 

Scalability of system

 

Dictionaries are constantly growing in size. Take the Oxford dictionary for example. This English dictionary undergoes updates and additions every year as new words are created. A multilingual dictionary will always continue to grow in size. JMdict currently only supports 3 languages at the current point in time, and even this support is limited. So when choosing the internal representation of the dictionary file, the issue of scalability is rather large. The system must be able to:

-          Accommodate large additions in data to all aspects of the file

-          Be flexible enough to allow addition of further fields and entry types

-          Still maintain efficiency with much larger files

-          Cope with the addition of another language

 

As discussed in a previous section, a DBMS is able to grow along with the dictionary file. Its Achilles heel is its inability to cope with inclusions of additional fields and entry types. Hash tables are poor in providing scalability and sequential access. The indexing system should be able to cope reasonably well to most of these factors.


Chapter 6

Interface to an Electronic Dictionary

 

 

 

The problem of visualization is a challenging one. Providing a translation of a word from the user’s native language to a second language is of little help without giving advice on the usage and relationship with regards to other words. This limitation not only comes from the format of the dictionary; failure to deliver the information to the user can also be the fault of the interface to the dictionary. Furthermore, the interface can deliver much more than the interface for a print dictionary, and it should reflect this fact. Success of an application also depends on whether a user find the interface attractive, but more important, intuitive and easy to navigate. Unless the software is outstanding, some users will not appreciate steep learning curves for applications.

 

The primary focus of the interface of the electronic dictionary is towards the user. The amount and format of information presented to the user can be varied greatly compared to the print ones. A user’s most basic requirement of a dictionary are:

-          To enter a query

-          View the entry in the dictionary that corresponds to the search

Therefore, the basis of the dictionary interface should be focused around these two aspects of a user’s needs.

 

6.1 Interface query options

 

Print dictionaries used to only provide a search for a specific entry. As mentioned previously, the electronic storage of dictionaries allows increased functionality of searches provided adequate data representation techniques are used. The following are some of the different input and searching methods that can be employed:

-          Complex searching options. This is one of the major features that electronic dictionaries have over print dictionaries – being able to search not only in alphabetical order, but in other interesting ways:

o        Searching for verbs, nouns, adjectives

o        Searching for words beginning or ending with a certain letter combinations

o        Search for synonyms and antonyms

o        Search for ‘n’ letter words

o        List the dictionary in alphabetical (traditional print) format

o        Combination searches (search for nouns that begin with “trace”)

o        Restriction of a search; re-search the list of entries with a different search

o        Return words that sound like “grass”

-          These complex searching techniques can be made possible by a data representation scheme such as indexing. Multiple index files would be created, each providing a different index allowing unique searches to be carried and combination searches that utilize several of the index files at one time.

-          Follow up cross references in an instant. Cross references displayed after a search should have the ability to be followed-up. The cross-references should be able to be inputted as a search query by either clicking on the text or some other form of requesting the cross-reference entry.

-          Choice of different input schemes for users of different languages. The provision of input schemes for languages that cannot be entered through a western keyboard needs to be implemented. Research into this area is extremely broad and diverse depending on the language and for this reason was not investigated in this project.

 

6.2 Viewing search results

 

Because the JMdict file is stored in Unicode and the programming language is capable of handling Unicode, the next issue with regards to character encoding is being able to display the Unicode information onto the screen. MS Window 9x operating systems do not use Unicode as their primary fonts. If Unicode is required on these systems, specialized Unicode fonts need to be installed or downloaded via the internet. Being a stand alone application, Unicode support for display of foreign characters must be incorporated into the software. The investigation of ways to display Unicode was outside the scope of the project.

 

The display of entries to the user can be creative, yet practical for the user. There are many things that can be displayed to the user at one time. The following is a list of features and displays that can be provided for the user:

-          Pictures and sounds for the user. Sounds can be used to give the user an audio experience that can aid in the learning of the language from the speech perspective. Dictionaries have usually been very useful for writing and reading languages, but inclusion of audio will add an extra dimension of language learning to the dictionary experience. Illustrative examples for bilingual dictionaries can be used to contribute to a user’s interest by showing the word in a live context and the enhance understanding of the grammatical and semantic rules governing the usage of the word by showing these rules in action. Pictorial illustrations can serve 2 purposes in a bilingual dictionary – They can cue and reinforce verbal equivalents, especially when the user can identify with the picture. Secondly, they serve as generalizing examples when several different but relevant pictures are given in order to establish concepts.

-          Customization of user needs– the ability to choose what type of information is presented. Various users may use this dictionary. Electronic dictionaries should be able to cater for young students and also for specialists. To allow such compatibility in one program, the display should have the option of customization.  For example, hiding or displaying the pronunciation, etymologies, variant spellings, part of speech, bibliography, cross reference, examples and pictures.

-          Tree or graph displays. To demonstrate the relationship between words in the dictionary, the user can be provided with the option for displaying cross-references or special word such as synonyms in the form of a graph. The following is an example of a possible synonym graph.

 

 

Figure 15 Example graph layout of the synonyms for ‘gruff’

 

 

Use of colours: Various colours  can be used to denote different attributes of an entry and can also provide links for cross-references.

Switching between languages: Because the application is a multilingual environment, the user should have the right to switch between different language displays at any time. An entry that has been displayed in Japanese should allow switching to a German display. This can be useful in finding language equivalents for a particular word or phrase.

 

A sample mock-up of a possible interface for users has been created. The interface demonstrates some of the aspects highlighted earlier. A sidebar has been included to allow users to view other entries located around the target entry, in addition to providing multimedia in the form of pictures. Buttons or hyperlinks can be used to filter different displays for the user.

 

Figure 16 Sample interface for a multilingual dictionary

 

 

6.3 XSL

 

XML is a customizable markup language. In the case of JMdict, it is optimized to handle pure data. This data needs to be processed so that articles that are formatted can be displayed to the user. A style sheet is required to automatically convert the document from the abstraction (XML) to a formatted rendition. This stylesheet language is called XSL.

 

XSL (eXstensible Stylesheet Language) is a stylesheet, which is a template that describes how to present documents. It enables a document to be presented in a variety of ways such as onto a monitor, onto paper or even speech. It is used to apply style to XML documents. XSL is a declarative language. Each ‘rule’ element must have a target-element and an action. Each element in a document matches a single rule. The XSL stylesheet will look at each element and apply the correct rule to it. It consists of two parts: 1) a language for transforming XML documents to other XML documents (XSLT), and 2) an XML vocabulary for expressing formatting semantics.

 

The Extensible Stylesheet Language Transformations (XSLT) W3C recommendation describes a transformation vocabulary used to specify how to create new structured information from existing XML documents. XSLT implements transformation by example, not by program logic. Templates are created that tell the XSLT engine how to transform the XML document. There are instructions included in the template file for the XSLT engine to find the information in the XML file. XSLT may be useful if temporary XML files need to be created containing a certain list of entries for display to the user. This new XML file may be passed to the XSL processor to produce the desired output.

 

An XSL stylesheet consists of a set of construction rules that are defined for the conversion of an XML source tree into a new XML document that is expressed in a hierarchy of formatting objects called flow objects. This means that XSL stylesheets are XML documents. These flow objects describe exactly how the document should be printed. When using the XSL formatting objects to present information, the objective of the stylesheet is to transform the XML information into a hierarchy exclusively comprised of the XSL formatting object vocabulary. A rendering engine then takes this resultant hierarchy and interprets the semantics of the XSL vocabulary to produce the desired format.

 

XSL increases the usefulness of XML. It provides the ability to reuse data, provide a standard style of presentation but on the other hand can provide customized presentations for different users. For example, in a product database, the user may like to see the products that are in stock for a company. Instead of drawing up a whole new database for them, can use the existing factory database, except use a fancy stylesheet which will only allow viewing of the database.

 

Figure 17 Diagram outlining the processes that take place to process an XML document for viewing.


 

Chapter 7

Conclusion

 

 

 

The motivation behind the research was to combine all the different research and information associated with creating a multilingual dictionary into one report. It has focused primarily on the handling of the XML-based JMdict file and its role in the multilingual dictionary context.

 

Review

 

This thesis has provided a platform for the analysis of a broad range of issues relating to electronic dictionaries, with the major focus on multilingual dictionaries. The operations required from a dictionary, the input of a query, searching for the entry and display of results. The analysis of the multilingual problems and possible solutions have been combined into one thesis report is quite rare.

            Many XML tools have been constructed without the application of a dictionary file taken in to consideration, so the investigation of their applicability will allow others to benefit from the analysis. This thesis has been undertaken with the hope that future developers of a multilingual dictionary application will have a better understanding of the difficulties and possible solutions to some of the questions raised.

 

Based on the investigations carried out on the different storage and retrieval techniques, it appears that indexing the JMdict file currently seems the most effective option. Indexing allows multiple keys to be stored and also provides for quick access that can be carried out in a sequential order. The two XML APIs, DOM and SAX are alone unable to support such large and complex XML structures such as the JMdict. DOM simply cannot provide memory effective ways of representing entries as objects. SAX is event-driven, restricting the range of searching and retrieval methods.

 

Restrictions and Future work

           

While the thesis has been a thorough review, there are many aspects that can be investigated further and implemented. Because the topic is so huge, a restriction on the research has been the time frame. It would have been advantageous to have begun an implementation of a multilingual dictionary. However, lack of information and tools specifically for multilingual dictionaries resulted in the investing of time in combining all the information into a single document. Despite this, there are still several gaps in the research and subsequently gaps in the thesis report.

           

A proper design and subsequent prototype can be created because the major issues have been identified and in some situations, suggestions are given. An important extension to this thesis would be to produce a proof-of-concept application. Such an application would combine the techniques outlined for each of the programming concepts required for each stage. A demonstration of the application is often a good way of proving that the suggested techniques work well in the environment.

           

Input techniques such as those described in Chapter 3 are so diverse amongst different languages that investigation into the most appropriate techniques to use could result in a completely different thesis report. Japanese Kanji and Kana input would need analysis along with the traditional input methods. As more different languages are included into the XML JMdict, further research would be required to establish input techniques.

           

Studying the JMdict file itself was touched upon only briefly in this paper. However, a further in depth look could be taken. If the JMdict were to be altered structurally, the impact on internal representation of the file could be examined, identifying the advantages and disadvantages of different changes. An example of a change could be the impact of converting some of the data into attributes. In addition, further discussion on the drawbacks of XML and the JMdict could be carried out along with possible solutions.

           

Investigations on user learning ability when given different a user interfaces would be an interesting future research topic. From this form of investigation, it may be possible to determine the best ways to present information to the user and identify which types of interfaces are best for users to navigate through.

 


Appendices

 

Appendix A

 

 

Methods of storage for electronic dictionary information

 

There have been a variety of different types of electronic dictionaries developed around the world. A quick search from a search engine will bring up a host of electronic dictionary web sites. Of notable interest is the Oxford English Dictionary [OED, 2000]. This dictionary is the largest English dictionary in the world, containing 20 volumes in its paperback version. The OED is stored in flat files with SGML markup (Standard Generalised Markup Language). The following is a table describing the type of data available per entry that a user can access. The OED contains the most information per entry pertaining to a single language.

 

Headword section

 

Sense section

 

Special types of main entries

Cross-reference entries

status symbol

headword

pronunciation

part of speech

homonym number

label

variant forms

etymology and etymological note

 

status symbol

sense number

label

definition and definition note

quotation paragraph

date of publication

author

title

text of quotation

compounds

derivatives

letters of the alphabet

initialisms, acronyms, abbreviations

affixes and combining forms

proper names

erroneous, spurious, or ghost words

lengthy entries

 

 

Table A1. Illustration of the structure and data contained in each entry [OED, 2000]

 

EDICT

 

            EDICT is an electronic dictionary file containing Japanese entries with translations to English created by Breen [2000 (a)]. Initially created as a file for a piece of software called Moke [Breen, 2000 (1)], the dictionary size has grown to over 100,000 entries. The entries are very simple (lacking structure) and as the following sample of EDICT entries demonstrates, only a headword and definition are included per entry:

 

Figure A18 Sample of the EDICT dictionary file

 

The format of the entries is as follows:

Kanji entry (if any) [kana entry] / English definition/

 

 

KEBI

 

KEBI stands for Kamus Elektronik Bahasa Indonesia which means the Indonesian Electronic Dictionary. KEBI is developed as part of the Multilingual Machine Translation System Project. 22,500 root word entries exist, which consist of a total of 43,500 derivation words. The dictionary structure consists of the following information:

-          Morphological information: consisting of suffixes and prefixes that are tagged onto root words to form derivation words

-          Syntactic information: consists of parts of speech

-          Semantic information: consists of semantic category of a word.

-          Concept description: contains word meaning which is described in English.

 

The following figure illustrates the results of a search in Indonesian

Figure A19 Output display of KEBI system.

 

Format of entries:

Head Word Information

Morphological information

Syntactic Information

Semantic Information

Concept Information

 

EDR

 

The Japanese EDR (Electronic Dictionary Research) dictionary was a project funded by 8 Japanese electronic companies Fujitsu, NEC, Hitachi, Sharp, Toshiba, Oki, Mitsubishi and Matsushita. It was developed for advanced processing of natural language by computers. It is made up of eleven sub-dictionaries. The sub-dictionaries include a concept dictionary, word dictionaries and bilingual dictionaries. Altogether there are tens of thousands of entries. Of great interest is the bilingual dictionary. This dictionary consists of an English to Japanese and a Japanese to English file. The files list the correspondences between headwords in the different languages. The Japanese-English bilingual dictionary contains 230,000 words, and the English-Japanese bilingual dictionary contains 190,000 words. The following figure provides the structure of an entry in the bilingual dictionary

 

Each record consists of

-          Entry information

-          Grammatical information

-          Semantic information

-          Bilingual information: consists of correspondence word information (equivalent, paraphrase, direct translation, romanization/katakana, and explanation)

-          Part of speech

 

 

Figure A20 Diagram displaying the structure of Japanese/English entry structure in EDR [EDR, 2000]

 


Appendix B

 

 

Text representation

 

Japanese text representation

 

Asian character representation, in particular Japanese representation on computers, is rather unique in that they do not resemble or fit into the traditional European displays. In addition to the great differences in representation, they require special handling when being displayed on Western computers. There used to be no character set for Japanese that was universally recognized. In addition to the representation standards, there are many differences between Western and Asian languages relating to the actual display and use of the language.

 

The Japanese language is not just composed of a single character set; it is made up of three different types of characters Hiragana (native Japanese words), Katakana (Japanese representation of non-Japanese words) and Kanji (Chinese characters). The need for Japanese representation resulted in the creation of a number of different character representation standards.

           

Half-width katakana was the very first attempt at Japanese character representation on computers. It could be displayed relatively easily on Western computers because they could be displayed in the same space as ASCII characters.

 

JIS X 0208:1997 has been the most widely used of the Japanese electronic character sets. It was created in 1978 and was to become the very first Japanese coded character set to include Chinese characters.

           

After the release of JIS X 0208-1990 (an extension to JIS X 0208:1997 character set), Ken Lunde released a document called ‘Japan.inf, Electronic handling of Japanese Text’. This article informed users on how to handle Japanese on a variety of platforms.

 

Encoding is the mapping of a numeric value to a character. Throughout the different Asian character sets, the main encoding method is EUC (Extended Unix Code) and ISO-2022-JP. There is also a Japan-specific encoding used which is called Shift-JIS. These encoding standards supported one or several of the different types of Japanese text representations.

 

Chinese and Korean text representation

 

Chinese characters, otherwise known as Hanzi, are the most complex type of Asian character set. In comparison with the English language that has 26 unique letters, the Chinese character spectrum is huge, containing over ten thousand characters from different regions. Therefore, the encoding and representation standards for Chinese can become rather difficult. To complicate matters even further, there are traditional and simplified representations of the same characters that must be handled.

           

In China, the Chinese character set standard is GB or ‘guo biao’ which stands for national standard. This standard enumerates several thousand Chinese characters. Numerous extensions and corrections to this standard have been carried out since its introduction in 1981. In Taiwan, a country that does not use simplified characters, the character set standard used is called Big Five. It is an unofficial standard, but has been adopted and widely used by the Taiwanese people. Established in 1984, it has the capacity to store fourteen thousand characters. CNS, another character set standard, is seen to be a corrected and updated version of Big Five. It has the largest capacity, enumerating  48,000 Hanzi. [Lunde, 1999]

 

Korean is represented using Hangul characters. These are a totally different set of characters in comparison with Kanji and Han characters. The character enumeration standard KS X was established in 1992 by South Korea. This standard contains almost five thousand entries. North Korea also developed their own character standard in 1997 called KPS that enumerates over 8,000 characters. [Lunde, 1999]


Appendix C

 

Hash Tables

 

A hash table is form of data structure that can offer very fast insertion and searching. No matter how many data items there are, insertion and searching can take close to constant time. Hash tables rely on the concept of having a range of keys that are transformed into a range of array index values. In some cases, the index number can map directly onto an array, removing the need to transform index to array number.

 

It may be a straightforward idea to simply map each character so a unique number and create an index file in this manner. However, after a short consideration, this plan has many flaws. Many of the words will have the same index. Combinations of numbers to amount of words in a dictionary cannot be fulfilled. Another possible way of indexing each entry into a unique could be to possibly represent each character of a word as a value 10 times as big as the position to its right. For example:

If ‘a’ = 1, ‘b’ = 2, ‘c’ = 3, ‘d’ = 4, ‘e’ = 5

Then the word ‘ace’ would be indexed as the number: 1*102 + 3*101 + 5*100 = 135

This poses another set of problems in that a large word, such as a word like ‘encyclopaedia’, a 13 letter word, would result in an extremely large index number, of size 1012 – far more space in an array than required. Therefore a method is required to store this range of numbers into a reasonably sized array. A hsah function is required to convert numbers in a large range into a number in a smaller range. A simple hash function is to find the remainder by dividing the large number by the array size to reduce the range of numbers to fit into the size of the array.

 

When this form of hashing is used, there is the problem where words like ‘bat’ and ‘tab’ hashes to the same index. This means that there will definitely be hashing of several different words into the same array location. This is a collision. There are solutions to solve this kind of problem. Open addressing is a term used to describe the action of searching the array in a sequential manner to find the next empty cell slot and placing the collision item in that slot. A problem with open addressing is that when there are many entries that share the same index number (which there undoubtedly will be for an English dictionary), clustering will occur (increasing size of a filled sequence of array entries). Another alternative is to create a linked list at each index entry. This way, the matching index numbers are linked together, anchored at the front to the correct index slot in the array. This technique is called separate chaining. An issue involved with separate chaining is that when there are many entries that contain the same index number, searching will take longer. The following are graphs comparing successful and unsuccessful searches using both open addressing and separate chaining techniques.

Figure C1 Graph of the performance of searching using open addressing

 

Figure C2 Graph of the performance of separate chaining.

 

In the situation of the JMdict, the dictionary keys are not well behaved; they are words of different length. The dictionary, if able to fit into computer memory, is a good choice solely because of the fact that it can be accessed quickly. There are currently approximately 500,000 entries in JMdict. The words would like to be accessed from the hash table. This requires the conversion of the word into an index number using the hash function. The array would not only have to be 500,000 in length, but at least twice the size because having a hash table completely full decreases the efficiency of searches.

 

There are several disadvantages. They are based on arrays, and arrays are difficult to expand once they have been created. For some hash tables, the performance of a search can decrease rapidly when the table becomes too full. If a search was carried out on a full hash table, an unsuccessful search would result in a complete linear search of the table. In addition, hash tables are not very suitable for growing data sets. The JMdict is sure to grow in the future so this would mean reworking the hash function and increasing the size of the hash table. The hash table would be indexed with one value only. This is restricts the flexibility of the system because it only allows one specific type of search to be carried out. Multiple hash tables would be required for different languages. All of them would not be able to fit into memory. There is no convenient way to visit the items in a hash table in any kind of order (such as in alphabetical order, or words ending with ‘less’). This is quite an important drawback. Many of the searches will come in the form of finding certain entry. A user may want to view a list of words that are in alphabetical order (as in the traditional format of viewing a dictionary). When this case arises, it will be very inconvenient to go through the hash table to retrieve the entries in alphabetical order. As suggested by Lafore, if the capability of ordered searching is required, a different data structure should be sought out. [Lafore, 1998] Therefore the optimum requirements for a hash table implementation would be: no requirement of ordered item visitation, accurate prediction of the size of the hash table, attributes that the multilingual dictionary does not possess.

 


Bibliography

 

 

 

 

[Al-Kasimi, 1977] Al-Kasimi, A. (1977) Linguistics and Bilingual Dictionaries. E.J. Brill, Leiden.

 

[Atkins, 1985] Atkins, B. ‘Monolingual and Bilingual Learners’ Dictionaries: A Comparison’ Dictionaries, Lexicography and Language Learning, Pergamon Press, 1985. Pages 15-24

 

[Ballesteros, 1996] Ballesteros, L and Croft, W.B. (1996) Dictionary methods for cross-lingual information retrieval. Proceedings of the 7th International DEXA Conference on Database and Expert Systems. Pages 791-801

 

[Brajnik et al., 1996] Brajnik, G. and Mizzaro, S. and Tasso, C. (1996) Evaluating User Interfaces to Information Retrieval Systems: A Case Study on User Support. SIGIR ’96 Zurich, Switzerland. Pages 128-136

 

[Beryl, 1985] “Beryl T. A, Monolingual and bilingual learners’ dictionaries: a comparison’, Collins PublishersBrumfit, C. Dictionaries, lexicography and language learning. Pergamon institute of English (Oxford), 1985

 

[Breen, 2000 (a)] Breen, J.W. (2000)  E D I C T JAPANESE/ENGLISH DICTIONARY FILE

 

[Breen, 2000 (b)] Breen, J.W. (2000) A WWW Japanese Dictionary. Japanese Studies Centre Symposium, Melbourne, Australia.

 

[Breen, 1995] Breen, J.W. (1995) Building an Electronic Japanese-English Dictionary. Japanese Studies Association of Australia Conference, July 1995, Brisbane, Australia.

 

[Callan] Callan, J., Croft, B. and Harding, S. The INQUERY Retrieval System. Proceedings of the 3rd International Conference on Database and Expert Systems

 

[Ceponkus & Hoodbhoy, 1999] Ceponkus, A., Hoodbhoy, F. Applied XML. A toolkit for programmers. Wiley Publishing, 1999

 

[Croft et. al] Croft, B., Broglio, J. and Fujii, H. Applications of Multilingual Text Retrieval, Proceedings of the 29th Annual Hawaii International Conference on System Sciences, Pages 98 - 107

 

[Croft, 1995] Croft, W. and Xu, J. (1995) Corpus-Specific Stemming using Word Form Co-occurrence. Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada Pages 147-159

 

[Davis, 1998] Davis, M. (1998) Free resources and advanced alignment for cross-language text retrieval. Proceedings of the sixth text retrieval conference (TREC-6), Gaithersburg, MD: National Institute of Standards Technology (NIST).

 

[EDR, 2000] Japanese Electronic Dictionary Research Institute: http://www.iijnet.or.jp/edr/ (2000)

 

[Eichmann et al., 1998] Eichmann, D. and Ruiz, M.E. and Srinivasan P. (1998) Cross-Language Information Retrieval with the UMLS Metathesaurus. SIGIR ’98 Melbourne Australia. Pages 72-80.

 

[Erbach, 1997] Erbach, G., Neumann, G., Uszkoriet, H. MULINEX, Multilingual Indexing, Navigation and Editing Extensions for the World Wide Web. Project Note at AAAI Symposium on Cross-Language Text and Speech Retrieval, Stanford, 1997

 

[Fujii, 1993] Fujii, H., Croft, W. (1993) A Comparison of Indexing Techniques for Japanese Text Retrieval. Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. Pittsburgh, USA

 

[Goldfarb, 1998] Goldfarb, C. F. and Prescod, P (1998) The XML Handbook. Prentice Hall

 

[Han et. al, 1994] Han, C., Fujii, H. and Croft, W. (1994) Automatic Query Expansion for Japanese Text Retrieval. Technical report, Departement of Computer Science, University of Massachusetts, Amherst

 

[Hartmann, 1983] Hartmann, R. (1983) Lexicography: Principles and Practise. Academic Press

 

[Hlava et al., 1997] Hlava, M., Belonogov, G., Kuznetsov, B., Hainebach, R. (1997) Cross Language Retrieval – English/Russian/French. American Association for Artificial Intelligence, Spring Symposium Series, 1997

 

[Hull and Grefensette, 1996] Hull, D.A. and Grefenstette, G. (1996) Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval. Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. 1996, Zurich, Switzerland.

 

[Hunter & McLaughlin, 2000] Hunter, J and McLaughlin, B, JDOM Introduction: http://javaworld.com/javaworld/jw-05-2000/jw-0518-jdom.html (2000)

 

[IME, 2000] Microsoft Global IME: http://www.microsoft.com/Windows/ie/Features/ime.asp (2000)

 

[JDOM, 2000] JDOM: www.jdom.org(2000)

 

[Jones, 1999] Jones, G et al. (1999) A Comparison of Query Translation Methods for English-Japanese Cross-Language Information Retrieval. SIGIR ’99 Berkley, CA, USA. Pages 269 – 270

 

[KEBI, 2000] KEBI Online - The Indonesian Electronic Dictionary Online: http://nlp.aia.bppt.go.id/kebi /(2000)

 

[Knuth, 1993] Knuth, D. The Art of Computer Programming – Volume/Sorting and Searching, Addison-Wesley, 1993

 

[Kwok, 1997] Kwok, K.L (1997) Evaluation of an English-Chinese Cross-Lingual Retrieval Experiment. AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, 1997

 

[Laddad, 2000] Laddad, R. (2000). XML APIs for databases.

At: http://www-4.ibm.com/software/developer/library/jw-xmlapis (2000)

 

[Lafore, 1998] Lafore, R. Data structures and algorithms in Java, Waite Group Press, 1998

 

[Levenstein, 1966], Levenstein V.I. Binary codes capable of correcting deletions, insertions and reversals. Cybernet. Control Theor. 1996 Pages: 707-710

 

[Landau, 1984] Landau, S. (1984) Dictionaries: The Art and Craft of Lexicography. The Scribner Press, Charles Scribner’s Sons, New York

 

[Leventhal, 1998] Leventhal, M. Lewis, D. Fuchs, M. Designing XML Internet Applications, Prentice Hall PTR

 

[Lunde, 1999]  Lunde, K (1999). CJKV Information Processing. O’Reilly Publishing

 

[Mair and Liu, 1991] Mair, V.H. and Liu, Y. (1991) Characters and Computers. IOS Press

 

[Maruyama et. al, 1999] Maruyama, H., Tamura, K., Uramoto, N. (1999) XML and Java. Developing Web Applications. Addison-Wesley

 

[OED, 2000] The Oxford English Dictionary On-line: http://www.oed.com (2000)

 

[Oard and Dorr, 1996] Oard, D and Dorr, B. (1996) A Survey of Multilingual Text Retrieval, Technical Report. UMIACS-TR-96-19, University of Maryland, Institute for Advanced Computer Studies.

 

[Pirkola, 1998] Pirkola, A. (1998) The Effects of Query Structure and Dictionary Setups in Dictionary-Based Cross-language Information Retrieval. Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. 1998, Melbourne, Australia. Pages 55-63

 

[PlumbDesign, 2000] PlumbDesign, ThinkMap Visual Thesaurus.

At: http://www.plumbdesign.com/thesaurus (2000)

 

[Porter, 1980] Porter, M.F. (1980). An algorithm for suffix stripping. Program, Vol. 14, no. 3, July 1980

Pages 130-137

 

[Sebrechts et. al, 1999] Sebrechts, M., Vasilakis, J., Miller, M., Cugini, J., Laskowski, S. (1999) Visualisation of Search Results: A Comparative Evaluation of Text, 2D and 3D Interfaces. Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. 1999, Berkley, CA, USA

 

[St. Laurent & Cerami, 1999] St. Lauren, S. and Cerami, E. Building XML Applications, McGraw Hill

 

[Sundsted, 2000] Sundsted, T. Adelard, one year later : http://www.javaworld.com (2000)

 

[The Unicode Consortium, 1996] The Unicode Consortium. (1996) The  Unicode Standard Version 2.0. Addison-Wesley Developers Press

 

[Veerasamy and Belkin 1996] Veerasamy, A., Belkin, N. Evaluation of a Tool for Visualisation of Information Retrieval Results. Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval. 1996, Zurich Switzerland. Pages 85-92

 

[XSLT, 2000] http://www.xslt.com/what_is.htm (2000)

 

[XML.com, 2000] Technical Introduction to XML, At: http://www.XML.com (2000)

 

[XML QE, 2000] XML Query Engine: http://www.fatdog.com/#30000 (2000)

 

[Yamabana et al. 1996] Yamabana, K. and  Muraki, K. and Doi, S. and Kamei, S. (1996). A Language Conversion Front-end for Cross-Linguistic Information Retrieval. Working notes of the Workshop on Cross-Linguistic Information Retrieval, ACM SIGIR, Zurich, Switzerland.

 

 

 

 

 

 

 

 



[1] The Oxford English Dictionary, Second Edition, Volume IV

2 A dictionary is a book that lists words in alphabetical order and describes their meanings. They include information such as spelling, syllabication, pronunciation and etymology (word derivation). An encyclopedia is a collection of articles about every branch of knowledge. The often include definitions, and go far beyond the information given by a dictionary [Landau, 1984]. This division probably arose because of the inability to store the diverse information about a topic in a dictionary, where only definitions were required. But with the electronic medium, this border is likely to be blurred.