Identification of Neologisms in Japanese by Corpus Analysis

This paper describes recent work to extend some techniques reported earlier to identify and extract neologisms from Japanese texts. [2,3,7] The purpose of the work is to extend the recorded lexicon of Japanese, both in free and commercial dictionaries.

Despite having a rich lexicon, the Japanese language has a noted tendency to adopt and create new words[4,6]. While the reasons for adopting new words are varied, there are number of processes associated with the Japanese language which tend to encourage neologism creation:

  1. the readiness to accept loanwords. Unlike some countries, which attempt to restrict loanword usage, Japan has placed no formal restriction on their use. Estimates of the number of loanwords used in Japanese range as high as 80,000. Most of these words have been borrowed directly from English, however a significant number, known as wasei eigo (Japanese-made English) have been assembled from English words or word fragments.
  2. the accepted morphological process of creating words by combining two or more kanji (Chinese characters) chosen for their semantic properties. This process was used extensively in the mid-19th century when Japan re-engaged with the rest of the world and needed an expanded lexicon to handle the technological, cultural, etc. information flowing into the country. This process has continued. A broadly similar process is used to create compound verbs.
  3. the tendency to create abbreviations, particularly from compound nouns and long loanwords. For example, the formal term for "student discount" in Japanese is gakusei waribi (学生割引), however the common term is gakuwari (学割) formed from the first kanji in each of the two constituent nouns. A similar process is applied to loanwords, resulting in words such as sekuhara (セクハラ) for "sexual harassment" (a contraction of sekushuaru harasumento).

Many neologisms find their way eventually into published dictionaries, and there are several special neologism dictionaries (shingo jiten, gendaiyôgo jiten), however many abbreviations, compound verbs and loanwords are less well lexicalized as native speakers can usually recognize them as such and recognize the pronunciation and meaning.

Traditional techniques for identifying neologisms involve extracting lexemes and comparing them with a lexical database. This process can have problems in Japanese as the orthography does not use any separators between words. Segmentation software for Japanese typically use extensive lexicons to enable word segments to be identified, and usually output unassociated strings of characters when words are encountered which are not in their lexicons. Some work has been carried out on reconstructing these "unknown words", but usually in the context of part-of-speech tagging and dependency analysis.[1,9,11]

In the work described in the this paper several broad techniques are used:

  1. drawing on the fact that loanwords in Japanese are are written in the katakana syllabary, thus enabling relatively straightforward extraction and comparison. Some processing is needed to separate out other classes of words, such as the scientific names of flora and fauna, which are also traditionally written using katakana. [5]
  2. mimicking the morphological process for forming abbreviations to construct potential abbreviations from known compound nouns. The candidate is then checked in a WWW-based corpus[8] or via a WWW search engine to determine whether the potential abbreviation is used enough to warrant closer inspection.
  3. a similar mimicking of the morphological process for forming compound verbs, also examining a WWW corpus to determine whether the potential verb is in regular use.
  4. conducting post-analysis of the output of a morphological analyzer to detect when unknown words have been encountered. In such cases the analyzers usually just produce a string of kanji until they can resynchronize. These strings need careful analysis as Japanese is an agglutinative language which makes considerable use of single-character affixes. Having automatically identified potential new words by this method, they are checked in a WWW corpus to determine whether they are used elsewhere, and sample passages are collected to extract the meanings.

Two other aspects of Japanese neologisms also need to be determined: the pronunciation and the meaning.

In the case of loanwords written in katakana the pronunciation is clear from the syllabic text. It is also clear in compound verbs, where the pronunciation of the component verb roots is unchanged. For the kanji compounds the pronunciation is less clear, as many characters can have multiple pronunciations, and the voicing may change on some non-initial consonants (rendaku). As writers of newspaper articles and similar texts writers will often follow new or rare words with the pronunciation in parentheses, candidate pronunciations are generated and the texts examined for possible confirmation.

The meanings of neologisms which are loanwords or abbreviations can usually be reliably derived from the source words or compounds, however care is needed in the case of loanwords as a high proportion have nuances which differ from the original. In the case of other neologisms, an initial presumption about the meaning can usually be made based on the meanings of the kanji used, however it is important to examine the text passages in which the word appears to verify or determine the meaning. [10] Often the arrival of a neologism will result in discussion about it in online forums and articles, and by searching for the language patterns used in such discussions it is often possible to isolate definitions. For example, in discussion of the the meaning of a word, the word in question is often followed by the particle pair towa (とは) (as for this word/passage/etc.), thus providing an identifiable text pattern to use in searches. There is scope for training machine-learning systems with such explanatory passages.

References

  1. Masayuki Asahara and Yuji Matsumoto Japanese Unknown Word Identification by Character-based Chunking COLING 2004, Geneva.

  2. James Breen, Expanding the Lexicon: Harvesting Neologisms in Japanese, Papillon (Multi-lingual Dictionary) Project Workshop, Chiang Rai, Thailand, 2005

  3. James Breen, Expanding the Lexicon: the Search for Abbreviations, Papillon (Multi-lingual Dictionary) Project Workshop, Grenoble, 2004.

  4. Lee Shiu Chen, Lexical Neologisms in Japanese, Australian Association for Research in Education Conference, Brisbane, 2002.

  5. Toshiaki Nakazawa, Daisuke Kawahara and Sadao Kurohashi Automatic acquisition of basic katakana lexicon from a given corpus IJCNLP, 2005.

  6. Natsuko Tsujimura, An Introduction to Japanese Linguistics, Blackwell, 1996.

  7. Nobuhiro Kaji, Ryoko Uno and Masaru Kitsuregawa, Mining Neologisms from a Large Diachronic Web Archive for Supporting Linguistic Research Technical Report - Institute of Industrial Science/Graduate School of Arts and Sciences, University of Tokyo. (in Japanese)

  8. Taku Kudo and Hideto Kazawa Japanese Web N-gram Corpus Version 1, Google/Linguistic Data Consortium, http://www.ldc.upenn.edu/

  9. Kiyotaka Uchimoto, Satoshi Sekine and Hitoshi Isahara The unknown word problem: a morphological analysis of Japanese using maximum entropy aided by a dictionary, EMNLP 2001

  10. Kiyoko Uchiyama, Timothy Baldwin and Shun Ishizaki Disambiguating Japanese Compound Verbs Computer Speech & Language, Volume 19, Issue 4, October 2005, (Special issue on Multiword Expression)

  11. Takehito Utsuro, Takao Shime, Masatoshi Tsuchiya, Suguru Matsuyoshi and Satoshi Sato Chunking and Dependency Analysis of Japanese Compound Functional Expressions by Machine Learning Text, Speech and Dialogue: 10th International Conference, TSD 2007, Plzen, Czech Republic