Identification of Neologisms in Japanese by Corpus Analysis
This paper describes recent work to extend some techniques reported earlier to
identify and extract neologisms from Japanese texts. [2,3,7] The purpose of the
work is to extend the recorded lexicon of Japanese, both in free and commercial
dictionaries.
Despite having a rich lexicon, the Japanese language has a noted tendency
to adopt and create new words[4,6]. While the reasons for adopting new words are
varied, there are number of processes associated with the Japanese language
which tend to encourage neologism creation:
- the readiness to accept loanwords. Unlike some countries, which attempt
to restrict loanword usage, Japan has placed no formal restriction on their use.
Estimates of the number of loanwords used in Japanese range as high as 80,000. Most of
these words have been borrowed directly from English, however a significant
number, known as
wasei eigo
(Japanese-made English) have been assembled from English words or word
fragments.
- the accepted morphological process of creating words by combining two or
more
kanji
(Chinese characters) chosen for their semantic properties. This
process was used extensively in the mid-19th century when Japan re-engaged
with the rest of the world and needed an expanded lexicon to handle the
technological, cultural, etc. information flowing into the country. This process
has continued. A broadly similar process is used to create compound verbs.
- the tendency to create abbreviations, particularly from compound nouns and
long loanwords. For example, the formal term for "student discount" in Japanese is
gakusei waribi
(学生割引), however the common term is
gakuwari
(学割) formed from the first
kanji
in each of the two constituent nouns. A similar
process is applied to loanwords, resulting in words such as
sekuhara
(セクハラ) for "sexual harassment" (a contraction of
sekushuaru harasumento).
Many neologisms find their way eventually into published dictionaries, and
there are several special neologism dictionaries
(shingo jiten, gendaiyôgo jiten),
however many abbreviations, compound verbs and loanwords
are less well lexicalized as native
speakers can usually recognize them as such and recognize the pronunciation
and meaning.
Traditional techniques for identifying neologisms involve extracting lexemes
and comparing them with a lexical database. This process can have problems in Japanese
as the orthography does not use any separators between words. Segmentation
software for Japanese typically use extensive lexicons to enable word segments to
be identified, and usually output unassociated strings of characters when words are
encountered which are not in their lexicons. Some work has been carried out on
reconstructing these "unknown words", but usually in the context of part-of-speech
tagging and dependency analysis.[1,9,11]
In the work described in the this paper several broad techniques are used:
- drawing on the fact that loanwords in Japanese are are written in the
katakana
syllabary, thus enabling relatively straightforward extraction and comparison. Some processing
is needed to separate out other classes of words, such as the scientific names of
flora and fauna, which are also traditionally written using
katakana.
[5]
- mimicking the morphological process for forming abbreviations to construct
potential abbreviations from known compound nouns. The candidate is then
checked in a WWW-based corpus[8] or via
a WWW search engine to determine whether the potential abbreviation
is used enough to warrant closer inspection.
- a similar mimicking of the morphological process for forming compound
verbs, also examining a WWW corpus to determine whether the potential verb
is in regular use.
- conducting post-analysis of the output of a morphological analyzer to detect
when unknown words have been encountered. In such cases the analyzers usually
just produce a string of
kanji
until they can resynchronize. These strings need
careful analysis as Japanese is an agglutinative language which makes considerable
use of single-character affixes. Having automatically identified potential new
words by this method, they are checked in a WWW corpus to determine
whether they are used elsewhere, and sample passages are collected to extract the
meanings.
Two other aspects of Japanese neologisms also need to be determined: the pronunciation
and the meaning.
In the case of loanwords written in
katakana
the pronunciation is clear from the syllabic text. It is also clear in compound
verbs, where the pronunciation of the component verb roots is unchanged.
For the
kanji
compounds the pronunciation is less clear, as many characters can
have multiple pronunciations, and the voicing may change on some
non-initial consonants
(rendaku).
As writers of newspaper articles and similar texts writers will often follow new
or rare words with the pronunciation in parentheses, candidate pronunciations are
generated and the texts examined for possible confirmation.
The meanings of neologisms which are loanwords or abbreviations can usually be
reliably derived from the source words or compounds, however care is needed in the case
of loanwords as a high proportion have nuances which differ from the original.
In the case of other neologisms, an initial presumption about the meaning can usually
be made based on the meanings of the
kanji
used, however it is important
to examine the text passages in which the word appears to verify or determine the meaning. [10]
Often the arrival of a neologism
will result in discussion about it in online forums and articles, and by searching for
the language patterns used in such discussions it is often possible to isolate
definitions. For example, in discussion of the the meaning of a word, the word in question
is often followed by the particle pair
towa
(とは) (as for this word/passage/etc.), thus providing an identifiable
text pattern to use in searches. There is scope for training
machine-learning systems with such explanatory passages.
References
- Masayuki Asahara and Yuji Matsumoto
Japanese Unknown Word Identification by Character-based Chunking
COLING 2004, Geneva.
- James Breen,
Expanding the Lexicon: Harvesting Neologisms in Japanese,
Papillon (Multi-lingual Dictionary) Project Workshop, Chiang Rai,
Thailand, 2005
- James Breen,
Expanding the Lexicon: the Search for Abbreviations,
Papillon (Multi-lingual Dictionary) Project Workshop, Grenoble, 2004.
- Lee Shiu Chen,
Lexical Neologisms in Japanese,
Australian Association for Research in Education Conference, Brisbane,
2002.
- Toshiaki Nakazawa, Daisuke Kawahara and Sadao Kurohashi
Automatic acquisition of basic katakana lexicon from a given corpus
IJCNLP, 2005.
- Natsuko Tsujimura,
An Introduction to Japanese Linguistics,
Blackwell, 1996.
- Nobuhiro Kaji, Ryoko Uno and Masaru Kitsuregawa,
Mining Neologisms from a Large Diachronic Web Archive for Supporting Linguistic Research
Technical Report - Institute of Industrial Science/Graduate School of Arts and Sciences, University of Tokyo. (in Japanese)
- Taku Kudo and Hideto Kazawa
Japanese Web N-gram Corpus Version 1,
Google/Linguistic Data Consortium, http://www.ldc.upenn.edu/
- Kiyotaka Uchimoto, Satoshi Sekine and Hitoshi Isahara
The unknown word problem: a morphological analysis of Japanese using maximum entropy aided by a dictionary,
EMNLP 2001
- Kiyoko Uchiyama, Timothy Baldwin and Shun Ishizaki
Disambiguating Japanese Compound Verbs
Computer Speech & Language, Volume 19, Issue 4, October 2005, (Special issue on Multiword Expression)
- Takehito Utsuro, Takao Shime, Masatoshi Tsuchiya, Suguru Matsuyoshi and Satoshi Sato
Chunking and Dependency Analysis of Japanese Compound Functional Expressions by Machine Learning
Text, Speech and Dialogue: 10th International Conference, TSD 2007, Plzen, Czech Republic