EDICT/JMdict
Future Directions
Introduction
This page contains some thoughts I have been having about the future (if
any) of the
EDICT
and
JMdict
project/files, and some possible courses it might take.
I welcome discussion, feedback, etc. It can be posted on the
sci.lang.japan newsgroup or emailed to me
here.
Wiktionary-oriented discussion is probably best carried out on the
discussion page
there.
Why Am I Raising This Topic?
Well, editing and coordinating these files has been largely a
one-man-band. Me. Sure there has been a lot of advice from others, and
certainly masses of input, but I have set the standards, done the
updates, released new versions, etc. etc.
I think the files are useful enough and of a good enough quality that
they deserve to live on, expand, be maintained, etc. They are the only
freely-available and parseable source of Japanese-English lexical
material.
I would like to see a future where:
- the project is continuing and to a major extent self-sustaining,
with the edit/update processes spread over a larger group of people;
- it is not at all dependent on my continued involvement. I won't be
around forever, and in fact after 14 years of EDICT I am seriously
thinking about the rest of my life (of which there is not an awful lot
left.)
- it is not dependent on support from Monash University. My honorary
appointment
there will not continue for ever, and internal changes at Monash may
well result in withdrawal of server support.
My Vision, Hopes, Whatever.
What I would like to see is something like this:
- the underlying database from which the EDICT and JMdict files are
generated migrates from its present form (a large text file on my PC)
to an on-line database where it can be seen, edited, expanded,
etc. by a community of users;
- the edits to the database undergo some form of moderation/oversight,
either prior to their commitment (e.g. a moderation panel) or a more
passive after-the-event fixing of mistakes (more the Wiki model);
- from the on-line database a regular and automatic extraction take
place to generate the distributed forms of the file (currently EDICT and
JMdict, but there may be other formats in the future.)
- a more "open" copyright and usage licence arrangment. I am
considering moving to a
Creative Commons
licence. The one I have in mind is the
Attribution
licence, which is very similar to the current one.
Options
Well, as I see it there are two main ways this vision could be
achieved:
-
A Special System
Outline.
A server be developed around updating the EDICT/JMdict database.
Pros:
- could be made to match exactly the desired update model.
- new releases of EDICT, JMdict, etc. would be relatively straightforward.
- existing copyright, etc. arrangements continue.
Cons:
- a significant software development/debug effort needed.
- maintenance, future location of the server become issues.
Problems:
- actually making it happen. I have been exploring PHP/MySql (all my WWW
server work in the past has been in C and not used databases). I'm
feeling a bit tired and old for yet-another major software development.
- maybe some other software could be modified to do the task, e.g.
Twiki.
-
Move Into An Established Environment, e.g. Wiktionary
Outline.
The entire dictionary database is uploaded into an established "wiki"
environment, e.g.
Wiktionary.
Edits would happen in that environment, i.e.
anyone
could edit any entry.
Pros:
- a well-established environment, with excellent prospects for ongoing
support and availability;
- high and growing visibility, with the prospects of gathering a
large(r) support community.
Cons:
- it may be very difficult to maintain a tight entry format.
Wiktionary is more oriented to what you see on the screen and less to
generation of an accurately marked-up data set.
- the Wiktionary editing process is rather clumsy, and more oriented
to free-form text than tightly defined fields.
- Copyright. Wiktionary uses the
GNU Free Documentation License,
which is more oriented to things like software manuals. It has no
provision for mandating attribution of source in followup applications
of the files, for example.
Problems:
- actually getting such a move accepted within Wiktionary. Wiktionary
is currently a set of monolingual subprojects. It currently has no formal
structure for bilingual dictionaries.
- getting the EDICT/JMdict data structure accepted. It is very much
tailored to the issues associated with Japanese lexicography, and may
not find acceptance in a a wider framework.
- download of the files and conversion into the EDICT, JMdict, etc.
forms may be a major task.