|
XRCE developed
several methods for exploiting multiple resources in bilingual
lexicon extraction, either from parallel or comparable corpora.
In particular, by combining information provided by three
different probabilistic lexical translation models, the
first one mainly based on the corpus, the second one mainly
based on a multilingual thesaurus, and the third one derived
from a bilingual dictionary. Special attention was paid
to the use of multilingual thesauri, and three search strategies
were developed: complete search, Viterbi search and subtree
search, which exploits, at different levels, the information
contained in the thesaurus.
The corpus used in multilingual term extraction is composed
of 5500 medical abstracts from the MuchMore Springer corpus.
These abstracts are “partial” translations of
each other, because in some cases the English writer directly
summarizes the articles in English, rather than translating
the German abstracts. The set of abstracts is used both
as a parallel corpus and as a comparable corpus, in which
case we do not make use of alignment information. There
is a continuum from parallel corpora to fully unrelated
texts, going through comparable corpora. The comparable
corpus we use is in a way “ideal” and is biased
in the sense that we know the translation of a German word
of the German corpus to be, almost certainly, present in
the English corpus. However, this bias, already present
in previous works, does not impact the comparison of the
methods we are interested in, all methods being equally
affected. As a general bilingual resource, we use the German/English
ELRA dictionary, which contains about 50,000 bilingual
entries. The medical thesaurus used is MeSH (Medical Subject
Headings), and its German version, DMD, provided by DIMDI.
Through UMLS the MeSH English entries and the DMD German
entries are aligned, so we can extract a bilingual thesaurus.
Since DMD is smaller than MeSH, the resulting bilingual
thesaurus only contains 15,000 bilingual entries, when MeSH
contains 200,000 entries.
Based on this work, a bilingual lexicon was delivered (D7.1)
that includes 1,400 new German terms from 700 medical abstracts
for direct enrichment of existing concept classes in the
German MeSH. |
|