much.more about partners contacts home  
publications  

WP7.1: Multlingual Term Extraction: Sate of the Art

 


XRCE developed several methods for exploiting multiple resources in bilingual lexicon extraction, either from parallel or comparable corpora. In particular, by combining information provided by three different probabilistic lexical translation models, the first one mainly based on the corpus, the second one mainly based on a multilingual thesaurus, and the third one derived from a bilingual dictionary. Special attention was paid to the use of multilingual thesauri, and three search strategies were developed: complete search, Viterbi search and subtree search, which exploits, at different levels, the information contained in the thesaurus.

The corpus used in multilingual term extraction is composed of 5500 medical abstracts from the MuchMore Springer corpus. These abstracts are “partial” translations of each other, because in some cases the English writer directly summarizes the articles in English, rather than translating the German abstracts. The set of abstracts is used both as a parallel corpus and as a comparable corpus, in which case we do not make use of alignment information. There is a continuum from parallel corpora to fully unrelated texts, going through comparable corpora. The comparable corpus we use is in a way “ideal” and is biased in the sense that we know the translation of a German word of the German corpus to be, almost certainly, present in the English corpus. However, this bias, already present in previous works, does not impact the comparison of the methods we are interested in, all methods being equally affected. As a general bilingual resource, we use the German/English ELRA dictionary, which contains about 50,000 bilingual entries. The medical thesaurus used is MeSH (Medical Subject Headings), and its German version, DMD, provided by DIMDI. Through UMLS the MeSH English entries and the DMD German entries are aligned, so we can extract a bilingual thesaurus. Since DMD is smaller than MeSH, the resulting bilingual thesaurus only contains 15,000 bilingual entries, when MeSH contains 200,000 entries.

Based on this work, a bilingual lexicon was delivered (D7.1) that includes 1,400 new German terms from 700 medical abstracts for direct enrichment of existing concept classes in the German MeSH.


 
 
last modified, july 2003
more   close