much.more

WP5: Sense Disambiguation Sate of the Art

The work on WP5 is divided into: 1. construction of evaluation corpora (based on the Springer corpus of medical scientific abstracts); 2. development of various unsupervised disambiguation methods and evaluation of these methods with the constructed evaluation corpora; 3. integration of these methods into a semantic tagging system; 4. development of sense discovery methods.

Evaluation Corpora: Unfortunately, there is a lack of test sets for WSD evaluation, specifically for languages other than English and even more so for specific domains like medicine. Given that our work focuses on German as well as English text in the medical domain, we had to develop our own evaluation corpora in order to test our disambiguation methods. We decided to construct a set of lexical sample corpora taken from the MuchMore Springer corpus of medical scientific abstracts. Given that the size of the German part in EuroWordNet is rather small, we decided to use a more recent, larger version of GermaNet instead. GermaNet is a lexical semantic resource for German with a structure similar to that of WordNet and EuroWordNet. In parallel we developed two evaluation corpora for UMLS (or rather MeSH for English and German). Annotation is manual, by use of the KIC tool (based on ANNOTATE). Gold standards are defined by solving the disagreement cases between different annotators. The GermaNet evaluation corpus consists of 40 terms, each with up to 100 instances. The UMLS evaluation corpora consist of 70 terms for English (token frequencies at least 28, 41 terms having token frequencies over 100) and 24 terms for German (token frequencies at least 11, 7 terms having token frequency over 100), as the German part of UMLS is rather small.

Annual Report 2002

last modified, july 2003