The
work on WP5 is divided into: 1. construction of evaluation
corpora (based on the Springer corpus of medical
scientific abstracts); 2. development of various unsupervised
disambiguation methods and evaluation of
these methods with the constructed evaluation corpora; 3.
integration of these methods into a semantic tagging
system; 4. development of sense discovery
methods.
Evaluation Corpora: Unfortunately, there
is a lack of test sets for WSD evaluation, specifically
for languages other than English and even more so for specific
domains like medicine. Given that our work focuses on German
as well as English text in the medical domain, we had to
develop our own evaluation corpora in order to test our
disambiguation methods. We decided to construct a set of
lexical sample corpora taken from the MuchMore Springer
corpus of medical scientific abstracts. Given that the size
of the German part in EuroWordNet is rather small, we decided
to use a more recent, larger version of GermaNet instead.
GermaNet
is a lexical semantic resource for German with a structure
similar to that of WordNet and EuroWordNet. In parallel
we developed two evaluation corpora for UMLS (or rather
MeSH for English and German). Annotation is manual, by use
of the KIC tool (based on ANNOTATE).
Gold standards are defined by solving the disagreement cases
between different annotators. The GermaNet evaluation corpus
consists of 40 terms, each with up to 100 instances. The
UMLS evaluation corpora consist of 70 terms for English
(token frequencies at least 28, 41 terms having token frequencies
over 100) and 24 terms for German (token frequencies at
least 11, 7 terms having token frequency over 100), as the
German part of UMLS is rather small.
|
|