Deliverable
D4.1 on the MuchMore annotation format is completed. The
report describes an XML-based DTD that covers several levels
of linguistic and semantic annotation in a flexible and
extendable way. The format allows for efficient processing,
e.g. in indexing with the EIT search engine, for work on
the extraction of novel terms and relations, and for building
statistical data models for use in disambiguation.
Work continued on the adaptation of shallow processing
tools for linguistic annotation (specifically part-of-speech
tagging, and morphological analysis) to the medical domain,
using the UMLS medical specialist lexicon for English and
a preliminary version of the German specialist lexicon (DSL)
as provided by ZInfo. After including these, a manual evaluation
of a sample of 10 Springer abstracts (~1,500 tokens) in
both German and English showed a remaining error rate of
about 1.5% on part-of-speech tagging. In order to evaluate
automatic morphological processing a test list was produced
by medical experts at ZInfo. Evaluation showed a recall
of roughly 69%, with a varying error rate of 6% to 12% on
two different Springer sub-corpora.
Also semantic annotation is continuously updated according
to progressing requirements and results of the project.
This resulted in several more versions of the annotated
Springer corpus for both German and English that were made
available to all partners and also to additional interested
parties.
An interactive demo of the DFKI MuchMore
system for linguistic and semantic annotation is available.
For annotation development purposes, a GUI (MMV tool) is
available to check the validity of extracted features and
to inspect statistics for selected features.
|