much.more

WP4: Corpus Annotation: Sate of the Art

Deliverable D4.1 on the MuchMore annotation format is completed. The report describes an XML-based DTD that covers several levels of linguistic and semantic annotation in a flexible and extendable way. The format allows for efficient processing, e.g. in indexing with the EIT search engine, for work on the extraction of novel terms and relations, and for building statistical data models for use in disambiguation.

Work continued on the adaptation of shallow processing tools for linguistic annotation (specifically part-of-speech tagging, and morphological analysis) to the medical domain, using the UMLS medical specialist lexicon for English and a preliminary version of the German specialist lexicon (DSL) as provided by ZInfo. After including these, a manual evaluation of a sample of 10 Springer abstracts (~1,500 tokens) in both German and English showed a remaining error rate of about 1.5% on part-of-speech tagging. In order to evaluate automatic morphological processing a test list was produced by medical experts at ZInfo. Evaluation showed a recall of roughly 69%, with a varying error rate of 6% to 12% on two different Springer sub-corpora.

Also semantic annotation is continuously updated according to progressing requirements and results of the project. This resulted in several more versions of the annotated Springer corpus for both German and English that were made available to all partners and also to additional interested parties.

An interactive demo of the DFKI MuchMore system for linguistic and semantic annotation is available. For annotation development purposes, a GUI (MMV tool) is available to check the validity of extracted features and to inspect statistics for selected features.

Annual Report 2002

last modified, july 2003