much.more

WP4:Corpus Annotation

In the context of WP4.1 (Corpus Preparation and Annotation Scheme Specification), the Springer corpus was prepared for further processing through an extensive manual clean-up that included: removing HTML-tags, removing English segments from German abstracts and vice versa, removing or converting symbols and other non-ASCII elements, etc. The previously considered corpus of German autopsy reports was discarded for use in the project, because: 1. an English equivalent was hard to obtain, 2. the semi-controlled language use in the corpus was deemed not useful for the project. However, before this decision was taken, some work was done on automatically converting about half of the corpus from all-capitals to normal text.
Also in WP4.1, an annotation format (XML-DTD) has been defined that covers several levels of linguistic and semantic annotation in a flexible, extendable way and at the same time allows for efficient processing, e.g. in indexing for information retrieval purposes.
In WP4.2 (Annotation Tool), already available shallow processing tools for linguistic annotation (part-of-speech tagging, morphological analysis, chunking) were further integrated, a server was installed and adaptation to the medical domain was started (using medical lexicons for English and German). Also, the development of an integrated tool for linguistic and semantic annotation was initiated (based on the UMLS knowledge base of medical terms, concepts and semantic relations).
In WP4.3 (Corpus Annotation), a first cycle of automatic linguistic and semantic annotation of the Springer corpus was started for both German and English. A manual evaluation of samples of the annotated corpus is underway.

Annual Report 2001

last modified, december 2001