In the context of WP4.1 (Corpus Preparation and Annotation
Scheme Specification), the Springer corpus was prepared
for further processing through an extensive manual clean-up
that included: removing HTML-tags, removing English segments
from German abstracts and vice versa, removing or converting
symbols and other non-ASCII elements, etc. The previously
considered corpus of German autopsy reports was discarded
for use in the project, because: 1. an English equivalent
was hard to obtain, 2. the semi-controlled language use
in the corpus was deemed not useful for the project. However,
before this decision was taken, some work was done on automatically
converting about half of the corpus from all-capitals to
normal text.
Also in WP4.1, an annotation format (XML-DTD) has been defined
that covers several levels of linguistic and semantic annotation
in a flexible, extendable way and at the same time allows
for efficient processing, e.g. in indexing for information
retrieval purposes.
In WP4.2 (Annotation Tool), already available shallow processing
tools for linguistic annotation (part-of-speech tagging,
morphological analysis, chunking) were further integrated,
a server was installed and adaptation to the medical domain
was started (using medical lexicons for English and German).
Also, the development of an integrated tool for linguistic
and semantic annotation was initiated (based on the UMLS
knowledge base of medical terms, concepts and semantic relations).
In WP4.3 (Corpus Annotation), a first cycle of automatic
linguistic and semantic annotation of the Springer corpus
was started for both German and English. A manual evaluation
of samples of the annotated corpus is underway.
|
|