The corpus used in the MuchMore project
is a parallel corpus of English-German scientific
medical abstracts obtained from the Springer
Link web site. The corpus consists approximately
of 1 million tokens for each language. Abstracts are
from 41 medical journals, each of which constitutes
a relatively homogeneous medical sub-domain (e.g.
Neurology, Radiology, etc.). The corpus of downloaded
HTML documents is normalized in various ways, in order
to produce a clean, plain text version, consisting
of a title, abstract and keywords. Additionally, the
corpus was aligned on the sentence level.
Automatic (!) annotation includes: Part-of-Speech;
Morphology (inflection and decomposition); Chunks;
Semantic Classes (UMLS: Unified Medical Language System,
MeSH: Medical Subject Headings, EuroWordNet); Semantic
Relations from UMLS.
MuchMore Springer Corpus German
Plain
Version
Annotated
Version
MuchMore Springer Corpus English
Plain
Version
Annotated
Version
Please note - Annotated versions are in the MuchMore
XML annotation format (see deliverable D4.1) |