MuchMore Springer Bilingual Corpus


The corpus used in the MuchMore project is a parallel corpus of English-German scientific medical abstracts obtained from the Springer Link web site. The corpus consists approximately of 1 million tokens for each language. Abstracts are from 41 medical journals, each of which constitutes a relatively homogeneous medical sub-domain (e.g. Neurology, Radiology, etc.). The corpus of downloaded HTML documents is normalized in various ways, in order to produce a clean, plain text version, consisting of a title, abstract and keywords. Additionally, the corpus was aligned on the sentence level.

Automatic (!) annotation includes: Part-of-Speech; Morphology (inflection and decomposition); Chunks; Semantic Classes (UMLS: Unified Medical Language System, MeSH: Medical Subject Headings, EuroWordNet); Semantic Relations from UMLS.

MuchMore Springer Corpus German

Plain Version

Annotated Version

MuchMore Springer Corpus English

Plain Version

Annotated Version

Please note - Annotated versions are in the MuchMore XML annotation format (see deliverable D4.1)


