|   The corpus used in the MuchMore project 
                            is a parallel corpus of English-German scientific 
                            medical abstracts obtained from the Springer 
                            Link web site. The corpus consists approximately 
                            of 1 million tokens for each language. Abstracts are 
                            from 41 medical journals, each of which constitutes 
                            a relatively homogeneous medical sub-domain (e.g. 
                            Neurology, Radiology, etc.). The corpus of downloaded 
                            HTML documents is normalized in various ways, in order 
                            to produce a clean, plain text version, consisting 
                            of a title, abstract and keywords. Additionally, the 
                            corpus was aligned on the sentence level. 
                          Automatic (!) annotation includes: Part-of-Speech; 
                            Morphology (inflection and decomposition); Chunks; 
                            Semantic Classes (UMLS: Unified Medical Language System, 
                            MeSH: Medical Subject Headings, EuroWordNet); Semantic 
                            Relations from UMLS. 
                          MuchMore Springer Corpus German 
                             
                             Plain 
                            Version  
                           Annotated 
                            Version  
                          MuchMore Springer Corpus English 
                             
                             Plain 
                            Version  
                           Annotated 
                            Version  
                          
                          Please note - Annotated versions are in the MuchMore 
                            XML annotation format (see deliverable D4.1)  |