much.more about partners contacts home  
publications  

WP5: Sense Disambiguation Sate of the Art

 

Disambiguation Methods and Evaluation: Four methods were developed within the MuchMore project: 1. the bilingual method takes advantage of having a translated corpus, because knowing the translation of an ambiguous word can be enough to determine its sense; 2. the dictionary based method uses relations between terms as deduced from UMLS to determine which sense is being used in a particular instance; 3. the domain-specific method uses the fact that certain meanings of general terms are more significant than others in specific domains (for example, in the medical domain, operation is far more likely to refer to a surgical operation than a military operation); 4. the instance-based learning method uses a machine-learning technique that we applied to unsupervised training in word-sense disambiguation. Evaluation of these methods showed that high precision, broad coverage disambiguation of medical documents can be achieved without the costly annotation of many training examples. The best results for precision ranged from 74% (English) to 79% (German), achieved by the UMLS related terms method on the UMLS evaluation corpus, and from 77%-99% achieved by the Domain Specific Sense method on the GermaNet evaluation corpus (although with low coverage). The best results for coverage range from 67% achieved by the Instance-Based Learning method on the GermaNet evaluation corpus, to 83% (English) and 87% (German) achieved by the UMLS related terms method on the whole Springer corpus.

Semantic Tagging System: Development of an integrated semantic tagging and disambiguation system as part of the MuchMore tools for linguistic and semantic annotation. The system integrates two DFKI methods (domain specific sense; instance-based learning) for EuroWordNet disambiguation and two CSLI methods (bilingual; collocation) for UMLS disambiguation.

Sense Discovery: Development of tools for clustering and cross-lingual visualisation of word distribution to analyse which clusters carry meaningful information in specific domains and may be interpreted as domain specific word-senses.

 



 
last modified, july 2003
more   close