Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.
|Publication status||Published - Oct 2014|
|Event||Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) - Doha, Qatar|
Duration: 25 Oct 2014 → 29 Oct 2014
|Conference||Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)|
|Period||25/10/14 → 29/10/14|