Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora

Georgios Kontonatsios, Ioannis Korkontzelos, Jun'ichi Tsujii, Sophia Ananiadou

Research output: Contribution to conferencePaper

14 Citations (Scopus)
39 Downloads (Pure)

Abstract

Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.
Original languageEnglish
Pages1701-1712
Publication statusPublished - Oct 2014
EventProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) - Doha, Qatar
Duration: 25 Oct 201429 Oct 2014

Conference

ConferenceProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
CountryQatar
CityDoha
Period25/10/1429/10/14

Fingerprint Dive into the research topics of 'Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora'. Together they form a unique fingerprint.

  • Cite this

    Kontonatsios, G., Korkontzelos, I., Tsujii, J., & Ananiadou, S. (2014). Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora. 1701-1712. Paper presented at Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. http://www.aclweb.org/anthology/D14-1177