Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora

Georgios Kontonatsios, Ioannis Korkontzelos, Jun'ichi Tsujii, Sophia Ananiadou

Research output: Contribution to conferencePaper

10 Citations (Scopus)
2 Downloads (Pure)

Abstract

Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.
Original languageEnglish
Pages1701-1712
Publication statusPublished - Oct 2014
EventProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) - Doha, Qatar
Duration: 25 Oct 201429 Oct 2014

Conference

ConferenceProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
CountryQatar
CityDoha
Period25/10/1429/10/14

Fingerprint

Glossaries
Logistics
Classifiers

Cite this

Kontonatsios, G., Korkontzelos, I., Tsujii, J., & Ananiadou, S. (2014). Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora. 1701-1712. Paper presented at Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.
Kontonatsios, Georgios ; Korkontzelos, Ioannis ; Tsujii, Jun'ichi ; Ananiadou, Sophia. / Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora. Paper presented at Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.
@conference{3ceafbf4b3e243e29cca23286a9bf5c8,
title = "Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora",
abstract = "Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.",
author = "Georgios Kontonatsios and Ioannis Korkontzelos and Jun'ichi Tsujii and Sophia Ananiadou",
year = "2014",
month = "10",
language = "English",
pages = "1701--1712",
note = "Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) ; Conference date: 25-10-2014 Through 29-10-2014",

}

Kontonatsios, G, Korkontzelos, I, Tsujii, J & Ananiadou, S 2014, 'Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora' Paper presented at Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25/10/14 - 29/10/14, pp. 1701-1712.

Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora. / Kontonatsios, Georgios; Korkontzelos, Ioannis; Tsujii, Jun'ichi; Ananiadou, Sophia.

2014. 1701-1712 Paper presented at Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora

AU - Kontonatsios, Georgios

AU - Korkontzelos, Ioannis

AU - Tsujii, Jun'ichi

AU - Ananiadou, Sophia

PY - 2014/10

Y1 - 2014/10

N2 - Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.

AB - Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based on the first observation, we develop a new character n-gram compositional method, a logistic regression classifier, for learning a string similarity measure of term translations. According to the second observation, we use an existing context-based approach. For evaluation, we investigate the performance of compositional and context-based methods on: (a) similar and unrelated languages, (b) corpora of different degree of comparability and (c) the translation of frequent and rare terms. Finally, we combine the two translation clues, namely string and contextual similarity, in a linear model and we show substantial improvements over the two translation signals.

M3 - Paper

SP - 1701

EP - 1712

ER -

Kontonatsios G, Korkontzelos I, Tsujii J, Ananiadou S. Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora. 2014. Paper presented at Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.