Approximate Top-K Answering under Uncertain Schema Mappings

Longzhuang Li, Feng Tian, Yonghuai Liu, Shanxian Mao

Research output: Contribution to journalArticle

Abstract

Data integration techniques provide a communication bridge between isolated sources and offer a platform for information exchange. When the schemas of heterogeneous data sources map to the centralized schema in a mediated data integration system or a source schema maps to a target schema in a peer-to-peer system, multiple schema mappings may exist due to the ambiguities in the attribute matching. The obscure schema mappings lead to the uncertainty in query answering, and frequently people are only interested in retrieving the best k answers (top-k) with the biggest probabilities. Retrieving the top-k answers efficiently has become a research issue. For uncertain queries, two semantics, by-table and by-tuple, have been developed to capture top-k answers based on the schema mapping probabilities. However, although the existing algorithms support certain features to capture the accurate top-k answers and avoid accessing all data from sources, they cannot effectively reduce the number of processed tuples in most cases. In this paper, new algorithms based on the histogram approximation and heuristic are proposed to efficiently identify the top-k answers for the data integration systems under uncertain schema mappings. In the experiments, the Histogram algorithm in the by-table semantics and the expected approach in the by-tuple semantics are shown to significantly reduce the number of processed tuples while maintaining high accuracy with the estimated probabilistic confidence.
Original languageEnglish
Pages (from-to)71-91
JournalData & Knowledge Engineering
Volume118
Early online date17 Oct 2018
DOIs
Publication statusE-pub ahead of print - 17 Oct 2018

Fingerprint

Data integration
Semantics
Communication
Experiments

Cite this

Li, Longzhuang ; Tian, Feng ; Liu, Yonghuai ; Mao, Shanxian. / Approximate Top-K Answering under Uncertain Schema Mappings. In: Data & Knowledge Engineering. 2018 ; Vol. 118. pp. 71-91.
@article{e9f72121c4fd49aa93b83f9cb69949be,
title = "Approximate Top-K Answering under Uncertain Schema Mappings",
abstract = "Data integration techniques provide a communication bridge between isolated sources and offer a platform for information exchange. When the schemas of heterogeneous data sources map to the centralized schema in a mediated data integration system or a source schema maps to a target schema in a peer-to-peer system, multiple schema mappings may exist due to the ambiguities in the attribute matching. The obscure schema mappings lead to the uncertainty in query answering, and frequently people are only interested in retrieving the best k answers (top-k) with the biggest probabilities. Retrieving the top-k answers efficiently has become a research issue. For uncertain queries, two semantics, by-table and by-tuple, have been developed to capture top-k answers based on the schema mapping probabilities. However, although the existing algorithms support certain features to capture the accurate top-k answers and avoid accessing all data from sources, they cannot effectively reduce the number of processed tuples in most cases. In this paper, new algorithms based on the histogram approximation and heuristic are proposed to efficiently identify the top-k answers for the data integration systems under uncertain schema mappings. In the experiments, the Histogram algorithm in the by-table semantics and the expected approach in the by-tuple semantics are shown to significantly reduce the number of processed tuples while maintaining high accuracy with the estimated probabilistic confidence.",
author = "Longzhuang Li and Feng Tian and Yonghuai Liu and Shanxian Mao",
year = "2018",
month = "10",
day = "17",
doi = "https://doi.org/10.1016/j.datak.2018.09.004",
language = "English",
volume = "118",
pages = "71--91",
journal = "Data and Knowledge Engineering",
issn = "0169-023X",
publisher = "Elsevier",

}

Approximate Top-K Answering under Uncertain Schema Mappings. / Li, Longzhuang; Tian, Feng; Liu, Yonghuai; Mao, Shanxian.

In: Data & Knowledge Engineering, Vol. 118, 17.10.2018, p. 71-91.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Approximate Top-K Answering under Uncertain Schema Mappings

AU - Li, Longzhuang

AU - Tian, Feng

AU - Liu, Yonghuai

AU - Mao, Shanxian

PY - 2018/10/17

Y1 - 2018/10/17

N2 - Data integration techniques provide a communication bridge between isolated sources and offer a platform for information exchange. When the schemas of heterogeneous data sources map to the centralized schema in a mediated data integration system or a source schema maps to a target schema in a peer-to-peer system, multiple schema mappings may exist due to the ambiguities in the attribute matching. The obscure schema mappings lead to the uncertainty in query answering, and frequently people are only interested in retrieving the best k answers (top-k) with the biggest probabilities. Retrieving the top-k answers efficiently has become a research issue. For uncertain queries, two semantics, by-table and by-tuple, have been developed to capture top-k answers based on the schema mapping probabilities. However, although the existing algorithms support certain features to capture the accurate top-k answers and avoid accessing all data from sources, they cannot effectively reduce the number of processed tuples in most cases. In this paper, new algorithms based on the histogram approximation and heuristic are proposed to efficiently identify the top-k answers for the data integration systems under uncertain schema mappings. In the experiments, the Histogram algorithm in the by-table semantics and the expected approach in the by-tuple semantics are shown to significantly reduce the number of processed tuples while maintaining high accuracy with the estimated probabilistic confidence.

AB - Data integration techniques provide a communication bridge between isolated sources and offer a platform for information exchange. When the schemas of heterogeneous data sources map to the centralized schema in a mediated data integration system or a source schema maps to a target schema in a peer-to-peer system, multiple schema mappings may exist due to the ambiguities in the attribute matching. The obscure schema mappings lead to the uncertainty in query answering, and frequently people are only interested in retrieving the best k answers (top-k) with the biggest probabilities. Retrieving the top-k answers efficiently has become a research issue. For uncertain queries, two semantics, by-table and by-tuple, have been developed to capture top-k answers based on the schema mapping probabilities. However, although the existing algorithms support certain features to capture the accurate top-k answers and avoid accessing all data from sources, they cannot effectively reduce the number of processed tuples in most cases. In this paper, new algorithms based on the histogram approximation and heuristic are proposed to efficiently identify the top-k answers for the data integration systems under uncertain schema mappings. In the experiments, the Histogram algorithm in the by-table semantics and the expected approach in the by-tuple semantics are shown to significantly reduce the number of processed tuples while maintaining high accuracy with the estimated probabilistic confidence.

U2 - https://doi.org/10.1016/j.datak.2018.09.004

DO - https://doi.org/10.1016/j.datak.2018.09.004

M3 - Article

VL - 118

SP - 71

EP - 91

JO - Data and Knowledge Engineering

JF - Data and Knowledge Engineering

SN - 0169-023X

ER -