TY - JOUR
T1 - Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities
AU - Mu, Tingting
AU - Goulermas, John Y.
AU - Korkontzelos, Ioannis
AU - Ananiadou, Sophia
PY - 2016/1/1
Y1 - 2016/1/1
N2 - Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbor-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.
AB - Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co-embedded space that preserves higher-order, neighbor-based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co-embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields.
KW - machine learning
KW - unsupervised clustering
KW - natural language processing
KW - text mining
KW - information retrieval
UR - http://www.scopus.com/inward/record.url?scp=84975089751&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84975089751&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/0a344cc9-d994-32e4-bd20-924e426346bd/
U2 - 10.1002/asi.23374
DO - 10.1002/asi.23374
M3 - Article (journal)
SN - 2330-1635
VL - 67
SP - 106
EP - 133
JO - Journal of the Association for Information Science and Technology
JF - Journal of the Association for Information Science and Technology
IS - 1
ER -