Descriptive document clustering aims to automatically discover groups of semantically related documents and to assign a meaningful label to characterise the content of each cluster. In this paper, we present a descriptive clustering approach that employs a distributed representation model, namely the paragraph vector model, to capture semantic similarities between documents and phrases. The proposed method uses a joint representation of phrases and documents (i.e., a coembedding) to automatically select a descriptive phrase that best represents each document cluster. We evaluate our method by comparing its performance to an existing state-of-the-art descriptive clustering method that also uses co-embedding but relies on a bag-of-words representation. Results obtained on benchmark datasets demonstrate that the paragraph vector-based method obtains superior performance over the existing approach in both identifying clusters and assigning appropriate descriptive labels to them.
|Publication status||Accepted/In press - 3 Dec 2016|
|Event||15th Conference of the European Chapter of the Association for Computational Linguistics - Valencia, Spain|
Duration: 3 Apr 2017 → 7 Apr 2017
|Conference||15th Conference of the European Chapter of the Association for Computational Linguistics|
|Period||3/04/17 → 7/04/17|
Sato, M., Brockmeier, A. J., Kontonatsios, G., Mu, T., Goulermas, J., Tsujii, J., & Ananiadou, S. (Accepted/In press). Distributed Document and Phrase Co-embeddings for Descriptive Clustering. Paper presented at 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain. https://doi.org/10.18653/v1/e17-1093