Two-layer classification and distinguished representations of users and documents for grouping and authorship identification

Haytham Mohtasseb*, Amr Ahmed

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding (ISBN)peer-review

4 Citations (Scopus)

Abstract

Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity.

Original languageEnglish
Title of host publicationProceedings - 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS 2009
Pages651-657
Number of pages7
DOIs
Publication statusPublished - 22 Nov 2009
Event2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS 2009 - Shanghai, China
Duration: 20 Nov 200922 Nov 2009

Publication series

NameProceedings - 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS 2009
Volume1

Conference

Conference2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS 2009
Country/TerritoryChina
CityShanghai
Period20/11/0922/11/09

Keywords

  • Authorship identification
  • Keywords extraction
  • Personal blogs
  • Similarity detection
  • Users lexicon and representation

Fingerprint

Dive into the research topics of 'Two-layer classification and distinguished representations of users and documents for grouping and authorship identification'. Together they form a unique fingerprint.

Cite this