Abstract
Social media are well-established means of online communication, generating vast amounts of data. In this paper, we focus on Twitter and investigate behavioural differences between male and female users on social media. Using Natural Language Processing and Machine Learning approaches, we propose a user gender identification method that considers both the tweets and the Twitter profile description of a user. For experimentation and evaluation, we enriched and used an existing Twitter User Gender Classification dataset, which is freely available on Kaggle. We considered a variety of methods and components, such as the Bag of Words model, pre-trained word embeddings (GLOVE, BERT, GPT2 and Word2Vec) and machine learners, e.g., Naïve Bayes, Support Vector Machines and Random Forests. Evaluation results have shown that including the Twitter profile description of a user significantly improves gender classification accuracy, by 10% approximately. Stanford’s GLOVE embedding model, pre-trained on 2 billion tweets, 27 billion tokens and a vocabulary size of 1.2 million words, achieved the highest gender prediction accuracy, considering both the tweets and the profile description of a user. Statistical significance has been assessed using McNemar’s two-tailed test.
Original language | English |
---|---|
Article number | 100018 |
Pages (from-to) | 1-9 |
Journal | Natural Language Processing Journal |
Volume | 4 |
Early online date | 9 Jun 2023 |
DOIs | |
Publication status | Published - 30 Sept 2023 |
Keywords
- Gender prediction
- Gender classification
- Machine Learning
- Natural Language Processing
- Pre-trained word embeddings
Research Centres
- Data and Complex Systems Research Centre
Research Groups
- SustainNET