Gender prediction with descriptive textual data using a Machine Learning approach

Research output: Contribution to journalArticle (journal)peer-review

31 Downloads (Pure)


Social media are well-established means of online communication, generating vast amounts of data. In this paper, we focus on Twitter and investigate behavioural differences between male and female users on social media. Using Natural Language Processing and Machine Learning approaches, we propose a user gender identification method that considers both the tweets and the Twitter profile description of a user. For experimentation and evaluation, we enriched and used an existing Twitter User Gender Classification dataset, which is freely available on Kaggle. We considered a variety of methods and components, such as the Bag of Words model, pre-trained word embeddings (GLOVE, BERT, GPT2 and Word2Vec) and machine learners, e.g., Naïve Bayes, Support Vector Machines and Random Forests. Evaluation results have shown that including the Twitter profile description of a user significantly improves gender classification accuracy, by 10% approximately. Stanford’s GLOVE embedding model, pre-trained on 2 billion tweets, 27 billion tokens and a vocabulary size of 1.2 million words, achieved the highest gender prediction accuracy, considering both the tweets and the profile description of a user. Statistical significance has been assessed using McNemar’s two-tailed test.
Original languageEnglish
Article number100018
Pages (from-to)1-9
JournalNatural Language Processing Journal
Early online date9 Jun 2023
Publication statusPublished - 30 Sept 2023


  • Gender prediction
  • Gender classification
  • Machine Learning
  • Twitter
  • Natural Language Processing
  • Pre-trained word embeddings

Research Centres

  • Data and Complex Systems Research Centre

Research Groups

  • SustainNET


Dive into the research topics of 'Gender prediction with descriptive textual data using a Machine Learning approach'. Together they form a unique fingerprint.

Cite this