The HeCz corpus: A large, richly annotated reading corpus of newspaper headlines in Czech

  • Jan Chromý*
  • , Markéta Ceháková
  • , James Brand
  • *Corresponding author for this work

Research output: Contribution to journalArticle (journal)peer-review

Abstract

Large behavioral datasets that provide detailed data on reading processes are valuable resources for a range of researchers working in linguistics, psychology and cognitive science. This paper presents the HeCz corpus, which comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question, resulting in a rich dataset of reading times for each individual word and comprehension accuracy. The corpus is novel in terms of the sheer scale of data collection, with 1872 native Czech speakers, each reading approximately 120 headlines, with 1162 of those participants also completing the experiment again in a re-testing session using the same stimuli approximately 1 month later. There is participant level meta-data also available relating to basic demographic information, reading habits and a profile of their mood state prior to completing the experiment. Beyond the behavioral and demographic data, we also include a range of linguistic annotations for several variables, e.g., frequency, surprisal, morphological tagging. To better understand how these variables might impact processing, we present exploratory analyses where we predicted the reading times for words, with the results indicating important roles for linguistic, demographic, and methodological variables. Given the range of multidisciplinary applications of the HeCz corpus, we hope that it will provide a valuable and unprecedented resource for a range of research applications related to reading processes.
Original languageEnglish
Pages (from-to)1-18
Number of pages18
JournalBehavior Research Methods
Volume57
Issue number12
Early online date14 Nov 2025
DOIs
Publication statusE-pub ahead of print - 14 Nov 2025

Keywords

  • Reading
  • Sentence processing
  • Mood
  • Morphology
  • Czech
  • Corpus
  • Sentence Processing
  • Humans
  • Male
  • Comprehension
  • Young Adult
  • Newspapers as Topic
  • Czech Republic
  • Linguistics
  • Adult
  • Female
  • Psycholinguistics

Fingerprint

Dive into the research topics of 'The HeCz corpus: A large, richly annotated reading corpus of newspaper headlines in Czech'. Together they form a unique fingerprint.

Cite this