Abstract
Large behavioral datasets that provide detailed data on reading processes are valuable resources for a range of researchers working in linguistics, psychology and cognitive science. This paper presents the HeCz corpus, which comprises self-paced reading data for 1919 newspaper headlines (23,634 words) in Czech, with each headline being accompanied by a yes–no comprehension question, resulting in a rich dataset of reading times for each individual word and comprehension accuracy. The corpus is novel in terms of the sheer scale of data collection, with 1872 native Czech speakers, each reading approximately 120 headlines, with 1162 of those participants also completing the experiment again in a re-testing session using the same stimuli approximately 1 month later. There is participant level meta-data also available relating to basic demographic information, reading habits and a profile of their mood state prior to completing the experiment. Beyond the behavioral and demographic data, we also include a range of linguistic annotations for several variables, e.g., frequency, surprisal, morphological tagging. To better understand how these variables might impact processing, we present exploratory analyses where we predicted the reading times for words, with the results indicating important roles for linguistic, demographic, and methodological variables. Given the range of multidisciplinary applications of the HeCz corpus, we hope that it will provide a valuable and unprecedented resource for a range of research applications related to reading processes.
| Original language | English |
|---|---|
| Pages (from-to) | 1-18 |
| Number of pages | 18 |
| Journal | Behavior Research Methods |
| Volume | 57 |
| Issue number | 12 |
| Early online date | 14 Nov 2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 14 Nov 2025 |
Keywords
- Reading
- Sentence processing
- Mood
- Morphology
- Czech
- Corpus
- Sentence Processing
- Humans
- Male
- Comprehension
- Young Adult
- Newspapers as Topic
- Czech Republic
- Linguistics
- Adult
- Female
- Psycholinguistics
Fingerprint
Dive into the research topics of 'The HeCz corpus: A large, richly annotated reading corpus of newspaper headlines in Czech'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver