The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study

T. McEnery, R. Xiao

Research output: Contribution to conferencePaper

50 Citations (Scopus)

Abstract

This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98%. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese.
Original languageEnglish
Publication statusPublished - 2004
Event4th International Conference on Language Resources and Evaluation - Lisbon, Portugal
Duration: 26 May 200428 May 2004

Conference

Conference4th International Conference on Language Resources and Evaluation
CountryPortugal
CityLisbon
Period26/05/0428/05/04

Fingerprint Dive into the research topics of 'The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study'. Together they form a unique fingerprint.

  • Cite this

    McEnery, T., & Xiao, R. (2004). The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. Paper presented at 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal.