The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study

T. McEnery, R. Xiao

Research output: Contribution to conferencePaper

46 Citations (Scopus)

Abstract

This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98%. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese.
Original languageEnglish
Publication statusPublished - 2004
Event4th International Conference on Language Resources and Evaluation - Lisbon, Portugal
Duration: 26 May 200428 May 2004

Conference

Conference4th International Conference on Language Resources and Evaluation
CountryPortugal
CityLisbon
Period26/05/0428/05/04

Fingerprint

Language Studies
Contrastive
Mandarin Chinese
British English
Mainland China
Tagging
Resources
American English

Cite this

McEnery, T., & Xiao, R. (2004). The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. Paper presented at 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal.
McEnery, T. ; Xiao, R. / The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. Paper presented at 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal.
@conference{a9b5b2d7b7ad4788973b3f224ba82545,
title = "The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study",
abstract = "This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98{\%}. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese.",
author = "T. McEnery and R. Xiao",
year = "2004",
language = "English",
note = "4th International Conference on Language Resources and Evaluation ; Conference date: 26-05-2004 Through 28-05-2004",

}

McEnery, T & Xiao, R 2004, 'The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study' Paper presented at 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal, 26/05/04 - 28/05/04, .

The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. / McEnery, T.; Xiao, R.

2004. Paper presented at 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal.

Research output: Contribution to conferencePaper

TY - CONF

T1 - The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study

AU - McEnery, T.

AU - Xiao, R.

PY - 2004

Y1 - 2004

N2 - This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98%. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese.

AB - This paper presents the newly released Lancaster Corpus of Mandarin Chinese (LCMC), a Chinese match for the FLOB and Frown corpora of British and American English. LCMC is a one-million-word balanced corpus of written Mandarin Chinese. The corpus contains five hundred 2,000-word samples of written Chinese texts sampled from fifteen text categories published in Mainland China around 1991, totalling one million words. LCMC is XML-compliant and conforms to CES, with each document containing a corpus header giving general information about the corpus and a body of text. The corpus is segmented and POS tagged with a tagging precision rate of over 98%. The corpus is a useful resource for research into modern Chinese as well as the cross-linguistic contrast between English and Chinese.

M3 - Paper

ER -

McEnery T, Xiao R. The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. 2004. Paper presented at 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal.