Keyness: Appropriate metrics and practical issues

Costas Gabrielatos, Anna Marchi

Research output: Contribution to conferencePaper

Abstract

In this paper we examine the definitions of two widely-used interrelated constructs in corpus linguistics, keyness and keywords, as presented in the literature and corpus software manuals. In particular, we focus on a. the consistency of definitions given in different sources; b. the metrics used to calculate the level of keyness; c. the compatibility between definitions and metrics. Our survey of studies employing keyword analysis has indicated that the vast majority of studies examine a subset of keywords – almost always the top X number of keywords as ranked by the metric used. This renders the issue of the appropriate metric central to any study using keyword analysis. In this study, we first argue that an appropriate, and therefore useful, metric for keyness needs to be fully consistent with the definition of keyword. We then use four sets of comparisons between corpora of different types and sizes, in order to test whether and to what extent the use of different metrics affects the ranking of keywords. More precisely, we look at the extent of overlap in the keyword rankings resulting from the adoption of different metrics, and we discuss the implications of ranking-based analysis adopting one metric or another. Finally, we propose a new metric for keyness, and demonstrate a simple way to calculate the metric, which supplements the keyword extraction in existing corpus software.
Original languageEnglish
Publication statusPublished - 2012
EventCorpus-assisted Discourse Studies International Conference - University of Bologna, Italy
Duration: 13 Sep 201214 Sep 2012

Conference

ConferenceCorpus-assisted Discourse Studies International Conference
CountryItaly
Period13/09/1214/09/12

Fingerprint

Linguistics

Keywords

  • corpus linguistics
  • keyness
  • keywords
  • metrics
  • frequency difference
  • effect size
  • statistical significance

Cite this

Gabrielatos, C., & Marchi, A. (2012). Keyness: Appropriate metrics and practical issues. Paper presented at Corpus-assisted Discourse Studies International Conference, Italy.
Gabrielatos, Costas ; Marchi, Anna. / Keyness: Appropriate metrics and practical issues. Paper presented at Corpus-assisted Discourse Studies International Conference, Italy.
@conference{f553d2621f30471aadcc0eb4356014f8,
title = "Keyness: Appropriate metrics and practical issues",
abstract = "In this paper we examine the definitions of two widely-used interrelated constructs in corpus linguistics, keyness and keywords, as presented in the literature and corpus software manuals. In particular, we focus on a. the consistency of definitions given in different sources; b. the metrics used to calculate the level of keyness; c. the compatibility between definitions and metrics. Our survey of studies employing keyword analysis has indicated that the vast majority of studies examine a subset of keywords – almost always the top X number of keywords as ranked by the metric used. This renders the issue of the appropriate metric central to any study using keyword analysis. In this study, we first argue that an appropriate, and therefore useful, metric for keyness needs to be fully consistent with the definition of keyword. We then use four sets of comparisons between corpora of different types and sizes, in order to test whether and to what extent the use of different metrics affects the ranking of keywords. More precisely, we look at the extent of overlap in the keyword rankings resulting from the adoption of different metrics, and we discuss the implications of ranking-based analysis adopting one metric or another. Finally, we propose a new metric for keyness, and demonstrate a simple way to calculate the metric, which supplements the keyword extraction in existing corpus software.",
keywords = "corpus linguistics, keyness, keywords, metrics, frequency difference, effect size, statistical significance",
author = "Costas Gabrielatos and Anna Marchi",
note = "This is a revised version of: Gabrielatos, C. & Marchi, A. (2011). Keyness: Matching metrics to definitions. Invited presentation. Corpus Linguistics in the South: Theoretical-methodological challenges in corpus approaches to discourse studies - and some ways of addressing them. University of Portsmouth, 5 November 2011. [http://repository.edgehill.ac.uk/4100/] Andrew, D.P.S., Pedersen, P.M. & McEvoy, C.D. (2011). Research Methods and Design in Sport Management. Human Kinetics. Biber, D., Connor, U. & Upton, A. with Anthony, M. & Gladkov, K. (2007). Rhetorical appeals in fundraising. In D. Biber, U. Connor & A. Upton. Discourse on the Move: Using corpus analysis to describe discourse structure (pp. 121-151). Amsterdam: John Benjamin. Gabrielatos, C. (2007). If-conditionals as modal colligations: A corpus-based investigation. In M. Davies, P. Rayson, S. Hunston & P. Danielsson (eds.), Proceedings of the Corpus Linguistics Conference: Corpus Linguistics 2007. Birmingham: University of Birmingham. Gabrielatos, C. & McEnery, T. (2005). Epistemic modality in MA dissertations. In P.A. Fuertes Olivera (ed.), Lengua y Sociedad: Investigaciones recientes en ling{\"u}{\'i}stica aplicada. Ling{\"u}{\'i}stica y Filolog{\'i}a no. 61. (pp. 311-331). Valladolid: Universidad de Valladolid. Gabrielatos, C. & Marchi, A. (2011). Keyness: Matching metrics to definitions. Invited presentation. Corpus Linguistics in the South: Theoretical-methodological challenges in corpus approaches to discourse studies - and some ways of addressing them. University of Portsmouth, 5 November 2011. Kilgariff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics 6(1): 1-37. Kilgariff, A. (2012). Getting to know your corpus.​ In Proc. Text, Speech, Dialogue (TSD 2012), Lecture Notes in Computer Science. Sojka, P., Horak, A., Kopecek, I., Pala, K. (eds). Springer. Mujis, D. (2010). Doing Quantitative Research in Education with SPSS. Sage. Ridge, E. & Kudenko, D. (2010). Tuning an algorithm using design of experiments. In T. Batz-Beiselstein, M. Chiarandini, L. Paquette & M. Preuss (Eds.), Experimental Methods for the Analysis of Optimization Algorithms (265-286). Springer. Rosenfeld, B. & Penrod, S.D. (2011). Research Methods in Forensic Psychology. John Wiley and Sons. Scott, M. (1996). WordSmith Tools Manual. Oxford: Oxford University Press. Scott, M. (1997). PC analysis of key words - and key key words. System, 25(2), 233-45. Scott, M. (2011). WordSmith Tools Manual, Version 6. Liverpool: Lexical Analysis Software Ltd. Taylor, C. (2011). Searching for similarity: The representation of boy/s and girl/s in the UK press in 1993, 2005, 2010. Paper given at Corpus Linguistics 2011, University of Birmingham, 20-22 July 2011.; Corpus-assisted Discourse Studies International Conference ; Conference date: 13-09-2012 Through 14-09-2012",
year = "2012",
language = "English",

}

Gabrielatos, C & Marchi, A 2012, 'Keyness: Appropriate metrics and practical issues' Paper presented at Corpus-assisted Discourse Studies International Conference, Italy, 13/09/12 - 14/09/12, .

Keyness: Appropriate metrics and practical issues. / Gabrielatos, Costas; Marchi, Anna.

2012. Paper presented at Corpus-assisted Discourse Studies International Conference, Italy.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Keyness: Appropriate metrics and practical issues

AU - Gabrielatos, Costas

AU - Marchi, Anna

N1 - This is a revised version of: Gabrielatos, C. & Marchi, A. (2011). Keyness: Matching metrics to definitions. Invited presentation. Corpus Linguistics in the South: Theoretical-methodological challenges in corpus approaches to discourse studies - and some ways of addressing them. University of Portsmouth, 5 November 2011. [http://repository.edgehill.ac.uk/4100/] Andrew, D.P.S., Pedersen, P.M. & McEvoy, C.D. (2011). Research Methods and Design in Sport Management. Human Kinetics. Biber, D., Connor, U. & Upton, A. with Anthony, M. & Gladkov, K. (2007). Rhetorical appeals in fundraising. In D. Biber, U. Connor & A. Upton. Discourse on the Move: Using corpus analysis to describe discourse structure (pp. 121-151). Amsterdam: John Benjamin. Gabrielatos, C. (2007). If-conditionals as modal colligations: A corpus-based investigation. In M. Davies, P. Rayson, S. Hunston & P. Danielsson (eds.), Proceedings of the Corpus Linguistics Conference: Corpus Linguistics 2007. Birmingham: University of Birmingham. Gabrielatos, C. & McEnery, T. (2005). Epistemic modality in MA dissertations. In P.A. Fuertes Olivera (ed.), Lengua y Sociedad: Investigaciones recientes en lingüística aplicada. Lingüística y Filología no. 61. (pp. 311-331). Valladolid: Universidad de Valladolid. Gabrielatos, C. & Marchi, A. (2011). Keyness: Matching metrics to definitions. Invited presentation. Corpus Linguistics in the South: Theoretical-methodological challenges in corpus approaches to discourse studies - and some ways of addressing them. University of Portsmouth, 5 November 2011. Kilgariff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics 6(1): 1-37. Kilgariff, A. (2012). Getting to know your corpus.​ In Proc. Text, Speech, Dialogue (TSD 2012), Lecture Notes in Computer Science. Sojka, P., Horak, A., Kopecek, I., Pala, K. (eds). Springer. Mujis, D. (2010). Doing Quantitative Research in Education with SPSS. Sage. Ridge, E. & Kudenko, D. (2010). Tuning an algorithm using design of experiments. In T. Batz-Beiselstein, M. Chiarandini, L. Paquette & M. Preuss (Eds.), Experimental Methods for the Analysis of Optimization Algorithms (265-286). Springer. Rosenfeld, B. & Penrod, S.D. (2011). Research Methods in Forensic Psychology. John Wiley and Sons. Scott, M. (1996). WordSmith Tools Manual. Oxford: Oxford University Press. Scott, M. (1997). PC analysis of key words - and key key words. System, 25(2), 233-45. Scott, M. (2011). WordSmith Tools Manual, Version 6. Liverpool: Lexical Analysis Software Ltd. Taylor, C. (2011). Searching for similarity: The representation of boy/s and girl/s in the UK press in 1993, 2005, 2010. Paper given at Corpus Linguistics 2011, University of Birmingham, 20-22 July 2011.

PY - 2012

Y1 - 2012

N2 - In this paper we examine the definitions of two widely-used interrelated constructs in corpus linguistics, keyness and keywords, as presented in the literature and corpus software manuals. In particular, we focus on a. the consistency of definitions given in different sources; b. the metrics used to calculate the level of keyness; c. the compatibility between definitions and metrics. Our survey of studies employing keyword analysis has indicated that the vast majority of studies examine a subset of keywords – almost always the top X number of keywords as ranked by the metric used. This renders the issue of the appropriate metric central to any study using keyword analysis. In this study, we first argue that an appropriate, and therefore useful, metric for keyness needs to be fully consistent with the definition of keyword. We then use four sets of comparisons between corpora of different types and sizes, in order to test whether and to what extent the use of different metrics affects the ranking of keywords. More precisely, we look at the extent of overlap in the keyword rankings resulting from the adoption of different metrics, and we discuss the implications of ranking-based analysis adopting one metric or another. Finally, we propose a new metric for keyness, and demonstrate a simple way to calculate the metric, which supplements the keyword extraction in existing corpus software.

AB - In this paper we examine the definitions of two widely-used interrelated constructs in corpus linguistics, keyness and keywords, as presented in the literature and corpus software manuals. In particular, we focus on a. the consistency of definitions given in different sources; b. the metrics used to calculate the level of keyness; c. the compatibility between definitions and metrics. Our survey of studies employing keyword analysis has indicated that the vast majority of studies examine a subset of keywords – almost always the top X number of keywords as ranked by the metric used. This renders the issue of the appropriate metric central to any study using keyword analysis. In this study, we first argue that an appropriate, and therefore useful, metric for keyness needs to be fully consistent with the definition of keyword. We then use four sets of comparisons between corpora of different types and sizes, in order to test whether and to what extent the use of different metrics affects the ranking of keywords. More precisely, we look at the extent of overlap in the keyword rankings resulting from the adoption of different metrics, and we discuss the implications of ranking-based analysis adopting one metric or another. Finally, we propose a new metric for keyness, and demonstrate a simple way to calculate the metric, which supplements the keyword extraction in existing corpus software.

KW - corpus linguistics

KW - keyness

KW - keywords

KW - metrics

KW - frequency difference

KW - effect size

KW - statistical significance

M3 - Paper

ER -

Gabrielatos C, Marchi A. Keyness: Appropriate metrics and practical issues. 2012. Paper presented at Corpus-assisted Discourse Studies International Conference, Italy.