Title: Predicting the 'Chunkiness' of Sequences in Spoken English. An Application of CART Trees and Random Forests in Psycholinguistics Author: Ulrike Schneider Affiliation: Johannes-Gutenberg-Universität Mainz Abstract: Usage-based theories in linguistics postulate that through frequent repetition, sequences of sounds, words and constructions may develop into routine behaviours. Thus frequently-used good morning and how are you? are routines which can be uttered faster and more fluently than novel or rarely-used combinations of words. Such routinised or entrenched sequences are also referred to as 'chunks'. How such chunks are mentally represented and what causes them to form are hotly debated topics in current psycholinguistics. My presentation focusses mainly on one particular theory of chunking, namely Bybee's (2010) Linear Fusion Hypothesis, which postulates that chunking strength gradually increases the more frequently a sequence is used and that "more frequent chunks [...] are easily accessible as wholes" (Bybee 2010:36), which means that they are best modelled as single words with a single, combined entry in the mental lexicon. Bybee crucially rejects approaches such as Collostructional Analysis (Stefanowitsch and Gries 2003), which rely on relative chances of co-occurrence. She maintains that the absolute co-occurrence frequency of the words is the "most important factor" (Bybee 2010:97). However, based on absolute usage frequency alone, we would expect relatively meaningless sequences like and I or in the to be among the most strongly chunked combinations. Only relative measures of co-occurrence, like transitional probabilities or the mutual information score, predict the chunking of good morning and how are you?. I test these conflicting claims in an analysis of the placement of over 11,000 hesitations in spontaneous conversations. Speakers should not need to hesitate within chunks and, in fact, avoid to do so, thus hesitations should be shifted to less 'chunky' locations in the proximity. Thus I can compare whether this behaviour is best predicted by absolute co-occurrence frequency of by transitional probabilities, the mutual information score (MI) or lexical gravity G (cf. Daudaravicius and Marcinkeviciene 2004; Gries and Mukherjee 2010). I show that for these types of analyses, CART trees and random forests (more precisely ctree and cforest implemented in the party package in R; Hothorn, et al. 2006; Strobl, et al. 2007, 2008, 2009) are an ideal tool for linguists (cf. also Tagliamonte and Baayen 2012), not only because they can handle multinomial outcomes and necessarily correlated predictors, but also because the clusters created by the individual trees provide objectively justified groupings, which can be further analysed in subsequent steps of the analysis. In this way, the analyst does not lose sight of the linguistic constructions 'behind' the frequencies. References: bwGRiD, member of the German D-Grid initiative, funded by the Ministry for Education and Research (Bundesministerium für Bildung und Forschung) and the Ministry for Science, Research and Arts Baden-Wuerttemberg (Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg) http://www.bw-grid.de. Bybee, Joan (2010): Language, Usage, and Cognition. Cambridge: CUP. Daudaravicius, Vidas and Ruta Marcinkeviciene (2004): "Gravity Counts for the boundaries of collocations." International Journal of Corpus Linguistics 9 (2). 321-48. Gries, Stefan Th. and Joybrato Mukherjee (2010): "Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes." International Journal of Corpus Linguistics 15 (4). 520-48. Hothorn, Torsten, Kurt Hornik and Achim Zeileis (2006): "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical Statistics 15 (3). 651-74. R Development Core Team (2009): R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. http://www.R-project.org. Stefanowitsch, Anatol and Stefan Th. Gries (2003): "Collostructions: Investigating the interaction of words and constructions." International Journal of Corpus Linguistics 8 (2). 209-43. Strobl, Carolin, Anne-Laure Boulestreix, Thomas Kneib, Thomas Augustin and Achim Zeileis (2008): "Conditional variable importance for random forests." BMC Bioinformatics 9 (307). Strobl, Carolin, Anne-Laure Boulestreix, Achim Zeileis and Torsten Hothorn (2007): "Bias in random forest variable importance measures: Illustrations, sources and a solution." BMC Bioinformatics 8 (25). Strobl, Carolin, James Malley and Gerhard Tutz (2009): "An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests." Psychological Methods 14 (4). 323-48. Tagliamonte, Sali A. and R. Harald Baayen (2012): "Models, forests, and trees of York English: Was/were variation as a case study for statistical practice." Language Variation and Change 24. 135-78.