A- A+
Alt. Display

# Challenges in detecting evolutionary forces in language change using diachronic corpora

## Abstract

Newberry et al. (Detecting evolutionary forces in language change, Nature 551, 2017) tackle an important but difficult problem in linguistics, the testing of selective theories of language change against a null model of drift. Having applied a test from population genetics (the Frequency Increment Test) to a number of relevant examples, they suggest stochasticity has a previously under-appreciated role in language evolution. We replicate their results and find that while the overall observation holds, results produced by this approach on individual time series can be sensitive to how the corpus is organized into temporal segments (binning). Furthermore, we use a large set of simulations in conjunction with binning to systematically explore the range of applicability of the Frequency Increment Test. We conclude that care should be exercised with interpreting results of tests like the Frequency Increment Test on individual series, given the researcher degrees of freedom available when applying the test to corpus data, and fundamental differences between genetic and linguistic data. Our findings have implications for selection testing and temporal binning in general, as well as demonstrating the usefulness of simulations for evaluating methods newly introduced to the field.
Keywords:
How to Cite: Karjus, A., Blythe, R. A., Kirby, S., & Smith, K. (2020). Challenges in detecting evolutionary forces in language change using diachronic corpora. Glossa: A Journal of General Linguistics, 5(1), 45. DOI: http://doi.org/10.5334/gjgl.909
Published on 07 May 2020
Accepted on 31 Oct 2019            Submitted on 31 Jan 2019

## 1 Introduction

All natural languages change over time. The way each new generation of speakers pronounces their words is subtly different from their parents, new words replace old ones, marginal grammatical paradigms become the norm, and norms dissolve. Many authors have suggested that language change, like other evolutionary processes, involves both directed selection as well as stochastic drift (Sapir 1921; Jespersen 1922; Andersen 1990; McMahon 1994; Croft 2000; Baxter et al. 2006; Van de Velde 2014; Steels & Szathmáry 2018). Systematically quantifying the relative contribution of these two processes — particularly with reference to individual time series — is an open problem.

There are a number of ways in which selective biases may influence language change. For example various cognitive biases have been postulated as important in the evolution of language (Haspelmath 1999; Croft 2000; Kirby, Cornish & Smith 2008; Fay et al. 2010; Smith, Tamariz & Kirby 2013; Enfield 2014; Tamariz et al. 2014) and one might therefore expect to see manifestations of these in instances of language change. Selective advantage stemming from sociolinguistic prestige of (the users of) competing variants has been shown to play a considerable role in change, both via competition between forms within the language community as well as borrowing from other languages (Labov 2011; Hernández-Campoy & Conde-Silvestre 2012). A foreign or novel variant may also be selected for by virtue of filling a lexical or morphosyntactic gap (McMahon 1994; Trask 1996). The form of a variant alone may convey a selective advantage. For example, it has been observed that, all other things being equal, speakers prefer shorter forms that take less effort to utter (Zipf 1949; Kanwal et al. 2017) and limited iconicity can be advantageous (Dingemanse et al. 2015). Various usage and acquisition properties have been shown to be predictors of success (Kershaw, Rowe & Stacey 2016; Calude, Miller & Pagel 2017; Grieve, Nini & Guo 2018; Monaghan & Roberts 2019). There is also evidence that certain phonetic changes are more likely than others, due to the articulatory and acoustic properties of human speech sounds (Ohala 1983; Baxter et al. 2006). In certain circumstances there may be even qualitative evidence of directed selection, such as knowledge of previous activities of some authoritative language planning body, prescriptive grammars, or other exogenous forces (Rubin et al. 1977; Anderwald 2012; Ghanbarnejad et al. 2014; Daoust 2017).

It is a reasonable hypothesis that, given adequately large and representative samples of language use over time (i.e., corpora), signatures of selection should be inferable from the usage data alone. This idea has recently been explored in a number of works (Hahn & Bentley 2003; Bentley 2008; Reali & Griffiths 2010; Blythe 2012; Sindi & Dale 2016; Amato et al. 2018), and has been also applied to domains of cumulative cultural evolution beyond language (Kandler, Wilder & Fortunato 2017; Kandler & Crema 2019). One of the more ambitious attempts is that of Newberry et al. (2017), who employ a standard method borrowed from the field of population genetics, which also deals with the inference of selection in a population and the assessment of drift in evolution. We will henceforth refer to this work as “Newberry et al.” (an earlier version of the paper is Ahern et al. 2016). They use the Frequency Increment Test (Feder, Kryazhimskiy & Plotkin 2014), or FIT for short, and make an explicit connection with the Wright-Fisher model (Wright 1931; Ewens 2004) of neutral stochastic drift (not unlike a previous similar contribution, Sindi & Dale 2016).

Newberry et al. consider three grammatical changes in the English language. Their main focus is the (ir)regularization of past-tense verbs (e.g. the change from irregular snuck to regular sneaked), a topic that has been of some interest (Lieberman et al. 2007; Cuskley et al. 2014; Gray et al. 2018). They also investigate the change in periphrastic do (say not that! becoming don’t say that!), the evolution of verbal negation (from the Old English pre-verbal to the Early Modern English post-verbal), and possible phonological neighborhood effects (which we will not discuss here). They use data from the Corpus of Historical American English (Davies 2010) and the Penn Parsed Corpora of Historical English (Kroch & Taylor 2000). Their method consists of calculating the relative frequencies of alternative forms in a corpus (e.g., the relative frequency of the irregular past tense form snuck against that of the regular sneaked), placing the count data into variable-length temporal bins, and running the FIT on the resulting time series. Ultimately, the test yields a p-value under the null hypothesis of change by drift alone. They also infer the “effective population size” of the verbs and show that the strength of drift (in a subset of verbs with a FIT p > 0.2) correlates inversely with corpus frequencies, echoing the analogous observation about small populations in genetics.

The FIT points towards selection being operative in some cases, while labelling others (in fact, most changes in past-tense forms) as changes stemming from drift. In this work, we replicate this analysis (using Newberry et al.’s original code; see the Data Availability section in the end). We highlight an important methodological issue that arises when applying the FIT to linguistic data and which should be taken into account in future applications of the FIT (and similar tests) to identify cases of selection from linguistic corpora. The key issue lies in the construction of the time series via binning counts (e.g. from a corpus), and the application of the test in question to such time series, but we also draw attention to issues more specific to diachronic language data. While the FIT may be an appropriate test in some cases, we show that an incautious application of the FIT to linguistic data can end up incorrectly identifying cases of drift as cases of selection, and missing subjectively clear cases of selection.

While the approach of applying a test of selection to corpus-based time series shows promise as a method of linguistic analysis, we believe these issues deserve further investigation. We briefly explain the technical aspects of temporal binning and the FIT in the next subsections.

### 1.1 Linguistic corpora and data binning

In quantitative research on language dynamics, words and grammatical constructions are often equated with alleles (Reali & Griffiths 2010). This analogy is motivated by the observation that a given “underlying form” may have two or more (near-) synonymous actualizations or “surface forms” (e.g. as in the sneakedsnuck case which are both actualizations of sneak.PAST). Word variants are not quite like alleles though. Organisms inherit genetic material from their parents, and one can (in principle) test for the presence of a particular allele in each individual in the population over time. In the context of language use, the notions of parents, offspring and generations are more diffuse than they are in genetics. What is done in practice when analysing time series is to construct an artificial “generation” by collecting together all instances of the word variants under consideration that fall within a specific time window (or “bin”). Particularly troublesome is that fact that a given lexeme may not occur in a given corpus in a particular period of time, which means having to widen the bin to obtain a meaningful frequency. Such absences may occur simply because of the finite size of the sample: any corpus is in the end just a sample from a population of utterances. The smaller the corpus, the smaller the chance a lexeme has to occur. It may also be because people talked and wrote about other topics in that time window, which did not require the use of this particular sense. A corpus may be large, but not well balanced, in the sense that it does not cover all the relevant genres or topics of the time. Incidentally, this is a point of critique directed by Pechenick, Danforth & Dodds (2015) at another widely used diachronic corpus, the Google Books N-grams dataset.

To understand the issue of binning (or temporal segmentation) in more detail, let us consider for a moment a fictional corpus of a daily newspaper, spanning two centuries. Our goal is to count the occurrences of two competing spelling forms of a word and operationalise these as relative frequencies in a time series. The smallest possible temporal sample would consist of the text that makes up one daily issue of the paper (yielding a fine grained time series of about n = 73000 data points). One could also aggregate (bin) all the texts from one month (n = 2400), year (n = 200), decade (n = 20) or century (n = 2). However, there is no single ideal way to bin the data. A century, with only two data points, may be too large a chunk, as it may miss processes taking place in between — and it is difficult to infer anything about the dynamics of the change from two data points. A day is likely too small a sample, since the word (in either spelling) might not occur every day, unless it is a particularly commonly used one.

In corpus-based language research either years or decades therefore seem the most commonly used bins. Regardless, a decision has to be made regarding how to bin corpus data; our point here is to show that this decision (which potentially constitutes an additional researcher degree of freedom, since different binning decisions may yield different results) influences the outcome of analyses which use tests like the FIT to identify selection.

### 1.2 The Frequency Increment Test

The FIT (Feder, Kryazhimskiy & Plotkin 2014) belongs to a family of methods conceived to detect selection in time series genetic data, with intended application to population genetics experiments and historic DNA samples. All of them boil down to looking for certain patterns in time series of allele frequencies (Nishino 2013; Terhorst, Schlötterer & Song 2015; Schraiber, Evans & Slatkin 2016; Iranmehr et al. 2017; Taus, Futschik & Schlötterer 2017; Vlachos & Kofler 2018) (see Malaspinas 2016; Vlachos et al. 2019: for reviews). Such approaches rely on the presumption that a change driven by selection would look different, or leave different “signatures”, from a change happening due to stochastic drift.

The FIT works as follows. Relative frequencies in the range (0, 1) are transformed into frequency increments Y according to

(1)
${Y}_{i}=\left({v}_{i}-{v}_{i-1}\right)/\sqrt{2{v}_{i-1}\left(1-{v}_{i-1}\right)\left({t}_{i}-{t}_{i-1}\right)}$

where vi is the relative frequency of a variant at a measurement time ti. The rationale behind this rescaling is that, under neutral evolution, the mean increment vi–vi–1 (i.e. the change in frequency of vi from time ti–1 to time ti) is zero, and its variance is proportional to

(2)
${v}_{i}{}_{-1}\left(1-{v}_{i}{}_{-1}\right)\left({t}_{i}-{t}_{i}{}_{-1}\right),$

i.e. the expected variance under drift is large when we are looking at the changes in frequency between two widely separated time points (i.e. ti–1 and ti are far apart) or when values of vi are close to 0.5 (i.e. changes in frequency driven by drift will tend to be small when the variant is very rare and vi is close to 0, or very common and vi is close to 1).

The FIT relies on the Gaussian approximation of the Wright-Fisher diffusion process. When the variant frequency vi is not too close to either of the boundary values 0 or 1 and the time between successive measurements is sufficiently small, the random variables Yi can be approximated as having a normal distribution with a mean of zero and a variance that is inversely proportional to an effective population size (which is taken to be constant over time). Thus a test under the null hypothesis of drift amounts to a test of how likely the transformed increments Yi are under the assumption that they are drawn from a normal distribution with a mean of zero, as would be the case under drift: this can be evaluated using a one-sample t-test test under the assumption of normally-distributed increments with a zero mean and equal variance.

In this context, a failure to reject the null indicates a failure to reject the hypothesis of drift. On the other hand, if the null hypothesis is rejected, than the changes may be due to some non-neutral process. In this work, we check for the normality assumption using the Shapiro-Wilk test. Homoscedasticity (the assumption that the underlying distributions have equal variances) is less straightforward; we explore its relevance in the Supplementary Appendix.

The authors of the Frequency Increment Test (Feder, Kryazhimskiy & Plotkin 2014) note that its power increases with the number of sampled time points, but also that it has low power in cases of both very weak (near-drift) and very strong selection coefficients. The latter leads to a situation where fixation to a variant happens swiftly within the sampling interval (the range of the time series), making the rest of the time series uninformative. The frequencies should also be far from absorbing boundaries (i.e., situations where one variant is at (or near) 0% and the other at 100% of the population), which might pose a particular problem in corpus-based time series analysis: since linguistic change is (classically) believed to follow an S-shaped trajectory (Blythe & Croft 2012), a change which takes place near the start or end of a given corpus would throw off the test, since most of the length of the given time series would be (near-)stationary. Similarly, if a corpus (equivalent to the “sampling period” in a genetics experiment) is too “short”, it might only chronicle a segment of a longer change process.

## 2 The FIT and binning decisions in linguistic corpora: A reanalysis of English past tense verb regularization

We focus here on the main result of Newberry et al. — the application of the FIT for assessing time series of verb form frequencies in order to determine if the observed patterns of change for 36 English verbs results from stochastic drift or selection. Technical data processing details described in this section are based on the Supplementary Information of Newberry et al., their code, and M. Newberry, p.c.

They construct a time series for each of 36 pre-selected verbs using 200 years of data in the Corpus of Historical American English (COHA), by counting how many times the regular past tense form occurs relative to the total number of instances of either the regular or irregular form. The yearly verb count series are then binned (grouped) into a number of variable-width quantile bins n(b) = ┌ln(n(v))┐, where n(v) is the sum of both (regular and irregular) past tense form tokens of the verb counted across the entire corpus. For example, light.PAST occurs n(v) = 8869 times in the corpus, resulting in ┌ln(n(v))┐ = 10 bins to group the years where the verb occurs. The first bin contains years 1810–1863 (and contains 897 tokens), the second 1864–1886 (890 tokens) and so on, up to the tenth (1994–2009, 884 tokens). Since the grouping is by years (years being the time resolution of the corpus), the bin size varies slightly in the exact number of tokens falling into each bin. More frequent verbs thus get more bins (up to 13), whereas less frequent verbs get fewer bins (down to 6). For each verb in each bin, the relative frequency of its regular past tense form in [0, 1] is calculated. Since the FIT assumes relative frequencies in (0, 1), Laplace +1 smoothing is applied to count values in bins where one of the variants has no occurrences at all in this section of the corpus.

As discussed above in the section on corpus binning, some temporal segmentation process is necessary. The binning procedure applied by Newberry et al. is somewhat different from the more common strategy of using fixed length bins such as years or decades. The advantage of their approach is that there is guaranteed to be data in every bin (whereas a low frequency lexeme might be entirely absent in a fixed-width bin), the bins are roughly the same size in terms of tokens, and the resulting increments tend (although are not guaranteed) to be normally distributed with equal variance. These properties are beneficial for the FIT, more likely yielding normally distributed increments with less sampling noise (Feder, Kryazhimskiy & Plotkin 2014). It should be noted though that the resulting bins differ quite widely in their temporal granularity — e.g. in the example above, the longest bin covers the earliest 53 years of the corpus, the shortest covers the most recent 15 years, and different verbs will use different time windows depending on their frequency in the corpus. Since the COHA is smaller on the early end (less tokens per year) and bigger on the more recent end, variable-width bins of the verb data are systematically longer in the early 1800s compared to the 20th century ones (cf. the Supplementary Appendix for more discussion).

The series of relative frequencies based on the resulting bins are fed into the Frequency Increment Test to assess whether one may reject the null hypothesis of drift and assert that a given trajectory is therefore probably a product of selection. Newberry et al. set the FIT α = 0.05 but also report results for α = 0.2. They conduct the Shapiro-Wilk normality test on the transformed frequency increments, as the FIT assumes the increments to be normally distributed.

We replicate their original results, using their code, and furthermore explore the consequences of manipulating the size of the bins, in two ways. We present results for both binning strategies. That is, variable-width bins, n(b) = cln(n(v)), where c is an additional arbitrary constant, and c = 1 recovers the Newberry et al. procedure; and fixed-width bins, each set to a fixed duration in years.

Importantly, the fixed-width binning approach necessitates the introduction of an additional parameter: since some bins may end up with no or few occurrences of either form of a verb, we set a threshold of minimum 10 total occurrences for a relative value to be calculated in a bin; otherwise the bin is excluded before applying the FIT (hence also reducing the number of bins that make up the time series). As the FIT assumes values in (0, 1), smoothing of boundary values is required. But if there is only a single occurrence of a verb in a bin (meaning the single present form would be at 100%, the other at 0), then the +1 smoothing would force the relative value to be 50–50, which is undesirable. Similar distortions would happen with small frequency values, hence the threshold of 10. See the Supplementary Appendix for more discussion on the differences between these approaches and how different minimal frequency thresholds affect the results. A more conservative threshold (such as 100) would yield more reliable bins (and less noisy time points), but given the size of COHA, most verbs don’t have 100 occurrences per year (or some even in 5 years), which would preclude testing in shorter fixed bins.

Figure 1 shows the results of these various analyses, in terms of how many verbs (out of the 36) allow us to reject the null hypothesis of drift, given the thresholds mentioned in the original work, as well as taking into account the normality assumption of the FIT (see above). We use the Shapiro-Wilk normality test, following Newberry et al. (this test is of course subject to low power in small samples as well). Out of the 466 time series analyses summarised in Figure 1 (36 verbs times 13 binning choices, minus two series with not enough data points), 63% of the FIT p-values are eligible to be interpreted at Shapiro-Wilk α = 0.1.

Figure 1

Results of applying the FIT to time series constructed based on 200 years of COHA frequency data. The verbs are ordered by overall frequency (low on the left). The constant c determines the number of variable length bins via n(b) = cln(n(v)). c = 1 corresponds to Newberry et al.’s original results. 10 years corresponds to fixed bin length of 10 years, etc; “no bin” refers to no additional binning on top of the default yearly bins in the corpus. The colour of each point corresponds to the result of the FIT test of a verb time series in each binning (orange: p < 0.05, gold: 0.05 ≤ p < 0.2, light blue: p ≥ 0.2). The shape corresponds to the Shapiro-Wilk test result (filled circle: p ≥ 0.1, hollow square: p < 0.1, likely not normal), with cases of selection meeting the normality assumption highlighted by a larger circle. The column of numbers on the left displays the (rounded) median of the bins to years ratio in the given binning strategy. Only years where the verb occurs are counted (exclusion of sparse bins also leads the median in the no-binning version to be below 1). The listed variable (panel a) and fixed-width strategies (b) yield comparable binning ratios, e.g. the “c = 1” version is comparable to 20-year fixed-width. In summary, the results presented here demonstrate that the FIT is sensitive to the strategy used for binning.

We find that binning strategy does have an effect on the results, both in variable and fixed binning. Importantly, in broad strokes, the picture presented by Newberry et al. holds. They found that 6 out of 36 verbs undergoing selection; since the majority of verbs do not give a positive signal for selection, they interpret this as indicating that language change is often primarily stochastic. Looking at a wider range of binnings, we find that in most cases, there are indeed 5 ± 2 verbs that get flagged as undergoing selection at FIT α = 0.05, consistent with their conclusion. However, the specific verbs that are flagged as undergoing selection vary depending on the binning strategy. There are 4 verbs for which selection is detected in most binning choices — light, smell, sneak, wake (incidentally the ones with the strongest inferred selection coefficient, given the original binning, cf. EDT1 in Newberry et al.). There are also between 9 and 11 verbs (in variable-width binning; depending on how stringently the normality assumption is observed) which provide a robust absence of significant indications of selection, where the FIT p-value never drops below 0.2 regardless of binning. However, for the remaining verbs the decision as to whether or not they are undergoing selection depends on the binning choices. That being said, Newberry et al. do draw attention to the fact that results of applying the FIT come with a certain margin of error and report their false discovery estimates (30% for verbs with a FIT α = 0.05, 45% at 0.2).

Given that binning leads to different sample sizes of increments for the underlying t-test, those in turn being based on differing distributions of the tokens, some variance in the p-values is to be expected (not unlike in a replication of an experiment). The interpretation of our results and the appropriate conclusion regarding the sensitivity of the FIT test to binning strategy ultimately depends on one’s intention in carrying out a tests of selection in the first place. If the goal is to test a large set of series to determine general tendencies, as is the case for Newberry et al. then this approach may well be good enough — the qualitative result of Newberry et al. does broadly apply in most binning strategies.

However, most individual time series seem rather sensitive to binning, in the sense that the p-values fluctuate across conventional α levels between binnings. No verbs show an unambiguous signal of selection. For example, drift is not rejected in the time series of wed using the Newberry et al. binning, while it is when the number of variable-width bins is multiplied by 2. The verb sneak is significant at α = 0.05 in almost all the variable-width binnings, but in none of the fixed length ones; awake is significant in only a single explored binning strategy (variable-width with c = 0.5) and there are 4 more such verbs particularly sensitive to binning (the 1-year bins notwithstanding).

The no-binning results (i.e., using the default 1-year bins of COHA without further binning) differ visibly from the rest, but the normality assumption is also mostly violated. Given the small and variable bin sizes (tokens per bin), the same is likely true for the homoskedasticity assumption (although how much that matters and how to set a threshold is not clear, cf. the Supplementary Appendix). Most importantly, using “default” 1-year bins leads to testing on series where the increments are often based on very small samples, which is not desirable for any statistical test.

These evaluations obviously depend on the choice of α thresholds for the FIT and the supporting normality test — for example, a more stringent FIT α would lead to more verbs being classified as unambiguous cases of drift. In any case, if the intention is to test a particular example of linguistic change for selection (something a linguist may well be interested in), things become difficult. The issue diminishes if there is sufficient data on the variants, but that does not seem to be the case for many of the verbs tested here, given the size of COHA.

All in all, these findings merit a further investigation into the inner workings of the Frequency Increment Test and its applicability to corpus-based time series, which we will conduct in the following two sections.

## 3 The behaviour of the Frequency Increment Test in artificial time series

We construct a number of artificial examples (Figure 2) to probe the behaviour of the FIT on time series of length and character similar to those investigated in the original paper (which contained between 6 and 13 time points). The FIT can be shown to yield robust results for a certain range of series (as already shown by the subset of binning-insensitive verbs in the previous section). Yet we also observe a number of scenarios — time series that could be plausibly derived from linguistic corpora — where the results of the FIT are perhaps not what one might expect, from a language science point of view. To put it another way, this is the section where we push the FIT and see if it breaks. The next section demonstrates scenarios where the results of the FIT remain robust.

Figure 2

Artificially constructed time series of fictional variant relative frequencies (thick black lines, in (0, 1)); time on the x-axis. The rescaled increments (after adjusting for absorption) are shown as dotted grey lines with dash points, and their distribution is shown on the left side as a violin plot. Points of interest discussed in this section are highlighted with red on some panels. The FIT and Shapiro-Wilk test p-values are reported in the corners. This figure depicts a number of realistic scenarios where applying the FIT would yield unexpected results, due to either the range of the time series derived from the corpus (a, b), a difference in the number of data points (c), the sensitivity of FIT to near-zero values (d, e), and how stringently the assumption of the normality of the distribution of increments is being observed (e). This figure illustrates reasons to exercise caution when applying a test like the FIT to linguistic time series.

Each series in Figure 2 may be interpreted as the percentage of a variant of some fictional linguistic element over time (after binning). We calculate the FIT p-value of each series, as well as the Shapiro-Wilk test p-values. Figure 2.a draws attention to how the temporal range of the time series (or that of the coverage of the corpus) can lead to quite different conclusions. Both 2.a.1 and 2.a.2 are different ends of the same series (the overlap highlighted with the red circle). The series, if analysed as a whole, would yield a pFIT = 0.02, but neither end on its own holds sufficient data to reject drift (nor is the FIT technically applicable, if the assumption of normality is observed). This perspective may explain the case of the purportedly drift-driven regularization of the verbs spill and burn, which are brought up in Newberry et al. as examples where drift alone is sufficient to explain the change, but which are problematic because the regular forms were already highly frequent by the early 19th century where the COHA coverage starts. spill starts out with a share of 55% regular forms in the first bin given the variable-width binning strategy; burn is at 86% regular. Under fixed decade binning, burn is 36% regular in the first bin, increasing to 62% and then to 82%, indicating a sharp increase characteristic of strong selection rather than drift (but obscured by the variable binning approach).

This example also points to a case where different evolutionary domains (genetics, language) might have different expectations about what a reasonable time-series characteristic of selection should look like. The FIT assumes the Wright-Fisher as the underlying model (reasonably so in population genetics). The long tail of near-zero values followed by a sudden increase in 2.a.1 is something that is unlikely to be observed in a Wright-Fisher model with constant selection strength parameter. However, from a linguistic point of view, this is a very natural series: a recent innovation or borrowing will be represented in the corpus as an increase preceded by a period of zero frequencies as far back as the corpus goes; this pattern could be explained as a recent change in fitness (e.g. a change in the subjective sociolinguistic prestige of a word).

A similar case is presented in Figure 2.b.1: if the time series chronicles both strong selection for one variant, and subsequent selection for the competing variant, then a blind application of the FIT will invariably indicate drift. Using only (either) half of the series as input to the test would yield a p-value indicating selection. knit is a verb undergoing a somewhat similar process, with usage spiking towards the regular (observable under finer binnings), followed by mostly irregular usage. Figure 2.b.2 is an example of the behaviour of FIT if the corpus coverage is too wide. The S-curve in the middle would yield a FIT p-value of 0.02 — in fact, it is the exact same curve as in Figure 2.c.2 (highlighted by the red dots). Yet the S being surrounded by (near-)absorption values, the FIT would indicate drift (were the test to be used despite the possible non-normality of the distribution).

In the case of real data, the part of the time series depicting the long period of no change could in principle be clipped away. This is straightforward if the “tail” consists of zeroes, but less so given small near-boundary values. Similarly, only the part of the time series far enough from the boundaries could be analysed (keeping in mind the specifics of the FIT, see above). However, any such solutions would introduce yet another researcher degree of freedom (what part of the series to include in the analysis) (cf. Simmons, Nelson & Simonsohn 2011).

Figure 2.c further illustrates how the FIT result is affected by a change in the way the time series is operationalised (e.g., using a different number of bins). 2.c.1 and 2.c.2 are S-curves with identical parameters, differing only in length (by 2 data points). Yet their FIT p-values are notably different (see the next section for more on sensitivity to binning differences). As expected, the FIT is sensitive to small changes if the sample is small (being based on the t-test). This may explain to some extent the changes in FIT p-values of short time series, between similar binnings differing only by a few points in length (cf. Figure 1). However, fewer bins can also lead to a lower p, if it results in a less jagged time series (likely the case for e.g. burn; cf. Section 4 for the effects of binning on drift series).

The examples so far however have had more to do with particularities of pre-test data manipulation. Figure 2.d illustrates a property of the FIT, its sensitivity to changes near the boundaries. 2.d.1 and 2.d.1 differ only by the value of the fourth data point, but the resulting FIT p-value is quite different (and furthermore the Shapiro-Wilk test indicates departure from normality in the increment distribution due to the outlier). The issue of applicability of the FIT to series with increments departing from normality is further illustrated with the last pair of series. 2.e.1 is a typical S-curve often observed in language change, but the non-normal distribution of its increments would disallow the interpretation of the FIT p-value (that would otherwise indicate a clear case of selection).

We observe that in general, for longer series exhibiting monotonic increase (characteristic of strong selection), the distribution of the increments quickly veers into the non-normal (as indicated by the Shapiro-Wilk p-value; other normality tests behave similarly; see also the Supplementary appendix). Time series composed of random values drawn from a uniform or normal distribution (or log-normal with small σ) — i.e., the kind of series that should exhibit no selection — tend to have increments distributed approximately normally, as long as the series is away from the boundary values. However, the increments of S-shaped curves tend towards a bimodal distribution. Increment distributions of are severely skewed when a series is shaped like an S-curve but with a sharp “bend”, a straight line (linear increase or decrease), and when a series include long periods of no change.

The assumption of normality could of course be relaxed. However, we observe that this would lead to at least one additional issue, in the form of false positives stemming from the sensitivity of the FIT to small near-boundary changes, illustrated by 2.e.2. Given a long enough series of random values (here sampled from a normal distribution) with a near-zero mean and small standard deviation, the FIT often yields a small p-value (the same applies to samples from the uniform and log-normal distributions; this effect is not observed when the mean is away from the boundaries). Such series would however usefully get flagged as having non-normal increment distributions.

This is also likely why the otherwise flat-lining series for tell in Newberry et al. ends up being included in the discussion as a possible case of selection (at FIT p = 0.12, with a red flag of Shapiro-Wilk p = 0.001). Among the 12 bins of its series (under the original variable-width quantile binning procedure), it has only a few once-per-bin occurrences of regular telled after the initial three bins — a total of 4 singleton occurrences spread out over the span of a century. The +1 absorption adjustment forces the zeroes for telled in the rest of the bins to be ones as well. The observed fluctuations (and resulting FIT p-value) in the series only reflect the slightly fluctuating token frequency of tell, which ranges between 9189 and 11940 in the variable-width bins. Keeping the relative frequency value constant after the third bin instead (at the value equal to the third bin to avoid bias) would result in a FIT p = 0.21.

These last four usages of the regular past form telled in COHA all occur in the fiction part of the corpus, all appearing to reflect the intention of the author to convey a particular kind of character (not used randomly as per a drift model). This would be an example of how an archaic variant can re-surface — quite possible in a language with a long written record, where speakers need not necessarily even directly “inherit” a variant from the previous generation. In that case, telled could be said to have been selected for, due to having increased fitness in a specific (stylistic) niche, and its usage is not due to random variation in the utterances of the speakers (or drift). However, as shown above, this possible (occasional) selection is not what the FIT is picking up on in this case, but rather simply the fluctuating frequency of tell.

Meaning change can also give rise to apparent re-emergence of variants. The occurrence of a form does not guarantee that it is being used in the same meaning or function that it had in another period or context (an implicit assumption in Newberry et al.). For example, the aforementioned spill in COHA quickly converges to the regular past tense spilled, but occasional usages of the irregular spilt still occur, yielding what appears to be a randomly fluctuating time series. On closer inspection, the latter appear to be mostly adjectival usages, not actual past tense verbs, and often turn up in the lexicalized (or “fossilized”) phrase of cry over spilt milk. Examples like that of the time series of telled and spilt, or the series in Figure 2.a.2 and e.2. may possibly be seen as edge cases from the perspective of population genetics — the original domain of the Frequency Increment Test and related approaches. However, as highlighted here, they are examples of not particularly uncommon processes (lexicalization, stylistic usage of unusual variants) in the domain of language.

Finally, one might argue the examples in Figure 2 are not really counterexamples to the utility of the FIT, being representative of cases where the FIT is, strictly speaking, not designed to apply in the first place, such as series with not-quite-normal increments, long flat segments, and values near the boundaries. Excluding these however would mean excluding a fair share of language change scenarios easily observable in corpora, such as changes starting at zero as in cases of linguistic innovations, ongoing changes stretching beyond the bounds of a corpus, and many S-curves typical of language change (and series in general where the underlying selection coefficient is likely not constant). Yet dismissing these as invalid points of concern would also mean dismissing the FIT as a broadly applicable test of selection for the domain of language change.

In the next section, we turn to simulations to explore the behaviour of the FIT beyond that of a few specific series (Section 4), before finally trying to reconcile these conflicting viewpoints (Section 5).

## 4 The effect of binning frequency data for time series: A simulated example

Here we attempt to further explore the “parameter space” of applying the FIT to simulated data with known properties of selection strength and binning. (code to replicate these results: see the Data availability section in the end). We use the Wright-Fisher model (Ewens 2004) to simulate a large number of time series using the following parameters: population size N = 1000 (N here does not refer to the “population” of speakers, but is analogous to the sum of parallel variants in a corpus bin, e.g. the sum of the counts of lit and lighted in a given year); selection coefficients s in [0, 5]; 200 generations (the latter emulating COHA, where the minimal time resolution is 1 year, and there is 200 years of data). The update rule for this model is as follows. Given nt “mutants” (e.g., regular past tense forms) in generation t, each individual in the next generation is a mutant individual with probability

(3)
$q=\frac{{n}_{t}\left(1+s\right)}{{n}_{t}\left(1+s\right)+\left(N-{n}_{t}\right)}$

Otherwise, it is the wild type (e.g., irregular past tense forms). Where s = 0 we have random drift; higher values of s given an increasingly strong selective advantage to the mutant variant.

Each series (200 data points) is binned into a decreasing number of bins (i.e., [200, 4], of length [1, 50]), and the FIT is applied to every binned version. The simulation for each combination of selection strength and bin length is replicated 1000 times. In summary, in this section we vary the selection strength s and binning, while keeping N and the number of generations constant.

Importantly, we also apply binning to the series post-simulation the same way one would apply binning to corpus counts, as discussed above. The obvious difference from corpus-based time series is that the latter usually do not come from a population with a stable size (total lexeme frequency usually varies in addition to variation in its variants), and are often not continuous (gaps where a lexeme might be completely absent). Since our artificial series do not suffer from these problems, variable-width and fixed-length binning yield identical results, and we can simply use the latter.

We explore two scenarios, where the competing “mutant” variant starts out at 50% of the population and where it starts out at 5%. The former is useful for exploring the effects of binning at low s and false positive rates, the latter for exploring high s and false negatives. Obviously, any specific s thresholds and ranges discussed in this section apply to this specific experiment and would likely be somewhat different given series of different length and N (cf. the Supplementary Appendix for some further exploration).

### 4.1 Drift and low selection

Figure 3 depicts how the results of the FIT change depending on binning, given a time series with low selection (s = 0.01, bottom row) and no selection (s = 0, top row; corresponds to the leftmost column of pixels on the panels in Figure 4). At zero selection, the FIT has a reasonable false positive rate of around 5% at α = 0.05. Binning such series into a smaller number of bins causes an increase in the share of p-values below 0.05 (presumably because noise is smoothed out). Binning appears to affect the s = 0.01 range even more (bottom row).

Figure 3

The distribution of FIT p-values given 1000 series from the Wright-Fisher model (200 generations, starting at 50%). The panels are arranged from left to right reflecting increased binning. The small inset panels display how binning affects a single example series. p-values below 0.05 are coloured red (left of the dashed line), above 0.05 in blue. Note the log10 x-axis. This figure illustrates that the false positive rate is susceptible to increasing when the series are binned (top row). At non-zero but low s, differences between binning and no binning can be more pronounced (bottom row). See Figure 4 for the full exploration of the parameter space.

Figure 4

FIT p-values of time series generated using the Wright-Fisher model (with the “mutant” variant starting at 50%), across a range of selection coefficients (x-axis, note the log scale), binned into a decreasing number of bins (y-axis). Left in pink and green (a): % of time series with FIT p < 0.05, in 1000 replicates. Right in red and blue (b): mean FIT p-value. The bottom pair (a.2, b.2): the same data, but series with a Shapiro-Wilk p < 0.1 have been removed before calculating the percentages and means. The white rectangle: the range of s and binning explored in Newberry et al. The vertical black line highlights the s explored in Figure 3. A consistent colour across a column of pixels indicates robustness to binning choices under the corresponding s, while variable colouring indicates sensitivity to binning.

Figure 4 represents the entire parameter space explored in this experiment for the 50% start condition. Each pixel on the heat maps corresponds to a parameter combination of selection strength (horizontal axis) and number of bins (vertical axis). The vertical axis starts with 200 or no binning, corresponding to bin length 1 — and running up to 4 bins, with bin length 50, being the result of 200 data points squeezed into the 4 bins. Minimal binning — compressing 200 generations into 100 bins of length 2 — appears to make the clearest immediate difference: the share of p < 0.05 is consistently about 10% higher between the binned and non-binned series when s is low (observe the bottom two “shifted” looking pixel rows in Figure 4.a.2).

The 50% start is suitable for exploring low selection, as in the case of lower starting values, many such series hit absorption or “run into the ground”, and the resulting mostly-zero series would violate the normality assumption (of its underlying Gaussian approximation of the diffusion process). However, the higher s range in Figure 4.a.2 could be interpreted as a model of the situation where a change is only partially chronicled by a corpus, e.g. Figure 2.a.2 in Section 3. Selection becomes understandably difficult to detect in very short series regardless of the underlying selection coefficient.

### 4.2 High selection

The 5% start is suitable for exploring high selection, as with higher starting values, many high-selection time series reach absorption fast, yielding series not meeting the increment normality assumption. Figure 5 depicts distributions of FIT p-values under different binnings, given time series with a moderately high s of 0.04, and the incoming variant starting out at 5%. This appears to be the subset of series where the FIT works very well and is most insensitive to binning choices.

Figure 5

The distribution of FIT p-values given 1000 Wright-Fisher series with strong selection (200 generations, starting at 5%). The panels are arranged from left to right reflecting increased binning. The small inset panels display how binning affects a single example series. p-values below 0.05 are coloured red (left of the dashed line), above 0.05 in blue. Note the log10 x-axis. The red value in the bottom left corner shows the percentage of p-values below 0.05. This figure illustrates the s range where the FIT is most robust to binning, retaining a small and stable false negative rate (i.e. the inverse of the percentage value in the corner).

Beyond that, things become more complicated. Our reanalysis of the 36 verb time series in Section 2 indicated that it is series exhibiting the strongest selection that would remain consistent in terms of their FIT result across the different binnings. However, as illustrated in Figure 6, it seems too high selection can have the inverse effect, as this is where false negatives begin to crop up under too much binning (e.g. with 10 bins, >10% at s = 0.07, >90% at s = 0.1). That is, if the increment normality assumption is being be strictly observed — if it is, then the results of the test are not valid any more at this range (cf. white area in Figure 6.a.2). This illustrates that the FIT has a maximum selection strength for which it is effective. At higher selection strengths, i.e. above 0.06.0.1 in our toy model, sensitivity to binning and violations of the normality assumption both become problematic, yielding results with a high false negative rate (if the assumption is relaxed; cf. Figure 6.a.1) or results which are invalid (if it is observed; 6.a.2). Incidentally, this also is the s range where S-curves characteristic of language change begin to form (cf. the Supplementary appendix).

Figure 6

FIT p-values of time series generated using the Wright-Fisher model (with the “mutant” variant starting at 5%), across a range of selection coefficients (x-axis, note the log scale), binned into a decreasing number of bins (y-axis). Left in pink and green (a): % of time series with FIT p < 0.05, in 1000 replicates. Right in red and blue (b): mean FIT p-value. The bottom pair (a.2, b.2): the same data, but series with a Shapiro-Wilk p < 0.1 have been removed before calculating the percentages and means. The white rectangle: the range of s and binning explored in Newberry et al. The vertical black line highlights the s explored in Figure 5. A consistent colour across a column of pixels indicates robustness to binning choices under the corresponding s, while variable colouring indicates sensitivity to binning.

In summary, these results indicate that if one is to take the same ensemble of language changes, with known selection strength, and apply different binning protocols, one could easily end up drawing very different conclusions depending on the bin length and the normality assumption threshold, if the conclusions are based solely on applying a test such as the FIT. However, if awareness of these limits is maintained, then the FIT works well on time series with moderately strong selection, and reasonably well (with the caveat of somewhat increased false positives rate under binning) on time series generated by a zero or low selection coefficient.

## 5 Discussion

We started out by focussing on the study of the (ir)regularisation of the past tense of 36 English verbs in Newberry et al. specifically their finding that drift cannot be rejected in most cases, leading to the claim of the “an underappreciated role for stochasticity in language evolution” (Newberry et al. 2017: 223). The conclusion of our reanalysis section — that their broad conclusion stands but that the FIT is sensitive in specific instances to the chosen binning strategy — prompted further investigation of the properties and range of potential applicability of the FIT. In the following sections, we demonstrated that the FIT yields reasonable results in a certain subset of possible time series, yet perhaps less expected results in others, when applied to a variety of series with different lengths, shapes and underlying selection coefficients.

The fundamental issue is that corpus data has to be operationalised one way or another if one is to apply a time series analysis that is based on variant frequencies. There is as yet no single best method to do so, and the additional researcher degree of freedom is practically unavoidable. Also, unlike microbial experimental data — for which the FIT was designed originally — the beginning and end of a corpus in terms of temporal coverage may not necessarily overlap with the beginning and end of a language change trajectory. The implications of these scenarios on the FIT approach were explored in Figures 2 and 4. Any test based on increment signatures is likely to miss a significant change, if it is recorded by very few data points. This could be either due to data sparsity or low number of bins, very high underlying selection, or the change happening in the middle of an otherwise long series. This could be remedied to an extent by only considering the bins of a corpus or the segments a time series where a change “looks like” it is taking place — but that introduces yet another parameter or researcher degree of freedom.

In what follows, we attempt to summarize our findings and distil them into actionable guidelines for applying tests of selection to linguistic corpus-derived time series.

### 5.1 Limitations for linguistic selection testing

Besides the fact that caution should be exercised when its statistical assumptions are not met (as with any statistical test), the following should be taken into account when applying the FIT or a similar test of selection to corpus data. s continues to refer to the selection coefficient driving the process of change (assuming an underlying Wright-Fisher like process; see Section 1.2 for related discussion). Obviously, a test of selection being carried out implies that s is actually unknown to the tester — the guidelines sketched here are meant to draw attention to situations where it might be beneficial to inspect the results more carefully. In terms of the input data quality, the results of a test can be misleading if the time series:

• chronicle only a part of a change (beginning or end);
• are too short (too few data points or bins);
• are too long (if covering multiple events, variable s);
• based on greatly variable bin sizes (avoidable with variable-width binning, which leads to variable bin lengths).

In terms of the types and shapes of possible series, binning can lead to unpredictable results in the case of FIT (and its assumption of increment normality is likely violated) in time series:

• which are S-curves (non-normal increments);
• where s may be suspected to vary over time (e.g. S-curves with long tails);
• where s = 0 (binning increases false positives);
• with a very high s (sharp changes, quick fixation);
• with tiny near-boundary fluctuations;
• where such values are introduced by smoothing (absorption adjustment).

The high s and absorption issue can be avoided by either excluding any series with a long span of zeroes or by making a choice to clip the post-absorption part of the series. That may leave a variable number of very few data points, and of course requires some consistent method of choosing the clipping point. The tiny fluctuations issue is typically caused by occasional occurrences of the less popular variant of a pair or set with a very high underlying total token frequency. Such series can be avoided by checking for the normality of increments.

As exemplified in this contribution, the way data is handled can in some cases drive the results of a test of selection. An application of such a test — particularly if it is borrowed from a different domain — should thus take into account the nature of the data. In the case of time series derived from diachronic corpora, a number of issues require attention. These include corpus size and normalisation (Gries 2010), quality of corpus tagging (cf. Supplementary appendix), genre (Szmrecsanyi 2016) and topic (Karjus et al. 2020) dynamics, representativeness and composition (Lijffijt, Säily & Nevalainen 2012; Pechenick, Danforth & Dodds 2015; Koplenig 2017). For example, imbalances in genre or register can easily lead to a drifty-looking series, if the usage of a variant differs between them. It is also not clear how the interplay of multiple, possibly opposing sources of selection (inherent properties of the variant, sociolinguistic prestige, top-down language planning, etc.) could be captured by a single test. Properties inherent to language can make a difference, such as the aforementioned re-use of archaic variants from the written record (Section 3), or meaning change, which may reasonably resolve competition between variants as they go on to inhabit different niches (automatic methods exist to detect the latter, cf. Dubossarsky et al. 2019). This relates to the issue of determining what variants do and which do not actually compete with one other for the same meaning or function, often referred to in sociolinguistics as the problem of the envelope of variation (cf. Walker 2010).

### 5.2 Opportunities for linguistic selection testing

On the bright side, despite these concerns, the Frequency Increment Test and presumably similar tests are likely reliably applicable to time series derived from linguistic corpus data when:

• the series covers the entire change (yet if possible also excludes near-boundary values);
• the assumptions of the test are checked for;
• the underlying s can be assumed to be constant;
• the interplay of s ranges and binning is taken into account (simulations help);
• the corpus is large, representative and consistently balanced over time for genre, style and topics;
• the target token count for each time bin is large (≳100, cf. the Appendix);
• the semantics of the pair (or set) of variants remain the same;
• the set of variants yielding the relative frequencies can be assumed to be competing, and the set contains all the competitors for a meaning or function.

Besides these rules of thumb, it would be beneficial in most cases to have some principled mechanisms to:

• evaluate multiple possible binning choices for the robustness of the test results;
• deal with the “leftover” flat part of the series before and after the change being analysed;
• distinguish drift and the effects of variable s over time.

Possible use cases in linguistics involving the FIT (or a similar test) presumably fall on a spectrum where on the one end the subject of a study would be a single change in the history of a language, and the aim would be to determine if that change has occurred due to drift or due to individuals consistently selecting for one of the variants, owing to its perceived higher fitness. On the other end would be the evaluation of a very large set of linguistic time series derived from a corpus, with the aim to reveal general patterns and dynamics of language change processes. The study of 36 English verbs by Newberry et al. falls closer towards this end of the spectrum.

When the subject of a study is a single change (or a few), and the result hinges on a single test result, then we would naturally advise to take the preceding concerns into careful consideration, from data sampling and preparation to the specifics of a given selection test, while being mindful of the involved researcher degrees of freedom. If a study veers toward the other end of the spectrum, involving a large set of series, then its design would largely come down to a choice between two approaches.

One could either take a “big-data” approach, feeding the test with a very large set of time series to explore the role of selection and drift in language change, checking for only the minimal statistical assumptions of the test. The upside is that, hopefully, despite the concerns specific to corpora and language, true patterns would emerge, given enough data. The downside is of course the danger of garbage in, garbage out.

Or alternatively, one could take the approach of also trying to check for the various linguistic assumptions in addition to the statistical ones, filtering out unsuitable series. This would hopefully lead to better language science. On the downside, this requires the meticulous introduction of a number of extra parameters, or researcher degrees of freedom. Furthermore, the results might not be representative of general language change dynamics in the end, if based on testing only a niche subset of series “suitable” for a given test — of which there might not be that many either. In other words, no free lunch.

### 5.3 Future prospects

The multitude of points listed above might sound like a lot of limitations. However, we would not by any means conclude that efforts to detect selection in linguistic data should be abandoned. The idea of detecting selection in diachronic linguistic data based on shapes or signatures is not new and remains an open challenge (Bentley 2008; Reali & Griffiths 2010; Blythe 2012; Sindi & Dale 2016; Amato et al. 2018). At the same time, methods for detecting selection continue being improved in the field of population genetics (Nishino 2013; Terhorst, Schlötterer & Song 2015; Schraiber, Evans & Slatkin 2016; Iranmehr et al. 2017; Taus, Futschik & Schlötterer 2017; Vlachos & Koer 2018).

Perhaps it would be useful to draw a distinction between exploratory and confirmatory findings. In essence, this strand of research (including Newberry et al.) has remained exploratory. Simulations with controlled properties allow for an evaluation of the performance of a test or model under various conditions and suspected confounds (cf. also Kauhanen 2017). However, to the best of our knowledge, there is currently no objective way to evaluate such methods or compare their accuracy against one another, in terms how well they reflect the actual selection biases operating on the level of the speaker, that may eventually give rise to a change in the consensus on the population level — a sample of which is (the only thing that is) eventually observable in a diachronic corpus. It would therefore be useful to distinguish between approaches that test for selection, and those that more accurately generate (albeit potentially interesting and worthwhile) hypotheses. The latter may be useful e.g. when positing causes of language change — be they linguistic, social, or cognitive in nature. If drift cannot be rejected, then theorising about possible “causes” of the change is unnecessary.

The difficulties with binning suggest that trying to manipulate the data to make it look more like the underlying Wright-Fisher model — i.e., coarse-graining individual instances of use to construct the continuously-varying variant frequencies that the model predicts — is not the way to go. An alternative procedure would be to include the process of sampling these instances of use to build the corpus as part of the model. For example, given some time series x(t) generated by the Wright-Fisher model, then at an instant t this model says that we should expect to encounter one of the two word variants with probability x(t). In an ideal world, one would then maximise the likelihood of the observed sequence of tokens with respect to the parameters of the Wright-Fisher model (i.e., the selection strength and effective population size). This procedure looks to be somewhat computationally demanding, and may prove intractable for large corpora. However, such a procedure could in principle be applied to token counts as they appear in a corpus, without the need for pre-processing (such as binning) and the researcher freedom associated with it.

Another domain besides language which has attracted similar genetics-inspired modelling approaches is that of archaeology, particularly datasets of (pre-)historical artefacts (Bentley & Shennan 2003). Similar concerns have followed: “time-averaged assemblages” of variants in cumulative cultural evolution (essentially binned data) can easily introduce bias in various tests (Premo 2014; Crema, Kandler & Shennan 2016). Diachronic datasets (e.g. those based on the archaeological record, but similarly, corpora) only provide sparse, aggregated frequency information, which may be the reflection of a variety of neutral or selective transmission processes at the individual level (Premo 2014; Crema, Kandler & Shennan 2016; Kandler, Wilder & Fortunato 2017; Kandler & Crema 2019). Since these underlying processes cannot be directly observed (particularly in prehistoric data), Kandler, Wilder & Fortunato (2017) suggest shifting the focus from identifying the single individual-level process that likely produced the observed data — to excluding those that likely did not. A corpus being a sample of individual utterances, this suggestion is worth consideration. Although the written record tends to have more metadata than the archaeological, the author of an utterance, along with their selective biases, is often unknown.

Detecting signatures of selection and drift in the evolution of language (and other domains of cumulative culture) remains an interesting prospect. It would be informative to see a comparison of the FIT-like selection detection methods that have been developed in population genetics or archaeology, applied to linguistic data, and systematically evaluated. If the issues listed in the sections above could be solved, then this would certainly improve possibilities for exciting linguistic inquiry, inviting answers to questions such as, do lexemes experience stronger drift than syntactic constructions? What is the relationship of selection and niche (Laland, Odling-Smee & Feldman 2001; Altmann, Pierrehumbert & Motter 2011) in language change? Are some parts of speech more susceptible to change via selection than others? (M. Newberry, p.c.) What is the role of drift in creole evolution? (Strimling, Jansson & Parkvall 2015) In semantic change? (Hamilton, Leskovec & Jurafsky 2016) Are some languages changing more due to drift than others? (and if that relates to community size; Atkinson, Kirby & Smith 2015; Reali, Chater & Christiansen 2018) Can different types of selection be distinguished, e.g. top-down planning, grassroots (Amato et al. 2018), momentum-driven (Stadler et al. 2016)?

## 6 Conclusions

We find ourselves witnessing an exciting time for linguistic research, where more and more data on actual language usage is becoming available, encompassing different languages, dialects, registers, modalities, but also centuries. At the same time computational means for analysing big data have become readily accessible, hand in hand with the development of methods providing new insight into how languages function, change and evolve over time. Alongside and perhaps interlinked with these developments, language as a domain of scientific investigation has attracted interest in recent decades from fields traditionally not engaged in linguistic research, such as physics and biology.

We evaluated the proposal of Newberry et al. (2017), consisting of the application of the Frequency Increment Test as a method for determining whether any time series constructed from corpus frequencies of competing variants is a case of selection or a case of change stemming from stochastic drift. We found that while some of the original results remain robust to binning choices, other do not. Based on constructed and simulated examples, we find that while the results of the FIT can be robust given a subset of suitable series, there are scenarios where they affected by the way the diachronic corpus data are binned.

We advocate that in the interest of reproducibility, binning, like any other data manipulation and operationalisation procedures, should be explicitly described in a contribution (as it is by Newberry et al.) — but additionally, if the results change given different choices, this should also be reported. Beyond data operationalisation, we drew attention to issues specific to linguistic data that should be taken into account to ensure quality of testing results, as well as to work in cultural evolution where it has been shown that the inference of individual transmission processes from population-level frequency aggregates is susceptible to error and should be handled with care.

To conclude, identifying the role and prevalence of stochastic drift in language change is an important goal, but our results suggest that great care should be exercised when applying such tests to linguistic data, in order for the results to not be biased by issues specific to the domain as well as properties of a particular test.

## Data Accessibility Statement

The R code we used to replicate the results of the original paper is available at https://github.com/mnewberry/ldrift, and the corpus at https://corpus.byu.edu/coha. The code to run the simulations described is this paper is available at https://github.com/andreskarjus/wfsim_fit.

Supplementary file 1

Supplementary appendix to “Challenges in detecting evolutionary forces in language change using diachronic corpora”. DOI: https://doi.org/10.5334/gjgl.909.s1

## Acknowledgements

The authors would like to thank Mitchell Newberry for discussion and comments on the paper that led to significant improvements and revisions, the anonymous reviewers for useful comments, and Alison Feder for providing an implementation of the FIT earlier when one was not yet publicly available.

## Funding Information

The first author of this research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.

## Competing Interests

The authors have no competing interests to declare.

## Author Contributions

Andres Karjus carried out the research and wrote the paper. Richard A. Blythe, Simon Kirby and Kenny Smith provided revisions, comments and feedback on the design of the research and the paper.

## References

1. Ahern, Christopher A., Mitchell G. Newberry, Robin Clark & Joshua B. Plotkin. 2016. Evolutionary forces in language change. ArXiv e-prints. (5 , 2017).

2. Altmann, Eduardo G., Janet B. Pierrehumbert & Adilson E. Motter. 2011. Niche as a determinant of word fate in online groups. PLOS ONE 6(5). 1–12. DOI: https://doi.org/10.1371/journal.pone.0019009

3. Amato, Roberta, Lucas Lacasa, Albert Díaz-Guilera & Andrea Baronchelli. 2018. The dynamics of norm change in the cultural evolution of language. Proceedings of the National Academy of Sciences 115(33). 8260–8265. DOI: https://doi.org/10.1073/pnas.1721059115

4. Andersen, Henning. 1990. The structure of drift. In Henning Andersen & Konrad Koerner (eds.), Historical Linguistics 1987. Papers from the 8th International Conference on Historical Linguistics, 1–20. Amsterdam: Benjamins.

5. Anderwald, Lieselotte. 2012. Variable past-tense forms in nineteenth-century American English: Linking Normative Grammars and language change. American Speech 87(3). 257–293. (18 , 2019). DOI: https://doi.org/10.1215/00031283-1958327

6. Atkinson, Mark, Simon Kirby & Kenny Smith. 2015. Speaker input variability does not explain why larger populations have simpler languages. PLOS ONE 10(6). 1–20. DOI: https://doi.org/10.1371/journal.pone.0129463

7. Baxter, G. J., R. A. Blythe, W. Croft & A. J. McKane. 2006. Utterance selection model of language change. Physical Review E 73(4). 046118. DOI: https://doi.org/10.1103/PhysRevE.73.046118

8. Bentley, R. Alexander. 2008. Random drift versus selection in academic vocabulary: an evolutionary analysis of published keywords. PLOS ONE 3(8). 1–7. DOI: https://doi.org/10.1371/journal.pone.0003057

9. Bentley, R. Alexander & Stephen J. Shennan. 2003. Cultural transmission and stochastic network growth. American Antiquity 68(3). 459–485. DOI: https://doi.org/10.2307/3557104

10. Blythe, Richard A. 2012. Neutral evolution: A null model for language dynamics. Advances in complex systems 15(3–4). DOI: https://doi.org/10.1142/S0219525911003414

11. Blythe, Richard A. & William Croft. 2012. S-curves and the mechanisms of propagation in language change. Language 88(2). 269–304. (5 , 2017). DOI: https://doi.org/10.1353/lan.2012.0027

12. Calude, Andreea S., Steven D. Miller & Mark Pagel. 2017. Modelling loanword success a sociolinguistic quantitative study of Māori loanwords in New Zealand English. Corpus Linguistics and Linguistic Theory, 1–38. DOI: https://doi.org/10.1515/cllt-2017-0010

13. Crema, Enrico R., Anne Kandler & Stephen Shennan. 2016. Revealing patterns of cultural transmission from frequency data: Equilibrium and nonequilibrium assumptions. Scientific reports 6. 39122. DOI: https://doi.org/10.1038/srep39122

14. Croft, W. 2000. Explaining language change: An evolutionary approach. Longman.

15. Cuskley, Christine F., Martina Pugliese, Claudio Castellano, Francesca Colaiori, Vittorio Loreto & Francesca Tria. 2014. Internal and external dynamics in language: Evidence from verb regularity in a historical corpus of English. PLOS ONE 9(8). 1–7. DOI: https://doi.org/10.1371/journal.pone.0102882

16. Daoust, Demise. 2017. Language planning and language reform. In The handbook of sociolinguistics, 436–452. Wiley-Blackwell. DOI: https://doi.org/10.1002/9781405166256.ch27

17. Davies, Mark. 2010. The Corpus of Historical American English (COHA): 400 million words, 1810–2009. Available online at https://www.englishcorpora.org/coha.

18. Dingemanse, Mark, Damián E. Blasi, Gary Lupyan, Morten H. Christiansen & Padraic Monaghan. 2015. Arbitrariness, iconicity, and systematicity in language. Trends in Cognitive Sciences 19(10). 603–615. DOI: https://doi.org/10.1016/j.tics.2015.07.013

19. Dubossarsky, Haim, Simon Hengchen, Nina Tahmasebi & Dominik Schlechtweg. 2019. Time-out: Temporal referencing for robust modeling of lexical semantic change. In Proceedings of the 57th annual meeting of the association for computational linguistics, 457–470. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19-1044

20. Enfield, N. J. 2014. Transmission biases in the cultural evolution of language: towards an explanatory framework. In Daniel Dor, Chris Knight & Jerome Lewis (eds.), The social origins of language. Oxford: Oxford University Press. DOI: https://doi.org/10.1093/acprof:oso/9780199665327.003.0023

21. Ewens, Warren J. 2004. Mathematical population genetics 1: Theoretical introduction (Interdisciplinary Applied Mathematics). New York: Springer. DOI: https://doi.org/10.1007/978-0-387-21822-9

22. Fay, Nicolas, Simon Garrod, Leo Roberts & Nik Swoboda. 2010. The interactive evolution of human communication systems. Cognitive science 34(3). 351–386. DOI: https://doi.org/10.1111/j.1551-6709.2009.01090.x

23. Feder, Alison F., Sergey Kryazhimskiy & Joshua B. Plotkin. 2014. Identifying signatures of selection in genetic time series. Genetics 196(2). 509–522. (5 , 2017). DOI: https://doi.org/10.1534/genetics.113.158220

24. Ghanbarnejad, Fakhteh, Martin Gerlach, José M. Miotto & Eduardo G. Altmann. 2014. Extracting information from S-Curves of language change. Journal of The Royal Society Interface 11(101). DOI: https://doi.org/10.1098/rsif.2014.1044

25. Gray, Tyler J., Andrew J. Reagan, Peter Sheridan Dodds & Christopher M. Danforth. 2018. English verb regularization in books and tweets. ArXiv e-prints. DOI: https://doi.org/10.1371/journal.pone.0209651

26. Gries, Stefan Th. 2010. Useful statistics for corpus linguistics. A mosaic of corpus linguistics: Selected approaches 66. 269–291.

27. Grieve, Jack, Andrea Nini & Diansheng Guo. 2018. Mapping lexical innovation on American social media. Journal of English Linguistics 46(4). 293–319. DOI: https://doi.org/10.1177/0075424218793191

28. Hahn, Matthew W. & R. Alexander Bentley. 2003. Drift as a mechanism for cultural change: An example from baby names. Proceedings of the Royal Society of London B: Biological Sciences 270. S120–S123. DOI: https://doi.org/10.1098/rsbl.2003.0045

29. Hamilton, William L., Jure Leskovec & Dan Jurafsky. 2016. Cultural shift or linguistic drift? Comparing two computational measures of semantic change. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 2016. 2116–2121. DOI: https://doi.org/10.18653/v1/D16-1229

30. Haspelmath, Martin. 1999. Optimality and diachronic adaptation. Zeitschrift für Sprachwissenschaft 18(2). 180–205. DOI: https://doi.org/10.1515/zfsw.1999.18.2.180

31. Hernández-Campoy, Juan Manuel & Juan Camilo Conde-Silvestre. 2012. The handbook of historical sociolinguistics. Wiley-Blackwell. DOI: https://doi.org/10.1002/9781118257227

32. Iranmehr, Arya, Ali Akbari, Christian Schlötterer & Vineet Bafna. 2017. CLEAR: Composition of likelihoods for evolve and resequence experiments. Genetics 206(2). 1011–1023. (5 , 2017). DOI: https://doi.org/10.1534/genetics.116.197566

33. Jespersen, Otto. 1922. Language, its nature, development, and origin. H. Holt.

34. Kandler, Anne & Enrico R. Crema. 2019. Analysing cultural frequency data: Neutral theory and beyond. In Anna Marie Prentiss (ed.), Handbook of evolutionary research in archaeology, 83–108. Cham: Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-11117-5

35. Kandler, Anne, Bryan Wilder & Laura Fortunato. 2017. Inferring individuallevel processes from population-level patterns in cultural evolution. Royal Society Open Science 4(9). DOI: https://doi.org/10.1098/rsos.170949

36. Kanwal, Jasmeen, Kenny Smith, Jennifer Culbertson & Simon Kirby. 2017. Zipf’s Law of Abbreviation and the Principle of Least Effort: Language users optimise a miniature lexicon for efficient communication. Cognition 165. 45–52. DOI: https://doi.org/10.1016/j.cognition.2017.05.001

37. Karjus, Andres, Richard A. Blythe, Simon Kirby & Kenny Smith. 2020. Quantifying the dynamics of topical fluctuations in language. Language Dynamics and Change, 1–40. DOI: https://doi.org/10.1163/22105832-01001200

38. Kauhanen, Henri. 2017. Neutral change. Journal of Linguistics 53(2). 327–358. DOI: https://doi.org/10.1017/S0022226716000141

39. Kershaw, Daniel, Matthew Rowe & Patrick Stacey. 2016. Towards modelling language innovation acceptance in online social networks. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM ‘16). 553–562. ACM. DOI: https://doi.org/10.1145/2835776.2835784

40. Kirby, Simon, Hannah Cornish & Kenny Smith. 2008. Cumulative cultural evolution in the laboratory: An experimental approach to the origins of structure in human language. Proceedings of the National Academy of Sciences 105(31). 10681–10686. DOI: https://doi.org/10.1073/pnas.0707835105

41. Koplenig, Alexander. 2017. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data Sets-Reconstructing the composition of the German corpus in times of WWII. Digital Scholarship in the Humanities 32(1). 169–188. DOI: https://doi.org/10.1093/llc/fqv037

42. Kroch, Anthony & Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania.

43. Labov, W. 2011. Principles of linguistic change, volume 3: Cognitive and cultural factors (Language in Society). Wiley-Blackwell. DOI: https://doi.org/10.1002/9781444327496

44. Laland, K. N., J. Odling-Smee & M. W. Feldman. 2001. Cultural niche construction and human evolution. Journal of Evolutionary Biology 14(1). 22–33. DOI: https://doi.org/10.1046/j.1420-9101.2001.00262.x

45. Lieberman, Erez, Jean-Baptiste Michel, Joe Jackson, Tina Tang & Martin A. Nowak. 2007. Quantifying the evolutionary dynamics of language. Nature 449(7163). 713–716. DOI: https://doi.org/10.1038/nature06137

46. Lijffijt, Jefrey, Tanja Säily & Terttu Nevalainen. 2012. CEECing the baseline: lexical stability and significant change in a historical corpus. In Jukka Tyrkkö, Matti Kilpiö, Terttu Nevalainen, Matti Rissanen (ed.), Outposts of historical corpus linguistics: From the Helsinki Corpus to a proliferation of resources (Studies in Variation, Contacts and Change in English 10). Helsinki: Research Unit for Variation, Contacts and Change in English (VARIENG).

47. Malaspinas, Anna-Sapfo. 2016. Methods to characterize selective sweeps using time serial samples: An ancient DNA perspective. Molecular Ecology 25(1). 24–41. DOI: https://doi.org/10.1111/mec.13492

48. McMahon, April M. S. 1994. Understanding language change. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9781139166591

49. Monaghan, Padraic & Seán G. Roberts. 2019. Cognitive inuences in language evolution: Psycholinguistic predictors of loan word borrowing. Cognition 186. 147–158. DOI: https://doi.org/10.1016/j.cognition.2019.02.007

50. Newberry, Mitchell G., Christopher A. Ahern, Robin Clark & Joshua B. Plotkin. 2017. Detecting evolutionary forces in language change. Nature 551(7679). 223–226. DOI: https://doi.org/10.1038/nature24455

51. Nishino, Jo. 2013. Detecting selection using time-series data of allele frequencies with multiple independent reference loci. G3: Genes, Genomes, Genetics 3(12). 2151–2161. DOI: https://doi.org/10.1534/g3.113.008276

52. Ohala, John J. 1983. The origin of sound patterns in vocal tract constraints. In The production of speech, 189–216. New York, NY: Springer. DOI: https://doi.org/10.1007/978-1-4613-8202-7_9

53. Pechenick, Eitan Adam, Christopher M. Danforth & Peter Sheridan Dodds. 2015. Characterizing the Google Books Corpus: Strong limits to inferences of socio-cultural and linguistic evolution. PLoS ONE 10(10). e0137041. DOI: https://doi.org/10.1371/journal.pone.0137041

54. Premo, L. S. 2014. Cultural transmission and diversity in time-averaged assemblages. Current Anthropology 55(1). 105–114. DOI: https://doi.org/10.1086/674873

55. Reali, Florencia, Nick Chater & Morten H. Christiansen. 2018. Simpler grammar, larger vocabulary: How population size affects language. Proceedings of the Royal Society of London B: Biological Sciences 285(1871). DOI: https://doi.org/10.1098/rspb.2017.2586

56. Reali, Florencia & Thomas L. Griffiths. 2010. Words as alleles: Connecting language evolution with Bayesian learners to models of genetic drift. Proceedings of the Royal Society B: Biological Sciences 277(1680). 429–436. (8 , 2017). DOI: https://doi.org/10.1098/rspb.2009.1513

57. Rubin, Joan, Björn H. Jernudd, Jyotirindra DasGupta, Joshua A. Fishman & Charles A. Ferguson. 1977. Language planning processes (Contributions to the Sociology of Language). Mouton. DOI: https://doi.org/10.1515/9783110806199

58. Sapir, Edward. 1921. Language. An introduction to the study of speech. New York: Harcourt, Brace and Company.

59. Schraiber, Joshua G., Steven N. Evans & Montgomery Slatkin. 2016. Bayesian inference of natural selection from allele frequency time series. Genetics. DOI: https://doi.org/10.1534/genetics.116.187278

60. Simmons, Joseph P., Leif D. Nelson & Uri Simonsohn. 2011. False-positive psychology: Undisclosed exibility in data collection and analysis allows presenting anything as significant. Psychological Science 22(11). 1359–1366. DOI: https://doi.org/10.1177/0956797611417632

61. Sindi, Suzanne S. & Rick Dale. 2016. Culturomics as a data playground for tests of selection: mathematical approaches to detecting selection in word use. Journal of Theoretical Biology 405. 140–149. DOI: https://doi.org/10.1016/j.jtbi.2015.12.012

62. Smith, Kenny, Monica Tamariz & Simon Kirby. 2013. Linguistic structure is an evolutionary trade-off between simplicity and expressivity. In Markus Knauff, Michael Pauen, Natalie Sebanz & Ipke Wachsmuth (eds.), Proceedings of the 35th Annual Conference of the Cognitive Science Society, 1348–1353. Cognitive Science Society.

63. Stadler, Kevin, Richard A. Blythe, Kenny Smith & Simon Kirby. 2016. Momentum in language change: A model of self-actuating S-shaped curves. Language Dynamics and Change 6(2). 171–198. (5 , 2017). DOI: https://doi.org/10.1163/22105832-00602005

64. Steels, Luc & Eörs Szathmáry. 2018. The evolutionary dynamics of language. Biosystems 164. 128–137. DOI: https://doi.org/10.1016/j.biosystems.2017.11.003

65. Strimling, Pontus, Fredrik Jansson & Mikael Parkvall. 2015. Modeling the evolution of creoles. Language Dynamics and Change 5(1). 1–51. (5 , 2017). DOI: https://doi.org/10.1163/22105832-00501005

66. Szmrecsanyi, Benedikt. 2016. About text frequencies in historical linguistics: Disentangling environmental and grammatical change. Corpus Linguistics and Linguistic Theory 12(1). 153–171. DOI: https://doi.org/10.1515/cllt-2015-0068

67. Tamariz, Monica, T. Mark Ellison, Dale J. Barr & Nicolas Fay. 2014. Cultural selection drives the evolution of human communication systems. Proceedings of the Royal Society B: Biological Sciences 281(1788). 20140488. DOI: https://doi.org/10.1098/rspb.2014.0488

68. Taus, Thomas, Andreas Futschik & Christian Schlötterer. 2017. Quantifying Selection with Pool-Seq Time Series Data. Molecular Biology and Evolution 34(11). 3023–3034. DOI: https://doi.org/10.1093/molbev/msx225

69. Terhorst, Jonathan, Christian Schlötterer & Yun S. Song. 2015. Multi-locus analysis of genomic time series data from experimental evolution. PLoS genetics 11(4). e1005069. DOI: https://doi.org/10.1371/journal.pgen.1005069

70. Trask, Robert Lawrence. 1996. Historical linguistics. London: Arnold.

71. Van de Velde, Freek. 2014. Degeneracy: The maintenance of constructional networks. In The extending scope of construction grammar 54. 141–179. Berlin/Boston: Walter De Gruyter GmbH.

72. Vlachos, Christos, Claire Burny, Marta Pelizzola, Rui Borges, Andreas Futschik, Robert Koer & Christian Schlötterer. 2019. Benchmarking software tools for detecting and quantifying selection in evolve and resequencing studies. Genome Biology 20(1). 169. DOI: https://doi.org/10.1186/s13059-019-1770-8

73. Vlachos, Christos & Robert Kofler. 2018. MimicrEE2: Genome-wide forward simulations of Evolve and Resequencing studies. PLOS Computational Biology 14(8). 1–10. DOI: https://doi.org/10.1371/journal.pcbi.1006413

74. Walker, James A. 2010. Variation in linguistic systems. New York: Routledge.

75. Wright, Sewall. 1931. Evolution in Mendelian populations. Genetics 16(2). 97–159.

76. Zipf, George Kingsley. 1949. Human behavior and the principle of least effort: An introduction to human ecology. Reading, MA: Addison-Wesley Press.