Estimating child linguistic experience from historical corpora

Child language acquisition is often identified as one of the primary drivers of language change, but the lack of historical child data presents a challenge for empirically investigating its effect. In this work, I observe the relationship between lexicons extracted from modern child-directed speech and those drawn from modern and historical literary corpora in order to better understand when language acquisition can be modeled over historical and non-child corpora as it is over child corpora. The type frequencies of morphophonological and syntactic-semantic patterns occur at similar type frequencies in these corpora among high token frequency items, and furthermore, when a learning algorithm is applied to lexicons sampled from these sources, it consistently achieves the same learning outcomes in each. With appropriate care and pre-processing, modern and historical text corpora are effectively interchangeable with child-directed speech corpora for the purpose of estimating child lexical experience, opening a path for modeling language acquisition where child-directed corpora are not available.


Introduction
The advent of child-directed speech (CDS) corpora in recent decades containing years' worth of early linguistic input (e.g., CHILDES; MacWhinney 2000) has facilitated significant progress in the field of native language acquisition. That said, no CDS corpora exist for the overwhelming majority of the world's languages, and none that do exist date back before the mid-20th century. Without such corpora, the insights that child language acquisition researchers gain from modern methodologies cannot be extended to most of today's world, let alone to past eras. The contribution of this paper is methodological: I establish that, despite the differences that intuitively exist, CDS and modern and historical non-CDS corpora are fundamentally similar along dimensions relevant for native language acquisition. This stands to facilitate acquisition research for a more diverse range of languages, and, as discussed here, research into child language acquisition in the past.
Four aspects of language learning which are reflected in CDS motivate this work. First, the relative uniformity of language acquisition: learners exhibit remarkable synchronic uniformity despite the variability of the input they receive (Labov 1972). Second, the crucial role of type frequency: convergent results from a wide variety of research programs connect grammar learning to the number of types over which linguistic patterns are expressed in the input rather than the attestation of any particular lexical items (Aronoff 1976;MacWhinney 1978;Bybee 1985;Baayen 1993;Elman 1998;Pierrehumbert 2003;Yang 2016). Third, token frequency and availability: the relative age at which learners acquire vocabulary items is correlated with their token frequencies (Goodman et al. 2008) in the input. And fourth, small early vocabularies: the typical learner knows only a few hundred to a thousand words by around age three (Hart & Risley 1995;Szagun et al. 2006). Since children acquire most properties of their native grammars by that age, the bulk of grammar acquisition is undertaken on the basis of relatively few mostly high frequency items rather than large adult-like lexicons.
Lexical variability between CDS corpora reflects the real-world variation in early linguistic experience that leads to precociousness or delays among learners (Maratsos 2000;Yang 2002). It also reflects realistic assumptions about learner knowledge. Since higher token frequency items tend to be acquired earlier, young learner's lexicons may be estimated trimming off the less frequent items from CDS (Nagy & Anderson 1984;Yang 2016). Doing so yields approximations of "typical" children's lexicons which are the right size and consist primarily of high frequency items. It is these properties that make corpora of child directed speech such useful resources for studying grammar learning. If the field can establish whether historical and other non-CDS corpora share these properties as well, researchers can apply models of language acquisition to historical data to work out how, when, and whether the process of native language acquisition effects change.
To that end, I conduct three studies which elaborate on the similarities between modern and historical non-CDS corpora on one hand and CDS on the other for the purpose of modeling productivity. I begin in Section 2 by illustrating the effect that trimming low token frequency items has on CDS and adult corpora in Modern English. This is extended to historical corpora in Section 3, where I compare semantic overlap between cross-linguistic modern CDS and historical lexicons. Finally, Section 4 demonstrates that a type-based threshold learning algorithm to morphological problems yields the same acquisition outcomes in Modern English lexicons taken from CDS and modern non-CDS and to Icelandic lexicons drawn from historical and modern non-CDS.

Verbal lexicons derived from child-directed speech and adult corpora
This study establishes the similarity between lexicons derived from adult literary corpora and those derived from corpora of child directed speech. I begin by demonstrating the effect of trimming low frequency vocabulary from the extracted lexicons, and following that, I compare the attested type frequency of various linguistic properties between the adult and CDS-derived lexicons. Types frequencies of these properties are quantitatively similar in these corpora despite superficial differences in specific lexical content.
Adult corpus lexicons are drawn from the Corpus of Contemporary American English (COCA; Davies 2009), which contains millions of lemmatized and POS-tagged words of text drawn from five genres: spoken, popular magazine, fiction, newspaper, and academic. Each genre contains individual subcorpora for each year, and each subcorpus contains between 2.5 and 5.5 million tokens and between 4,200 and 10,200 verb lemmas when those tagged as auxiliaries or modals are excluded. 1 Child input lexicons are drawn from three lemmatized POS-tagged corpora within CHILDES (MacWhinney 2000), each containing roughly 1,000 unique verb lemmas, again with auxiliaries and modals excluded: Brent (n = 984;Brent & Siskind 2001), Brown (n = 916;Brown 1973), and MacWhinney (n = 1042; MacWhinney 1991). 2 These were chosen for their large size relative to other CDS corpora, each containing about a year's worth of child-directed speech. I focus on verbs here for consistency across studies and because they show more interesting inflectional patterns in English than other syntactic categories do. That said, the Zipfian statistical corpus distributions of verb lemmas, inflectional categories, and so on, are the same as those obeyed by other categories (Chan 2008;Finley 2018), which is demonstrated in practice by learning behavior in computational morphology learners (e.g., Lignos et al. 2010). The results can therefore be extended to other syntactic categories.
The most frequent verb lemmas are tabulated for each CHILDES corpus and COCA subcorpus, and four estimates are made from each with the following frequency cutoffs: n = all, 1,042 (all types in the largest of the CHILDES corpora), 500, and 100. The three frequency-trimmed conditions represent the lexicons of late, middle, and early learners respectively. Given the total vocabulary size estimates of Hart & Risley (2003) and Szagun et al. (2006), a learner who only knows about 100 verbs is certainly less than three years old, while one who knows 500 is perhaps closer to school age.

Raw lexical similarity
Measuring lexical overlap between extracted lexicons illustrates the effect of trimming infrequent vocabulary. Jaccard similarity |A ∩ B|/|A ∪ B| is employed to measure the set overlap between each pair of lexicons (self-similarity excluded). The metric has a range [0,1] where higher is more similar. Figure 1 shows the range of Jaccard similarities between CDS and COCA corpora on the left and between COCA corpora of different genres on the right. Two observations stand out. First, similarities are much higher for all frequency-trimmed conditions than for n = all, which suggests that items which are not shared between corpora are predominately low-frequency. Second, though CDS-COCA similarities are lower than COCA-COCA similarities, their ranges overlap once frequency trimming has been applied, which means that some CDS corpora are more similar to some COCA corpora than some COCA corpora are to one another. Specific lexical items are not necessarily well-shared across corpora regardless of genre, but frequency trimming improves the situation significantly.

Lexical property similarity
But what matters for learning is often not the individual linguistic items in the input so much as the properties of those items. As discussed in the introduction, the type frequency of some property, the number of items in the lexicon exhibiting it rather than which specific items those are, is what is drives productivity learning. This time, I compare the same adult COCA and CHILDES-derived lexicons in terms of the type frequencies of three linguistic properties. These were chosen for coverage: first, Latinate verbs are a morphophonological class which is acquired relatively late, around the start of school (Tyler & Nagy 1989;Jarmulowicz 2002), second, irregular verbs are morphological, learned much earlier, and factor into the classic Past Tense Debates about productivity in the acquisition literature (Rumelhart & McClelland 1986;Pinker & Prince 1988;Pinker & Ullman 2002: inter alia), and third, double object alternator verbs are syntactic and semantic in nature (Rappaport Hovav & Levin 2008), and their acquisition is one of the classic case studies in argument structure learning (Baker 1979;Pinker et al. 1987;Gropen et al. 1989;Yang 2016). The results show that there is less variation between corpora in terms of type frequencies in terms of lexical identity, and that the CDS-derived lexicons are in general quantitatively similar to the adult lexicons in the frequency-trimmed conditions.

Irregular verbs
So-called irregular verbs in English are those that undergo stem changes or suppletion when forming the past tense and past participle, e.g., sing ∼ sang ∼ sung, go ∼ went ∼ gone, or tell ∼ told ∼ told. A learner acquiring English verbal morphology must work out which of these verbs are inflected according to some generalizable pattern and which are truly one-off "irregulars" that must be listed or memorized (Berko 1958;Pinker & Prince 1994). Figure 2 shows the mean number of strong verb lemmas by genre for each frequency cutoff n. It is plain from visual inspection alone that CDS and the COCA genres become much more alike when the rare items are trimmed from COCA. It is also striking that academic writing rather than CDS appears to be the greatest outlier for each trimmed condition.
At n = all, the adult lexicons contain far more irregular verbs than the CDS-derived lexicons simply because they are taken from larger corpus samples, but when trimmed to n = 1042 and 500, CDS falls within the range of the adult lexicons, while at n = 100, CDS overlaps with fiction. A regression predicting the number of strong verbs by CDS/adult status finds no significant difference between CDS and adult lexicons in any of the frequencytrimmed conditions -if one were presented with the box plots in Figure 2 with the genre labels and colors removed, it would not be possible to identify which box corresponded to CDS in the trimmed conditions.

Double object/to-dative alternator verbs
The acquisition of DO/to-dative verbs (e.g., give, send and tell) (Levin 1993: §2.1) is one of the classic problems in argument structure acquisition. Their attestation in these corpora reveals the same kind of pattern as the irregular verbs: again, trimming the low token frequency items from the COCA-derived lexicons brings them in line with the CDS lexicons.
The results are shown in Figure 3. There is no significant difference between CDS and adult lexicons at n = 500 or 100, and while CDS is statistically different from adult at n = 1042, it is not different from academic, and the difference between CDS and adult means decreased from a factor of about 200% to near 10%.

Latinate verbs
Unlike the previous properties, Latinate verbs are saliently associated with genre (Levin et al. 1981;Levin & Novak 1991), and many, but not all are high-register (COCA contains encapsulate, irradiate, reconstitute, but also confuse, offer, and remember). Additionally, the morphophonological generalizations associated with English Latinate vocabulary are acquired late, typically not until children enter school. As such, we expect there to exist significant quantitative differences between the rate of Latinate verbs in CDS-derived and adult-derived lexicons as shown in Figure 4.
This prediction bears out since every test shows a significant difference except for n = 100. Nevertheless, frequency trimming brings the type frequencies of CDS and non-CDS much closer together since Latinate vocabulary is disproportionately present among low-frequency items in every COCA genre. Notably, academic lexicons once again differ from all other genres.

Interim conclusions
These studies show that type frequencies in corpora derived from child-directed speech are statistically similar to frequency-trimmed corpora derived from adult literary genres even though they differ in their specific lexical contents. In every instance, frequency trimming brings CDS-derived and non-CDS-derived type counts much closer together, and in most cases there is no statistically significant difference between the two trimmed lexicon categories. Adult corpora may be reasonably substituted for CDS corpora for the purpose of modeling grammar learning in child language acquisition, since it is these type frequencies that are directly relevant and frequency trimming is just a normal step for approximating child vocabulary size and composition when analyzing CDS for productivity.

Verbal lexicons derived from child-directed speech and historical corpora
Child language acquisition is often implicated as a driving force in language change (Sweet 1899;Halle 1962;Kiparsky 1965;Andersen 1973;Baron 1977;Lightfoot 1979;Niyogi & Berwick 1996;Kroch 2001;Yang 2002;Cournane 2017: inter alia), and some programs which do not privilege child language acquisition still acknowledge a special role for children (Labov 1989), though there are also prominent dissenters (Croft 2000;Meisel 2011;Diessel 2012: inter alia). Children of the past must have acquired language in a way similar to modern children (a straightforward consequence of linguistic uniformitarianism (Labov 1972;Walkden 2019)) so the obvious obstacle to investigating the relationship between acquisition and change, whether or not the position is empirically supported, is more practical than theoretical: it is hampered by the lack of access to children of the past.
This study extends the previous analysis back through time to compare the contents of modern CDS-derived and (frequency-trimmed when applicable) historical lexicons. Since linguistic properties like the presence of "irregular" inflection are not conserved across languages, this study compares the meanings contained in each lexicon instead. Items are matched between two lexicons if there is a shared translation between them. For example, English slide is matched with Spanish resbalar 'slip,' Latin lābī 'slip, glide,' and Proto-Germanic *slīdaną 'slide.' 3 Since correspondences between the lexicons are no longer one-to-one, Jaccard similarity does not make sense here. Instead, raw percent overlap is calculated as |A ∩ B|/min(|A|,|B|). Overlaps are systematically higher than Jaccard similarities because the denominator is smaller.
English CDS (Brown) and Spanish CDS from CHILDES (FernAguado, Hess, OreaPine, Remedi, Romero, and SerraSole (Romero et al. 1992;Hess Zimmer-mann 2003;MacWhinney 2000;Aguado-Orea & Pine 2015)) are compared to two pre-modern lexicons: Latin from all Old and Classical texts in the Perseus online edition (Smith et al. 2000), and Proto-Germanic (PGmc) taken from all securely reconstructable strong verbs in Seebold (1970). 4 The Proto-Germanic strong verbs are chosen because they are not semantically coherent and provide a sufficiently large set for comparison, and frequency cutoffs are established for each corpusderived lexicon to bring them in line with the size of PGmc. To establish a within-language CDS baseline, the overlap procedure was performed between the Brown and Brent corpora with the same frequency cutoff applied to both, and Brown and Spanish were compared as a cross-language CDS baseline. Table 1 reveals a spread of about 15 points between lowest and highest raw percent overlap scores. The within-language English-English baseline is the highest at about 82%, while the cross-language CDS baseline is somewhat lower at 73%. The Latin comparisons are higher than the CDS baseline, while the Proto-Germanic numbers are a few points lower. The high overlap between the reconstructed and modern lexicons is likely due to the fact that words are securely reconstructable only if they are retained in multiple daughter branches, and that the words that are likely to be retained tend to be frequent everyday terms -the same kind that we expect to find in CDS. For example, the Proto-Germanic words for 'bite, ' 'wait,' 'fall,' 'pull,' 'sing,' and 'help' are reconstructable because they were retained in its daughters, and their equivalents are all present in both the modern English CDS corpora since they are common everyday terms. 5 It seems that cultural differences account for the extra discrepancy between Proto-Germanic and CDS. The PGmc lexicon contains many terms for farming ('sow,' 'plant,' 'thresh'), household chores ('weave,' 'knead,' 'bank a fire') and other aspects of culture ('cast lots,' 'be a retainer') that modern urban children are unlikely to know, but which children growing up in Iron Age agricultural societies must have. These cultural terms account for roughly 3.1 points of overlap, which when added in would put the PGmc comparisons in line with the English-Spanish overlap.
All in all, lexical overlap is conserved between CDS, adult historical corpora, and reconstructed lexicons about as well as between CDS lexicons. They contain largely the same kinds of meanings despite their varied origins, and differences between lexicons can be partially account for by cultural differences rather than corpus differences. The lists collated in the supplementary material show that higher frequency items are more likely to match than low frequency items, even among different CDS corpora. This reiterates the point from Section 2 that low token frequency items are more likely to be corpus-specific than high-frequency items. 4 I thank Donald Ringe for his help in sorting through Seebold. 5 The supplementary material contains a full list of examples.

Deploying an acquisition model
This study compares outcomes when a learning algorithm is applied to CDS, modern non-CDS and historical corpora. First, I compare the acquisition of Modern English productive past -ed on lexicons sourced from CDS and adult corpora. Following that, I apply the same algorithm to a past tense generalization in Old and Modern Icelandic to draw conclusions about child development in the past. In both cases, I apply the Tolerance Principle (TP) following Yang (2016: Chapter 4.1), though any type-based acquisition model could be used here. The TP stated in (1) is a model of productivity learning that defines a threshold θ for how many exceptional types a hypothesized grammatical pattern can tolerate before it becomes untenable and the learner resorts to memorization and listing instead. The threshold is derived such that it lies at the point where it becomes more economical for a language user to learn a pattern plus exceptions rather than no pattern (Yang 2016: 48-51).
(1) Tolerance Principle: If R is a productive rule applicable to N candidates, then the following relation holds between N and e, the number of exceptions that could but do not follow R:

Modern English Past -ed
To investigate whether CDS-derived and adult-derived lexicons yield similar learning outcomes, I model the acquisition of the English productive past-forming -ed pattern. The acquisition of English past tense is a complex and classic problem in morphological learning which has triggered decades of debate (Berko 1958;Rumelhart & McClelland 1986;Pinker & Prince 1988;1994;Ramscar 2002;Kirov & Cotterell 2018: inter alia), and the acquisition of a default past -ed pattern is one piece of the challenge. In terms of the Tolerance Principle, the pattern being acquired is one that applies -ed (with the appropriate morphophonology) to a verb to produce its past tense form. All verb types learned up to a given point in development count towards the N in the formula, while those verb types learned with irregular pasts by that point make up e. Specific lexical items do not matter in the calculation, nor do the values of e and N past establishing whether or not e lies below the tolerance threshold. Yang (2016) finds that the English lexicon is such that early learners who know 500 or fewer verbs know too many irregulars relative to regularly derived past verbs to learn -ed productively. The situation is marginal at 800, and learners can finally reliably acquire the productive past once they know 1,000 verbs. I reproduce these results. 1,000 CDS-derived lexicons with 1,000 items each are sampled from the 1,515 unique lemmas attested together in English Brent, Brown, or MacWhinney weighted by their token frequencies across those corpora, then the same sampling is performed on the 1,500 most common COCA lemmas to create 1,000 sample adult-derived lexicons. The TP is calculated on each lexicon for the top N = 100, 150, and 200 through 1,000 items. For all CDS-derived and adult-derived lexicons, the results at N = 100, 500, and 1,000 are identical to what Yang (2016) reports for both sample types: every lexicon fail to generalize past -ed at low N but succeed by N = 1,000 as shown in Table 2. On its own, this TP calculation would imply that a learner would not acquire a productive past -ed until knowing near 1,000 unique verbs' past forms, but see Yang (2016: Chapter 4.1.2).
What differences do exist cluster around the N = 800 point that Yang reports as marginally non-productive. When plotted in Figure 5, we see that the adult corpus learning curve is shifted somewhat to the left, which reflects the slightly lower average number of irregular verbs in the adult-derived lexicons at that point (117 vs. 127). This is effectively a sample-dependent relative developmental delay of the kind reported in Maratsos (2000) and Yang (2002). Regardless, the final learning state is identical for every single adult and CDS sample.

Icelandic Strong Verbs
Finally, I apply the Tolerance Principle to a problem in both Old and Modern Icelandic to compare modern and historical learning trajectories. The remarkable diachronic stability of Icelandic morphology renders it uniquely suitable for this study since it allows us to set up a null hypothesis: patterns that emerge among the highest frequency items in a Modern Icelandic text should be apparent in a Modern Icelandic text as well. We could run the same test on Old English or Latin, but we would have no hypothesis to test since their modern descendents are so different.
Icelandic, like English, has a significant number of verbs that express past tense by stem mutation (so-called strong verbs, e.g., syngja ∼ söng 'sing'), and a much larger number which express the past through suffixation (multiple classes of weak verbs, e.g., dvelja ∼ dvaldi 'dwell, reside,' svara ∼ svaraði 'answer, respond'). It is up to child learners to sort out whether any patterns exist over these verbs that indicate which type of inflection to productively employ. This turns out to be quite challenging -even eight-year-old Icelandic children still make a non-trivial number of errors in which they substitute one class for another (Ragnarsdóttir et al. 1999).
I consider one such generalization that illustrates this pattern of learning: the relationship between monosyllabic verbs (e.g, dá 'adore, worship,' ná 'get, obtain,' sjá 'see, perceive')  Figure 5: Proportion of learners acquiring productive -ed past by vocabulary size. and strong inflection. Most verbs in this set are weak, but a few are strong as well, so a learner has to determine which ones belong to the productive pattern, if any, and which should be learned as exceptions. To investigate this quantitatively, I extract all verbs which are attested at least once in the past tense from the Old Icelandic and Modern Icelandic texts in the POS-tagged and lemmatized Icelandic Parsed Historical Corpus (Wallenberg et al. 2011), which results in two sets of 735 and 921 unique verb types respectively. Next, I apply the same sampling procedure as in the above section to generate 1,000 sample lexicons from each era to model the learning trajectories of "typical" learners exposed to these verbs in their input. The resulting developmental trajectories are presented in Figure 6. There are two takeaways here. First, the average learning trajectories are closely matched between the Old Icelandic and Modern Icelandic learners, which confirms our expectations of morphological conservatism in Icelandic and once again demonstrates the insignificance of genre differences when it comes to the type expression of linguistic properties. Second, it shows that all early learners with small vocabularies can productively apply strong verb inflection to monosyllabic verbs, but that they gradually lose this option as their vocabularies grow and monosyllabic strong verbs are revealed to be the true exceptions. This pattern of early spurious productivity is consistent with the widely observed tendency for "irregulars" (here, strong verbs) to cluster among high token frequency items (Bybee 1985;Baayen 1993;Yang 2016: inter alia). It drives modern learners to tenable but ultimately incorrect hypotheses about their languages (e.g., Xu & Pinker 1995;Ragnarsdóttir et al. 1999), and now we can say that it did so for Icelandic learners of the past too.

Conclusions
The studies presented here identify substantial similarities between corpora of childdirected speech and both modern and historical adult corpora as well as a reconstructed lexicon. When lexicons derived from child-directed speech and non-CDS corpora are trimmed by token frequency in order to approximate child lexicon sizes, they express type frequencies to a degree that is statistically similar to those in CDS corpora. Since it is these type frequencies that are critical for the acquisition of linguistic generalizations, non-CDS corpora can be used to model aspects of child language learning. These results open up a path for researchers who wish empirically evaluating the relationship between acquisition and change and gives reason to investigate what other relationships may hold between child directed and non-child directed corpora.