<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<!--<?xml-stylesheet type="text/xsl" href="article.xsl"?>-->
<article article-type="research-article" dtd-version="1.2" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="issn">2397-1835</journal-id>
<journal-title-group>
<journal-title>Glossa: a journal of general linguistics</journal-title>
</journal-title-group>
<issn pub-type="epub">2397-1835</issn>
<publisher>
<publisher-name>Open Library of Humanities</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.16995/glossa.17737</article-id>
<article-categories>
<subj-group>
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A large-scale investigation of vowel co-occurrence patterns in the world&#8217;s lexicons</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">https://orcid.org/0009-0009-8967-9200</contrib-id>
<name>
<surname>&#352;egedin</surname>
<given-names>Bruno Ferenc</given-names>
</name>
<email>bruno_ferenc_segedin@brown.edu</email>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-5689-1186</contrib-id>
<name>
<surname>Priva</surname>
<given-names>Uriel Cohen</given-names>
</name>
<email>Uriel_Cohen_Priva@brown.edu</email>
<xref ref-type="aff" rid="aff-2">2</xref>
</contrib>
</contrib-group>
<aff id="aff-1"><label>1</label>Department of Cognitive and Psychological Sciences, Program in Linguistics, Brown University, Providence, RI, USA</aff>
<aff id="aff-2"><label>2</label>Program in Linguistics, Brown University, Providence, RI, USA</aff>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2026-04-17">
<day>17</day>
<month>04</month>
<year>2026</year>
</pub-date>
<pub-date pub-type="collection">
<year>2026</year>
</pub-date>
<volume>11</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>41</lpage>
<permissions>
<copyright-statement>Copyright: &#x00A9; 2026 The Author(s)</copyright-statement>
<copyright-year>2026</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC-BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See <uri xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</uri>.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://www.glossa-journal.org/articles/10.16995/glossa.17737/"/>
<abstract>
<p>This paper explores whether there are universal trends for vowels that co-occur to share featural properties. The existence of various productive featural vowel harmony systems across the world&#8217;s languages suggests that the factors underlying harmony may be universal. An empirical prediction that follows from this proposal is that vowel co-occurrence in the world&#8217;s lexicons should be featurally organized. In corpus analyses of two cross-linguistic, phonologically transcribed lexicons &#8212;92 lexicons in the XPF corpus and 107 lexicons in the NorthEuraLex corpus&#8212; we find a cross-linguistic tendency for languages to over-represent pairs of identical vowels but no universal preference for height or backness harmony. We do, however, find some evidence that identity affects some vowels systematically more than others across languages, which indicates that vowel co-occurrence is sensitive to the phonetic properties of vowel categories in cross-linguistically generalizable ways. Ultimately, the lack of featural harmony and the over-representation of identity is consistent with the notion that the phonological organization of lexicons is subject to factors beyond local assimilation or phonetic and phonological well-formedness.</p>
</abstract>
</article-meta>
</front>
<body>
<sec>
<title>1 Introduction</title>
<sec>
<title>1.1 Overview</title>
<p>Researchers have argued that languages&#8217; lexicons over-represent phonologically licit sound sequences that are consistent with cognitive, phonological and phonetic constraints and under-represent sequences that violate these constraints (e.g., <xref ref-type="bibr" rid="B33">Frisch et al. 2004</xref>; <xref ref-type="bibr" rid="B78">Wilson 2006</xref>; <xref ref-type="bibr" rid="B75">Walter 2010</xref>). This paper investigates the extent to which vowel co-occurrence patterns universally reflect such pressures. Recent corpus research has found evidence for an identity bias<xref ref-type="fn" rid="n1">1</xref> in vowel co-occurrence patterns, but also that gradient vowel similarity does not have a strong or cross-linguistically consistent effect on lexical attestation (<xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>). The current corpus analysis thus tests whether the frequency of vowel sequences is universally sensitive to featural dimensions.</p>
<p>The existence of vowel harmony, a phonological constraint requiring vowels within a word to share certain features, suggests that vowels may be subject to universal pressures toward featural alignment (e.g. <xref ref-type="bibr" rid="B64">Rose &amp; Walker 2011</xref>; <xref ref-type="bibr" rid="B72">van der Hulst 2016</xref>). A diverse range of explanations for the emergence of productive vowel harmony rules posit the existence of universal and phonetically-motivated pressures that favor harmony. These include accounts proposing that assimilatory processes like vowel-to-vowel coarticulation are a universal precursor to productive vowel harmony (e.g. <xref ref-type="bibr" rid="B58">Ohala 1994</xref>; <xref ref-type="bibr" rid="B21">Cole 2009</xref>), as well as formal grammatical accounts which posit the existence of universal harmony-inducing and featurally-specific constraints on segmental co-occurrence (e.g. <xref ref-type="bibr" rid="B62">Prince &amp; Smolensky 1997</xref>; <xref ref-type="bibr" rid="B78">Wilson 2006</xref>; <xref ref-type="bibr" rid="B35">Goldsmith 1985</xref>). However, the phonemic organization of lexicons has also been argued to be subject to lexical pressures, including communicatively-motivated pressures to avoid ambiguity in the lexicon (e.g. <xref ref-type="bibr" rid="B32">Flemming 2004</xref>; <xref ref-type="bibr" rid="B10">Blevins &amp; Wedel 2009</xref>; <xref ref-type="bibr" rid="B50">Mahowald et al. 2018</xref>), that may prevent the phonological organization of lexicons from being influenced by assimilatory pressures like those argued to underlie vowel harmony.</p>
<p>In concurrent analyses of two large-scale corpora, 92 lexicons in the XPF corpus (<xref ref-type="bibr" rid="B19">Cohen Priva et al. 2021</xref>) and 107 Northern Eurasian languages in the NorthEuraLex Corpus (<xref ref-type="bibr" rid="B26">Dellert &amp; J&#228;ger 2020</xref>), the current paper investigates whether the featural dimensions of height or backness constrain vowel-co-occurrence in cross-linguistically systematic ways. We find evidence that vowel identity is over-represented in the world&#8217;s languages but no evidence for a backness- or height-harmony bias. We also find some evidence that the extent of vowel identity varies across vowel categories in a cross-linguistically systematic way. The findings call into question the notion that assimilatory pressures like coarticulation or feature-specific alignment constraints can alone account for the phonological organization of lexicons.</p>
</sec>
<sec>
<title>1.2 Vowel harmony and its cognitive and phonetic grounding</title>
<p>The term &#8216;vowel harmony&#8217; refers to a phonological constraint in some languages whereby all vowels within a word must align with respect to some phonological feature or property (e.g., <xref ref-type="bibr" rid="B4">Aoki 1968</xref>; <xref ref-type="bibr" rid="B35">Goldsmith 1985</xref>; <xref ref-type="bibr" rid="B63">Ringen &amp; Vago 1998</xref>; <xref ref-type="bibr" rid="B40">Hayes &amp; Londe 2006</xref>; <xref ref-type="bibr" rid="B64">Rose &amp; Walker 2011</xref>; <xref ref-type="bibr" rid="B72">van der Hulst 2016</xref>). Hungarian, for example, exhibits productive backness harmony: roots are highly unlikely to have vowels with opposite places of articulation along the dimension of backness (e.g. <xref ref-type="bibr" rid="B40">Hayes &amp; Londe 2006</xref>). While there are lexicalized exceptions, the productivity of Hungarian backness harmony is evident in its morpho-phonological alternations: when a suffix is appended to a stem, the vowel of the suffix adopts the backness feature of that stem; the dative morpheme /-n&#603;k/, for example, surfaces as [-n&#603;k] after stems with front vowels, and as [-nak] after stems with back vowels (<xref ref-type="bibr" rid="B40">Hayes &amp; Londe 2006</xref>). Another frequently attested form of vowel harmony is Advanced Tongue Root harmony (ATR), common in African languages, whereby vowels within a word must agree in the position of the tongue root&#8212;either advanced ([+ATR]) or retracted ([-ATR])&#8212;typically affecting vowel height and contributing to systematic alternations across morphological paradigms (e.g. <xref ref-type="bibr" rid="B15">Casali 2008</xref>). Other forms of productive vowel harmony include height harmony, found in some Bantu languages (e.g. <xref ref-type="bibr" rid="B43">Hyman 2003</xref>), and a combination of rounding harmony and backness harmony, found in Turkish (e.g. <xref ref-type="bibr" rid="B58">Ohala 1994</xref>), among others (see <xref ref-type="bibr" rid="B4">Aoki 1968</xref> and <xref ref-type="bibr" rid="B64">Rose &amp; Walker 2011</xref> for comprehensive typological overviews of harmony systems). While phonological harmony systems are typically not exceptionless, with some languages having lexicalized violations (<xref ref-type="bibr" rid="B29">Finley 2010</xref>; <xref ref-type="bibr" rid="B5">Archangeli &amp; Pulleyblank 2007</xref>) or loanwords that deviate from the harmony pattern (<xref ref-type="bibr" rid="B38">Harrison et al. 2002</xref>), they are generally synchronically productive (e.g. <xref ref-type="bibr" rid="B4">Aoki 1968</xref>; <xref ref-type="bibr" rid="B40">Hayes &amp; Londe 2006</xref>), and all have in common some non-adjacent dependency between a surface property of one vowel and that of another vowel within a particular domain like a word (<xref ref-type="bibr" rid="B4">Aoki 1968</xref>; <xref ref-type="bibr" rid="B64">Rose &amp; Walker 2011</xref>).</p>
<p>While languages with productive vowel harmony comprise a minority of the world&#8217;s languages, they are nevertheless frequent and typologically diverse (e.g. <xref ref-type="bibr" rid="B64">Rose &amp; Walker 2011</xref>). Productive vowel disharmony, in contrast, is rare (e.g. <xref ref-type="bibr" rid="B52">Martin &amp; White 2021</xref>). The fact that languages frequently exhibit vowel harmony rules suggests that the existence of these rules is not an accident of historical change but rather that there exist at least some universal pressures or constraints that underlie harmony.</p>
<p>The claim that vowel harmony is a manifestation of universal pressures is consistent with research that describes harmony as phonetically grounded (e.g. <xref ref-type="bibr" rid="B58">Ohala 1994</xref>; <xref ref-type="bibr" rid="B28">Fagyal et al. 2003</xref>; <xref ref-type="bibr" rid="B48">Linebaugh 2007</xref>; <xref ref-type="bibr" rid="B21">Cole 2009</xref>). Specifically, vowel harmony has been argued to have a phonetic underpinning in the form of vowel-to-vowel coarticulation, an inevitable (and thus universal) result of articulatory overlap between adjacent or nearly adjacent sounds that renders those sounds more acoustically similar than they would be if they occurred separately (e.g. <xref ref-type="bibr" rid="B58">Ohala 1994</xref>; <xref ref-type="bibr" rid="B49">Magen 1997</xref>; <xref ref-type="bibr" rid="B31">Flego &amp; Forrest 2021</xref>; <xref ref-type="bibr" rid="B18">Cohen Priva &amp; Strand 2023</xref>). To account for the phonologization and spread of vowel harmony across the lexicon, some researchers have proposed that, given some degree of phonetic overlap between a pair of vowels, a listener may reinterpret an ambiguous vowel token as belonging to a different vowel category, one that is in harmony with the other vowel (e.g. <xref ref-type="bibr" rid="B58">Ohala 1994</xref>; <xref ref-type="bibr" rid="B6">Beddor 2009</xref>; <xref ref-type="bibr" rid="B10">Blevins &amp; Wedel 2009</xref>). Such word-specific reanalysis is argued to spread to other words or phonological contexts incrementally (e.g. <xref ref-type="bibr" rid="B58">Ohala 1994</xref>) or rapidly via analogy (e.g. <xref ref-type="bibr" rid="B43">Hyman 2003</xref>).</p>
<p>Some studies have also posited inductive learning biases in favor of harmony relative to disharmony (e.g. <xref ref-type="bibr" rid="B51">Martin &amp; Peperkamp 2020</xref>; <xref ref-type="bibr" rid="B30">Finley &amp; Badecker 2008</xref>). Martin &amp; Peperkamp (<xref ref-type="bibr" rid="B51">2020</xref>), for example, found that when exposed to an ambiguous artificial grammar input, speakers of English are more likely to infer a harmony rule than a disharmony rule. The putative asymmetry in learning between harmony and disharmony has been attributed both to the fact that learners&#8217; knowledge of harmony&#8217;s phonetic naturalness constitutes a substantive bias (e.g. <xref ref-type="bibr" rid="B51">Martin &amp; Peperkamp 2020</xref>), and that, as an abstract rule, harmony is formally more concise than disharmony (e.g. <xref ref-type="bibr" rid="B30">Finley &amp; Badecker 2008</xref>). Moreover, infants have been shown to exhibit biases in favor of detecting vowel harmony patterns in acoustic inputs (e.g. <xref ref-type="bibr" rid="B59">Omane et al. 2024</xref>; <xref ref-type="bibr" rid="B66">Sol&#225;-Llonch &amp; Sundara 2025</xref>). Sol&#225;-Llonch &amp; Sundara (<xref ref-type="bibr" rid="B66">2025</xref>) in fact show that even 4-month old infants can detect vowel harmony in acoustic inputs, meaning that harmony patterns are perceptually accessible even to naive learners. However, the ostensible learning advantage bestowed by harmony does not manifest in all learning conditions. For example, Huang &amp; Do (<xref ref-type="bibr" rid="B42">2023</xref>) find only a bias in favor of harmony under conditions of high variability in the training input, which suggests that disharmony is not intrinsically difficult to learn.</p>
<p>There is also evidence that the within-word, inter-segmental redundancy that results from harmony is beneficial for segmenting speech for learners of an unfamiliar language (e.g. <xref ref-type="bibr" rid="B68">Suomi 1983</xref>; <xref ref-type="bibr" rid="B74">Vroomen et al. 1998</xref>; <xref ref-type="bibr" rid="B69">Suomi et al. 1997</xref>; <xref ref-type="bibr" rid="B55">Mintz et al. 2018</xref>; <xref ref-type="bibr" rid="B54">Mersad &amp; Nazzi 2011</xref>; <xref ref-type="bibr" rid="B74">Vroomen et al. 1998</xref>). Specifically, the fact that vowel transitions within words are more predictable than those between words serves as a robust cue for word boundaries. Beyond harmony, there is evidence that sounds are distributed in words in a way that facilitates spoken word recognition and lexical disambiguation in perception (e.g. <xref ref-type="bibr" rid="B24">Dautriche et al. 2017</xref>; <xref ref-type="bibr" rid="B45">King &amp; Wedel 2020</xref>). The mutually informative word-internal cues between vowels bestowed by vowel harmony systems or harmony biases might make online lexical access easier since it reduces the listener&#8217;s reliance on any particular sound in the word to retrieve the word form from the speech signal. In short, as a form of inter-segmental redundancy, harmony may facilitate robust lexical retrieval in a noisy channel.</p>
</sec>
<sec>
<title>1.3 Soft constraints on vowel co-occurrence</title>
<p>Given the evidence that harmonious vowel sequences are both phonetically and perceptually motivated at the word level, it appears plausible that vowel harmony is a universal &#8220;soft constraint&#8221;- a constraint that manifests not merely as a near-exceptionless or productive rule but as a statistical over-representation in the lexicon (e.g. <xref ref-type="bibr" rid="B33">Frisch et al. 2004</xref>) or as a graded bias in speakers&#8217; acceptability judgments of phonological patterns in perceived speech (e.g. <xref ref-type="bibr" rid="B23">Coleman &amp; Pierrehumbert 1997</xref>; <xref ref-type="bibr" rid="B39">Hay et al. 2004</xref>; <xref ref-type="bibr" rid="B12">Breiss &amp; Albright 2022</xref>). An extensive body of work has argued that phonological grammars should shape gradient statistical patterns in the lexicon (e.g. <xref ref-type="bibr" rid="B11">Boersma &amp; Hayes 2001</xref>; <xref ref-type="bibr" rid="B33">Frisch et al. 2004</xref>; <xref ref-type="bibr" rid="B78">Wilson 2006</xref>; <xref ref-type="bibr" rid="B1">Albright 2009</xref>), or that the same universal constraints that produce near-exceptionless segmental dependencies in some languages are likely to appear as exceptions or statistical trends in others (e.g. <xref ref-type="bibr" rid="B82">Zuraw 2000</xref>; <xref ref-type="bibr" rid="B41">Hayes &amp; Wilson 2008</xref>).</p>
<p>In the current study, we test whether vowel co-occurrence patterns in the world&#8217;s lexicons universally reflect featural vowel harmony. This study fits within an extensive body of work that uses frequency distributions of sound sequences in languages&#8217; lexicons to infer languages&#8217; phonological biases or constraints (e.g. <xref ref-type="bibr" rid="B33">Frisch et al. 2004</xref>; <xref ref-type="bibr" rid="B61">Pozdniakov &amp; Segerer 2007</xref>; <xref ref-type="bibr" rid="B75">Walter 2010</xref>; <xref ref-type="bibr" rid="B67">Stanton 2021</xref>). Broadly, the logic of this approach is that if a sound or sequence relevant to a particular property occurs more often than would be expected relative to some baseline (in which the effect of a potential bias is absent by design while other aspects of the language are preserved or controlled for), this over-representation can be taken as evidence that the language in question has a soft constraint or latent statistical bias in favor of that property. <xref ref-type="bibr" rid="B33">Frisch et al. 2004</xref>, for example, found that, compared to a baseline where sounds can combine freely, Arabic roots over-represent dissimilar non-adjacent consonant pairs. They also find that the degree of pairs&#8217; over- or under-representation is a function of consonants&#8217; gradient similarity. Subsequent studies have also found evidence that languages&#8217; lexicons exhibit statistical avoidance of similar consonants (e.g. <xref ref-type="bibr" rid="B75">Walter 2010</xref>; <xref ref-type="bibr" rid="B33">Frisch et al. 2004</xref>; <xref ref-type="bibr" rid="B61">Pozdniakov &amp; Segerer 2007</xref>; <xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>).</p>
<p>Research on vowel co-occurrence patterns, however, has not reported clear cross-linguistic evidence of featurally-constrained co-occurrence biases. In fact, evidence from recent studies suggests that pressures that restrict vowel co-occurrence are qualitatively distinct from those that restrict consonant co-occurrence, and that vowels do not appear to be driven by OCP-like similarity avoidance (e.g. <xref ref-type="bibr" rid="B75">Walter 2010</xref>; <xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>). <xref ref-type="bibr" rid="B75">Walter 2010</xref>, for example, compared vowels&#8217; and consonants&#8217; co-occurrence biases in Spanish and Croatian lexicons and found no evidence of a harmony bias among either for consonants or vowels, but did find that consonants were uniquely subject to a similarity avoidance constraint. Among studies focusing exclusively on vowel co-occurrence patterns, Alderete &amp; Finley (<xref ref-type="bibr" rid="B3">2016</xref>) found that sequences of identical vowels were over-represented while sequences of similar but non-identical vowels were under-represented in four Polynesian languages. Stanton (<xref ref-type="bibr" rid="B67">2021</xref>) similarly found that for Ngbaka, sequences of identical vowels were over-represented, but also that backness and height harmony were significant predictors of pair counts, independent of identity. The most cross-linguistically comprehensive analysis of vowel co-occurrence patterns and their difference from consonants, to the authors&#8217; knowledge, is an analysis of 107 Northern Eurasian languages by Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>) that compares vowel and consonant co-occurrence restrictions. Consistent with prior work on fewer languages like Walter (<xref ref-type="bibr" rid="B75">2010</xref>), these authors find evidence for a cross-linguistically consistent anti-similarity bias for consonants, while they find no effect of vowel similarity on vowel co-occurrence. They do, however, also report some evidence for a universal bias in favor of vowel identity.</p>
<p>The finding that vowel identity is cross-linguistically over-represented while similarity has no effect on vowel co-occurrence is consistent with work suggesting that copying operations are distinct from the kinds of partial assimilatory processes that produce featural harmony (e.g. <xref ref-type="bibr" rid="B34">Gallagher &amp; Coon 2009</xref>; <xref ref-type="bibr" rid="B44">Kawahara 2007</xref>). In some analyses, identical segments arise primarily through morphological processes such as reduplication (<xref ref-type="bibr" rid="B53">McCarthy 1995</xref>), rather than through gradient assimilation of features like [back] or [high]. Other work argues for a phonological preference for identity itself (<xref ref-type="bibr" rid="B34">Gallagher &amp; Coon 2009</xref>; <xref ref-type="bibr" rid="B67">Stanton 2021</xref>), or proposes copying mechanisms that do not require reference to fine-grained featural structure (<xref ref-type="bibr" rid="B44">Kawahara 2007</xref>). That is, copying-based identity creates an exact repetition of a vowel (e.g., /i/ &#8594; [i&#8230;i]) without requiring that the repeated vowel share just one feature such as [+high] or [+front], whereas assimilatory harmony operates by aligning individual features (e.g., a suffix vowel becoming [+back] to match the stem). Thus, if copying processes contribute to lexical structure, identity may surface independently of any pressure for featural similarity. A separate possibility is that assimilatory influences are themselves nonlinear, such that identical vowels surpass a similarity threshold that triggers stronger alignment than partially similar vowels (e.g. <xref ref-type="bibr" rid="B34">Gallagher &amp; Coon 2009</xref>; <xref ref-type="bibr" rid="B76">Wayment 2009</xref>). In any case, the widespread over-representation of identical vowel pairs suggests that identity may constitute a distinct organizing principle in phonological systems, rather than simply being the endpoint of featural harmony processes.</p>
<p>Results like those reported in Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>) and Walter (<xref ref-type="bibr" rid="B75">2010</xref>) might indicate that substantive properties of vowels do not measurably restrict vowel co-occurrence in a cross-linguistically consistent way. However, in Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>), vowel similarity was operationalized as a continuous value aggregating featural similarity across a variety of featural dimensions or natural classes. Under this approach, a pair of vowels is considered to be more similar if they have more matching specifications or features or natural classes. It is possible that such measures of similarity that aggregate over many featural dimensions obscure featurally-constrained variability in vowel co-occurrence patterns, that might be consistent across languages. Unlike Doucette et al, the current study tests whether there are systematic biases constraining vowel-co-occurrence along featural dimensions of backness or height. In addition, we explore whether identity biases vary by vowel category in a cross-linguistically consistent manner, in order to examine whether there is any evidence substantive properties of vowels might mediate their co-occurrence patterns in universal ways. In a supplementary analysis, we also test whether any detected co-occurrence constraints apply to local vowel pairs or more generally to vowel pairs separated by one or more syllables. Across all analyses we examine two large-scale cross-linguistic datasets: the NorthEuraLex corpus previously used by Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>) as well as 92 languages&#8217; lexicons from the XPF corpus.</p>
</sec>
<sec>
<title>1.4 Global pressures on the lexicon</title>
<p>The frequency distribution of sound sequences in a lexicon not only provides evidence for latent phonetic pressures or phonological constraints, but it can itself reflect global pressures on the lexicon to preserve its communicative efficacy (e.g. <xref ref-type="bibr" rid="B60">Piantadosi et al. 2012</xref>; <xref ref-type="bibr" rid="B45">King &amp; Wedel 2020</xref>; <xref ref-type="bibr" rid="B71">Trott &amp; Bergen 2020</xref>). Some researchers have argued that lexicons and sound inventories contain measurable evidence of ambiguity reduction. Trott &amp; Bergen (<xref ref-type="bibr" rid="B71">2020</xref>), for example, argue that languages exhibit fewer homophones than would be predicted by a phonotactic baseline alone. Flemming (<xref ref-type="bibr" rid="B32">2004</xref>) further suggests that vowel inventories are themselves organized to preserve distinctiveness between vowel categories. There is also evidence that natural phonological changes are sensitive to the structure of the lexicon. For example, several accounts argue that homophony avoidance is a factor in language change (e.g. <xref ref-type="bibr" rid="B25">De Smet &amp; Rosseel 2023</xref>) and that languages are more likely to undergo the loss of phonological contrasts between sounds whose merger would collapse fewer lexical distinctions (e.g. <xref ref-type="bibr" rid="B10">Blevins &amp; Wedel 2009</xref>; <xref ref-type="bibr" rid="B77">Wedel et al. 2013</xref>; <xref ref-type="bibr" rid="B37">Gurevich 2013</xref>; <xref ref-type="bibr" rid="B80">Yin &amp; White 2018</xref>). Some work has also found that (language-specific) informativity of particular sound classes itself predicts which of these sounds will undergo natural sound changes like lenition (e.g. <xref ref-type="bibr" rid="B16">Cohen Priva 2017</xref>). Because vowel assimilation is an information reducing process (<xref ref-type="bibr" rid="B36">Goldsmith &amp; Riggle 2012</xref>), it could be at odds with supposed pressures to preserve distinctiveness among lexical items.</p>
<p>The claim that lexicons are shaped by factors beyond local phonetic and phonotactic pressures is not new. Zipf&#8217;s Law of Abbreviation, for example, famously states that frequent word forms are likely to be shorter in length (<xref ref-type="bibr" rid="B81">Zipf 1945</xref>), and this generalization appears to be universal (e.g. <xref ref-type="bibr" rid="B7">Bentz &amp; i Cancho 2016</xref>; <xref ref-type="bibr" rid="B47">Linders &amp; Louwerse 2023</xref>). While it is contested to what extent this property of lexicons actually constitutes communicative optimization, or can emerge at random (e.g. see <xref ref-type="bibr" rid="B14">Caplan et al. 2020</xref>), it is nevertheless apparent that factors beyond local phonological interactions of sounds can influence the attestation of forms in lexicons, and in cross-linguistically consistent ways. Importantly, such pressures on lexicons need not be understood as teleological. The attestation of forms in lexicons can be shaped by historical change, wherein the retention of certain forms in the lexicon over time is constrained by a range of factors beyond local segmental interactions, like the cognitive biases of its users, learnability, communicative efficiency, or perceptual salience, among others (e.g. <xref ref-type="bibr" rid="B9">Blevins 2004</xref>; <xref ref-type="bibr" rid="B10">Blevins &amp; Wedel 2009</xref>). To the extent that these broader pressures are themselves universal or prevalent, they could complicate the prediction that lexicons should universally reflect phenomena like vowel harmony that may be desirable phonotactically.</p>
</sec>
</sec>
<sec>
<title>2 Current study</title>
<sec>
<title>2.1 Overview</title>
<p>This study investigates whether vowel co-occurrence is universally structured along featural dimensions. Specifically, we use Bayesian negative binomial regression to test whether languages&#8217; frequency distributions of vowel pairs are biased in favor of height and backness harmony, as well as segmental identity. Each analysis is replicated on two distinct corpora, 92 languages in the XPF corpus (<xref ref-type="bibr" rid="B19">Cohen Priva et al. 2021</xref>), as well as 107 Northern Eurasian languages in the NorthEuraLex corpus (<xref ref-type="bibr" rid="B26">Dellert &amp; J&#228;ger 2020</xref>), both of which are described in greater detail in the Materials section.</p>
<p>There is an abundant variety of dimensions along which a pair of vowels could, in principle, exhibit harmony or featural alignment. We choose to focus on the dimensions of height and backness, because speakers of many languages use at least the first and second formants (the acoustic correlates of front-ness and backness) to distinguish vowels (e.g. <xref ref-type="bibr" rid="B57">Oganian et al. 2023</xref>). It is of course possible that other phonological dimensions (e.g. ATR, vowel roundness) may influence vowel-co-occurrence patterns across languages. Given that height harmony and backness harmony are both well attested, and constitute at least loosely orthogonal articulatory and perceptual dimensions, they are a useful starting point for testing in a coarse-grained manner whether co-occurrence patterns might be featurally constrained in a language-general manner. To make our modeling approach sufficiently flexible to control for dependencies along a variety of dimensions, we include vowel-specific random effects (described in more detail in the Model section), but we ultimately leave more fine-grained and explicit testing of biases along other vowel dimensions to future work.</p>
</sec>
<sec>
<title>2.2 Materials and data</title>
<p>To evaluate the universality of potential vowel-to-vowel dependencies consistent with harmony and identity, the current study analyzes phonologically-transcribed word lists for many languages from a diverse range of language families. We look for converging evidence from two resources with separate G2P pipelines, the XPF corpus (<xref ref-type="bibr" rid="B19">Cohen Priva et al. 2021</xref>), and NorthEuraLex (<xref ref-type="bibr" rid="B26">Dellert &amp; J&#228;ger 2020</xref>), the latter of which was the source of data for Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>)&#8217;s recent investigation of consonant and vowel co-occurrence in 107 Northern Eurasian languages.</p>
<sec>
<title>2.2.1 XPF corpus</title>
<p>The XPF corpus (<xref ref-type="bibr" rid="B19">Cohen Priva et al. 2021</xref>) is a grapheme to phoneme (G2P) engine, which provides phonemic representations for 201 languages using rule-based grapheme-to-phoneme (G2P) mappings. Specifically, XPF provides for each language a hand-specified G2P rule set and scripts that apply these rules to large wordlists. For rules or substrings that could not be parsed by the rules, the translator inserts a placeholder symbol. We exclude languages in which more than 2% of tokens contain such untranslatable material. The G2P rules in XPF are fully documented, hand-specified, and based on descriptive linguistic sources for each language. The orthographies of the languages in the corpus exhibit high transparency; that is, they make it possible to deduce the phonemic makeup of words from their written form (e.g. like Spanish, and unlike English). The XPF corpus does not provide word frequencies, and we therefore supplement it with a variety of corpora that do provide word counts for the languages in the corpus. These include corpora collected by the Linguistic Data Consortium (<xref ref-type="bibr" rid="B8">Bills et al. 2016</xref>); only Georgian in the current study), OpenSubtitles (<xref ref-type="bibr" rid="B70">Tiedemann 2016</xref>), based on subtitles in the languages), and the Cr&#250;bad&#225;n corpus (<xref ref-type="bibr" rid="B65">Scannell 2007</xref>) based web-scraped written corpora from publicly available web resources. These are described separately in the following sections.</p>
<p>The combination of the XPF corpus&#8217; G2P rules and the word frequency corpora make it possible to detect gradient distributional tendencies that might emerge at the aggregate level within and across many languages. XPF thus provides a reproducible foundation for cross-linguistic phonological analysis in the absence of broad-coverage phonemically transcribed corpora. Previous research has shown that segment-level distributional properties tend to be relatively stable across corpora (<xref ref-type="bibr" rid="B20">Cohen Priva et al. 2020</xref>); the specific choice of corpus is thus unlikely to have a substantial effect on results based on such distributional properties.</p>
<p>One concern that might arise due to the usage of G2P resources is the inescapable presence of noise in the data. The XPF corpus itself attempts to address this by providing information about words present in the word frequency corpus that do not appear to follow the G2P conventions of the language (e.g. &#8220;q&#8221; in a language that doesn&#8217;t use it). Furthermore, though noise is likely still present in the input, it is not apparent a priori that such noise is likely to skew the results in any specific direction, and certainly not in a way that an abundance of data across many languages would not be able to overcome. Nevertheless, in the current study, we exclude languages that have more than 2% untranslated tokens, to keep any potentially systematic transcription failure from biasing the sample of attested forms (see Supplementary Appendix, <xref ref-type="table" rid="T2">Table 2</xref>).</p>
</sec>
<sec>
<title>2.2.2 Cr&#250;bad&#225;n word lists</title>
<p>The Cr&#250;bad&#225;n project (<xref ref-type="bibr" rid="B65">Scannell 2007</xref>) is the resource from which the lexical statistics for a large majority of languages transcribed by the XPF corpus (84 out of 92) in this study are derived. It was developed to provide textual resources for low-resource languages and compiles publicly available text data across many languages. Scannell (<xref ref-type="bibr" rid="B65">2007</xref>) reports using a recursive web-crawler to attempt to obtain an exhaustive sample of words written in a given language, and the number of documents varies by language (see Supplementary Appendix, <xref ref-type="table" rid="T4">Table 4</xref>).<xref ref-type="fn" rid="n2">2</xref> Given that many languages in the Cr&#250;bad&#225;n project are under-resourced, common source corpora for obtaining word lists are documents that are translated in multiple languages, most commonly Bible translations, the UN Declaration of Human Rights, as well as all Wikipedia pages if available in the language.</p>
<p>While variability in source corpora can be a source of noise, the utility of this resource for large-scale corpus analysis is nevertheless supported by prior work. Cohen Priva et al. (<xref ref-type="bibr" rid="B20">2020</xref>) compared phoneme-level unigram and bigram frequencies between Cr&#250;bad&#225;n word lists and corpora that more closely reflect spoken language, finding that systematic differences at such local levels were relatively uncommon, which mitigates the potential problem that sources for some languages are not representative of language use. Cohen Priva &amp; Jaeger (<xref ref-type="bibr" rid="B17">2018</xref>) also find that estimates of parameters like segment informativity derived from small sub-samples of corpus data closely align with full corpora, which also mitigates the potential for spurious effects as a result of variability in amount of data. In the current work, we only consider type frequencies of sub-lexical configurations and not token frequencies. Counts of phonological patterns based on their type frequencies are less likely to be biased by the source corpus than those based on raw token frequencies (e.g. certain words might be over-represented only in certain texts). Using type-frequencies thus prevents phonological count measures from being skewed by corpus-specific biases in lexical frequencies.</p>
</sec>
<sec>
<title>2.2.3 Open Subtitles and LDC corpora</title>
<p>In addition to Cr&#250;bad&#225;n, for a smaller subset of languages in our study, we used word lists derived from the Linguistic Data Consortium (LDC) and OpenSubtitles corpora and transcribed them using the XPF corpus G2P rules. OpenSubtitles (<xref ref-type="bibr" rid="B70">Tiedemann 2016</xref>) is a large-scale multilingual corpus composed of movie and television subtitles. It is characterized by a more informal and conversational register, and thus potentially contains more colloquial word forms. We use OpenSubtitles word lists for Hungarian, Turkish, Korean, Spanish, Greek, Bulgarian and Malayalam.</p>
<p>LDC (<xref ref-type="bibr" rid="B8">Bills et al. 2016</xref>) is a curated repository of linguistic datasets primarily intended for research and development in computational linguistics and language technology. These corpora offer relatively high-quality transcription standards, and their linguistic coverage tends to reflect more formal or structured registers of the language. In the current study of the 92 languages, only the Georgian word list comes from the LDC corpus.</p>
</sec>
<sec>
<title>2.2.4 NorthEuraLex</title>
<p>To ensure the robustness of our results, we replicate each analysis on the NorthEuraLex corpus (<xref ref-type="bibr" rid="B26">Dellert &amp; J&#228;ger 2020</xref>). NorthEuraLex 0.9 provides standardized phonemic transcriptions for 1,016 shared concepts (e.g. culturally-non-specific concepts body parts, animals) across 107 Northern Eurasian languages from 21 language families. These include families like Uralic, Indo-European, Turkic, Mongolic, Tungusic, and Caucasian families, as well as isolates such as Basque and Burushaski (see Supplementary Appendix, <xref ref-type="table" rid="T3">Table 3</xref> for a full breakdown). This corpus was the subject of the cross-linguistic analysis in Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>).</p>
<p>NorthEuraLex differs from XPF and related corpora in several ways. First, its word lists are limited to a fixed list of 1,016 shared concepts (such as animal names, kinship terms, and body parts), with each language contributing wordforms only for those concepts. This contrasts with the XPF corpus, which includes multiple thousands of words per language drawn from general-purpose web corpora. Thus, while NorthEuraLex has less data per language, its word lists are potentially less subject to cross-linguistic variability in source corpora. The transcriptions in NorthEuraLex are also generated independently of XPF&#8217;s G2P rules. Most phonemic forms are derived automatically via language-specific orthography-to-IPA transducers developed from grammatical descriptions (see Dellert &amp; J&#228;ger (<xref ref-type="bibr" rid="B26">2020</xref>) for a more detailed description). For a subset of languages where such automation was infeasible, e.g. due to non-transparent orthographies, enhanced dictionary sources were used. All transcriptions are stored in standardized IPA format, and translation rules are publicly documented at <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.northeuralex.org">www.northeuralex.org</ext-link>. Also, while XPF and NorthEuraLex contain some forms with affixation as well as bare forms, unlike XPF, NorthEuraLex is curated to mostly contain lemmas, so including analyses with this corpus should at least partially guard against spurious results from a lack of normalization for inflection. Ultimately, while NorthEuraLex has a smaller set of observations and more restricted areal scope (Northern Eurasia only, including mostly Indo-European and Uralic languages), converging evidence for phonological biases across both XPF and NorthEuraLex would suggest that observed phonological biases are unlikely to be artifacts of any data source or transcription scheme.</p>
</sec>
<sec>
<title>2.2.5 PHOIBLE</title>
<p>The XPF and NorthEuraLex corpora offer phonemic transcriptions but not sub-phonemic featural information. To assign featural labels to phonemes in both corpora, we rely on PHOIBLE (<xref ref-type="bibr" rid="B56">Moran &amp; McCloy 2019</xref>), a publicly available database containing phonological inventory and corresponding feature data for 2,186 distinct languages.</p>
</sec>
</sec>
<sec>
<title>2.3 Data filtering</title>
<p>The 92 languages from XPF corpus were selected out of a total of 201 languages. Given that grapheme-to-phoneme transcription can be noisy, we aimed to maximize the reliability of our data by filtering out languages whose phonologically transcribed word lists are likely to be unreliable. For every language in the XPF corpus (<xref ref-type="bibr" rid="B19">Cohen Priva et al. 2021</xref>), we applied the following exclusion procedure:</p>
<list list-type="order">
<list-item><p>We initially exclude XPF words that have a token frequency less than 5, or with a token frequency smaller than the word ranked 10,000th by frequency.</p></list-item>
<list-item><p>To ensure that lexicons are of sufficient quality and representative of the corpus from which they are derived, we eliminate languages with more than 2% untranslated words in each corpus. This threshold is set low to minimize the risk that transcription failures systematically bias a language&#8217;s segmental distributions in a particular direction. This filtering step alone rules out 100 languages.</p></list-item>
<list-item><p>We eliminate languages with fewer than 2,500 word types in the original dataset, in order to ensure that the word list approximates a reliable and exhaustive sample of language use. This filtering step rules out seven more languages (18 in total, including those ruled out by the previous step).</p></list-item>
<list-item><p>We eliminate all languages marked in the XPF corpus metadata as having compromised vowel transcriptions (as previously judged by trained linguists who curated the XPF corpus. This information is available on the corpus website). This filtering step applies to 1 additional language (16 in total).</p></list-item>
<list-item><p>Lastly, we also eliminate Vietnamese, because it has only 46 multisyllabic words in the dataset. The language with the next lowest is Cof&#225;n, with 562 multisyllabic words, making Vietnamese qualitatively an outlier.</p></list-item>
</list>
<p>An exhaustive breakdown of the 92 languages that met which exclusion criterion is also shown in Supplementary Appendix, <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<p>For NorthEuraLex, we do not filter languages based on the number word types in the original dataset, because all languages have fewer than 2,500 total words. The fact that the data consist of shared concepts independently helps guarantee that these corpora, while small, are representative of language use and are comparable across languages.</p>
<p>For the remaining phonemically transcribed words across the XPF languages and NorthEuraLex languages, we employ the following modifications to transcriptions:</p>
<list list-type="order">
<list-item><p>We eliminate lexical items that have diphthongs or adjacent vowels not separated by a consonant. This is to avoid conflating non-adjacent dependencies between separate vowels with effects that might arise from direct interactions between vowels (e.g. hiatus avoidance). Diphthongs are omitted because a single vowel representing multiple vowel qualities poses a problem for classifying a vowel sequence as adhering to a particular harmony pattern.</p></list-item>
<list-item><p>We eliminate any monosyllabic words since vowel-vowel pairs are necessary to carry out the analysis.</p></list-item>
<list-item><p>We are primarily interested in vowels&#8217; spectral properties, so we ignore vowel length and replace long vowels with their short counterparts. We also omit vowel devoicing diacritics for the same reason.</p></list-item>
<list-item><p>We exclude any two-word lexical entries (only applicable in NorthEuraLex data.)</p></list-item>
</list>
<p>Across both datasets, filtering removed roughly one quarter of lexical items on average (25.8% in XPF; 29.0% in NorthEuraLex), with mean post-filter lexicon sizes remaining substantial (4632 and 763 types per language, respectively). No language was excluded based on these criteria, and even the smallest post-filter lexicons (1181 types in XPF; 200 in NorthEuraLex) were sufficiently large for the vowel co-occurrence analyses because observations are vowel bi-grams extracted from all multi-vowel lexical items, and not the lexical items themselves (see Supplementary Appendix, <xref ref-type="table" rid="T2">Tables 2</xref> and <xref ref-type="table" rid="T3">3</xref> for a breakdown of word counts and vowel pair counts by language). Because not all symbols across the two datasets are compatible with PHOIBLE, we conduct corpus-specific manual data cleaning for both XPF and NorthEuraLex corpora to ensure that the transcription conventions are compatible with symbols used in the PHOIBLE dataset. We note that nasal contrasts between vowels were collapsed in the NorthEuraLex pipeline but not in our XPF pipeline. For our purposes, this, amounts to a difference in whether oral vowels and their nasal counterparts are treated as identical or not. More generally, &#8220;identity&#8221; in the current study should be understood simply as two vowels having identical labels given the granularity of labels available in the datasets and normalization across transcriptions used; we do not make a theoretical commitment to a phonetic or phonological threshold based on which a pair of vowels should be judged as being identical or not.</p>
<p>In NorthEuraLex, there are 26 Uralic languages and 8 Turkic languages, all of which are reported to have backness harmony. Our sample of XPF languages includes 3 Uralic languages, and 7 Turkic languages. We show which languages we were able to confirm have some sort of vowel harmony based on academic sources in Supplementary Appendix, <xref ref-type="table" rid="T2">Tables 2</xref> and <xref ref-type="table" rid="T3">3</xref>. We ultimately control for any harmony biases based on genetic affiliation by including language and language family as random effects (discussed in more detail below).</p>
<p>18 languages appear in both XPF and NorthEuraLex: Armenian, Bashkir, Basque, Bulgarian, Czech, Erzya, Georgian, Hungarian, Kannada, Korean, Malayalam, Romanian, Slovak, Spanish, Tatar, Telugu, Turkish, and Ukrainian. Concurrent analyses across multiple corpora thus allows for judging whether different datasets of the same languages yields similar co-occurrence estimates.</p>
</sec>
<sec>
<title>2.4 Modeling approach: Negative binomial regression</title>
<p>The analytical framework used in this paper is to measure the over-representation of segmental sequences relative to some baseline or expected probability to get observed over expected (O/E) values (e.g. <xref ref-type="bibr" rid="B67">Stanton 2021</xref>; <xref ref-type="bibr" rid="B3">Alderete &amp; Finley 2016</xref>). We implement this approach using a Bayesian negative binomial model (see <xref ref-type="table" rid="T1">Table 1</xref>), which predicts the counts of vowel sequences with the binary predictors corresponding to constraints like height or backness harmony. More specifically, the model predicts how often each vowel pair occurs in a corpus relative to how often it would be expected to occur purely on the basis of the individual frequencies of the two vowels. The log link function means that predictors are interpreted as multiplicative effects on counts (e.g., a coefficient of 0.7 corresponds to roughly twice the expected count). The model&#8217;s estimates for binary phonological predictors fulfill a similar purpose as the aforementioned observed/expected (O/E) used in studies like Frisch et al. (<xref ref-type="bibr" rid="B33">2004</xref>). Using a log-linear model instead provides an estimate of a given bias for one pattern while controlling for the effect of other patterns or constraints (e.g. <xref ref-type="bibr" rid="B79">Wilson &amp; Obdeyn 2009</xref>; <xref ref-type="bibr" rid="B2">Albright &amp; Breiss 2024</xref>; <xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>).</p>
<table-wrap id="T1">
<caption>
<p><bold>Table 1:</bold> Fixed-effect predictors used in the full model.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Term</bold></td>
<td align="left" valign="top"><bold>Meaning</bold></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>pair.count</monospace> &#8764;</td>
<td align="left" valign="top">We predict counts of vowel bigrams extracted from words.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>1 +</monospace></td>
<td align="left" valign="top">The intercept, a term the model estimates for the average count of vowel pairs in the entire dataset when all other predictors are at 0.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>identity +</monospace></td>
<td align="left" valign="top">1 if the two vowels in the pair are identical, 0 otherwise. As a main effect, the model uses this variable to estimate the overall average multiplicative effect of a vowel pair being identical on the predicted count. The same logic applies to all other main effect terms below.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>backness_harm_nident +</monospace></td>
<td align="left" valign="top">1 if the pair shows non-identical backness harmony, 0 otherwise.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>height_harm_nident +</monospace></td>
<td align="left" valign="top">1 if the pair shows non-identical height harmony, 0 otherwise.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>backness_viol_avoid +</monospace></td>
<td align="left" valign="top">&#8211;1 if the pair violates backness harmony, 0 otherwise.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>height_viol_avoid +</monospace></td>
<td align="left" valign="top">&#8211;1 if the pair violates height harmony, 0 otherwise.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>inventory +</monospace></td>
<td align="left" valign="top">Standardized log of the vowel inventory size for each language.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>identity: inventory +</monospace></td>
<td align="left" valign="top">Interaction: does the effect of identity vary by inventory size?</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>backness_harm_nident: inventory +</monospace></td>
<td align="left" valign="top">Interaction: does the effect of backness harmony (non-identical) vary by inventory size?</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>height_harm_nident: inventory +</monospace></td>
<td align="left" valign="top">Interaction: does the effect of height harmony (non-identical) vary by inventory size?</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>backness_viol_avoid: inventory +</monospace></td>
<td align="left" valign="top">Interaction: does the effect of backness violations vary by inventory size?</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>height_viol_avoid: inventory +</monospace></td>
<td align="left" valign="top">Interaction: does the effect of height violations vary by inventory size?</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>offset(log.expected)</monospace></td>
<td align="left" valign="top">Offset term normalizing the predictors for the expected pair frequencies assuming independence of v1 and v2.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Following prior work (e.g. <xref ref-type="bibr" rid="B67">Stanton 2021</xref>; <xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>), we consider as observations type frequencies of local vowel pairs extracted from a word separated by at least one consonant. A word like [animo], for example, would result in pairs [a, i] and [i, o] (see Supplementary Materials for analysis that treats non-local pairs as observations as well, and includes locality as a covariate.). Our model thus uses properties like identity and featural harmony to predict the counts of these pairs. <xref ref-type="table" rid="T1">Tables 1</xref> and <xref ref-type="table" rid="T2">2</xref> summarize the model parameters, and <xref ref-type="table" rid="T3">Table 3</xref> illustrates the input into our negative binomial model on a toy dataset.</p>
<table-wrap id="T2">
<caption>
<p><bold>Table 2:</bold> Random-effect structures for the full model. The <monospace>&#34;&#124;&#124;&#34;</monospace> notation (as opposed to <monospace>&#34;&#124;&#34;</monospace>) means that the model does not compute a correlation matrix between random slopes, which keeps the model from being over-parameterized.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Term</bold></td>
<td align="left" valign="top"><bold>Meaning</bold></td>
</tr>
<tr>
<td align="left" valign="top"><monospace>(1 + identity + backness_harm_nident + height_harm_nident + backness_viol_avoid + height_viol_avoid &#124;&#124; v1.in.lang)</monospace></td>
<td align="left" valign="top">Random intercept and random slopes for all main predictors grouped by <monospace>v1.in.lang</monospace>. This allows the model to account for variation due to individual vowels (within a language), preventing such variation from inflating the language general fixed-effect estimates (e.g., of identity). Random intercepts are simply group-specific deviations in the average overall count of vowel pairs (i.e. the number of observations in that grouping variable), and are not directly relevant for interpreting the extent of harmony or identity biases.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>(1 + identity + backness_harm_nident + height_harm_nident + backness_viol_avoid + height_viol_avoid &#124;&#124; name)</monospace></td>
<td align="left" valign="top">Random intercept and slopes for all main predictors grouped by <monospace>name</monospace> (language). This allows the model to capture language-specific over- or under-representation of identity or (dis)harmony patterns. And keep languages with unique patterns for any predictor (e.g. a language with strict backness harmony) from biasing the main effect.</td>
</tr>
<tr>
<td align="left" valign="top"><monospace>(1 + identity + backness_harm_nident + height_harm_nident + backness_viol_avoid + height_viol_avoid &#124;&#124; Family)</monospace></td>
<td align="left" valign="top">Random intercept and slopes for main predictors grouped by <monospace>Family</monospace>, allowing the model to detect family-level deviations from aggregate language-general tendencies, and keeps biases attributable to particular language families from inflating the aggregate main effect.</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T3">
<caption>
<p><bold>Table 3:</bold> Basic structure of input into model.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Lang</bold></td>
<td align="left" valign="top"><bold>V2.V2</bold></td>
<td align="left" valign="top"><bold>Observed</bold></td>
<td align="left" valign="top"><bold>Expected</bold></td>
<td align="left" valign="top"><bold>Identity</bold></td>
<td align="left" valign="top"><bold>Backness Violation-Avoid</bold></td>
<td align="left" valign="top"><bold>Backness Harmony (nonident)</bold></td>
</tr>
<tr>
<td align="left" valign="top">A</td>
<td align="left" valign="top">a.a</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">ln((4/22) * (4/22))</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">A</td>
<td align="left" valign="top">u.i</td>
<td align="left" valign="top">6</td>
<td align="left" valign="top">ln((6/22) * (18/22))</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">&#8211;1</td>
<td align="left" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">A</td>
<td align="left" valign="top">i.i</td>
<td align="left" valign="top">12</td>
<td align="left" valign="top">ln((12/22) * (18/22))</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">B</td>
<td align="left" valign="top">i.o</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">ln((1/11) * (10/11))</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">&#8211;1</td>
<td align="left" valign="top">0</td>
</tr>
<tr>
<td align="left" valign="top">B</td>
<td align="left" valign="top">e.i</td>
<td align="left" valign="top">3</td>
<td align="left" valign="top">ln((3/11) * (8/11))</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">1</td>
</tr>
<tr>
<td align="left" valign="top">B</td>
<td align="left" valign="top">e.o</td>
<td align="left" valign="top">3</td>
<td align="left" valign="top">ln((3/11) * (8/11))</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">1</td>
</tr>
<tr>
<td align="left" valign="top">B</td>
<td align="left" valign="top">o.o</td>
<td align="left" valign="top">4</td>
<td align="left" valign="top">ln((4/11) * (8/11))</td>
<td align="left" valign="top">1</td>
<td align="left" valign="top">0</td>
<td align="left" valign="top">0</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As mentioned, we test for harmony biases along the dimensions of backness and height. We code each in our model using two distinct predictors for a given dimension, labeled as <italic>dimension.harmony</italic> and <italic>dimension.violation.avoid</italic>. For example, we define a <italic>backness.viol_avoid</italic> as any pair of vowels that is on opposite ends of the backness dimension where one vowel is marked as [+back] and the other as [+front], and accordingly <italic>height.violation.avoid</italic> as any non-identical pair of vowels where one vowel is marked as [+high] and the other [+low]. We define the variable <italic>backness.harmony.nonident</italic> as any non-identical pair of vowels that are both [+front] or both [+back].</p>
<p>One motivation for this coding scheme is that along a given featural dimension like height, the separate predictors can account for distinct ways in which vowels marked as doubly negative (e.g. /a/, which is marked as -front -back, for backness in PHOIBLE) might participate in harmony. The binary <italic>violation.avoid</italic> predictor includes pairs of such vowels intended to detect a preference against extreme differences along a given dimension, which may include a preference for vowels marked as [-front, -back] for a particular dimension. The <italic>harmony.nonident</italic> predictor, in contrast, captures harmony biases that manifest as alignment of vowels with positive values, and it is restricted to non-identical pairs to avoid collinearity with identity.</p>
<p>This operationalization of predictors yields an orthogonal coding scheme that allows the posterior distributions for two sets of predictors to be added to produce a joint distribution that shows the overall strength of the underlying construct. For example, a combination of posteriors for predictors like <italic>backness_viol_avoid</italic> + <italic>backness_harmony_nonident</italic> can thus be interpreted as reflecting the overall effect of alignment along the dimension of backness.</p>
<p>The model includes random intercepts and random slopes, which allow it to capture variation across languages rather than assuming that all languages behave the same way. A random intercept means that each language is allowed its own baseline rate of vowel co-occurrence; some languages may show higher or lower overall counts than others, independent of any specific effect we test. A random slope, which is more critical for our purposes, means that the size or even direction of a particular effect (for example, a tendency for identical vowels to co-occur) can differ from one language to another. In other words, the model estimates both the overall trend (the main effect) across all languages and how individual languages deviate from that trend (the random slope). This approach allows the analysis to generalize to the population of languages while allowing the model to measure between language variability in particular biases, and estimate biases for specific languages.</p>
<p>We control for positional frequencies of each vowel in each language by using position-specific probabilities of vowels to compute an expected proportion, which is treated in the model as an offset term, following Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>). In short, the offset term normalizes its estimates based on the expected counts of vowel pairs. This ensures that the model asks not simply which pairs occur frequently, but which occur more (or less) often than would be predicted from the marginal frequencies of the vowels involved.</p>
<p>To control for the effect of a language&#8217;s vowel inventory size on co-occurrence biases, we include a predictor <italic>inventory.size</italic> and its interaction with the binary main effects. Along with the position-specific baseline, and between-language random slopes, having <italic>inventory.size</italic> as a covariate should keep effects from being skewed by the trivial fact that languages with smaller inventories are likely to have a greater raw proportion of any configuration like identity just by virtue of having fewer possible vowel combinations.</p>
<p>For all binary fixed effects, we also include random slopes for each vowel category in each language in the first position of a vowel pair <italic>v1.in.language</italic>. This grouping variable provides additional control for the independent probabilities of vowel categories, and also allows the model to robustly model idiosyncrasies of particular vowels and particular languages. Allowing the model to account for phoneme-specific variability across languages helps make the model sufficiently flexible to control for co-occurrence biases along a variety of dimensions.</p>
<p>To aid in interpreting this model given collinearity between predictors, we fit an auxiliary model identical to the model in <xref ref-type="table" rid="T1">Table 1</xref>, except that the <italic>identity</italic> predictor is omitted. This allows us to carry out model comparison to assess the unique contribution of identity to the predictive capacity of the model. It also allows us to separately quantify the extent of featural biases when identity is assumed to play no role in restricting vowel co-occurrence. By removing the identity predictor from the model structure, we can ask whether it is even possible for the model to account for co-occurrence counts with the given featural predictors alone.</p>
<sec>
<title>2.4.1 Model hyperparameters</title>
<p>For all negative binomial models, we use weakly informative priors across all binary fixed effects. Specifically, we place a normal prior centered at 0 with a standard deviation of 3 on all binary fixed effects and their interactions:</p>
<disp-formula id="FD1">
<alternatives>
<mml:math id="Eq001-mml"><mml:mrow><mml:mrow><mml:mo>&#x03B2;</mml:mo><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mtext>Normal</mml:mtext><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo lspace="0em">.</mml:mo></mml:mrow></mml:math>
<tex-math id="M1">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\beta\sim{\rm Normal}(0,3).
\]
\end{document}
</tex-math>
<graphic xlink:href="glossa-11-17737-e1.gif"/>
</alternatives>
</disp-formula>
<p>This weakly informative prior reflects no prior knowledge about the direction of any of the effects, but allows for a large degree of variability in the magnitude of the effect. For the standard deviations of the random effects (family, language, and vowel-in-language), we use the following prior:</p>
<disp-formula id="FD2">
<alternatives>
<mml:math id="Eq002-mml"><mml:mrow><mml:mrow><mml:mo>&#x03C3;</mml:mo><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>2.5</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo lspace="0em">.</mml:mo></mml:mrow></mml:math>
<tex-math id="M2">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\sigma\sim t(3,0,2.5).
\]
\end{document}
</tex-math>
<graphic xlink:href="glossa-11-17737-e2.gif"/>
</alternatives>
</disp-formula>
<p>This reflects our expectation that the fixed effects are likely to vary by language and language family, but also regularizes the model to avoid overfitting by attributing excessive variability to group-level effects unless strongly supported by the data.</p>
<p>For the dispersion parameter <italic>&#981;</italic> in the negative binomial distribution, we use a log-normal prior with a mean of 0 and a standard deviation of 0.5 on the log scale:</p>
<disp-formula id="FD3">
<alternatives>
<mml:math id="Eq003-mml"><mml:mrow><mml:mrow><mml:mo>&#x03D5;</mml:mo><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mtext>LogNormal</mml:mtext><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>0.5</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo lspace="0em">.</mml:mo></mml:mrow></mml:math>
<tex-math id="M3">
\documentclass[10pt]{article}
\usepackage{wasysym}
\usepackage[substack]{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage[mathscr]{eucal}
\usepackage{mathrsfs}
\usepackage{pmc}
\usepackage[Euler]{upgreek}
\pagestyle{empty}
\oddsidemargin -1.0in
\begin{document}
\[
\phi\sim{\rm LogNormal}(0,0.5).
\]
\end{document}
</tex-math>
<graphic xlink:href="glossa-11-17737-e3.gif"/>
</alternatives>
</disp-formula>
<p>This prior reflects a prior belief that the data may be moderately more dispersed than a Poisson model would predict (and is the default prior in brms). Under a Poisson model, the variance is assumed to be equivalent to the mean, while a negative binomial model allows the variance to exceed the mean, based on how counts are actually distributed across words which is estimated by this parameter. We train every model using R&#8217;s brms package (<xref ref-type="bibr" rid="B13">B&#252;rkner 2018</xref>). For every model, we use 4 chains and 2000 iterations per chain, with a burn-in period of 1000 iterations.</p>
</sec>
</sec>
<sec>
<title>2.5 Results</title>
<p>Across all analyses and plots, positive values of coefficients for a particular pattern like identity indicate that the model over-represents that pattern relative to what would be expected at chance, and negative values (below or to the left of zero) indicate that the model measures that pattern as being under-represented relative to what would be expected at chance and relative to the effect of other predictors. The plotted posterior distributions represent the model&#8217;s uncertainty for each coefficient. Each distribution shows the full range of parameter values that are plausible given the data, with wider distributions indicating greater uncertainty and narrower ones indicating more precise estimates.</p>
<p>All r-hat values across the XPF-trained and NorthEuraLex-trained models were between 1.0 and 1.01, suggesting that the chains adequately mixed and that the models converged. <xref ref-type="fig" rid="F1">Figure 1</xref> shows the posterior distributions for the binary model estimates. <xref ref-type="table" rid="T4">Tables 4</xref> and <xref ref-type="table" rid="T5">5</xref> show the raw estimates and credible intervals for every fixed effect coefficient for the NorthEuraLex and XPF data respectively. Given the presence of <italic>inventory.size</italic> as an interaction term, the coefficients reflect the expected effect of each variable at the average inventory size across all languages, which is 5.82 in our sample of XPF languages, and 8.21 in the 107 NorthEuraLex languages.</p>
<fig id="F1">
<caption>
<p><bold>Figure 1:</bold> Posterior distributions (within 95% CI) for main effects of models across both corpora. Both models show reliable identity effects. Positive values indicate biases in the direction of greater segmental overlap (harmony or identity). Distributions for XPF are likely narrower due to a greater amount of total observations.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g1.png"/>
</fig>
<p>Identity is a reliable positive predictor of adjacent vowel pair counts in both the XPF and NorthEuraLex datasets. In XPF, the identity effect is estimated at <italic>&#946;</italic> = 0.50 (95% CI: [0.26, 0.72]), corresponding to a mean 65% increase over expected frequency (exp(0.50)=1.65; 95% CI: [exp(0.26)=1.30, exp(0.72)=2.05]). In NorthEuraLex, the effect is even stronger at <italic>&#946;</italic> = 0.55 (95% CI: [0.29, 0.82]; exp(0.55)=1.73; 95% CI: [1.34, 2.27]), suggesting that adjacent identical vowels are reliably over-represented cross-linguistically.</p>
<table-wrap id="T4">
<caption>
<p><bold>Table 4:</bold> Population-Level Effects (Point Estimates): NorthEuraLex model.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Parameter</bold></td>
<td align="left" valign="top"><bold>Estimate</bold></td>
<td align="left" valign="top"><bold>SE</bold></td>
<td align="left" valign="top"><bold>2.5% CI</bold></td>
<td align="left" valign="top"><bold>97.5% CI</bold></td>
</tr>
<tr>
<td align="left" valign="top">Intercept</td>
<td align="left" valign="top">6.98</td>
<td align="left" valign="top">0.07</td>
<td align="left" valign="top">6.85</td>
<td align="left" valign="top">7.12</td>
</tr>
<tr>
<td align="left" valign="top">identity</td>
<td align="left" valign="top">0.55</td>
<td align="left" valign="top">0.13</td>
<td align="left" valign="top">0.29</td>
<td align="left" valign="top">0.82</td>
</tr>
<tr>
<td align="left" valign="top">inventory</td>
<td align="left" valign="top">&#8211;0.02</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">&#8211;0.05</td>
<td align="left" valign="top">0.01</td>
</tr>
<tr>
<td align="left" valign="top">backness_harm_nident</td>
<td align="left" valign="top">0.22</td>
<td align="left" valign="top">0.11</td>
<td align="left" valign="top">&#8211;0.02</td>
<td align="left" valign="top">0.43</td>
</tr>
<tr>
<td align="left" valign="top">height_harm_nident</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">0.05</td>
<td align="left" valign="top">&#8211;0.09</td>
<td align="left" valign="top">0.11</td>
</tr>
<tr>
<td align="left" valign="top">backness_viol_avoid</td>
<td align="left" valign="top">0.07</td>
<td align="left" valign="top">0.11</td>
<td align="left" valign="top">&#8211;0.16</td>
<td align="left" valign="top">0.30</td>
</tr>
<tr>
<td align="left" valign="top">height_viol_avoid</td>
<td align="left" valign="top">&#8211;0.08</td>
<td align="left" valign="top">0.09</td>
<td align="left" valign="top">&#8211;0.25</td>
<td align="left" valign="top">0.10</td>
</tr>
<tr>
<td align="left" valign="top">inventory:identity</td>
<td align="left" valign="top">&#8211;0.05</td>
<td align="left" valign="top">0.02</td>
<td align="left" valign="top">&#8211;0.08</td>
<td align="left" valign="top">&#8211;0.02</td>
</tr>
<tr>
<td align="left" valign="top">inventory:backness_harm_nident</td>
<td align="left" valign="top">&#8211;0.00</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">&#8211;0.02</td>
<td align="left" valign="top">0.02</td>
</tr>
<tr>
<td align="left" valign="top">inventory:height_harm_nident</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">&#8211;0.02</td>
<td align="left" valign="top">0.03</td>
</tr>
<tr>
<td align="left" valign="top">inventory:backness_viol_avoid</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">&#8211;0.02</td>
<td align="left" valign="top">0.03</td>
</tr>
<tr>
<td align="left" valign="top">inventory:height_viol_avoid</td>
<td align="left" valign="top">0.03</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">0.01</td>
<td align="left" valign="top">0.05</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T5">
<caption>
<p><bold>Table 5:</bold> Population-Level Effects (Point Estimates): XPF model.</p>
</caption>
<table>
<tbody>
<tr>
<td align="left" valign="top"><bold>Parameter</bold></td>
<td align="left" valign="top"><bold>Estimate</bold></td>
<td align="left" valign="top"><bold>SE</bold></td>
<td align="left" valign="top"><bold>2.5% CI</bold></td>
<td align="left" valign="top"><bold>97.5% CI</bold></td>
</tr>
<tr>
<td align="left" valign="top">Intercept</td>
<td align="left" valign="top">8.92</td>
<td align="left" valign="top">0.17</td>
<td align="left" valign="top">8.59</td>
<td align="left" valign="top">9.25</td>
</tr>
<tr>
<td align="left" valign="top">identity</td>
<td align="left" valign="top">0.50</td>
<td align="left" valign="top">0.11</td>
<td align="left" valign="top">0.26</td>
<td align="left" valign="top">0.72</td>
</tr>
<tr>
<td align="left" valign="top">inventory</td>
<td align="left" valign="top">0.05</td>
<td align="left" valign="top">0.04</td>
<td align="left" valign="top">&#8211;0.03</td>
<td align="left" valign="top">0.13</td>
</tr>
<tr>
<td align="left" valign="top">backness_harm_nident</td>
<td align="left" valign="top">0.03</td>
<td align="left" valign="top">0.09</td>
<td align="left" valign="top">&#8211;0.15</td>
<td align="left" valign="top">0.20</td>
</tr>
<tr>
<td align="left" valign="top">height_harm_nident</td>
<td align="left" valign="top">&#8211;0.10</td>
<td align="left" valign="top">0.13</td>
<td align="left" valign="top">&#8211;0.37</td>
<td align="left" valign="top">0.16</td>
</tr>
<tr>
<td align="left" valign="top">backness_viol_avoid</td>
<td align="left" valign="top">0.06</td>
<td align="left" valign="top">0.10</td>
<td align="left" valign="top">&#8211;0.14</td>
<td align="left" valign="top">0.25</td>
</tr>
<tr>
<td align="left" valign="top">height_viol_avoid</td>
<td align="left" valign="top">&#8211;0.10</td>
<td align="left" valign="top">0.05</td>
<td align="left" valign="top">&#8211;0.19</td>
<td align="left" valign="top">&#8211;0.00</td>
</tr>
<tr>
<td align="left" valign="top">inventory:identity</td>
<td align="left" valign="top">0.07</td>
<td align="left" valign="top">0.04</td>
<td align="left" valign="top">&#8211;0.00</td>
<td align="left" valign="top">0.14</td>
</tr>
<tr>
<td align="left" valign="top">inventory:backness_harm_nident</td>
<td align="left" valign="top">0.04</td>
<td align="left" valign="top">0.03</td>
<td align="left" valign="top">&#8211;0.02</td>
<td align="left" valign="top">0.11</td>
</tr>
<tr>
<td align="left" valign="top">inventory:height_harm_nident</td>
<td align="left" valign="top">0.03</td>
<td align="left" valign="top">0.04</td>
<td align="left" valign="top">&#8211;0.05</td>
<td align="left" valign="top">0.11</td>
</tr>
<tr>
<td align="left" valign="top">inventory:backness_viol_avoid</td>
<td align="left" valign="top">0.00</td>
<td align="left" valign="top">0.02</td>
<td align="left" valign="top">&#8211;0.04</td>
<td align="left" valign="top">0.05</td>
</tr>
<tr>
<td align="left" valign="top">inventory:height_viol_avoid</td>
<td align="left" valign="top">&#8211;0.00</td>
<td align="left" valign="top">0.02</td>
<td align="left" valign="top">&#8211;0.05</td>
<td align="left" valign="top">0.04</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In XPF, vowel pairs that avoid violations of backness harmony are modestly over-represented (<italic>&#946;</italic> = 0.06, 95% CI: [&#8211;0.14, 0.25]; exp(0.06)=1.06; 95% CI: [0.87, 1.28]), though the effect is not reliable. NorthEuraLex shows a similarly small, uncertain effect (<italic>&#946;</italic> = 0.07, 95% CI: [&#8211;0.16, 0.30]; exp(0.07)=1.07; 95% CI: [0.85, 1.35]). Vowel pairs that conform to non-identical backness harmony (e.g. two +front or two +back vowels) are also underrepresented or only weakly favored (XPF: <italic>&#946;</italic> = 0.03, 95% CI: [&#8211;0.15, 0.20]; exp(0.03)=1.03; 95% CI: [0.86, 1.22]; NorthEuraLex: <italic>&#946;</italic> = 0.22, 95% CI: [&#8211;0.02, 0.43]; exp(0.22)=1.25; 95% CI: [0.98, 1.54]). This pattern suggests that vowel pairs involving unmarked values for backness (e.g, two mid vowels that are neither back nor front) may be preferred.</p>
<p>There are no reliable positive effects of height harmony in either corpus. In fact, violations of height harmony appear to be favored: in XPF, height-violation-avoiding pairs are underrepresented (<italic>&#946;</italic> = &#8211;0.10, 95% CI: [&#8211;0.19, &#8211;0.00]; exp(&#8211;0.10)=0.90; 95% CI: [0.83, 1.00]), while in NorthEuraLex the direction is similar (<italic>&#946;</italic> = &#8211;0.08, 95% CI: [&#8211;0.25, 0.10]; exp(&#8211;0.08)=0.92; 95% CI: [0.78, 1.11]), but not reliable based on credible intervals. Meanwhile, vowel pairs that share non-identical height features show no meaningful bias (XPF: <italic>&#946;</italic> = &#8211;0.10, 95% CI: [&#8211;0.37, 0.16]; exp(&#8211;0.10)=0.90; 95% CI: [0.69, 1.17]; NorthEuraLex: <italic>&#946;</italic> = 0.01, 95% CI: [&#8211;0.09, 0.11]; exp(0.01)=1.01; 95% CI: [0.91, 1.12]).</p>
<p>Interactions with inventory size are generally small and weak, suggesting that vowel inventory size does not exhibit a consistent effect on the extent of identity or harmony biases. The identity&#8211;inventory.size interaction is positive in XPF (<italic>&#946;</italic> = 0.07, 95% CI: [&#8211;0.00, 0.14]; exp(0.07)=1.07; 95% CI: [1.00, 1.15]) but negative in NorthEuraLex (<italic>&#946;</italic> = &#8211;0.05, 95% CI: [&#8211;0.08, &#8211;0.02]; exp(&#8211;0.05)=0.95; 95% CI: [0.92, 0.98]), indicating opposing trends of small magnitude: the identity bias is slightly stronger in larger-inventory languages in XPF, and slightly weaker in larger-inventory languages in NorthEuraLex. A small but reliable interaction between non-identical backness harmony and inventory size appears in XPF (<italic>&#946;</italic> = 0.04, 95% CI: [&#8211;0.02, 0.11]; exp(0.04)=1.04; 95% CI: [0.98, 1.12]), consistent with the idea that languages with larger vowel inventories are somewhat less likely to under-represent harmonious backness pairs. For height violations, NorthEuraLex shows a modest positive interaction (<italic>&#946;</italic> = 0.03, 95% CI: [0.01, 0.05]; exp(0.03)=1.03; 95% CI: [1.01, 1.05]), suggesting that the attestation of height violations relative to baseline decrease slightly with a higher inventory size. Generally, the small magnitude of these interactions with inventory size likely reflects the fact that inventory size partially overlaps with language-level variation, which is already captured through random intercepts. In addition, trivial effects of inventory size on the prevalence of identity (e.g. higher identity proportions in systems with fewer vowels) are likely to be accounted for by the model&#8217;s position-specific expected values, so the inventory size interaction estimates reflect effects beyond what would be expected given the more trivial effects.</p>
<p>We also show distributions of raw O/E values for each binary main effect predictor. Values for each language are derived by comparing the overall expected count of a particular configuration like identity or backness harmony to the observed count of that configuration in <xref ref-type="fig" rid="F2">Figure 2</xref>. Qualitatively, the distributions of languages&#8217; O/E values for each binary predictor are broadly compatible with the effects reported in the model. They align with the notion that observed counts of vowel pairs do not strongly deviate from those predicted by a random baseline, and thus that a harmony bias along featural dimensions is not likely to be reflected in vowel co-occurrence patterns. There is also no language that strongly under-represents identity in either corpus.</p>
<fig id="F2">
<caption>
<p><bold>Figure 2:</bold> Raw O/E values for each language, plotted as a point, along with a density curve describing the distribution of points. Overall identity&#8217;s raw O/E value is positive in most languages: 73/107 in NorthEuraLex and 61/92 in XPF.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g2.png"/>
</fig>
<fig id="F3">
<caption>
<p><bold>Figure 3:</bold> All posterior samples for backness and height across both models, with both posterior samples for predictors of each feature added together. In both models, there is a small but reliable preference for vowels to align relative to backness. Values to the right of the diagonal (x=y) are samples where the estimate for backness is more positive than those for height.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g3.png"/>
</fig>
<p>Next we explore whether there is any difference between the extent of backness harmony and the extent of height harmony. To get an overall estimate of a given language&#8217;s harmony bias, we add posterior samples for violation avoidance, and non-identical harmony (<xref ref-type="fig" rid="F3">Figure 3</xref>). In both corpora, there appears to be a slight positive bias for backness relative to height.</p>
</sec>
<sec>
<title>2.6 Model comparison</title>
<p>It is possible, given collinearity between predictors, that a model without an identity predictor would sufficiently account for the count data, suggesting that a bias in favor of identity could still be reducible directly to the aggregate of the featural predictors. Specifically, because a given vowel pair with harmony might also be identical, there might be no optimal way to allocate credit for a high count of pairs that have both harmony and may thus inflate the estimate of one predictor, like identity, at the expense of the harmony predictors. To verify whether identity in fact robustly contributes to the predictive capacity of the model, we conduct a model comparison with the same model as in <xref ref-type="table" rid="T1">Table 1</xref>, but in which there is no identity predictor. The main effects of this &#8220;null&#8221; model are shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, and the full data are shown in Supplementary Materials.</p>
<fig id="F4">
<caption>
<p><bold>Figure 4:</bold> Posterior samples (within 95% CI) for all binary predictors in null model where identity predictor is omitted. Positive values are all in the direction of more harmony.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g4.png"/>
</fig>
<p>The model comparison between the null model and the model with identity is carried out using approximate leave-one-out cross-validation (LOO; <xref ref-type="bibr" rid="B73">Vehtari et al. 2017</xref>), which estimates the out-of-sample predictive performance of each model. LOO evaluates how well a model predicts data it has not seen. If a model with an identity predictor has better LOO performance than a model without identity, this indicates that identity meaningfully improves predictions of vowel co-occurrence patterns. For the North Euralex data, the identity model was strongly favored over the null model (ELPD difference = &#8211;233.5, SE = 33.2), indicating that the identity effect captures meaningful structure in the data beyond what can be explained by a simpler model. For XPF data, the identity model was also favored over the null model (ELPD difference = &#8211;150.5, SE = 32.3). Because LOO accounts for model complexity by evaluating predictive performance on held-out data, these differences cannot be attributed solely to the identity model&#8217;s increased flexibility when the identity predictor is included.</p>
<p>To assess whether a preference against alignment for height relative to backness in <xref ref-type="fig" rid="F3">Figure 3</xref> might arise solely as an artifact of collinearity with identity, we examine these biases in the null model where identity is excluded as a predictor, and thus where the feature-based predictors can assume &#8220;credit&#8221; for all cases of identity. <xref ref-type="fig" rid="F4">Figure 4</xref> shows the posterior distributions for all binary effects in the model. <xref ref-type="fig" rid="F5">Figure 5</xref> shows a weaker asymmetry between height and backness. This suggests that identical pairs may account for some of the positive effects of height harmony (the credit for which the full model assigns to identity rather than to height harmony).</p>
<fig id="F5">
<caption>
<p><bold>Figure 5:</bold> Null model posterior samples for backness and height alignment. Overall, while the asymmetry between backness and height persists, it is less reliable than in the Full model. Relative to the Full model, the distribution in the NorthEuraLex height bias shifts upward, suggesting that the identity predictor in the full model accounts for the model failing to detect a unique height bias.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g5.png"/>
</fig>
<p>To further qualify the relationship between featural alignment along the dimensions of backness and height absent the effect of identity, we fit a model identical to the null model except that identical vowel pairs are omitted from the data entirely. In this model, featural predictors are left to account for just variability among non-identical pairs, which distills the relative influence of backness and height on vowel co-occurrence absent any possible effect of identity. The binary fixed effects are shown in <xref ref-type="fig" rid="F6">Figure 6</xref>. While evidence for a robust cross-linguistic harmony bias is still weak, there is evidence that co-occurrence patterns favor alignment along the dimension of backness more strongly than along the dimension of height. The qualitative similarities in the correspondence between backness predictors in this and the original model (e.g. see <xref ref-type="fig" rid="F3">Figure 3</xref>) suggest that backness and height predictors in the full model are adequate for capturing featural harmony as an over- or under-representation of vowel pairs with height or backness harmony.</p>
<fig id="F6">
<caption>
<p><bold>Figure 6:</bold> Posterior samples for height and backness in a model (with no identity predictor) fit to data in which pairs of identical vowels are excluded from the model. Given that these estimates are relative to expected estimates, this shows that the height harmony bias is absent when identical pairs are omitted.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g6.png"/>
</fig>
<sec>
<title>2.6.1 Variability across languages and language families</title>
<p>For a phenomenon like an identity bias to be considered &#8220;universal&#8221; in a strict sense, it is necessary to determine whether that bias is apparent in most or all languages, and not just whether it appears as an asymmetry in an aggregate trend across many languages that individually might not exhibit that bias. We explore how the effects of patterns like identity or backness harmony on pair counts vary across languages and language families in the full model by exploring random slopes in the model. As discussed before, random slopes represent the language- or family-specific deviations from the main effect of a given pattern, like identity. In fact, we can derive the model&#8217;s O/E-like estimate for a particular pattern like identity in a particular language. Specifically, family-specific posterior distributions are derived by adding each main effect to the by-Family random slopes (the Family-specific deviations from the main effect) at each level of family, and language-specific distributions are derived by adding these family-specific deviations to language-specific deviations, in addition to adding the value of the interaction with inventory size at that language&#8217;s level for inventory size.</p>
<p><xref ref-type="fig" rid="F7">Figures 7</xref> and <xref ref-type="fig" rid="F8">8</xref> show posterior estimates for specific languages in the XPF corpus and NorthEuraLex corpus respectively. No language family in either corpus exhibits even a weak negative bias relative to identity.</p>
<fig id="F7">
<caption>
<p><bold>Figure 7:</bold> Posterior distributions of all languages for identity and for combined height and combined backness.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g7.png"/>
</fig>
<fig id="F8">
<caption>
<p><bold>Figure 8:</bold> Posterior distributions of all languages for identity and for combined height and combined backness.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g8.png"/>
</fig>
<p>These family- and language-specific effects are thus consistent with the notion that while a backness bias manifests robustly in Turkic and Uralic languages (which are known to have some form of backness harmony), vowel co-occurrence in most languages does not appear to be robustly restricted along these dimensions. While NorthEuraLex exhibits more between-language variability for identity across both corpora, few if any languages appear to under-represent identity based on their raw posterior means. Additionally, no evidence of an asymmetry between backness and height observed for global main effect estimates does emerge when observing the means of individual languages (see Section 2.2, Supplementary Materials).</p>
</sec>
<sec>
<title>2.6.2 Between-corpus overlap</title>
<p>As mentioned, 18 languages appear in both XPF and NorthEuraLex datasets: Armenian, Bashkir, Basque, Bulgarian, Czech, Erzya, Georgian, Hungarian, Kannada, Korean, Malayalam, Romanian, Slovak, Spanish, Tatar, Telugu, Turkish, and Ukrainian. Language-specific estimates of the full model were generally consistent for each language across both corpora. Out of the 18 languages the number that had the same sign for the point estimate across both corpora was 18 for <italic>back.viol</italic> and <italic>height.harm</italic>, 16 for <italic>identity</italic>, 15 for <italic>back.harm</italic> and 12 for <italic>height.viol</italic>. While the analyses in this study were tailored for detecting global aggregate trends and not for reverse-engineering language-specific patterns, this between-corpus consistency for individual languages at least broadly validates that the model was capable of arriving at qualitatively similar estimates for particular languages.</p>
</sec>
</sec>
<sec>
<title>2.7 Variability of identity effects across vowels</title>
<sec>
<title>2.7.1 Random slopes of the full model</title>
<p>The data thus far suggest that there is evidence for a cross-linguistic bias in favor of identity, but a lack of such a bias in favor of featural harmony. We now explore the extent to which vowels vary in their proclivity to over-represent identity across languages. Phenomena like vowel harmony potentially reflect a more general fact that substantive articulatory or perceptual attributes of vowels are relevant for their non-adjacent co-occurrence. One empirical prediction of this claim is that identity biases should be more pronounced for some vowel categories than for others. To this end, we look at random slopes for <italic>identity</italic> by <italic>v1.in.lang</italic> (<xref ref-type="fig" rid="F9">Figure 9</xref>), to evaluate whether identity is cross-linguistically preferred for some vowel categories relative to others.</p>
<fig id="F9">
<caption>
<p><bold>Figure 9:</bold> Random slopes for <italic>v1.in.language</italic> organized by vowel category on the y-axis. Each point represents the mean estimate for each language. We show vowels that appear in the most languages across the two corpora.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g9.png"/>
</fig>
<p>The <italic>v1.in.lang</italic> grouping variable has hundreds of levels for each model; the model estimates a separate identity effect (slope) for each language-specific vowel, such as &#8216;a.in.English&#8217;, rather than a unified effect for the phoneme /a/ across languages. In this post-hoc exploration, we assess whether phoneme-specific tendencies emerge independently across all languages. We also compute raw Observed/Expected values by comparing raw counts of identity to expected counts in languages where those vowel occur.</p>
<p>Additionally, to get more interpretable random effect estimates, but at the expense of obscuring between language variability, we fit a separate model with only identity (and its interaction with inventory size) as a fixed-effect predictor, and the first vowel in the pair (<italic>v1</italic>) as a global, rather than language-specific, grouping variable.<xref ref-type="fn" rid="n3">3</xref> Thus, a random effect of identity for each vowel can be understood as that vowel&#8217;s deviation from the global vowel-general effect of identity on pair counts. A negative effect (i.e. a negative random slope) for a particular vowel means that identity is weaker in that vowel than the overall identity effect, and a positive coefficient of a vowel-specific random effect would mean that the effect of identity is stronger for that vowel.</p>
<p><xref ref-type="fig" rid="F9">Figure 9</xref> shows all posterior means for random slopes grouped by vowel category, such that each vowel category is represented as a distribution of language-specific random slopes for that category. We show the seven categories that appear in the most languages across corpora. The data show somewhat consistent tendencies regarding cross-linguistic vowel-specific identity biases. In particular, while not robust, <xref ref-type="fig" rid="F9">Fig 9</xref> suggests that identity might be weakest for /a/ and /i/. There is a clear difference between the distribution for /u/ and /o/ and the distribution for /a/ or /i/, where most languages have positive random slopes and raw O/E values for /u/ and negative ones for /a/ or /i/.</p>
<p>For the model with only identity as a phonological predictor and global random slopes for vowels, we also observe similar patterns. We show the posterior distributions for the same seven frequent vowels in <xref ref-type="fig" rid="F10">Figure 10</xref>, (For a plot with all vowels, see Supplementary Materials). Both XPF and NorthEuraLex show broad similarities, namely a relative under-representation of identity for /a/ and /i/ compared to the global average.</p>
<fig id="F10">
<caption>
<p><bold>Figure 10:</bold> By-vowel random slopes for identity-only model. All random slopes are relative to the global main effect for identity, which is positive.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="glossa-11-17737-g10.png"/>
</fig>
<p>This exploration of variability of identity&#8217;s effect on vowel co-occurrence biases across vowel categories provides preliminary evidence that identity is sensitive to substantive acoustic and articulatory properties of vowels and thus that the abundance of identity is unlikely to be driven by operations that are entirely unmediated by such attributes. More generally, these differences suggest that despite the lack of robust featural alignment, substantive articulatory and perceptual properties that are common to vowel categories across languages nevertheless can influence their co-occurrence patterns.</p>
</sec>
</sec>
<sec>
<title>2.8 Summary</title>
<p>To summarize, the current study of vowel co-occurrence patterns across two large cross-linguistic datasets shows that languages over-represent identity, while there is a lack of strong evidence for universal harmony along featural dimensions, as most languages individually do not represent any form of harmony. A model comparison further confirms that identity meaningfully contributes to the predictive capacity of the model, beyond what would be expected from the added model complexity of including an extra predictor and its random effects. Lastly, a post-hoc exploration of between-vowel variability for identity reveals some systematic tendencies for identity to be over-represented in certain vowels more than others, suggesting that vowel co-occurrence is nevertheless likely to be sensitive to phonetic and phonological properties of vowel categories.</p>
</sec>
</sec>
<sec>
<title>3 General discussion</title>
<p>This study set out to investigate whether vowel-co-occurrence patterns in the world&#8217;s lexicons universally reflect a soft vowel harmony bias. Concurrent analyses of 92 languages in the XPF corpus as well as 107 languages in the NorthEuraLex corpus independently offer evidence that vowel identity is robustly over-represented across lexicons. In contrast, evidence for systematic over-representation of featural alignment was limited or absent, which is consistent with prior work (e.g. <xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>). We also show that across both corpora, identity for vowel categories like /a/ and /i/ is under-represented relative to the global identity effect, suggesting that substantive properties of vowels are consequential for their non-adjacent co-occurrence.</p>
<sec>
<title>3.1 The role of featural alignment in vowel co-occurrence</title>
<p>The findings for featural harmony are consistent with the notion that vowel co-occurrence is not driven by either a preference for or avoidance of similarity which aligns with work by Walter (<xref ref-type="bibr" rid="B75">2010</xref>) and Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>). Interestingly, the aggregate vowel similarity measures used by Doucette et al. (<xref ref-type="bibr" rid="B27">2024</xref>) did not detect identity-independent similarity biases even for Turkic languages, so the current results nevertheless underscore that aggregate similarity measures like those used in that study can obscure real featurally structured dependencies in vowel co-occurrence. Still, given evidence for identity biases but no evidence for universal partial similarity biases (outside of languages with productive harmony), one might conclude that the articulatory properties of vowels or the effects of vowel-to-vowel coarticulation are generally too weak to be reflected in the co-occurrence patterns of non-adjacent vowels. Our exploration of vowel-specific effects on vowel identity biases, however, shows that some vowels exhibit a greater over-representation of identity than others. In particular, across both the XPF and NorthEuraLex corpora, identity for /i/ and /a/ is underrepresented relative to other vowel categories like /y/, /e/ and /o/. Thus, there are likely to be language-general tendencies for certain identical vowel sequences to be more over-represented than others, and the current results suggest more broadly that universal properties of vowels can affect their co-occurrence patterns in measurable ways.</p>
</sec>
<sec>
<title>3.2 Why is vowel identity over-represented?</title>
<p>The statistical bias toward vowel identity was the most reliable cross-linguistic effect observed in this study and is in line with prior findings (<xref ref-type="bibr" rid="B3">Alderete &amp; Finley 2016</xref>; <xref ref-type="bibr" rid="B67">Stanton 2021</xref>; <xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>). While languages varied in the extent to which they over-represented identity, this pattern persisted across modeling approaches and corpora, and no language showed reliable evidence of a robust statistical bias against vowel identity. The findings align with earlier corpus studies that have reported positive vowel identity effects, and support the view that segmental identity is distinct from alignment along featural dimensions, or more generally that the relationship between vowel pairs&#8217; similarity and their attestation is nonlinear.</p>
<p>One possible explanation for the presence of an identity bias and absence of harmony biases is that identity emerges as a result of vowel assimilation but that assimilation is itself nonlinear. This view is compatible with some accounts of phonology proposing that assimilation is subject to similarity thresholds (<xref ref-type="bibr" rid="B76">Wayment 2009</xref>; <xref ref-type="bibr" rid="B22">Cole &amp; Trigo 1988</xref>; <xref ref-type="bibr" rid="B34">Gallagher &amp; Coon 2009</xref>). For example, parasitic vowel harmony is a well-documented nonlinearity in vowel co-occurrence whereby agreement along one featural dimension A is contingent on agreement along another featural dimension B (e.g. <xref ref-type="bibr" rid="B3">Alderete &amp; Finley 2016</xref>; <xref ref-type="bibr" rid="B22">Cole &amp; Trigo 1988</xref>). For example, Wayment (<xref ref-type="bibr" rid="B76">2009</xref>) argues that examples of nonlinear assimilation like parasitic harmony follow from a general attractor principle, wherein assimilation is driven by forces that strengthen representational connections between sufficiently similar segments. It is thus possible that there are universal interaction effects between particular featural dimensions that account for the excess of identity. That the strength of identity varies by vowel category is at least broadly consistent with this notion, as interactive thresholding might result in asymmetries in which vowel in a pair is a target of a change that results in identity. We ultimately leave investigation of such interactions, and the extent to which they are typologically widespread or language-specific, to future work.</p>
<p>An alternative account is that identity effects in the lexicon arise not from assimilation but from distinct copying mechanisms such as reduplication or epenthesis. Some work has suggested that identity is uniquely privileged in phenomena like reduplication (e.g. <xref ref-type="bibr" rid="B53">McCarthy 1995</xref>) as well as in echo epenthesis, where for some languages, the epenthetic vowel is a copy of another vowel in the stem rather than a single default vowel (e.g. <xref ref-type="bibr" rid="B46">Kitto &amp; de Lacy 1999</xref>; <xref ref-type="bibr" rid="B44">Kawahara 2007</xref>). While the precise computational mechanism of these processes is contested, a generalization is that vowel identity can in principle emerge independently of processes for enforcing sub-phonemic featural similarity, via mechanisms that simply add information to an unfilled slot. Additionally, because they add to an unfilled slot, such mechanisms can also preserve lexical contrast and are thus less likely than assimilation to be disruptive to the phonological organization of the lexicon. Consider a toy case of reduplication:</p>
<list list-type="simple">
<list-item><p>(a) <italic>/etik/</italic> &#8594; <italic>[tet-etik]</italic></p></list-item>
<list-item><p>(b) <italic>/otik/</italic> &#8594; <italic>[tot-otik]</italic></p></list-item>
</list>
<p>These changes increase identity while maintaining the distinctiveness of the original stems. Identity in this case is no different than adding an affix. In contrast, partial assimilation in (d): <italic>/o/</italic> &#8594; <italic>[e]</italic> conditioned by a following <italic>/i/</italic> to resolve a lack of agreement for backness:</p>
<list list-type="simple">
<list-item><p>(c) <italic>/etikt/</italic> &#8594; <italic>[etikt]</italic> (unchanged)</p></list-item>
<list-item><p>(d) <italic>/otikt/</italic> &#8594; <italic>[etikt]</italic> (changes to accommodate harmony restriction)</p></list-item>
</list>
<p>If synchronic processes or diachronic changes akin to morphological or epenthetic copying are in fact distinct from local assimilation as some work suggests (e.g. <xref ref-type="bibr" rid="B46">Kitto &amp; de Lacy 1999</xref>; <xref ref-type="bibr" rid="B53">McCarthy 1995</xref>), this could provide one account for why identity is systematically preserved in the lexicon, while featural assimilation is not. It is not clear, however, that such mechanisms alone could account for vowel-specific (and language-general) tendencies for exhibiting identity, meaning that a preference for identity must at least to some extent be sensitive to the phonological properties of particular sounds and cannot &#8220;bypass&#8221; such properties entirely. While the current results suffice to show that identity is unlikely to emerge solely from linear assimilation, more work is necessary to empirically substantiate the possibility that distinct non-assimilatory mechanisms account for an identity over-representation in any meaningful way.</p>
<p>It is also possible that surface-level factors like the relative perceptual salience of identical vowel pairs may contribute to their over-representation compared to featurally aligned pairs. For example, Mintz et al. (<xref ref-type="bibr" rid="B55">2018</xref>), found that identity, but not featural harmony, facilitated lexical segmentation in infants. They argue that featural harmony does not offer a sufficiently robust cue for segmentation. Because identity is directly perceptible on the surface, possibly at the level of conscious awareness, and does not require listeners to group vowels by sub-phonemic features, it may be easier to encode and recall, leading to its persistence in the lexicon over time. To the extent that some vowels are more perceptually salient than others, such a proposal is consistent with the finding that identity biases vary systematically by vowel category.</p>
<p>More generally, there is evidence that the constraints learners infer from linguistic data do not follow linearly from the data; for example, Breiss &amp; Albright (<xref ref-type="bibr" rid="B12">2022</xref>) find that learners make super-additive inferences from exposure to vowel co-occurrence data in an artificial language, and their grammaticality judgments do not match the frequency distributions of the input. In short, language users may themselves exhibit nonlinearities in terms of the inferences they draw from phonetic input, and to the extent that such inferences underlie grammars, they could result in nonlinearity in lexical data. Whether nonlinearities in learners&#8217; inferences can actually account for the empirical results presented here remains to be determined.</p>
<p>To better adjudicate the role that communicative or information-theoretic factors like preservation of lexical contrasts play in determining attestation of assimilation-driven phonological patterns, it would be necessary to quantify the actual effect such statistical biases have on the lexicon. An inviolable constraint for complete identity within words is intuitively untenable, so there is likely an upper limit at which point a language cannot sustain an over-representation of identity. What constitutes such an upper-limit, however, is ill-defined. The current O/E-based linear modeling approach is well-equipped to detect deviations from a baseline where sounds are allowed to combine freely, but is limited in its ability to show what the proximity of any effects is to such an upper limit. It is thus also necessary to describe the nature and extent of redundancy already present in a language&#8217;s baseline model of free combination that is attributable only to the relative frequency of sounds themselves. If a substantial degree of structure emerges even in the absence of explicit co-occurrence restrictions, then the overattestation of phenomena like identity could be lower than in a language whose baseline exhibits a high degree of lexical distinctiveness.</p>
</sec>
<sec>
<title>3.3 Limitations and future directions</title>
<p>While this study shows that there is no universal alignment of vowel pairs along the dimensions of backness or height, the effect of other featural dimensions cannot be ruled out. It is also possible that all languages exhibit harmony-like biases in at least one dimension but that the dimensions of these biases are language-specific. This proposal is consistent with the notion that productive harmony systems more generally vary by the features they use to enforce agreement like ATR and roundness. These are not included because they are not contrastive in many languages, and roundness is redundant with back in many languages. The hierarchical modeling structure partially mitigates this limitation. Because the model includes random slopes by language and family, languages are not forced to conform to the patterns predicted by these two features alone; languages exhibiting harmony or co-occurrence patterns based on other dimensions (e.g. ATR, rounding, nasality) will show this through systematic deviations in their language-specific random effects. But more work is necessary to explicitly rule out the possibility that these or other featural dimensions consistently mediate co-occurrence patterns.</p>
<p>Relatedly, the current corpora do not make use of fully narrow transcriptions, and thus the analyses may ignore pressures in favor of similarity that occur along more granular dimensions that could be captured with narrower transcriptions. Future work should extend the current approach to more featural dimensions and to data with narrower transcriptions. Analyses of individual languages should also be carried out to uncover sources of variability that are potentially obscured in the large-scale approach taken here.</p>
</sec>
</sec>
<sec>
<title>4 Conclusion</title>
<p>The results of this study are consistent with the notion that there is a universal bias for vowel identity but that vowel co-occurrence is otherwise relatively unconstrained based on featural or gradient similarity (e.g. <xref ref-type="bibr" rid="B27">Doucette et al. 2024</xref>). One caveat is that the degree of identity biases varies by vowel category in language-general ways, suggesting that vowel co-occurrence is at least broadly sensitive to cross-linguistically generalizable substantive properties of vowel categories. Together, these results suggest that local assimilatory processes do not straightforwardly affect the attestation of certain vowel sequences, and in particular that some nonlinear mechanisms or biases are necessary to account for the prevalence of identity biases relative to partial featural similarity biases. More generally, the current findings suggest distributions of sounds in lexicons are subject to global lexical factors and not just the well-formedness of sound sequences.</p>
</sec>
</body>
<back>
<sec>
<title>Supplementary files</title>
<p><bold>Appendix:</bold> Corpus and Language Metadata Tables. DOI: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://doi.org/10.16995/glossa.17737.s1">https://doi.org/10.16995/glossa.17737.s1</ext-link></p>
<p><bold>Supplementary Materials.</bold> DOI: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://doi.org/10.16995/glossa.17737.s2">https://doi.org/10.16995/glossa.17737.s2</ext-link></p>
</sec>
<sec>
<title>Acknowledgements</title>
<p>We would like to thank Stefon Flego for helpful comments at an early stage of this project. We would also like to thank both anonymous reviewers for helpful feedback that improved the paper.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors have no competing interests to declare.</p>
</sec>
<fn-group>
<fn id="n1"><p>This paper uses terms like &#8220;bias&#8221; and &#8220;preference&#8221; as a convenient shorthand for the statistical over-representation of certain patterns. This language is not used to advance any theoretical stance regarding what that numerical over-representation means.</p></fn>
<fn id="n2"><p>The complete word lists were downloaded in 2018 and are no longer publicly available online.</p></fn>
<fn id="n3"><p>The model formula is <italic>pair.count ~ identity*inventory.size + (1 + identity &#8212; v1) + (1+ identity &#8212;language) + (1+identity&#8212;family)</italic>. All hyper-parameters are identical as the full model.</p></fn>
</fn-group>
<ref-list>
<ref id="B1"><mixed-citation publication-type="journal"><string-name><surname>Albright</surname>, <given-names>Adam</given-names></string-name>. <year>2009</year>. <article-title>Modeling analogy as probabilistic grammar</article-title>. <source>Analogy in Grammar</source> <volume>3</volume>. <fpage>185</fpage>&#8211;<lpage>213</lpage>. DOI: <pub-id pub-id-type="doi">10.1093/acprof:oso/9780199547548.003.0009</pub-id></mixed-citation></ref>
<ref id="B2"><mixed-citation publication-type="book"><string-name><surname>Albright</surname>, <given-names>Adam</given-names></string-name> &amp; <string-name><surname>Breiss</surname>, <given-names>Canaan</given-names></string-name>. <year>2024</year>. <chapter-title>A poisson model of phonological cooccurrence restrictions</chapter-title>. In <source>Proceedings of the 19th Conference on Laboratory Phonology (LabPhon 19)</source>. <publisher-loc>Seoul, South Korea</publisher-loc>.</mixed-citation></ref>
<ref id="B3"><mixed-citation publication-type="journal"><string-name><surname>Alderete</surname>, <given-names>John</given-names></string-name> &amp; <string-name><surname>Finley</surname>, <given-names>Sara</given-names></string-name>. <year>2016</year>. <article-title>Gradient vowel harmony in oceanic</article-title>. <source>Language and Linguistics</source> <volume>17</volume>(<issue>6</issue>). <fpage>769</fpage>&#8211;<lpage>796</lpage>. DOI: <pub-id pub-id-type="doi">10.1177/1606822X16660960</pub-id></mixed-citation></ref>
<ref id="B4"><mixed-citation publication-type="journal"><string-name><surname>Aoki</surname>, <given-names>Haruo</given-names></string-name>. <year>1968</year>. <article-title>Toward a typology of vowel harmony</article-title>. <source>International Journal of American Linguistics</source> <volume>34</volume>(<issue>2</issue>). <fpage>142</fpage>&#8211;<lpage>145</lpage>. DOI: <pub-id pub-id-type="doi">10.1086/465006</pub-id></mixed-citation></ref>
<ref id="B5"><mixed-citation publication-type="book"><string-name><surname>Archangeli</surname>, <given-names>Diana</given-names></string-name> &amp; <string-name><surname>Pulleyblank</surname>, <given-names>Douglas</given-names></string-name>. <year>2007</year>. <chapter-title>Harmony</chapter-title>. In <string-name><surname>de Lacy</surname>, <given-names>Paul</given-names></string-name> (ed.), <source>The Cambridge Handbook of Phonology</source>, <fpage>353</fpage>&#8211;<lpage>378</lpage>. <publisher-loc>Cambridge, UK</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>. DOI: <pub-id pub-id-type="doi">10.1017/CBO9780511486371.016</pub-id></mixed-citation></ref>
<ref id="B6"><mixed-citation publication-type="journal"><string-name><surname>Beddor</surname>, <given-names>Patricia S.</given-names></string-name> <year>2009</year>. <article-title>A coarticulatory path to sound change</article-title>. <source>Language</source>, <fpage>785</fpage>&#8211;<lpage>821</lpage>. DOI: <pub-id pub-id-type="doi">10.1353/lan.0.0165</pub-id></mixed-citation></ref>
<ref id="B7"><mixed-citation publication-type="book"><string-name><surname>Bentz</surname>, <given-names>Christian</given-names></string-name> &amp; <string-name><surname>i Cancho</surname>, <given-names>Ramon Ferrer</given-names></string-name>. <year>2016</year>. <chapter-title>Zipf&#8217;s law of abbreviation as a language universal</chapter-title>. In <source>Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics</source>, <fpage>1</fpage>&#8211;<lpage>4</lpage>. <publisher-name>University of T&#252;bingen</publisher-name>.</mixed-citation></ref>
<ref id="B8"><mixed-citation publication-type="book"><string-name><surname>Bills</surname>, <given-names>Aric</given-names></string-name>, &amp; <string-name><surname>Bishop</surname>, <given-names>Judith</given-names></string-name> &amp; <string-name><surname>David</surname>, <given-names>Anne</given-names></string-name> &amp; <string-name><surname>Dubinski</surname>, <given-names>Eyal</given-names></string-name> &amp; <string-name><surname>Fiscus</surname>, <given-names>Jonathan G.</given-names></string-name> &amp; <string-name><surname>Hammond</surname>, <given-names>Simon</given-names></string-name> &amp; <string-name><surname>Gann</surname>, <given-names>Ketty</given-names></string-name> &amp; <string-name><surname>Harper</surname>, <given-names>Mary</given-names></string-name> &amp; <string-name><surname>Hefright</surname>, <given-names>Brook</given-names></string-name> &amp; <string-name><surname>Kazi</surname>, <given-names>Michael</given-names></string-name> &amp; <string-name><surname>Lam</surname>, <given-names>Julie</given-names></string-name> &amp; <string-name><surname>Ray</surname>, <given-names>Jessica</given-names></string-name> &amp; <string-name><surname>Richardson</surname>, <given-names>Fred</given-names></string-name> &amp; <string-name><surname>Rytting</surname>, <given-names>Anton</given-names></string-name> &amp; <string-name><surname>Walter</surname>, <given-names>Marle</given-names></string-name>. <year>2016</year>. <chapter-title>IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a LDC2016S12</chapter-title>. Web Download. <publisher-loc>Philadelphia</publisher-loc>: <publisher-name>Linguistic Data Consortium</publisher-name>. DOI: <pub-id pub-id-type="doi">10.35111/dcr5-ga44</pub-id></mixed-citation></ref>
<ref id="B9"><mixed-citation publication-type="book"><string-name><surname>Blevins</surname>, <given-names>Juliette</given-names></string-name>. <year>2004</year>. <source>Evolutionary Phonology: The Emergence of Sound Patterns</source>. <publisher-name>Cambridge University Press</publisher-name>. DOI: <pub-id pub-id-type="doi">10.1017/CBO9780511486357</pub-id></mixed-citation></ref>
<ref id="B10"><mixed-citation publication-type="journal"><string-name><surname>Blevins</surname>, <given-names>Juliette</given-names></string-name> &amp; <string-name><surname>Wedel</surname>, <given-names>Andrew</given-names></string-name>. <year>2009</year>. <article-title>Inhibited sound change: An evolutionary approach to lexical competition</article-title>. <source>Diachronica</source> <volume>26</volume>(<issue>2</issue>). <fpage>143</fpage>&#8211;<lpage>183</lpage>. DOI: <pub-id pub-id-type="doi">10.1075/dia.26.2.01ble</pub-id></mixed-citation></ref>
<ref id="B11"><mixed-citation publication-type="journal"><string-name><surname>Boersma</surname>, <given-names>Paul</given-names></string-name> &amp; <string-name><surname>Hayes</surname>, <given-names>Bruce</given-names></string-name>. <year>2001</year>. <article-title>Empirical tests of the gradual learning algorithm</article-title>. <source>Linguistic Inquiry</source> <volume>32</volume>(<issue>1</issue>). <fpage>45</fpage>&#8211;<lpage>86</lpage>. DOI: <pub-id pub-id-type="doi">10.1162/002438901554586</pub-id></mixed-citation></ref>
<ref id="B12"><mixed-citation publication-type="journal"><string-name><surname>Breiss</surname>, <given-names>Canaan</given-names></string-name> &amp; <string-name><surname>Albright</surname>, <given-names>Adam</given-names></string-name>. <year>2022</year>. <article-title>Cumulative markedness effects and (non-)linearity in phonotactics</article-title>. <source>Glossa: A Journal of General Linguistics</source> <volume>7</volume>(<issue>1</issue>). <fpage>1</fpage>&#8211;<lpage>34</lpage>. DOI: <pub-id pub-id-type="doi">10.16995/glossa.5713</pub-id></mixed-citation></ref>
<ref id="B13"><mixed-citation publication-type="journal"><string-name><surname>B&#252;rkner</surname>, <given-names>Paul-Christian</given-names></string-name>. <year>2018</year>. <article-title>Advanced bayesian multilevel modeling with the r package brms</article-title>. <source>The R Journal</source> <volume>10</volume>(<issue>1</issue>). <fpage>395</fpage>&#8211;<lpage>411</lpage>. DOI: <pub-id pub-id-type="doi">10.32614/RJ-2018-017</pub-id></mixed-citation></ref>
<ref id="B14"><mixed-citation publication-type="journal"><string-name><surname>Caplan</surname>, <given-names>Spencer</given-names></string-name> &amp; <string-name><surname>Kodner</surname>, <given-names>Jordan</given-names></string-name> &amp; <string-name><surname>Yang</surname>, <given-names>Charles</given-names></string-name>. <year>2020</year>. <article-title>Miller&#8217;s monkey updated: Communicative efficiency and the statistics of words in natural language</article-title>. <source>Cognition</source> <volume>205</volume>. <elocation-id>104466</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2020.104466</pub-id></mixed-citation></ref>
<ref id="B15"><mixed-citation publication-type="journal"><string-name><surname>Casali</surname>, <given-names>Roderic F.</given-names></string-name> <year>2008</year>. <article-title>ATR harmony in African languages</article-title>. <source>Language and Linguistics Compass</source> <volume>2</volume>(<issue>3</issue>). <fpage>496</fpage>&#8211;<lpage>549</lpage>. DOI: <pub-id pub-id-type="doi">10.1111/j.1749-818X.2008.00064.x</pub-id></mixed-citation></ref>
<ref id="B16"><mixed-citation publication-type="journal"><string-name><surname>Cohen Priva</surname>, <given-names>Uriel</given-names></string-name>. <year>2017</year>. <article-title>Informativity and the actuation of lenition</article-title>. <source>Language</source> <volume>93</volume>(<issue>3</issue>). <fpage>569</fpage>&#8211;<lpage>597</lpage>. DOI: <pub-id pub-id-type="doi">10.1353/lan.2017.0037</pub-id></mixed-citation></ref>
<ref id="B17"><mixed-citation publication-type="journal"><string-name><surname>Cohen Priva</surname>, <given-names>Uriel</given-names></string-name> &amp; <string-name><surname>Jaeger</surname>, <given-names>T. Florian</given-names></string-name>. <year>2018</year>. <article-title>The interdependence of frequency, predictability, and informativity in the segmental domain</article-title>. <source>Linguistics Vanguard</source> <volume>4</volume>(<issue>s2</issue>). <elocation-id>20170028</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1515/lingvan-2017-0028</pub-id></mixed-citation></ref>
<ref id="B18"><mixed-citation publication-type="journal"><string-name><surname>Cohen Priva</surname>, <given-names>Uriel</given-names></string-name> &amp; <string-name><surname>Strand</surname>, <given-names>Elizabeth</given-names></string-name>. <year>2023</year>. <article-title>Schwa&#8217;s duration and acoustic position in American English</article-title>. <source>Journal of Phonetics</source> <volume>96</volume>. <elocation-id>101198</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1016/j.wocn.2022.101198</pub-id></mixed-citation></ref>
<ref id="B19"><mixed-citation publication-type="webpage"><string-name><surname>Cohen Priva</surname>, <given-names>Uriel</given-names></string-name> &amp; <string-name><surname>Strand</surname>, <given-names>Emily</given-names></string-name> &amp; <string-name><surname>Yang</surname>, <given-names>Shiying</given-names></string-name> &amp; <string-name><surname>Mizgerd</surname>, <given-names>William</given-names></string-name> &amp; <string-name><surname>Creighton</surname>, <given-names>Abigail</given-names></string-name> &amp; <string-name><surname>Bai</surname>, <given-names>Justin</given-names></string-name> &amp; <string-name><surname>Mathew</surname>, <given-names>Rebecca</given-names></string-name> &amp; <string-name><surname>Shao</surname>, <given-names>Allison</given-names></string-name> &amp; <string-name><surname>Schuster</surname>, <given-names>Jordan</given-names></string-name> &amp; <string-name><surname>Wiepert</surname>, <given-names>Daniela</given-names></string-name>. <year>2021</year>. <source>The Cross-Linguistic Phonological Frequencies (XPF) Corpus manual</source>. <uri>https://cohenpr-xpf.github.io/XPF/manual/xpf_manual.pdf</uri>.</mixed-citation></ref>
<ref id="B20"><mixed-citation publication-type="journal"><string-name><surname>Cohen Priva</surname>, <given-names>Uriel</given-names></string-name> &amp; <string-name><surname>Yang</surname>, <given-names>Shiying</given-names></string-name> &amp; <string-name><surname>Strand</surname>, <given-names>Emily</given-names></string-name>. <year>2020</year>. <article-title>The stability of segmental properties across genre and corpus types in low-resource languages</article-title>. In <source>Proceedings of the Society for Computation in Linguistics</source>, vol. <volume>3</volume>. <fpage>1</fpage>&#8211;<lpage>9</lpage>.</mixed-citation></ref>
<ref id="B21"><mixed-citation publication-type="journal"><string-name><surname>Cole</surname>, <given-names>Jennifer</given-names></string-name>. <year>2009</year>. <article-title>Emergent feature structures: Harmony systems in exemplar models of phonology</article-title>. <source>Language Sciences</source> <volume>31</volume>(<issue>2&#8211;3</issue>). <fpage>144</fpage>&#8211;<lpage>160</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.langsci.2008.12.004</pub-id></mixed-citation></ref>
<ref id="B22"><mixed-citation publication-type="journal"><string-name><surname>Cole</surname>, <given-names>Jennifer</given-names></string-name> &amp; <string-name><surname>Trigo</surname>, <given-names>Loren</given-names></string-name>. <year>1988</year>. <article-title>Parasitic harmony</article-title>. <source>Features, Segmental Structure and Harmony Processes (Part II)</source>. <fpage>19</fpage>&#8211;<lpage>38</lpage>. DOI: <pub-id pub-id-type="doi">10.1515/9783110250497-004</pub-id></mixed-citation></ref>
<ref id="B23"><mixed-citation publication-type="book"><string-name><surname>Coleman</surname>, <given-names>John</given-names></string-name> &amp; <string-name><surname>Pierrehumbert</surname>, <given-names>Janet</given-names></string-name>. <year>1997</year>. <chapter-title>Stochastic phonological grammars and acceptability</chapter-title>. In <source>Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology</source>, <fpage>1</fpage>&#8211;<lpage>8</lpage>. <publisher-name>Association for Computational Linguistics</publisher-name>.</mixed-citation></ref>
<ref id="B24"><mixed-citation publication-type="journal"><string-name><surname>Dautriche</surname>, <given-names>Isabelle</given-names></string-name> &amp; <string-name><surname>Mahowald</surname>, <given-names>Kyle</given-names></string-name> &amp; <string-name><surname>Gibson</surname>, <given-names>Edward</given-names></string-name> &amp; <string-name><surname>Piantadosi</surname>, <given-names>Steven T.</given-names></string-name> <year>2017</year>. <article-title>Wordform similarity increases with semantic similarity: An analysis of 100 languages</article-title>. <source>Cognitive Science</source> <volume>41</volume>(<issue>8</issue>). <fpage>2149</fpage>&#8211;<lpage>2169</lpage>. DOI: <pub-id pub-id-type="doi">10.1111/cogs.12453</pub-id></mixed-citation></ref>
<ref id="B25"><mixed-citation publication-type="journal"><string-name><surname>De Smet</surname>, <given-names>Ive</given-names></string-name> &amp; <string-name><surname>Rosseel</surname>, <given-names>Laurens</given-names></string-name>. <year>2023</year>. <article-title>Who&#8217;s afraid of homophones? a multimethodological approach to homophony avoidance</article-title>. <source>Language and Cognition</source>, <fpage>1</fpage>&#8211;<lpage>24</lpage>. DOI: <pub-id pub-id-type="doi">10.1017/langcog.2023.50</pub-id></mixed-citation></ref>
<ref id="B26"><mixed-citation publication-type="journal"><string-name><surname>Dellert</surname>, <given-names>Johannes</given-names></string-name> &amp; <string-name><surname>J&#228;ger</surname>, <given-names>Gerhard</given-names></string-name>. <year>2020</year>. <article-title>Northeuralex: A wide-coverage lexical database of northern eurasia</article-title>. <source>Language Resources and Evaluation</source> <volume>54</volume>. <fpage>273</fpage>&#8211;<lpage>301</lpage>. DOI: <pub-id pub-id-type="doi">10.1007/s10579-019-09480-6</pub-id></mixed-citation></ref>
<ref id="B27"><mixed-citation publication-type="journal"><string-name><surname>Doucette</surname>, <given-names>Abigail</given-names></string-name> &amp; <string-name><surname>O&#8217;Donnell</surname>, <given-names>Timothy J.</given-names></string-name> &amp; <string-name><surname>Sonderegger</surname>, <given-names>Morgan</given-names></string-name> &amp; <string-name><surname>Goad</surname>, <given-names>Heather</given-names></string-name>. <year>2024</year>. <article-title>Investigating the universality of consonant and vowel co-occurrence restrictions</article-title>. <source>Glossa: A Journal of General Linguistics</source> <volume>9</volume>(<issue>1</issue>). <fpage>1</fpage>&#8211;<lpage>33</lpage>. DOI: <pub-id pub-id-type="doi">10.16995/glossa.9373</pub-id></mixed-citation></ref>
<ref id="B28"><mixed-citation publication-type="journal"><string-name><surname>Fagyal</surname>, <given-names>Zsuzsanna</given-names></string-name> &amp; <string-name><surname>Nguyen</surname>, <given-names>No&#235;l</given-names></string-name> &amp; <collab>de Mare&#252;il, Philippe Boula</collab>. <year>2003</year>. <article-title>From dilation to coarticulation: Is there vowel harmony in French?</article-title> <source>Studies in the Linguistic Sciences</source> <volume>32</volume>(<issue>2</issue>). <fpage>1</fpage>&#8211;<lpage>21</lpage>.</mixed-citation></ref>
<ref id="B29"><mixed-citation publication-type="journal"><string-name><surname>Finley</surname>, <given-names>Sara</given-names></string-name>. <year>2010</year>. <article-title>Exceptions in vowel harmony are local</article-title>. <source>Lingua</source> <volume>120</volume>(<issue>6</issue>). <fpage>1549</fpage>&#8211;<lpage>1566</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.lingua.2009.10.003</pub-id></mixed-citation></ref>
<ref id="B30"><mixed-citation publication-type="book"><string-name><surname>Finley</surname>, <given-names>Sara</given-names></string-name> &amp; <string-name><surname>Badecker</surname>, <given-names>William</given-names></string-name>. <year>2008</year>. <chapter-title>Analytic biases for vowel harmony languages</chapter-title>. In <string-name><surname>Chan</surname>, <given-names>Natasha</given-names></string-name> &amp; <string-name><surname>Henderson</surname>, <given-names>Claire</given-names></string-name> &amp; <string-name><surname>Kurisu</surname>, <given-names>Kenshi</given-names></string-name> (eds.), <source>Proceedings of the 27th West Coast Conference on Formal Linguistics (WCCFL)</source>, <fpage>168</fpage>&#8211;<lpage>176</lpage>. <publisher-name>Cascadilla Proceedings Project</publisher-name>.</mixed-citation></ref>
<ref id="B31"><mixed-citation publication-type="journal"><string-name><surname>Flego</surname>, <given-names>Stefon</given-names></string-name> &amp; <string-name><surname>Forrest</surname>, <given-names>Jessica</given-names></string-name>. <year>2021</year>. <article-title>Leveraging the temporal dynamics of anticipatory vowel-to-vowel coarticulation in linguistic prediction: A statistical modeling approach</article-title>. <source>Journal of Phonetics</source> <volume>88</volume>. <elocation-id>101093</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1016/j.wocn.2021.101093</pub-id></mixed-citation></ref>
<ref id="B32"><mixed-citation publication-type="book"><string-name><surname>Flemming</surname>, <given-names>Edward</given-names></string-name>. <year>2004</year>. <chapter-title>Contrast and perceptual distinctiveness</chapter-title>. In <string-name><surname>Hayes</surname>, <given-names>Bruce</given-names></string-name> &amp; <string-name><surname>Kirchner</surname>, <given-names>Robert</given-names></string-name> &amp; <string-name><surname>Steriade</surname>, <given-names>Donca</given-names></string-name> (eds.), <source>Phonetically Based Phonology</source>, <fpage>232</fpage>&#8211;<lpage>276</lpage>. <publisher-name>Cambridge University Press</publisher-name>. DOI: <pub-id pub-id-type="doi">10.1017/CBO9780511486401.008</pub-id></mixed-citation></ref>
<ref id="B33"><mixed-citation publication-type="journal"><string-name><surname>Frisch</surname>, <given-names>Stefan A.</given-names></string-name> &amp; <string-name><surname>Pierrehumbert</surname>, <given-names>Janet B.</given-names></string-name> &amp; <string-name><surname>Broe</surname>, <given-names>Michael B.</given-names></string-name> <year>2004</year>. <article-title>Similarity avoidance and the ocp</article-title>. <source>Natural Language &amp; Linguistic Theory</source> <volume>22</volume>(<issue>1</issue>). <fpage>179</fpage>&#8211;<lpage>228</lpage>. DOI: <pub-id pub-id-type="doi">10.1023/B:NALA.0000005557.78535.3c</pub-id></mixed-citation></ref>
<ref id="B34"><mixed-citation publication-type="journal"><string-name><surname>Gallagher</surname>, <given-names>Gillian</given-names></string-name> &amp; <string-name><surname>Coon</surname>, <given-names>Jessica</given-names></string-name>. <year>2009</year>. <article-title>Distinguishing total and partial identity: Evidence from chol</article-title>. <source>Natural Language &amp; Linguistic Theory</source> <volume>27</volume>. <fpage>545</fpage>&#8211;<lpage>582</lpage>. DOI: <pub-id pub-id-type="doi">10.1007/s11049-009-9075-3</pub-id></mixed-citation></ref>
<ref id="B35"><mixed-citation publication-type="journal"><string-name><surname>Goldsmith</surname>, <given-names>John</given-names></string-name>. <year>1985</year>. <article-title>Vowel harmony in Khalkha Mongolian, Yaka, Finnish and Hungarian</article-title>. <source>Phonology</source> <volume>2</volume>. <fpage>253</fpage>&#8211;<lpage>275</lpage>. DOI: <pub-id pub-id-type="doi">10.1017/S0952675700000452</pub-id></mixed-citation></ref>
<ref id="B36"><mixed-citation publication-type="journal"><string-name><surname>Goldsmith</surname>, <given-names>John</given-names></string-name> &amp; <string-name><surname>Riggle</surname>, <given-names>Jason</given-names></string-name>. <year>2012</year>. <article-title>Information theoretic approaches to phonological structure: The case of finnish vowel harmony</article-title>. <source>Natural Language &amp; Linguistic Theory</source> <volume>30</volume>. <fpage>859</fpage>&#8211;<lpage>896</lpage>. DOI: <pub-id pub-id-type="doi">10.1007/s11049-012-9169-1</pub-id></mixed-citation></ref>
<ref id="B37"><mixed-citation publication-type="book"><string-name><surname>Gurevich</surname>, <given-names>Naomi</given-names></string-name>. <year>2013</year>. <source>Lenition and contrast: The functional consequences of certain phonetically conditioned sound changes</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Routledge</publisher-name>. DOI: <pub-id pub-id-type="doi">10.4324/9780203505052</pub-id></mixed-citation></ref>
<ref id="B38"><mixed-citation publication-type="journal"><string-name><surname>Harrison</surname>, <given-names>K. David</given-names></string-name> &amp; <string-name><surname>Dras</surname>, <given-names>Mark</given-names></string-name> &amp; <string-name><surname>Kapicioglu</surname>, <given-names>Berk</given-names></string-name>. <year>2002</year>. <article-title>Agent-based modeling of the evolution of vowel harmony</article-title>. In <source>Proceedings of the North East Linguistics Society (NELS)</source>, vol. <volume>32</volume>. <elocation-id>14</elocation-id>.</mixed-citation></ref>
<ref id="B39"><mixed-citation publication-type="journal"><string-name><surname>Hay</surname>, <given-names>Jennifer</given-names></string-name> &amp; <string-name><surname>Pierrehumbert</surname>, <given-names>Janet</given-names></string-name> &amp; <string-name><surname>Beckman</surname>, <given-names>Mary</given-names></string-name>. <year>2004</year>. <article-title>Speech perception, well-formedness, and the statistics of the lexicon</article-title>. <source>Papers in Laboratory Phonology VI</source>. <fpage>58</fpage>&#8211;<lpage>74</lpage>.</mixed-citation></ref>
<ref id="B40"><mixed-citation publication-type="journal"><string-name><surname>Hayes</surname>, <given-names>Bruce</given-names></string-name> &amp; <string-name><surname>Londe</surname>, <given-names>Zsuzsa Czir&#225;ky</given-names></string-name>. <year>2006</year>. <article-title>Stochastic phonological knowledge: The case of hungarian vowel harmony</article-title>. <source>Phonology</source> <volume>23</volume>(<issue>1</issue>). <fpage>59</fpage>&#8211;<lpage>104</lpage>. DOI: <pub-id pub-id-type="doi">10.1017/S0952675706000765</pub-id></mixed-citation></ref>
<ref id="B41"><mixed-citation publication-type="journal"><string-name><surname>Hayes</surname>, <given-names>Bruce</given-names></string-name> &amp; <string-name><surname>Wilson</surname>, <given-names>Colin</given-names></string-name>. <year>2008</year>. <article-title>A maximum entropy model of phonotactics and phonotactic learning</article-title>. <source>Linguistic Inquiry</source> <volume>39</volume>(<issue>3</issue>). <fpage>379</fpage>&#8211;<lpage>440</lpage>. DOI: <pub-id pub-id-type="doi">10.1162/ling.2008.39.3.379</pub-id></mixed-citation></ref>
<ref id="B42"><mixed-citation publication-type="journal"><string-name><surname>Huang</surname>, <given-names>Tingyu</given-names></string-name> &amp; <string-name><surname>Do</surname>, <given-names>Youngah</given-names></string-name>. <year>2023</year>. <article-title>Substantive bias and variation in the acquisition of vowel harmony</article-title>. <source>Glossa: A Journal of General Linguistics</source>. DOI: <pub-id pub-id-type="doi">10.16995/glossa.9313</pub-id></mixed-citation></ref>
<ref id="B43"><mixed-citation publication-type="journal"><string-name><surname>Hyman</surname>, <given-names>Larry M.</given-names></string-name> <year>2003</year>. <article-title>Sound change, misanalysis, and analogy in the bantu causative</article-title>. <source>Journal of African Languages and Linguistics</source> <volume>24</volume>:<issue>1</issue>. <fpage>55</fpage>&#8211;<lpage>90</lpage>. DOI: <pub-id pub-id-type="doi">10.1515/jall.2003.004</pub-id></mixed-citation></ref>
<ref id="B44"><mixed-citation publication-type="journal"><string-name><surname>Kawahara</surname>, <given-names>Shigeto</given-names></string-name>. <year>2007</year>. <article-title>Copying and spreading in phonological theory: Evidence from echo epenthesis</article-title>. <source>UMOP</source> <volume>32</volume>, <fpage>111</fpage>&#8211;<lpage>144</lpage>.</mixed-citation></ref>
<ref id="B45"><mixed-citation publication-type="journal"><string-name><surname>King</surname>, <given-names>Adam</given-names></string-name> &amp; <string-name><surname>Wedel</surname>, <given-names>Andrew</given-names></string-name>. <year>2020</year>. <article-title>Greater early disambiguating information for less-probable words: The lexicon is shaped by incremental processing</article-title>. <source>Open Mind</source> <volume>4</volume>. <fpage>1</fpage>&#8211;<lpage>12</lpage>. DOI: <pub-id pub-id-type="doi">10.1162/opmi_a_00030</pub-id></mixed-citation></ref>
<ref id="B46"><mixed-citation publication-type="book"><string-name><surname>Kitto</surname>, <given-names>Catherine</given-names></string-name> &amp; <string-name><surname>de Lacy</surname>, <given-names>Paul</given-names></string-name>. <year>1999</year>. <chapter-title>Correspondence and epenthetic quality</chapter-title>. In <string-name><surname>Shahin</surname>, <given-names>Kimary N.</given-names></string-name> &amp; <string-name><surname>Blake</surname>, <given-names>Susan</given-names></string-name> &amp; <string-name><surname>Kim</surname>, <given-names>Eun-Sook</given-names></string-name> (eds.), <source>Toronto Working Papers in Linguistics</source>, vol. <volume>17</volume>, <fpage>163</fpage>&#8211;<lpage>187</lpage>. <publisher-name>University of Toronto Department of Linguistics</publisher-name>.</mixed-citation></ref>
<ref id="B47"><mixed-citation publication-type="journal"><string-name><surname>Linders</surname>, <given-names>Guido M.</given-names></string-name> &amp; <string-name><surname>Louwerse</surname>, <given-names>Max M.</given-names></string-name> <year>2023</year>. <article-title>Zipf&#8217;s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort</article-title>. <source>Psychonomic Bulletin &amp; Review</source> <volume>30</volume>(<issue>1</issue>). <fpage>77</fpage>&#8211;<lpage>101</lpage>. DOI: <pub-id pub-id-type="doi">10.3758/s13423-022-02142-9</pub-id></mixed-citation></ref>
<ref id="B48"><mixed-citation publication-type="thesis"><string-name><surname>Linebaugh</surname>, <given-names>Gary Dean</given-names></string-name>. <year>2007</year>. <source>Phonetic grounding and phonology: Vowel backness harmony and vowel height harmony</source>. PhD thesis, <publisher-name>University of Illinois at Urbana-Champaign</publisher-name> dissertation.</mixed-citation></ref>
<ref id="B49"><mixed-citation publication-type="journal"><string-name><surname>Magen</surname>, <given-names>Harriet S.</given-names></string-name> <year>1997</year>. <article-title>The extent of vowel-to-vowel coarticulation in english</article-title>. <source>Journal of Phonetics</source> <volume>25</volume>(<issue>2</issue>). <fpage>187</fpage>&#8211;<lpage>205</lpage>. DOI: <pub-id pub-id-type="doi">10.1006/jpho.1996.0041</pub-id></mixed-citation></ref>
<ref id="B50"><mixed-citation publication-type="journal"><string-name><surname>Mahowald</surname>, <given-names>Kyle</given-names></string-name> &amp; <string-name><surname>Dautriche</surname>, <given-names>Isabelle</given-names></string-name> &amp; <string-name><surname>Gibson</surname>, <given-names>Edward</given-names></string-name> &amp; <string-name><surname>Piantadosi</surname>, <given-names>Steven T.</given-names></string-name> <year>2018</year>. <article-title>Word forms are structured for efficient use</article-title>. <source>Cognitive Science</source> <volume>42</volume>(<issue>8</issue>). <fpage>3116</fpage>&#8211;<lpage>3134</lpage>. DOI: <pub-id pub-id-type="doi">10.1111/cogs.12689</pub-id></mixed-citation></ref>
<ref id="B51"><mixed-citation publication-type="journal"><string-name><surname>Martin</surname>, <given-names>Alexander</given-names></string-name> &amp; <string-name><surname>Peperkamp</surname>, <given-names>Sharon</given-names></string-name>. <year>2020</year>. <article-title>Phonetically natural rules benefit from a learning bias: A re-examination of vowel harmony and disharmony</article-title>. <source>Phonology</source> <volume>37</volume>(<issue>1</issue>). <fpage>65</fpage>&#8211;<lpage>90</lpage>. DOI: <pub-id pub-id-type="doi">10.1017/S0952675720000044</pub-id></mixed-citation></ref>
<ref id="B52"><mixed-citation publication-type="journal"><string-name><surname>Martin</surname>, <given-names>Alexander</given-names></string-name> &amp; <string-name><surname>White</surname>, <given-names>James</given-names></string-name>. <year>2021</year>. <article-title>Vowel harmony and disharmony are not equivalent in learning</article-title>. <source>Linguistic Inquiry</source> <volume>52</volume>(<issue>1</issue>). <fpage>227</fpage>&#8211;<lpage>239</lpage>. DOI: <pub-id pub-id-type="doi">10.1162/ling_a_00375</pub-id></mixed-citation></ref>
<ref id="B53"><mixed-citation publication-type="journal"><string-name><surname>McCarthy</surname>, <given-names>John J.</given-names></string-name> <year>1995</year>. <article-title>Extensions of faithfulness: Rotuman revisited</article-title>. Tech. Rep. ROA-63 Rutgers Optimality Archive Amherst, MA.</mixed-citation></ref>
<ref id="B54"><mixed-citation publication-type="journal"><string-name><surname>Mersad</surname>, <given-names>Karima</given-names></string-name> &amp; <string-name><surname>Nazzi</surname>, <given-names>Thierry</given-names></string-name>. <year>2011</year>. <article-title>Transitional probabilities and positional frequency phonotactics in a hierarchical model of speech segmentation</article-title>. <source>Memory &amp; Cognition</source> <volume>39</volume>. <fpage>1085</fpage>&#8211;<lpage>1093</lpage>. DOI: <pub-id pub-id-type="doi">10.3758/s13421-011-0074-3</pub-id></mixed-citation></ref>
<ref id="B55"><mixed-citation publication-type="journal"><string-name><surname>Mintz</surname>, <given-names>Toben H.</given-names></string-name> &amp; <string-name><surname>Walker</surname>, <given-names>Rachel L.</given-names></string-name> &amp; <string-name><surname>Welday</surname>, <given-names>Ashlee</given-names></string-name> &amp; <string-name><surname>Kidd</surname>, <given-names>Celeste</given-names></string-name>. <year>2018</year>. <article-title>Infants&#8217; sensitivity to vowel harmony and its role in segmenting speech</article-title>. <source>Cognition</source> <volume>171</volume>. <fpage>95</fpage>&#8211;<lpage>107</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2017.10.020</pub-id></mixed-citation></ref>
<ref id="B56"><mixed-citation publication-type="journal"><string-name><surname>Moran</surname>, <given-names>Steven</given-names></string-name> &amp; <string-name><surname>McCloy</surname>, <given-names>Daniel</given-names></string-name>. <year>2019</year>. <article-title>Phoible 2.0</article-title>. <source>Jena: Max Planck Institute for the Science of Human History</source>, <volume>10</volume>.</mixed-citation></ref>
<ref id="B57"><mixed-citation publication-type="journal"><string-name><surname>Oganian</surname>, <given-names>Yulia</given-names></string-name> &amp; <string-name><surname>Bhaya-Grossman</surname>, <given-names>Ilina</given-names></string-name> &amp; <string-name><surname>Johnson</surname>, <given-names>Keith</given-names></string-name> &amp; <string-name><surname>Chang</surname>, <given-names>Edward F.</given-names></string-name> <year>2023</year>. <article-title>Vowel and formant representation in the human auditory speech cortex</article-title>. <source>Neuron</source> <volume>111</volume>(<issue>13</issue>). <fpage>2105</fpage>&#8211;<lpage>2118</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.neuron.2023.04.004</pub-id></mixed-citation></ref>
<ref id="B58"><mixed-citation publication-type="journal"><string-name><surname>Ohala</surname>, <given-names>John J.</given-names></string-name> <year>1994</year>. <article-title>Towards a universal, phonetically-based, theory of vowel harmony</article-title>. In <source>ICSLP</source>, vol. <volume>3</volume>. <fpage>491</fpage>&#8211;<lpage>494</lpage>. DOI: <pub-id pub-id-type="doi">10.21437/ICSLP.1994-113</pub-id></mixed-citation></ref>
<ref id="B59"><mixed-citation publication-type="journal"><string-name><surname>Omane</surname>, <given-names>Paul Okyere</given-names></string-name> &amp; <string-name><surname>Benders</surname>, <given-names>Titia</given-names></string-name> &amp; <string-name><surname>Boll-Avetisyan</surname>, <given-names>Natalie</given-names></string-name>. <year>2024</year>. <article-title>Vowel harmony preferences in infants growing up in multilingual ghana (africa)</article-title>. <source>Developmental Psychology</source> <volume>60</volume>(<issue>8</issue>). <elocation-id>1372</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1037/dev0001776</pub-id></mixed-citation></ref>
<ref id="B60"><mixed-citation publication-type="journal"><string-name><surname>Piantadosi</surname>, <given-names>Steven T.</given-names></string-name> &amp; <string-name><surname>Tily</surname>, <given-names>Harry</given-names></string-name> &amp; <string-name><surname>Gibson</surname>, <given-names>Edward</given-names></string-name>. <year>2012</year>. <article-title>The communicative function of ambiguity in language</article-title>. <source>Cognition</source> <volume>122</volume>(<issue>3</issue>). <fpage>280</fpage>&#8211;<lpage>291</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2011.10.004</pub-id></mixed-citation></ref>
<ref id="B61"><mixed-citation publication-type="journal"><string-name><surname>Pozdniakov</surname>, <given-names>Konstantin</given-names></string-name> &amp; <string-name><surname>Segerer</surname>, <given-names>Guillaume</given-names></string-name>. <year>2007</year>. <article-title>Similar place avoidance: A statistical universal</article-title>. <source>Linguistic Typology</source> <volume>11</volume>(<issue>2</issue>). <fpage>307</fpage>&#8211;<lpage>348</lpage>. DOI: <pub-id pub-id-type="doi">10.1515/LINGTY.2007.025</pub-id></mixed-citation></ref>
<ref id="B62"><mixed-citation publication-type="journal"><string-name><surname>Prince</surname>, <given-names>Alan</given-names></string-name> &amp; <string-name><surname>Smolensky</surname>, <given-names>Paul</given-names></string-name>. <year>1997</year>. <article-title>Optimality: From neural networks to universal grammar</article-title>. <source>Science</source> <volume>275</volume>(<issue>5306</issue>). <fpage>1604</fpage>&#8211;<lpage>1610</lpage>. DOI: <pub-id pub-id-type="doi">10.1126/science.275.5306.1604</pub-id></mixed-citation></ref>
<ref id="B63"><mixed-citation publication-type="journal"><string-name><surname>Ringen</surname>, <given-names>Catherine O.</given-names></string-name> &amp; <string-name><surname>Vago</surname>, <given-names>Robert M.</given-names></string-name> <year>1998</year>. <article-title>Hungarian vowel harmony in optimality theory</article-title>. <source>Phonology</source> <volume>15</volume>(<issue>3</issue>). <fpage>393</fpage>&#8211;<lpage>416</lpage>. DOI: <pub-id pub-id-type="doi">10.1017/S0952675799003632</pub-id></mixed-citation></ref>
<ref id="B64"><mixed-citation publication-type="book"><string-name><surname>Rose</surname>, <given-names>Sharon</given-names></string-name> &amp; <string-name><surname>Walker</surname>, <given-names>Rachel</given-names></string-name>. <year>2011</year>. <chapter-title>Harmony systems</chapter-title>. In <string-name><surname>Goldsmith</surname>, <given-names>John A.</given-names></string-name> &amp; <string-name><surname>Riggle</surname>, <given-names>Jason</given-names></string-name> &amp; <string-name><surname>Yu</surname>, <given-names>Alan C. L.</given-names></string-name> (eds.), <source>The handbook of phonological theory</source>, <fpage>240</fpage>&#8211;<lpage>290</lpage>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Wiley-Blackwell</publisher-name> <edition>2nd</edition> edn. DOI: <pub-id pub-id-type="doi">10.1002/9781444343069.ch8</pub-id></mixed-citation></ref>
<ref id="B65"><mixed-citation publication-type="book"><string-name><surname>Scannell</surname>, <given-names>Kevin P.</given-names></string-name> <year>2007</year>. <chapter-title>The cr&#250;bad&#225;n project: Corpus building for under-resourced languages</chapter-title>. In <source>Building and Exploring Web Corpora (WAC3-2007): Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval</source>, vol. <volume>4</volume>. <elocation-id>5</elocation-id>. <publisher-name>Presses univ. de Louvain</publisher-name>.</mixed-citation></ref>
<ref id="B66"><mixed-citation publication-type="journal"><string-name><surname>Sol&#225;-Llonch</surname>, <given-names>Elizabeth</given-names></string-name> &amp; <string-name><surname>Sundara</surname>, <given-names>Megha</given-names></string-name>. <year>2025</year>. <article-title>Young infants&#8217; sensitivity to precursors of vowel harmony is independent of language experience</article-title>. <source>Infant Behavior and Development</source> <volume>78</volume>. <elocation-id>102032</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1016/j.infbeh.2025.102032</pub-id></mixed-citation></ref>
<ref id="B67"><mixed-citation publication-type="book"><string-name><surname>Stanton</surname>, <given-names>Juliet</given-names></string-name>. <year>2021</year>. <chapter-title>An identity preference in Ngbaka vowels</chapter-title>. In <source>Proceedings of the annual meetings on phonology</source>. <publisher-loc>Washington, DC</publisher-loc>: <publisher-name>Linguistic Society of America</publisher-name>. DOI: <pub-id pub-id-type="doi">10.3765/amp.v9i0.5151</pub-id></mixed-citation></ref>
<ref id="B68"><mixed-citation publication-type="journal"><string-name><surname>Suomi</surname>, <given-names>Kari</given-names></string-name>. <year>1983</year>. <article-title>Palatal vowel harmony: a perceptually motivated phenomenon?</article-title> <source>Nordic Journal of Linguistics</source> <volume>6</volume>(<issue>1</issue>). <fpage>1</fpage>&#8211;<lpage>35</lpage>. DOI: <pub-id pub-id-type="doi">10.1017/S0332586500000949</pub-id></mixed-citation></ref>
<ref id="B69"><mixed-citation publication-type="journal"><string-name><surname>Suomi</surname>, <given-names>Kari</given-names></string-name> &amp; <string-name><surname>McQueen</surname>, <given-names>James M.</given-names></string-name> &amp; <string-name><surname>Cutler</surname>, <given-names>Anne</given-names></string-name>. <year>1997</year>. <article-title>Vowel harmony and speech segmentation in finnish</article-title>. <source>Journal of Memory and Language</source> <volume>36</volume>(<issue>3</issue>). <fpage>422</fpage>&#8211;<lpage>444</lpage>. DOI: <pub-id pub-id-type="doi">10.1006/jmla.1996.2495</pub-id></mixed-citation></ref>
<ref id="B70"><mixed-citation publication-type="journal"><string-name><surname>Tiedemann</surname>, <given-names>J&#246;rg</given-names></string-name>. <year>2016</year>. <article-title>Finding alternative translations in a large corpus of movie subtitle</article-title>. In <source>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC&#8217;16)</source>. <fpage>3518</fpage>&#8211;<lpage>3522</lpage>.</mixed-citation></ref>
<ref id="B71"><mixed-citation publication-type="journal"><string-name><surname>Trott</surname>, <given-names>Sean</given-names></string-name> &amp; <string-name><surname>Bergen</surname>, <given-names>Benjamin</given-names></string-name>. <year>2020</year>. <article-title>Why do human languages have homophones?</article-title> <source>Cognition</source> <volume>205</volume>. <elocation-id>104449</elocation-id>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2020.104449</pub-id></mixed-citation></ref>
<ref id="B72"><mixed-citation publication-type="journal"><string-name><surname>van der Hulst</surname>, <given-names>Harry</given-names></string-name>. <year>2016</year>. <article-title>Vowel harmony</article-title>. In <source>Oxford research encyclopedia of linguistics</source>. DOI: <pub-id pub-id-type="doi">10.1093/acrefore/9780199384655.013.38</pub-id></mixed-citation></ref>
<ref id="B73"><mixed-citation publication-type="journal"><string-name><surname>Vehtari</surname>, <given-names>Aki</given-names></string-name> &amp; <string-name><surname>Gelman</surname>, <given-names>Andrew</given-names></string-name> &amp; <string-name><surname>Gabry</surname>, <given-names>Jonah</given-names></string-name>. <year>2017</year>. <article-title>Practical bayesian model evaluation using leave-one-out cross-validation and waic</article-title>. <source>Statistics and Computing</source> <volume>27</volume>. <fpage>1413</fpage>&#8211;<lpage>1432</lpage>. DOI: <pub-id pub-id-type="doi">10.1007/s11222-016-9696-4</pub-id></mixed-citation></ref>
<ref id="B74"><mixed-citation publication-type="journal"><string-name><surname>Vroomen</surname>, <given-names>Jean</given-names></string-name> &amp; <string-name><surname>Tuomainen</surname>, <given-names>Jyrki</given-names></string-name> &amp; <string-name><surname>de Gelder</surname>, <given-names>Beatrice</given-names></string-name>. <year>1998</year>. <article-title>The roles of word stress and vowel harmony in speech segmentation</article-title>. <source>Journal of Memory and Language</source> <volume>38</volume>(<issue>2</issue>). <fpage>133</fpage>&#8211;<lpage>149</lpage>. DOI: <pub-id pub-id-type="doi">10.1006/jmla.1997.2548</pub-id></mixed-citation></ref>
<ref id="B75"><mixed-citation publication-type="journal"><string-name><surname>Walter</surname>, <given-names>Mary Ann</given-names></string-name>. <year>2010</year>. <article-title>Harmony versus the ocp: Vowel and consonant cooccurrence in the lexicon</article-title>. <source>Laboratory Phonology</source> <volume>1</volume>(<issue>2</issue>). <fpage>395</fpage>&#8211;<lpage>413</lpage>. DOI: <pub-id pub-id-type="doi">10.1515/labphon.2010.020</pub-id></mixed-citation></ref>
<ref id="B76"><mixed-citation publication-type="book"><string-name><surname>Wayment</surname>, <given-names>Adam</given-names></string-name>. <year>2009</year>. <source>Assimilation as attraction: Computing distance, similarity, and locality in phonology</source>. Dissertation, <publisher-name>The Johns Hopkins University</publisher-name>.</mixed-citation></ref>
<ref id="B77"><mixed-citation publication-type="journal"><string-name><surname>Wedel</surname>, <given-names>Andrew</given-names></string-name> &amp; <string-name><surname>Jackson</surname>, <given-names>Scott</given-names></string-name> &amp; <string-name><surname>Kaplan</surname>, <given-names>Abby</given-names></string-name>. <year>2013</year>. <article-title>Functional load and the lexicon: Evidence that syntactic category and frequency relationships in minimal lemma pairs predict the loss of phoneme contrasts in language change</article-title>. <source>Language and Speech</source> <volume>56</volume>(<issue>3</issue>). <fpage>395</fpage>&#8211;<lpage>417</lpage>. DOI: <pub-id pub-id-type="doi">10.1177/0023830913489096</pub-id></mixed-citation></ref>
<ref id="B78"><mixed-citation publication-type="journal"><string-name><surname>Wilson</surname>, <given-names>Colin</given-names></string-name>. <year>2006</year>. <article-title>Learning phonology with substantive bias: An experimental and computational study of velar palatalization</article-title>. <source>Cognitive Science</source> <volume>30</volume>(<issue>5</issue>). <fpage>945</fpage>&#8211;<lpage>982</lpage>. DOI: <pub-id pub-id-type="doi">10.1207/s15516709cog0000_89</pub-id></mixed-citation></ref>
<ref id="B79"><mixed-citation publication-type="book"><string-name><surname>Wilson</surname>, <given-names>Colin</given-names></string-name> &amp; <string-name><surname>Obdeyn</surname>, <given-names>Marieke</given-names></string-name>. <year>2009</year>. <source>Simplifying subsidiary theory: statistical evidence from arabic, muna, shona, and wargamay</source>. <publisher-name>Ms, Johns Hopkins University</publisher-name>.</mixed-citation></ref>
<ref id="B80"><mixed-citation publication-type="journal"><string-name><surname>Yin</surname>, <given-names>Sora Heng</given-names></string-name> &amp; <string-name><surname>White</surname>, <given-names>James</given-names></string-name>. <year>2018</year>. <article-title>Neutralization and homophony avoidance in phonological learning</article-title>. <source>Cognition</source> <volume>179</volume>. <fpage>89</fpage>&#8211;<lpage>101</lpage>. DOI: <pub-id pub-id-type="doi">10.1016/j.cognition.2018.05.023</pub-id></mixed-citation></ref>
<ref id="B81"><mixed-citation publication-type="journal"><string-name><surname>Zipf</surname>, <given-names>George Kingsley</given-names></string-name>. <year>1945</year>. <article-title>The meaning-frequency relationship of words</article-title>. <source>The Journal of General Psychology</source> <volume>33</volume>(<issue>2</issue>). <fpage>251</fpage>&#8211;<lpage>256</lpage>. DOI: <pub-id pub-id-type="doi">10.1080/00221309.1945.10544509</pub-id></mixed-citation></ref>
<ref id="B82"><mixed-citation publication-type="thesis"><string-name><surname>Zuraw</surname>, <given-names>Kie</given-names></string-name>. <year>2000</year>. <source>Patterned exceptions in phonology</source>. Ph.D. dissertation, <publisher-name>University of California</publisher-name>, <publisher-loc>Los Angeles</publisher-loc>.</mixed-citation></ref>
</ref-list>
</back>
</article>