Cumulative markedness effects and (non-)linearity in phonotactics

How do grammars assess the well-formedness of words with multiple phonotactic violations? Certain models predict that as the strength of phonotactic restrictions decrease, forms that violate multiple restrictions should be less acceptable than expected, in a pattern we term super-linear cumulativity . We test this prediction using a series of Artificial Grammar Learning experiments, in which we vary the number of exceptions to phonotactic patterns in artificial languages. We find that super-linear cumulativity is indeed observed in the conditions with the weakest restrictions. Strikingly, participants exhibit super-linear cumulativity even when the trained language does not contain evidence for it.


Introduction
This paper addresses the relationship between the strength of phonotactic constraints and the way in which multiple coincident violations of such constraints interact in the grammar.Some grammatical approaches predict that violations simply stack up to yield a penalty that is the sum of the component penalties.Other approaches predict that forms with multiple violations are better or worse than would be obtained by adding the individual penalties, and indeed, cases of this sort have been observed in lexical counts and experimental results.As we will demonstrate, in some grammatical approaches, the predicted size of the penalty varies depending on the strength of the restrictions involved.We investigate whether there is a causal relationship between the strength of a given phonotactic restriction and how it combines with other restrictions in the grammar.
Using an Artificial Grammar Learning (AGL) paradigm, we find that as we decrease the strength of phonotactic restrictions by introducing exceptions, we observe an increasing penalty for multiple violations beyond the simple combination of the independent penalties.That is, participants' acceptability ratings for doubly-marked forms are lower than what is obtained by adding up the independent penalties in acceptability for each of those forms' individual violations.We argue that this supports a grammatical model in which the degree of penalty assigned to multiple constraint violations is a deterministic function of the weights of the constraints involved.We discuss the implications of this model for theories of phonotactics, and the contents of the constraint set.

Linear, super-linear, and sub-linear cumulativity
We begin by laying out some terminology in order to state our hypothesis as precisely as possible.
The broad domain of inquiry is about the acceptability of words that contain multiple marked structures.This contains an empirical question (how does the acceptability of multiply marked words relate to that of singly marked words), and a theoretical question (how do grammatical models combine violations to compute an overall grammaticality).
Empirically, the question is how decomposable acceptability judgments of strings are into separate components.A natural default assumption is that if a word has two dispreferred substrings (i.e., two Markedness violations), each contributes its own penalty independently, so the doublymarked form is exactly as unacceptable or improbable as one would expect based on its individual violations.There are various ways of computing such an expectation.In this section, we focus on expectations implemented in terms of probability, because several current grammatical formalisms generate probability distributions over outputs.Assuming that a model is able to predict the probability of a form with a single Markedness violation, then a word with two Markedness violations would have a probability equal to the joint probability of the two Markedness violations.
The joint probability of two violations is equal to the product of the independent probabilities of those violations, or, in log-space, their sum.We use the term linear to refer to the situation where the Markedness violations of a string all affect the outcome independently.The assumption of linear interactions is seen, for example, in how the "Expected" values in Observed/Expected counts are typically calculated (Frisch et al. 2004;Wilson & Obdeyn 2009), and also in how n-gram models combine probabilities of each successive n-gram (Jurafsky & Martin 2009: chapter 4).Weighted constraint models such as Harmonic Grammar (Legendre et al. 1990) and MaxEnt (Smolensky 1986;Goldwater & Johnson 2003) also calculate the Harmony of a candidate as the linear sum of its weighted violations.However, this alone does not guarantee that we will observe linear interactions empirically, since the way that the acceptability or probability of a form is determined from its Harmony in these frameworks may make the actual acceptability or probability of a doubly-marked form higher or lower than the joint probability of its parts (more on this below).
With this definition of linearity in hand, it is now straightforward to define deviations from linearity.Specifically, if the probability or acceptability of a multiply-marked form is lower than expected based on the independent probability or acceptability of its parts, we follow Smith & Pater (2020) in calling this a super-linear interaction. 1 Similarly, we can say that if the probability or acceptability of a multiply-marked form is higher than expected, it is a sub-linear interaction.
On the theoretical side, linearity can also be a property of grammatical models.Here, it refers to how models combine different theoretical quantities to yield an overall grammaticality value.For example, as noted above, a model that adds weighted Markedness violations to yield a Harmony value is linear, in the sense that Harmony is decomposable into the component violations.For present purposes, we are not directly concerned with whether a given grammatical model is a linear model, though in practice, all of the models that we consider are.Rather, we are concerned with what models predict for the candidates' grammaticality-determined probability, as observed through acceptability judgments.

Evidence for cumulativity of violations
A growing body of evidence in the phonological literature supports the view that Markedness violations are cumulative: when speakers judge the well-formedness of a word, their judgement is not based on only the most marked structure it contains (as predicted by strict-ranking constraintbased models such as Optimality Theory (Prince & Smolensky 1993) and its variants).Rather, speakers attend to all relevant structures in a domain, and weight their importance according to their severity (as predicted by weighted-constraint models such as Harmonic Grammar (Legendre et al. 1990) and its variants).This aggregation of evidence across different structures was termed cumulativity by Jäger & Rosenbach (2006), 2 and is observed both in the probability of a given structure in the lexicon, and that of experimentally-determined acceptability.
1 The term super-linear is used even though the probability or acceptability is lower than expected, because the penalty is higher than expected under linear combination. 2Note that the distinction that Jäger & Rosenbach (2006) make between counting and ganging cumulativity is orthogonal to the current discussion of different degrees of cumulativity -linear, sub-linear, or super-linear.
Recent work has focused on how exactly the contributions to markedness from each of a number of structures are combined in the grammar.Specifically, there are some indications that the total markedness of a word containing multiple marked structures might not be accurately measured by the simple combination of the markedness of its parts.In lexical attestation, nonce word judgments, and phonological patterning it's been observed that sometimes, strings with two marked structures are penalized to a greater extent than obtained by adding up the markedness of each of the violations assessed alone -super-linearity.An example of this type of cumulativity can be found in the lexicon of English: as part of a study of English monosyllable phonotactics, Albright (2012) found that 491 (8.2%) of monosyllables in the CELEX database (Baayen et al. 1996) had a stop+l onset, and 47 (3.2%) had a s+stop coda.However, the number of #stop+l… s+stop# words was lower than either of these, with only 7 occurrences (0.11%).This instance the cumulativity exhibited is super-linear in nature: the combination of independent probabilities of the marked syllable margins alone predicts that 8.2% × 3.2% = 0.22% of the monosyllables in the database -about 16 unique words -should exhibit both the marked onset and marked coda.Similar data in lexical studies have also been noted in Albright (2008), which finds that Lakhota roots which contain multiple structures which are only moderately uncommon, such as consonant clusters and fricatives, co-occur in dramatically fewer roots than predicted by their joint probability.Also in this vein is a study by Yang et al. (2018), who carry out a comparison of English and Mandarin monosyllables and find that the attested monosyllabic lexicons are more well-formed than would be expected by the independent probabilities of their parts.
Although lexical statistics are often advanced as evidence of synchronic phonological knowledge, divergences between lexical statistics and productive grammatical knowledge are well-known (Becker et al. 2011;Hayes & White 2013 among others).Indeed, Frisch (1996); Martin (2007;2011) and Beguš (2018) highlight how the phonotactic structure of the lexicon can change over time so as to favor well-formed words at the expense of marked forms as part of a self-amplifying feedback cycle with basic properties of the synchronic phonological grammar.
Thus simply observing that a generalization holds of a language's lexicon does not necessarily imply that it enjoys a cognitively real status in the synchronic grammar of its speakers.Therefore it is important to ask whether super-linear cumulativity is exhibited synchronically.Super-linear cumulativity has also been observed in nonce word judgments, though the data are relatively scarce.Albright (2012) replicated a nonword acceptability judgment task from Bailey & Hahn (2001) which asked subjects to rate the acceptability of novel English monosyllables containing onset clusters (e.g.[krεn, draf]), coda clusters (e.g.[lεsk, mısp]), or both (e.g.[drısp, krεsk]).Albright then modeled whether the acceptability of the doubly-marked forms could be predicted solely on the basis of their constituent violations and found that it could not: doubly-marked forms such as [drısp] were rated less acceptable than predicted by the sum of their independent penalties.
Other cases of super-linearity have been documented in phonological alternations: for example Smith & Pater (2020) note that super-linear behavior is observed in the interaction of deletion and epenthesis in the surface-realization of French schwa.Green & Davis (2014) find that multiple optional syllable structure simplifications in colloquial Bamana are dramatically less likely to co-occur than expected given the product of the probability of each independent simplification process.Kim (2019), building on Kumagai (2017), demonstrates the cumulative effect of nasals on blocking the inter-morpheme obstruent-voicing process rendaku in Japanese compounds which also displays super-linear behavior.Kawahara & Kumagai (2021) re-examine the data on nasals with a better-controlled experiment, and do not replicate Kumagai (2017)'s findings of super-linearity.However, they unexpectedly find that two approximants ([w] or [j]) in the second element of a compounds does exert a blocking effect on Rendaku that is dramatically stronger than that of a single approximant, again a case of super-linear cumulativity.Super-linear cumulativity has also been observed in the contribution of different phonological structures to the likelihood of belonging to a specific lexical class (Shih 2017).
At the same time, not all studies that have examined cumulativity have found it to be superlinear: Breiss (2020) tested for cumulativity in phonotactic markedness using an AGL paradigm, and found that, when trained on a language which conformed to two exceptionless phonotactics, participants judged words that violated both phonotactics as less well-formed than those which violated only one, again demonstrating cumulativity but without evidence of super-linearity.
Durvasula & Liter (2020) also used an AGL task to examine multiple concurrent phonological generalizations learned over representations of different grain-sizes, and also found results that are compatible with linear cumulativity.Moving beyond the domain of linguist-created languages, Kawahara & Breiss (2021) examined cumulativity in sound symbolism, and found that participants combined multiple phonological cues to the same sound-symbolic quality in a cumulative manner in the domain of Pokémon names (see also Kawahara & Moore 2021;Kawahara 2021).Pizzo (2015) found that English-speaking participants judged words which violated English syllablemargin phonotactics in one location, ex.plavb, tlag as less acceptable than one which violated none -plag -and crucially more acceptable than those which violated both, ex., tlavb.Importantly, the penalty for doubly-marked forms in her data was not more than the expected value under linear cumulativity (though we return to these findings in more detail in section 6.3).
Summarizing the state of the literature on cumulativity reviewed above, we find that there are conflicting claims about the linearity of cumulative phonological interactions, and further there is a lack of clarity about which factor(s) might lead a given instance of cumulativity to be (non-)linear in the first place, since studies on the topic draw on acceptability judgements from both real and artificial languages, as well as studies of lexical attestation, the distribution of subclasses of forms within the lexicon, and factors influencing phonological alternations.

Deriving non-linear cumulativity with grammatical models
Grammatical models differ in whether they predict the existence of linear, super-linear and sub-linear effects.Optimality Theory (OT; Prince & Smolensky (1993)) assumes strict constraint domination, and predicts no super-linear interactions.Categorical OT cannot derive probabilities other than 0 or 1 at all, and if a candidate contains two different intolerable (p = 0) violations, it will be eliminated by the higher ranked violation, with no additional cumulative effect of the lower-ranked violation; that is, only one violation contributes, but this is indistinguishable from the effect of two intolerable violations (probability of 0 is equivalent to probability 0 × probability 0) (see Coetzee 2004 for further discussion of grammaticality in categorical OT).Stochastic OT (Boersma et al. 1997;Boersma & Hayes 2001) can assign gradient probabilities, and Smith & Pater (2020) have shown that doubly-marked candidates may receive a probability that is not identical to the probability of its highest violation, but the interaction is always sub-linear, and never super-linear.
Weighted constraint models, by contrast, do not employ strict domination, and as mentioned above, all of the weighted violations in a form are summed to compute the Harmony of a candidate.Whether or not adding multiple Markedness violations leads to linear or super-linear interactions depends on how acceptability or probability are then determined, based on the Harmony of the candidates.In Harmonic Grammar (Legendre et al. 1990), the candidate with the best Harmony is chosen as the categorical winner, with the consequence that a single intolerable violation is all that matters in eliminating forms, as in categorical Optimality Theory.Noisy Harmonic Grammar assigns probabilities much like Stochastic OT by imposing noise on Harmony values, and the predictions for how this affects probability depends on implementational details of how noise is added (Hayes 2017;Zuraw & Hayes 2017;Flemming 2021).This has the potential to derive not only sub-linear and linear cumulativity, but also super-linear cumulativity under certain circumstances (Smith & Pater 2020 and others).
Maximum Entropy (MaxEnt) models (Smolensky 1986;Goldwater & Johnson 2003) have the potential to derive a wider range of non-linear interactions.In MaxEnt models, the probability of a candidate is derived from the Harmony via a non-linear transformation: exp(Harmony) (for details see Jurafsky & Martin 2009: chapter 5).Whether or not this yields super-linear interactions depends on certain assumptions about the candidate set, and how Markedness and Faithfulness constraints interact (Pater 2009b).The tableaux in Table 1 illustrate one way in which the probability of a doubly-marked form may come to be less than the product of the probability of individual violations (super-linearity).In these tableaux, we assume that the fully faithful form competes with a single "Null Parse" candidate, represented as [⊙], which represents the choice not to produce the form (Prince & Smolensky 1993, p. 51;Wolf & McCarthy 2010). 3The Null Parse violates a single constraint, MParse.The Harmony (H) of a candidate is the negated weighted sum of its violations, and the probability is exp(H) divided by the summed exponentiated Harmony for all candidates.The Markedness constraints Agree[±back] and Agree [±nasal] demand that adjacent vowels have the same value for backness, and adjacent consonants have the same value for nasality, respectively.The tableaux show that if MParse is assigned a weight of 5 and the Agree constraints are assigned weights of 3, the probability of the doubly-marked form [poni], which violates both Agree[±back] and Agree[±nasal], is only .27,which is far lower than the product of the probabilities of the independent violations in [poti] and [ponu] (.88 2 = .78).In a MaxEnt model that uses the Null Parse in this way, whether or not a cumulative interaction is expected to be super-linear, linear, or even sub-linear depends on the strengths of the restrictions (cf.Smith & Pater 2020: p. 23).In the example in Table 1, the restrictions against disharmonic forms are, qualitatively speaking, relatively weak, and super-linear cumulativity is predicted.Compare this behavior with the example in Table 2, where the same restrictions are stronger, reflected in the lower weight of MParse relative to the Markedness constraints.Here, we find a less obvious degree of super-linear cumulativity, since the probably assigned to a single violation is already low (.17), and the joint probability of two independent violations (.03) is scarcely different from the predicted probability of a doubly-marked form (.01).Floor effects of this type are not the only circumstance in which this model can predict linear cumulativity, but

/poti/
3 When the competition is defined as a two-way choice between the faithful output and the Null Parse, we avoid the "trading off" relations between Markedness and Faithfulness constraints observed by Pater (2009b), thus permitting a wider range of super-linear interactions.Note that the same type of effect can be observed in the interaction of multiple Markedness constraints with a single Faithfulness constraint, as in Smith & Pater (2020)'s analysis of French schwa epenthesis and deletion.
this example is chosen to resemble the exceptionless phonotactic restrictions in the Breiss (2020) experiment, which failed to detect super-linear cumulativity.
/ In this framework, it is also possible to derive sub-linear cumulativity under certain weighting conditions.For example, as shown in Table 3, if the weight of MParse is 1.4 and the weights of the Markedness constraints are .2,the predicted probability of a doubly-marked form (.73) is actually greater than the joint probability of two independent violations (.77 2 = .59).We return to the issue of sub-linear cumulativity in section 6.3.The preceding examples show that the MaxEnt with null-parse approach has the expressive power to capture various types of linear, super-linear, and sub-linear cumulativity.The approach is constrained, however: it is not able to capture any arbitrary interaction, but rather, the degree of (non-)linearity emerges as a by-product of the strength of the restrictions involved, and the  In an experimental manipulation, we cannot vary the weights that learners assign to markedness and MParse directly, but rather, we vary how strongly the markedness restriction is enforced.The goal of this study is to test the prediction that the degree of linearity in the cumulative interaction of two constraints depends on the strength of the restrictions involved.Note that since we do not have any way to derive expectations about the absolute weights of constraints in the learned grammar, we do not make a specific prediction about the amount of non-linearity that should be introduced by a particular manipulation of the strength of a restriction.We do expect that by exposing learners to languages with varying strengths of phonotactic restriction, we should observe different points along a single vertical "slice" of Figure 2, with the concomitant shift between linear and non-linear cumulativity.Furthermore, for a large portion of weight space, the model predicts that as markedness restrictions get weaker (from bottom to top of the plot), their predicted interaction shifts from linear to super-linear.In what follows, we will first test whether speakers exhibit super-linear cumulativity as phonotactic restrictions get weaker.We then test whether learners infer super-linear cumulativity as a function of the strength of the restrictions, even in the absence of overt evidence.A positive answer to both will support a theoretical device like the MaxEnt model illustrated here, in which super-linear cumulativity is an automatic consequence of the constraint weights.This finding also has the potential to shed light on the mixed empirical results in the literature summarised in section 6.3, in which both linear and super-linear cumulativity have been observed.

Testing for non-linear cumulativity
In this study, we use an AGL task to test whether we can observe non-linear interactions between phonotactic restrictions synchronically in speaker judgements.AGL tasks allow the or not, to test whether learners infer them even in the absence of overt evidence.This approach allows us to make controlled comparisons in a way that is impossible with natural languages.
Ultimately, though, we believe that whatever results we observe here should also be confirmed by studies of speakers' intuitions about how phonotactic restrictions in their native language interact.
Our strategy (following a design employed by Breiss 2020) is to create languages in which two distinct Markedness constraints hold: backness harmony between vowels, and nasal harmony between consonants.This combination of phonotactic restrictions is useful in probing super-linear cumulativity, because they are orthogonal: simultaneous violations of backness and nasal harmony (e.g., [poni]) do not create violations of any other known constraint (see 6.1 for further discussion).In each language, the constraints are enforced with a specific strength, meaning we manipulate the percentage of words that violate them.Participants were trained on mini-lexicons, and then asked to rate novel items that violated neither, one, or both Markedness constraints.What we are interested in measuring is the penalty for doubly-marked forms relative to the singly-marked ones, as modulated by the strength of the phonotactic restrictions.
At this point it is important to note that, just as we cannot experimentally observe and manipulate the weights in a speaker's grammar, we likewise cannot directly observe the probabilities that the grammar assigns.In general, we assume that grammars assign grammaticality values, which are used to judge the acceptability of linguistic expressions, which in turn guides responses in experimental tasks.The MaxEnt grammar that we employ assigns probabilities to competing candidates.However, experiments do not measure probabilities of candidates directly, but rather, probabilities of responses in a task.For this reason, the relation between grammatical probability and experimentally obtained measurements is necessarily indirect.We seek an experimental effect that bears the hallmarks of the expected grammatical effect.Specifically, we seek an experimental response that allows us to quantify the penalties for forms with individual markedness violations, and use these to predict responses for forms with multiple violations.The expected grammatical effect is that multiply marked forms should be judged worse than expected, based on their individual violations.In the experiments reported here, we have chosen a ratings task as a first way to explore this prediction.Ratings tasks allow us to quantify the penalty for individual violations, by comparing ratings for forms with zero vs. one violation.As described below in section 4, we use linear modeling to predict ratings for doubly marked forms, and we test whether participants' ratings are lower than expected.
Although the computation of expected values in the linear model is mathematically different from the computation of probabilities in the grammatical model, we believe that observing such an effect in ratings is a good first step in testing the super-linearity prediction of the grammatical model.
In Experiment 1, we begin by manipulating the number of exceptions to the two phonotactic restrictions.In this experiment, participants are trained on a lexicon that largely conforms to backness and nasal harmony, but has a certain number of exceptions to each independently (depending on the Condition).In this experiment, doubly-marked forms that violate both backness and nasal harmony are withheld in training, and we then test whether participants rate them exactly as predicted given their judgments about single violations (linear cumulativity), or whether they are rated better/worse (sub-/super-linear cumulativity).At a basic level, this experiment tests whether speakers show non-linear cumulativity in how they enforce restrictions synchronically.It also tests whether the degree of non-linearity depends on the strength of the phonotactics.
The design of Experiment 1 leaves open the possibility that participants exhibit super-linear cumulativity precisely because the doubly-marked forms were absent (withheld).Therefore, in Experiment 2, we test whether participants still exhibit super-linear cumulativity, even when the training language contains exactly as many doubly-marked forms as expected under linear cumulativity.This tests whether speakers are not only able to represent super-linear cumulativity, but whether they are compelled to, even when such forms are not actually underrepresented.
We will see that speakers do in fact infer super-linear cumulativity, even when it is not present in the training data.

Experiment 1
This experiment tests the relation between the strength of phonotactic restrictions and the type of cumulativity that they produce.The design described in this section was also employed in Breiss (2020), and the results of Experiment 3b of Breiss (2020) are included as Condition A.

Stimuli
The exposure phase contained 32 unique CVCV, initially-stressed nonwords, with consonants ∈ {/p, t, m, n/} and vowels ∈ {/i, e, u, o/}.As noted above, one of the two phonotactics was a requirement that consonants harmonize with respect to the feature [nasal], such that both consonants in the word were drawn from either {/p, t/} or {/m, n/} (exhibiting nasal harmony).
The other phonotactic required that vowels harmonize with respect to the feature [back], such that both vowels in the word were drawn from either {/i, e/} or {/u, o/} (backness harmony).For more on these types of consonant and vowel harmony respectively, see Hansson (2010); Walker (2011).
Five distinct training Conditions (A-E) were distinguished by the number of items that violated each of the phonotactic patterns in the language: 0%, 6.25%, 12.5%, 18.75% or 25%.
There were no training items which violated both phonotactics at once, so even in the most exceptionful Condition (Condition E) each phonotactic received support from 75% of the words in the training phase.The verification phase used 16 pairs of minimally-differing nonwords: one member of each pair was a fully-conforming word from the exposure phase, and the other was created by reversing the featural specification for backness (and rounding) or nasality of one of the consonants or vowels in the fully-conforming word.This yielded a pair of words differing only in a single instance of that phoneme.8 pairs differed in a violation of nasal harmony, and 8 in violation of backness harmony, with differences between pair-members balanced for segmental placement and identity.Verification pairs were balanced so that when a fully-conforming verification word had identical consonants (ex.totu), it differed only in the violation of backness harmony (ex., totu vs. toti).The same condition was imposed on verification trials whose conforming word contained identical vowels.There were no doubly-violating words in the verification phase, since its purpose was simply to ensure that participants had learned each of the two phonotactic constraints independently.
The test phase used a set of 48 novel nonwords which varied in conformity to both phonotactics.
All words were recorded in a sound-attenuated room by a phonetically trained female native English speaker using PCQuirer.They were digitized at 44,100 Hz and normalized for amplitude to 70 dB.

Design
Participants were assigned to one of the five Conditions, and learned the language by listening to a continuous speech stream containing 20 randomized repetitions of the 32 words selected for that particular training phase.After exposure, participants completed 16 self-paced two-alternative forced choice verification trials.Participants were allowed to advance to the generalization phase if they learned each of the phonotactics to a nonsignificantly-different degree.This was operationalized by imposing a condition that the difference in number of correct answers between pairs differing only in a nasal harmony violation and those differing only in a backness harmony violation was not allowed to be greater than 3, chosen by using Fisher's exact test (Fisher 1934) to determine the level at which the proportion of correct answers for each phonotactic significantly differed, across the range of possible accuracies.If participants did not meet criteria after two exposure blocks (one initial and one after failing to meet criterion during the verification phase), they were simply asked to complete the final demographic questionnaire and did not generate data in the generalization phase (although we will see in section 4.1.4that no participants were excluded for this reason).
If participants met criteria on the verification phase, they advanced to a generalization phase which consisted of a ratings task containing 48 novel words in which participants were asked to rate each of the words on a scale from 0 (very bad) to 100 (very good) based on how good they sounded as an example of the language they had learned during the exposure phase.At the end of the experiment, demographic and language-background information was collected.The entire experiment lasted approximately 20-30 minutes, depending on the number of additional exposure blocks each participant required.

Procedure
The experiment was conducted in a sound-attenuated room using a modified version of the Experigen platform (Becker & Levine 2020).At the start of the experiment, participants were informed that they would first be learning a new language, and that they then would be tested on their knowledge of that language.During the exposure phase, participants were instructed to simply sit and listen to the speech stream and, if they felt themselves getting bored, to try to count how many unique words they could find in the speech stream (this task was suggested simply to encourage participants to attend to the speech stream).The exposure phase lasted about ten minutes.
Following the exposure phase, participants completed a self-paced verification phase.On each verification trial participants were played a pair of nonwords in a random order, and were instructed to choose the one that sounded like it could belong to the language they had learned.The generalization phase followed a similar structure, except that each trial containing a single novel nonword to which participants assigned a numerical rating.After completing the generalization phase (or after failure to meet criterion during the verification phase), participants completed a brief demographic questionnaire.

Participants
375 undergraduate students were recruited from the SONA Psychology subject pool at the University of California, Los Angeles, and were compensated with course credit.Participants' data were excluded if they failed to meet the criterion for sufficient learning as assessed during a verification phase (n = 0; see section 4.1.2for details), for not having spoken English consistently in some context (home, school, etc.) since early childhood (n = 43), and in the case of experimenter error (n = 3), leaving data from 329 participants included in the final analysis.

Results
The results from the generalization phase are plotted in Figure 3.As anticipated, stimuli that conform to both restrictions received the highest ratings, stimuli that violated both restrictions received the lowest ratings, and stimuli that violated only one of the two restrictions received  Note that this difference emerged in spite of the fact that, to a first approximation, participants indicated comparable levels of sensitivity to violations of nasal and backness harmony in the verification phase.There are several possible sources of this discrepancy.First, the criterion for comparable accuracy in the verification phase was a difference of 3 responses or less, which translates to a difference of up to ∼19%; thus, participants may have learned nasal harmony more strongly and still passed the verification phase.Second, the verification phase involved trained items, whereas the generalization phase involved novel items, so it is conceivable that participants used memory to perform better on backness harmony in the verification phase than in the generalization phase.Finally, it is conceivable that the discrepancy reflects a difference in either the sensitivity of the measures or strategy that participants used to complete the verification vs. generalization tasks.
In addition to an overall difference between nasal and backness harmony, the interaction with Condition raises the question of whether the learning of backness harmony was impeded by exceptions in a way that the learning of nasal harmony was not.We can address this in a preliminary way by examining participants' performance in the verification phase.We calculated each participant's nasal advantage score, a measure ranging between -3 and 3 which corresponded to the difference between the number of correct answers (out of 8) that participant gave on questions testing backness vs. nasal harmony in the verification phase.A positive score indicates that a participant got more correct answers on the nasal-harmony-assessing questions (ex., potu vs. ponu) than on backness-harmony-assessing questions (potu vs. poti), and a negative score  We are now in a position to assess how the the cumulative interaction of nasal and backness harmony varied across Conditions.This interaction is seen in the re-plotted data in Figure 5, which shows a gradual divergence in the slopes of the two lines representing the effect of nasal harmony, in the presence or absence of backness harmony violations.Recall from section 2.1 that we define linear cumulativity as the scenario where each Markedness violation has its own independent effect on the well-formedness of a form, independent of any other violations present.Conversely, non-linear cumulativity means that certain combinations of violations yield a greater reduction in well-formedness than we could deduce from the sum of their violations alone.Statistically speaking, this means that we first fit a model of participants' ratings, in which we attempt to predict a form's rating as a function of its (non-)conformity to backness and nasal harmony, independently.Specifically, we fit a linear mixed effects regression model using the lme4 package (Bates et al. 2015) in R (R Core Team 2021), modeling the ratings data from the generalization phase.In this model, each constraint violation constitutes a main effect, with the possibility that it may combining forces with another constraint violation (an interaction).Thus, the model included fixed effects for the two Markedness constraints: violation of vowel harmony (y/n, reference level = n) and violation of consonant harmony (y/n, reference level = n).We can assess the linearity of constraint cumulativity by looking at whether the interaction between the markedness effects is significantly different from zero; the interaction term indicates the degree to which the rating is ill-formed above and beyond the contribution attributable to each of the component violations independently.Finally, we are interested in not only the cumulativity of any two violations per se but also the relationship between the strength of the individual constraints (Conditions A-E) and the cumulativity of those constraints.Therefore, we also included a continuous fixed effect corresponding to the percentage of exceptions to individual phonotactics in a given participants' training Condition.We are crucially interested in the three-way interaction between the two phonotactic violations and Condition: if it is significantly negative, that means that the penalty for doubly-violating forms is greater than can be accounted for based on the independent penalties for each violation.We will take such an interaction as initial support for a model that produces super-linear cumulativity.
Recall from section 3 that this way of calculating deviations from expected grammaticality is not identical to the probability-based definition given in section 2.3, but what they have in common is that a response value (probability, rating) for doubly marked forms is lower than expectations based on singly marked forms.
Following Barr et al. (2013), we began by fitting a model with a maximally-specified random effect structure and simplified as necessary to achieve convergence.The final model contained the three-way interaction between the fixed effects outlined above, plus random intercepts for participant and nonword.
This model revealed that violating the nasal harmony phonotactic was associated with significantly lower ratings (β = -24.93,p < 0.001).The interaction between violation of nasal harmony and Condition was significant (β = 0.29, p < 0.001), indicating that as the percentage of forms violating the nasal harmony phonotactic in the training data increased, novel forms which violated this phonotactic were judged less ill-formed.The analogous main effects and interaction between violation of the backness harmony phonotactic and training group was also significant (main effect: β = -9.95,p = 0.015; interaction: β = 0.19, p < 0.001).There was also a significant main effect of training group, indicating that as as the number of fully-conforming words heard in training decreased, fully-conforming words were judged less well-formed as a baseline (β = -0.18,p < 0.001).Critically, the three-way interaction between violation of nasal harmony, violation of backness harmony, and Condition was significant (β = -0.17,p < 0.002).
The negative coefficient indicates that as the percentage of nonconforming words in training increased, the difference between singly-marked and unmarked items decreased, while the relative markedness associated with the doubly-marked items remained approximately unchanged.

Local discussion
Experiment 1 found that speakers are able to represent super-linear patterns in their grammar, and that this super-linearity is related to the strength of the phonotactic restrictions involved.
We found that as the number of exceptions in the training increased, learners judged doublyviolating items as more and more ill-formed than one would expect, based on their judgements of singly-violating forms.These results are consistent with the proposed model that is able to represent super-linear cumulativity under particular weighting conditions.
Experiment 1 manipulated the number of forms that violated each phonotactic restriction; that is, we introduced violations of backness harmony (ex., poti) and of nasal harmony (ex., ponu).A by-product of this manipulation was that the Conditions also differed in the expected rate of doubly-violating forms -that is, forms that violated both backness and nasal harmony simultaneously, like poni.Recall that the expected rate of doubly-violating forms is the product of the probabilities of each individual violation.For Condition A, with zero exceptions, the rate of one violation is 0%, and the expected rate of two violations is 0% 2 = 0%.For Condition E, on the other hand, the rate of single violations is 25%, and the expected rate of two violations is 25% 2 = 6.25%, or 2 words in a lexicon of 32 words.However, such doubly-violating forms were withheld completely in training for all Conditions, since we were interested in testing participants' judgments about an untrained word type.This raises the possibility that learners were sensitive to the lack of doubly-violating forms, particularly in Conditions D and E, and used this to learn a grammar that specifically penalized them.The question, then, is whether learners in Experiment 1 noticed the one or two missing forms and used highly parameterized grammars to accommodate super-linearity, or whether they projected it as a by-product of enforcing the individual restrictions.We address this question in Experiment 2.

Experiment 2
We carried out a replication of Condition E from Experiment 1, except that the training data included two doubly-violating forms, so that they were no longer underrepresented in the training data.If this experiment finds linear cumulativity, we can conclude that the super-linear effect observed in Experiment 1 Condition E was due to the lack of doubly-violating forms, suggesting overt learning.If we nonetheless observe super-linear cumulativity, we can conclude that learners project super-linear cumulativity of weak phonotactic restrictions, even when this deviates from the observed frequencies. 4

Methods
The stimuli, design, and procedure for Experiment 2 were identical to those of Experiment 1 Condition E, except that two of the singly-violating forms were altered so as to also violate the other phonotactic; see

Results
The results of Experiment 2 are shown in Figure 6.Comparing Experiment 1 Condition E and Experiment 2, we see that in both cases, participants rated forms that violated neither phonotactic restriction were rated highest, and forms that violated backness harmony were rated essentially as high.Forms that violated nasal harmony were rated lower, while forms that violated both nasal and backness harmony received lower ratings still.As above, the question of interest is whether the penalty for violating both nasal and backness harmony continued to be greater than expected (super-linear) in Experiment 2, based on the independent penalties associated with each individual violation. 4We leave open whether learners project super-linearity because their grammatical mechanism is so tightly parameterized that the degree of linearity in cumulativity is necessarily determined by the strength of the restrictions, or whether they project it due to prior expectations about constraint weights that yield super-linearity of weak phonotactic restrictions.In order to address this, we would need provide learners with more evidence for super-linearity; for example, a larger number of forms, so that the discrepancy between observed and expected numbers of doublymarked forms would be greater.To test this, we analyzed the two datasets together in a mixed-effects linear regression model.Since we anticipate a null result, in contrast to Experiment 1 we opted for a Bayesian implementation of the model, using the brms package (Bürkner et al. 2017). 5Bayesian models estimate a range of probable values for the parameters of interest; thus we can conclude that an effect is robust to the extent that 95% of these values, a measure known as a 95% Credible Interval (abbreviated to "95% CI", followed by upper and lower bounds in square brackets), does not include zero.The inverse of this is that if the range is centered on zero, then we can say there is evidence for no effect of the parameter of interest on the dependent variable.Thus, the Bayesian model allows us to present evidence that supports, rather than simply fails to reject the null hypothesis.For a linguistically-oriented introduction to Bayesian methods for both theorybuilding and data analysis, see Nicenboim & Vasishth (2016); for tutorial materials on the brms package in a linguistic context, see Vasishth et al. (2018); Nalborczyk et al. (2019); for a more general primer in Bayesian statistical modeling, see Kruschke (2014).
As in Experiment 1, the dependent variable was the numerical rating given to each word in the generalization phase.Also as in Experiment 1 the model contained a fixed effect of whether the form violated backness harmony (y/n, reference level = n), whether the form violated nasal harmony (y/n, reference level = n), and a binary factor for Experiment (one/two, reference level = one), as well as all two-and three-way interactions of these predictors.The model also contained random intercepts for nonword with slopes for Experiment, and random intercepts for subject with slopes for the interaction of the two binary phonotactic predictors.
We can interpret the output of the model as follows: if the 95% Credible Interval for the three-way interaction of violating backness harmony, violating nasal harmony, and Experiment excludes zero, it indicates that the degree of linearity in the cumulative interaction of violating both phonotactics together compared to their independent violations differed meaningfully between studies.If the 95% Credible Interval for the interaction is centered on zero, we can conclude that the cumulative effect of violating both phonotactics did not differ between studies, and thus was unlikely to have been overtly learned in Experiment 1.

Local discussion
Experiment 2 tested for whether the super-linear cumulativity observed in Experiment 1 was a result of participants overtly learning a super-linear penalty from the super-linear underrepresentation in their data.We found that the linearity of cumulativity was not affected by whether or not the training data contained a subtle super-linear pattern.We take this to be compelling evidence in support a synchronic link between exceptionality in learning data and super-linear cumulativity, as discussed in section 2, and against the possibility of the effect having been overtly learned.

Discussion
The experimental results in this study have shown that speakers can enforce super-linear cumulativity between phonotactic restrictions as a synchronic effect, and in fact even assume super-linearity under certain conditions, even when it is not present in the data.Using AGL experiments, we first systematically varied the number of exceptions to phonotactic restrictions in training, and found that the degree of non-linearity depends on the strength of those restrictions in the grammar.We then varied the amount of evidence that learners received for super-linear cumulativity in the training data, and found that learners continued to exhibit it even when such evidence was removed entirely.
On the basis of these data, we conclude that speakers can represent super-linear cumulativity in their synchronic grammar, and that this super-linearity was emergent from the interaction of the two constraints -a property of the grammar itself -rather than overtly learned from the training data.

Super-linear cumulativity or one constraint?
In our experimental results, we observe an interaction between two harmony restrictions: backness of vowels and nasality of consonants.We have assumed that these restrictions are enforced by separate Markedness constraints, and that the observed effect must reflect a superlinear interaction between two constraints.It is crucial for this interpretation that it is not due to the action of a single constraint, "nasal and backness CV harmony", which penalizes only those sequences which violate both independent Agree constraints.Thus, it is important to consider whether participants were employing such a unitary constraint.
There are two ways of thinking about a putative "nasal and backness CV harmony" constraint.On the one hand, it could be a unitary constraint enforcing simultaneous agreement of consonant nasality and vowel backness.On the other hand, it could be a conjoined constraint, Agree[±back] & Agree[±nas] (Smolensky 1993;Ito & Mester 2003). 6In either case, we have no particular reason to believe that there is such a constraint, since we know of no formal, phonetic, or typological connection between these two restrictions.More importantly, even if such a constraint existed, it would be mysterious why participants in Experiment 2 inferred its presence or activity, since we removed any trace of nasal plus backness harmony from speakers' learning data.We therefore conclude that the effects that we observe involve super-linear cumulativity of two separate constraints.

Whence super-linearity? MParse and beyond
We have based much of the framing of this paper on a model of phonotactic acceptability in which each form competes against a Null Parse candidate for existence, and then this probability is mapped onto a rating given by the participant.We illustrated this using a MaxEnt model, in which super-linearity emerges as a consequence of how probability is calculated from Harmony.
We adopted the MaxEnt framework because it is easy to demonstrate how it derives superlinearity, but similar effects can be derived in other probabilistic constraint-based models, too, such as Noisy Harmonic Grammar (Boersma & Pater 2016;Hayes 2017;Smith & Pater 2020).
We do not believe that the current results uniquely support a MaxEnt model, though they are consistent with it.
A distinguishing feature of this model that does play an important role in deriving superlinearity is the use of MParse.In most existing frameworks, unacceptability is modeled with grammars that assign low probability to a form, and high probability to a competitor -either an unfaithful rendition of the UR (Prince & Smolensky 1993) or other competitor strings which are more probable (Hayes & Wilson 2008).In models that employ MParse, unacceptability may also be modeled as the selection of the Null Parse, which violates only a single constraint (Prince & Smolensky 1993;Smolensky 1993;Wolf & McCarthy 2010).In the model illustrated in section 2.1, we crucially assumed that the Null Parse is not only a competing candidate, but the only competing candidate.This allows for the grammar to set a threshold of markedness above which the marked form is quite probable, and below which the Null Parse quickly becomes the more favored candidate (see also footnote 9 in Legendre et al. 1998).This thresholding effect is not so readily available in models that have candidate sets in which Markedness and Faithfulness violations trade off against each other in a one-to-one manner (Pater 2009b), or models which are based on the relative Harmony of different non-null candidates, such as the model proposed in Hayes & Wilson (2008).
A consequence of choosing this model is that, since it lacks Faithfulness, it cannot model any process in which a string must be repaired, such as alternations, loanword adaptation, and others.This leaves open the question of how to model such phenomena.The question of how closely tied phonological repairs are to phonotactic restrictions is an area of long-standing debate (Sommerstein 1974;McCarthy 2002: p. 77;Pizzo 2015;Chong 2017;Do & Yeung 2021).We see several possible answers to this question that can accommodate super-linearity in phonotactics, while also producing repairs.One is that phonotactic acceptability and phonological alternations are completely separate processes, as suggested by Hayes & Wilson (2008).However, our use of MParse does not require two separate grammars.Phonotactic restrictions and repairs could be derived with a single grammar, with a single set of Markedness constraints and weights, but in which different candidate sets are considered in different contexts or for different tasks.For example, we could model phonotactic acceptability judgments as a competition between the fully faithful candidate and the Null Parse, in which MParse is the arbiter.Alternations, on the other hand, could be modeled as a competition between the fully faithful candidate and possible repairs, decided by Faithfulness constraints.

Super-linearity vs. sub-linearity
The model that we have explored here has the ability to capture both linear and super-linear

Conclusion
The work presented here is a first step towards a fuller understanding of the empirical and typological landscape of (non-)linear cumulativity.The dependency between constraint strength and cumulative behavior proposed by our model makes strong predictions about both the wide scope of constraints that can enter into non-linear cumulative relationships, and also specific claims about the weighting requirements that must be met for such effects to be observed.A great deal of further empirical research is therefore needed to test and refine these predictions going forward.
absolute value of the constraint weights.The relation between the weight of the constraints and their predicted cumulative interaction is shown in Figure 1, which illustrates how varying the weight of MParse and Markedness constraints determines whether the interaction is superlinear, linear, or sub-linear.A formal description of the specific weighting conditions under which Maximum Entropy grammars with MParse exhibit different types of linearity is provided in the appendix.

Figure 1 :
Figure 1: Relationship between weight of a singly-violating candidate and the weight of MParse.

Figure 2
recasts the relation between MParse and markedness, focusing on how linearity depends on the probability assigned to outputs with a single Markedness violation.A probability of zero reflects a strongly enforced markedness restriction, and a probability of one reflects a completely unenforced restriction.

Figure 2 :
Figure 2: Relationship between probability of singly-violating form and the weight of MParse.
experimenter to manipulate properties of languages, to perform controlled comparisons of what participants learn under minimally different learning conditions.Such tasks have been used to manipulate the formal complexity (Moreton 2008; Moreton & Pater 2012a; b; Lai 2015; Öttl et al. 2015; McMullin 2016; Avcu & Hestvik 2020) and phonological substance (Wilson 2006; Finley & Badecker 2009; White 2013; Finley 2015; Glewwe 2019) of phonotactic restrictions and alternations.In order to test the effect of the strength of phonotactic restrictions, we can control the probability of individual Markedness violations by introducing exceptions (cf.also Hudson Kam & Newport 2005; Schuler et al. 2021 among many others).This allows us to calculate the joint probability of two violations, and compare it to participants' acceptability judgements.It also allows us to manipulate those probabilities, to test whether the presence or strength of super-linear interactions depends on the strength of the individual Markedness violations.Finally, we can directly control whether super-linear interactions are present in the training data intermediate ratings.Furthermore, as the number of exceptions in training increased (Condition A through Condition E), the ratings of violating forms generally increased, as well.Unexpectedly, as the number of exceptions in training increased, the ratings of fully-conforming items also decreased, particularly in Conditions D and E.

Figure 3 :
Figure 3: Experiment 1 results, group-level rating plotted on the vertical axis with standard error, Condition plotted on the horizontal axis.Color denotes which phonotactics were violated.As it turns out, although exceptions to backness and nasal harmony were presented with equal frequency in the training data, violations of backness harmony were judged better than violations of nasal harmony, even converging with ratings of fully-conforming items in Conditions D and E.
indicates the reverse.If participants were simply not learning the backness-harmony phonotactic, we should expect to see participants in training Conditions with more exceptions having a higher nasal advantage score.Figure 4 plots nasal advantage scores by Condition.A linear model confirmed the visual impression that training Condition (coded as a numerical predictor corresponding to the percentage of training data conforming to both phonotactics) does notsignificantly predict nasal advantage score (β = -0.015,p = 0.791).We therefore conclude that although backness harmony was enforced less stringently than nasal harmony -and that this lead to an eventual convergence with fully-conforming items in the most exceptionful Conditions -it is not the case that manipulating the number of exceptions had a differential effect on learning backness vs. nasal harmony.

Figure 4 :
Figure 4: Nasal advantage score by Condition: one dot is one participant's score (jitter added for readability).

+
Null Parse model allow for the above suggestion of super-linear cumulativity via overt learning?Here, the expected degree of super-linearity is a function of the weights of the constraints involved.Figure2shows that for most weights of MParse, the model predicts that as the weight of Markedness decreases -and the probability of singly-violating forms correspondingly increases -the penalty for multiply-violating forms becomes super-linear.If the weight of MParse is invariant, the degree of super-linearity should be an emergent by-product of the strength of the phonotactic restrictions.Other models allow a broader range of cumulative effects through additional parameters.If the weight of MParse is variable, learners would be able to capture a wider (though still quite constrained) range of linear or non-linear effects by setting the weight of MParse in response to the data.An even more powerful approach is to induce a conjoined constraint, such as Agree[±back] & Agree[±nasal], which allows for any degree of superlinearity(Smolensky 1993;Ito & Mester 2003; Shih 2017 see section 6.1 for further discussion).

Figure 6 :
Figure 6: Comparison of mean and standard error of ratings by word type in Experiment 1 Condition E (left), and Experiment 2 (right).

6
Numerous authors have pointed out that local constraint conjunction has the potential to radically expand the predicted phonological typology beyond what has been observed (Pater 2009a; b; Potts et al. 2010).Moreover, the putative constraint Agree[±back] & Agree[±nas] pushes the limits of what is allowed for constraint conjunction, since the locus of violation spans multiple segments and syllables (Łubowicz 2005).
cumulativity, but in a restricted fashion: as seen in Figure 2, the degree of cumulativity depends on the strength of the phonotactic restrictions involved.This is precisely what we observed in our experimental results.The fact that the very same phonotactic restrictions can interact in different ways depending on their strength has the potential to shed light on a discrepancy in the literature, between studies that do (Albright 2008; 2012; Green & Davis 2014; Kumagai 2017; Shih 2017; Yang et al. 2018; Kim 2019; Smith & Pater 2020) and do not (Pizzo 2015; Breiss 2020; Durvasula & Liter 2020; Kawahara 2021; Kawahara & Breiss 2021; Kawahara & Moore 2021) observe super-linearity.

Figure 2
Figure2predicts exactly the type of transition we observed: as we move upwards along the vertical axis from a stronger to a weaker phonotactic restriction, we observe a transition from one type of interaction to another, and the specific transition depends on the weight of MParse.For most values of MParse, the prediction is a shift from linear to super-linear interactions, as in our experiments.However, for some values of MParse, the model also predicts a region of sub-linear cumulativity.In fact, there are possible indications of sublinear cumulativity in the literature:Pizzo (2015) found that violations of syllable-margin restrictions in English interacted sub-linearly in phonotactic acceptability.It is conceivable, therefore, that the apparently discrepant results in the literature are simply a consequence of the weights of markedness and MParse involved.Continued systematic experimental investigation of how phonotactic restrictions of varying strengths interact will reveal whether linearity, super-linearity, and sub-linearity emerge under the predicted weighting conditions.AGL tasks like the one employed here are a useful tool for probing this question, because they allow us to vary phonotactic strength independent of other properties of the language.

Table 1 :
Super -linear cumulativity in a MaxEnt + Null Parse model of phonotactics.

Table 2 :
Approximately linear cumulativity in a MaxEnt + Null Parse model of phonotactics.

Table 3 :
Sub -linear cumulativity in a MaxEnt + Null Parse model of phonotactics.

Table 4
displays the counts and violation profiles of stimuli.

Table 4 :
Distribution of stimuli across Conditions in Experiment 1.

Table 5 .
86 undergraduate students were recruited from the same subject pool to participate in the experiment, and were compensated for their time with course credit.Of these, 15 were excluded for not having spoken English consistently in some context since before the age of seven, leaving data from 71 participants for analysis.

Table 5 :
Distribution of training items by type, comparing Experiment 1 Condition E to Experiment 2.