A MaxEnt learner for super-additive counting cumulativity

Whereas most previous studies on (super-)gang effects examined cases where two weaker constraints jointly beat another stronger constraint (Albright 2012; Shih 2017; Breiss & Albright 2022), this paper addresses gang effects that arise from multiple violations of a single constraint, which Jäger & Rosenbach (2006) referred to as counting cumulativity . The super-additive version of counting cumulativity is the focus of this paper; cases where multiple violations of a weaker constraint not only overpower a single violation of a stronger constraint, but also surpass the mere multiplication of the severity of its single violation. I report two natural language examples where a morphophonological alternation in a compound is suppressed by the existence of marked segments in a super-additive manner: laryngeally marked consonants in Korean compound tensification and nasals in Japanese Rendaku. Using these two test cases, this paper argues that these types of super-additivity cannot be entirely captured by the traditional MaxEnt grammar; instead, a modified MaxEnt model is proposed, in which the degree of penalty is scaled up by the number of violations, through a power function. This paper also provides a computational implementation of the proposed MaxEnt model which learns necessary parameters given quantitative language data. A series of learning simulations on Korean and Japanese show that the MaxEnt learner is able to detect super-additive constraints and find the appropriate exponent values for those, correctly capturing the probability distributions in the input data.


Introduction
A gang effect has been traditionally defined as a cumulative constraint interaction in which two weaker constraints jointly beat another stronger constraint (Guy 1997;Pater 2009;Albright 2012;Shih 2017).Counter to traditional Optimality Theory (Smolensky & Prince 1993) where constraints are strictly ranked, Harmonic Grammar (Legendre et al. 1990) can capture gang effects without any stipulation, through linear interaction between weighted constraints.Consider the tableaux in (1).In (1)-a and (1)-b, the candidate that violates either ℂ 2 or ℂ 3 wins since violating ℂ 1 is more fatal.In (1)-c, however, the candidate that violates both ℂ 2 and ℂ 3 loses because the summation of ℂ 2 and ℂ 3 outweighs ℂ 1 .
(1) Gang effect a. w(ℂ 1 ) > w(ℂ 2 ) It was reported that the mere addition of constraint weights often cannot capture the cumulative effect correctly because forms with two marked structures occur even less in natural languages than what grammar predicts by multiplying the probabilities of two separate forms with one marked structure each (Albright 2012;Shih 2017;Breiss & Albright 2022).For example, Albright (2012) showed that some combinations of relatively marked structures are strongly underattested in both Lakhota and English.More specifically, singly marked structures are somewhat tolerated in the lexicon but doubly marked structures are not, appearing even more rarely than expected.
This super-gang effect calls for an additional penalty for combinations of markedness on top of the sum penalty of dispreferred subparts (Albright 2012).Shih (2017) convincingly showed that super-gang effects can be captured by adding a weighted conjoined constraint in the grammar; the addition of the conjoined constraint significantly improves the grammar's fit when modeling quantitative natural language patterns.Breiss & Albright (2022), through a series of artificial language experiments, investigated the relationship between the strength of an individual restriction and how it gangs up with other violations.They found that violations are more likely to super-gang up with each other if individual restrictions are weaker, or in other words, have more exceptions.They concluded that a MaxEnt grammar, although best fitting among the contemporary frameworks, can capture these super-gang effects only if very specific weighting conditions are met; the interacting constraints should be both low-weighted.Jäger & Rosenbach (2006) distinguish two types of cumulativity.Ganging-up cumulativity is what most previous works refer to as "gang effects".As described earlier in (1), it is characterized as a case in which two weaker constraints jointly overcome another stronger constraint.Counting cumulativity refers to cases where multiple violations of a single weaker constraint overpower a single violation of a stronger constraint.In (2)-a, the candidate which violates ℂ 1 loses because violating ℂ 1 is more severe than violating the weaker constraint ℂ 2 .However, in (2)-b, the candidate that incurs multiple violations of ℂ 2 loses because a single violation of ℂ 1 is outweighed by multiple violations of ℂ 2 .
(2) Counting cumulativity a. w(ℂ 1 ) > w(ℂ 2 ) This paper focuses on the super-additive version of this counting cumulativity, which I call super-additive counting cumulativity; multiple violations of a weaker constraint overpower not only a single violation of a stronger constraint, but also go beyond what is predicted by simply multiplying the severity of a single violation of itself.For example, as I will show in §3, an OCP constraint against a pair of laryngeally marked consonants in Korean gangs up in a super-additive manner so that one violation is negligible whereas multiple violations are severe.Throughout the paper, I use the term super-additive cumulativity to refer only to this specific type of gang effect.I report two examples of super-additive counting cumulativity in natural language to show that these effects can be well captured using MaxEnt Harmonic Grammar (Hayes & Wilson 2008) if the penalty is scaled up by the number of violations (n), through a power function (f(n) = n b , where b > 1).I also provide a computational implementation of this framework, called the Power Function MaxEnt Learner, which automatically detects constraints that show superadditive counting cumulativity and learns the appropriate b value.
The remainder of the paper is organized as follows.Through an illustration of a toy example of a super-additive counting cumulativity in §2, I first show that these effects cannot be entirely captured in MaxEnt Harmonic Grammar (Hayes & Wilson 2008).I further show that the superadditivity is better predicted by a MaxEnt grammar provided that a power function is incorporated into the violation assessment method.I also introduce the Power Function MaxEnt Learner, which will be used later in the paper to fit the natural language patterns that show super-additive counting cumulativity.In §3, I investigate a super-additive cumulativity of laryngeally marked consonants in Korean compound tensification, a productive morphophonological process.I analyze the pattern by employing two OCP constraints that are sensitive to the position of a word boundary and to whether the marked structure is derived, and vary in counting cumulativity.In §4, I examine the effects of nasals on blocking Rendaku, a well-known compounding process in Japanese that is similar to Korean compound tensification.Various restrictions in the Rendaku process are attributed to gradient similarity-avoidance effects in Japanese, which are formally analyzed in the paper as a set of OCP constraints on different natural classes and with varying degrees of counting cumulativity.For each example in §3 and §4, I present learning simulations on the quantitative data acquired from a survey and dictionaries to show that the MaxEnt model detects super-additive constraints and learns appropriate parameters to successfully predict the frequency distributions of the data.I summarize the paper in §5.

Terminology
Various terms regarding the topic of constraint interaction have been used in the literature.In this section, I will introduce these terms and determine which of these will be used in the paper.
Ganging and cumulativity have been used interchangeably to refer to an existence of some interaction between multiple violations, either from one constraint or more.As opposed to traditional Optimality Theory (Smolensky & Prince 1993) in which constraints are strictly ranked, Harmonic Grammar (Legendre et al. 1990) and its variations assume that constraints can be ganging or cumulative with each other.I use the terms cumulative and cumulativity in this paper.
Linearity and additivity have both been used in the literature, to refer to types of cumulativity in constraint interaction.According to Breiss & Albright (2022), these terms regard the relationship between the actual probability of doubly-marked structures and the predicted probability of those that is computed by multiplying probabilities of two separate singly-marked structures.
They also briefly comment on how linearity is only subtly different from additivity and choose to use the term linearity in keeping with previous literature.Linearity, super-linearity, and sublinearity refer to cases where the observed probability is equal to, lower than, or higher than the predicted probability, respectively.
As mentioned in §1, this paper focuses on super-additive counting cumulativity, cases where multiple violations of a single constraint overpower a mere multiplication of one violation.As it will be explained in more detail in §2.5, in the conventional MaxEnt grammar, the probability of a certain phonological form x is computed by exponentiating its harmony, the weighted sum of x's constraint violations (Goldwater et al. 2003;Hayes & Wilson 2008).Because of this exponentiating process, the linear increase in the harmony domain can affect probability in a super-linear manner; nevertheless, simply adding constraint violations is insufficient in some cases, as I will report in the paper.Thus, I use the terminology of additivity in the paper.Specifically, I use the term super-additivity to refer to cases where a monotonic increase in constraint violation cannot fully capture the amount of decrease in the probability domain.

Conventional MaxEnt Grammar
Tableau (3) illustrates an example that exhibits a super-additive counting cumulativity, motivated by a Korean dataset that will be introduced later in §2.The weights of the constraints and predicted probabilities are computed by the Maxent Grammar Tool (Hayes et al. 2009).
Violation of ℂ 1 will always be incurred by candidate (ii) and will be constant over the inputs.
Violation of ℂ 2 will always be incurred by candidate (i), with a monotonic increase.The Korean dataset in §3 will not include inputs like (d) because the majority of Korean native nouns are monosyllabic or disyllabic and therefore not long enough to incur three violations of ℂ 2 .Even if the word length allows, a co-occurrence of three laryngeally marked consonants is highly disfavored, reported in Park (2020) as phonotactic learning simulation results using the existing Korean native monomorphemic nouns.However, the input (d) is still included for the purpose of demonstration because the expected probability of nonce forms like (3)-d-i is most certainly zero.
(3)  (4) Weighting schema for the counting cumulativity of ℂ 2 w(ℂ 2 )×n > w(ℂ 1 ) > w(ℂ 2 ), where n ≥ 2 Taking a closer look at the observed probability distributions of (3)-c and (3)-d, the losing candidates with multiple violations of ℂ 2 barely occur between the two possible outputs.This shows that ℂ 2 is cumulative in a super-additive manner; multiple violations of ℂ 2 surpass a mere multiplication of the severity of its single violation.
The weights of the constraints, computed by the Maxent Grammar Tool (Hayes et al. 2009), satisfy the weighting condition given in (4); one violation of ℂ 2 (0.6) weighs less than one violation of ℂ 1 (0.7), which in turn weighs less than two or three violations of ℂ 2 (1.2, 1.8).However, the probabilities of the candidates predicted by the weighted constraints do not successfully match the input distributions.To reproduce the observed pattern of super-additivity, a grammar must satisfy the two following conditions.First, one violation of ℂ 2 should weigh low enough to correctly capture the marginal frequency difference between the winners of (3)-a and (3)-b.
Second, multiple violations of ℂ 2 should weigh high enough to capture the extreme gang-effect in (3)-c and (3)-d.However, this weighted grammar meets neither of these conditions because w(ℂ 2 ) is stuck in the middle; the candidate (3)-b-i was penalized too much for violating ℂ 2 once (predicted 52%, while 60% observed) while (3)-c-i was not penalized enough for incurring multiple violations of ℂ 2 (38% predicted, while 4% observed).
The simulation illustrates that there is no way for a traditional MaxEnt grammar to satisfy the two conditions: w(ℂ 2 )×1 being low enough while w(ℂ 2 )×2 and w(ℂ 2 )×3 high enough.This is because the disparity between one and multiple violations is wide, but w(ℂ 2 ) is fixed and violations are assessed only linearly.Under this linear strategy, multiple violations are merely doubly or triply penalized, which is not enough to capture the super-additivity.

Alternative: constraint family
The observed pattern of super-additive counting cumulativity can be captured by a constraint family where a separate self-conjoined constraint is responsible for each number of violations.
For the toy data, ℂ 2 was replaced by a set of constraints in (5), whose weights were computed using the MaxEnt Grammar Tool (Hayes et al. 2009), as presented in (6).
( The constraint family provides precise frequency matching, which is expected; there is a separate constraint responsible for every number of violations, and the weight can be adjusted to cater to each input and its candidate distribution. In this approach, however, constraints that stand for greater numbers of violations are not guaranteed to have higher weight.For example, a constraint family can predict a language in which violating w(ℂ 2 ) 3 is less severe than w(ℂ 2 ) 1 but more severe than w(ℂ 2 ) 2 , which is highly unnaturalistic and unattested (e.g., w(ℂ 2 ) 1 > w(ℂ 2 ) 3 > w(ℂ 2 ) 2 ).A de Lacian approach of stringency hierarchy (de Lacy 2002), where each constraint penalizes n or fewer marked structures, will not have this problem since candidates with one violation will also violate other constraints in the hierarchy such as *TwoOrFewer, *ThreeOrFewer, and so forth.Even this approach, however, only guarantees weak monotonicity; as Jäger & Rosenbach (2006) point out, a stringency hierarchy will predict a language where the probability of the candidates (6)-i is constant at 62% up to 3 violations of ℂ 2 , decreases to 60% at 4 violations, remains constant until 6 violations and approaches 0% with more.Thus, this line of approaches disregards the nature of super-additive counting cumulativity wherein more violations lead to the strictly lower probability of the offending candidate, as Zymet (2014) notes.
In the same vein, a constraint family also dismisses learners' abilities to make predictions about larger violations that are not evidenced in the existing lexicon.For example, the Korean native lexicon has very few monomorphemic nouns with two marked consonants, as briefly mentioned in §2.2; however, they know that tensifying a compound formed with these nouns is extremely unlikely.

Power function
A power function is a function of the form f(a) = a b where the independent variable a is raised With a b value less than 1, the number of violations n is scaled down; n b = 1 when b = 0 and n b < 1 when b is a negative value.In fact, Zymet (2014) exploits a negative power function, in order to capture distance-based decay patterns where the weight of a single constraint is scaled down according to the distance between the interacting segments.

Power Function MaxEnt Learner
This section outlines the MaxEnt implementation that automatically learns parameters for the power function model introduced in §2.4.In the conventional MaxEnt grammar, harmony of a certain phonological form x is defined as the weighted sum of x's constraint violations (Goldwater et al. 2003;Hayes & Wilson 2008), shown in (7).Here, N is the number of the constraints, w i is the weight of the ith constraint, n i (x) is the number of times that x violates the The learner of the power function model is crucially different in harmony calculation since the number of violations n(x) is exponentiated by b, as seen in ( 8).Here, b i is the exponent of the ith constraint.The other components of the learning algorithm remain unchanged from the standard MaxEnt learning algorithm (Hayes & Wilson 2008).( 8) The learner's goal is to find the set of weights that minimizes the negative log likelihood of the training data.Therefore, I formalize learning as minimizing the standard loss function, the sum of the negative log likelihood.For optimization, batch gradient descent is used.The weights for all constraints are initialized at 0 and the b values for constraints are all initialized at 1. From there, each iteration, gradients of the loss function with respect to a given set of weights and b values are determined using the Python library Autograd, which efficiently computes partial derivatives of given functions.Constraint weights as well as the b values are updated based on these gradients, with a learning rate of 0.1 for weights and 0.01 for bs.
Notably, with the current set up for bs, in which bs are initialized at 0 and incremented only slowly with a lower learning rate of 0.01, a more conservative assumption about the grammar can be made; that is, every constraint has a linear increase in penalty by default as in regular MaxEnt grammar and only small changes can be made from that initial assumption when the data evidences otherwise (e.g., the existence of super-additivity).If any b values go below 1 after an update, they are automatically reset as 1, because b must be above 1 in order to scale up the number of violation.Similarly, if any weights go below 0 after an update, they are reset as 0. After each update, the loss function is calculated again with the updated parameters.
The learner is programmed to terminate whenever the loss function increases rather than decreases.
I first fit a dataset without super-additivity in which a linear increase in the number of violations results in a fairly gradual decrease in observed probabilities using the Power Function MaxEnt Learner.In ( 9), for each input, the observed probability of the candidate (i) gradually decreases with larger violations of ℂ 2 ; the data portrays cases where the counting cumulativity of ℂ 2 is no more than additive.The observed probabilities in this tableau are adjusted from (3) for demonstration but this type of gradual decrease/increase in probabilities that arises from violating a scalar constraint is well attested in linguistics (English genitive variation in Jäger & Rosenbach (2006), distance decay in Latin and other languages in Zymet (2014) and Stanton (2016), and Tommo So vowel harmony in McPherson & Hayes (2016)).( 9) Counting cumulativity without super-additivity: good fit ii.
.68 winner -1 -0.5 .69 The Power Function MaxEnt Learner converged after 815 iterations and was able to successfully detect that increasing the b for ℂ 2 is not necessary; for both constraints, the b values stayed at 1, as shown in (9).Moreover, the learner was able to exactly replicate the weights and the predicted probabilities optimized by the MaxEnt Grammar Tool (Hayes et al. 2009).
Subsequently, the dataset with super-additivity (3) was fitted using the Power Function MaxEnt Learner.The learner converged after 50,134 iterations.As seen in ( 10), only the exponent for ℂ 2 increased to 4.5 whereas the exponent for ℂ 1 stayed at 1, which shows that the learner is able to detect the super-additive constraint given the input data and let b rise only for that constraint.
Crucially, it is necessary for the learner to see at least three different levels of scalar violations in the input data in order to determine whether the constraint is super-additive.For example, with (10), the difference between zero and one violation is compared to the difference between one and two violations; since the second step is steeper than the first step in reducing the probability, the learner can confirm that an increased b value of ℂ 2 would be beneficial in this case.The parameters in (10) show that the power function approach enables the grammar to reflect the crucial aspects of the super-additive counting cumulativity; a single violation of ℂ 2 (0.2) is outweighed by a single violation of ℂ 1 (0.5), which in turn is outweighed by multiple violations of ℂ 2 (0.2 × 2 4.5 = 4.5).
(10) Output of the Power Function MaxEnt Learner: good fit The power function model gives precise frequency matching with an addition of only one parameter on top of a single baseline constraint, which is more parsimonious than a family of constraints in §2.3.Moreover, by relying on the nature of the mathematical concept, the weight is guaranteed to monotonically and strictly increase as the number of violation increases (b > 1).

Summary
I illustrated a toy example of super-additive counting cumulativity and evaluated three different models: linear assessment, constraint family, and power function.In §2.2, it was shown that the conventional way of assessing violations cannot properly capture super-additivity.In §2.3, a set of self-conjoined constraints provided a good fit to the toy data; but these constraints can be arbitrarily weighted, in which case the observed monotonicity is not guaranteed.In §2.4 and §2.5, a grammar with exponentiated penalties showed a good fit to the toy example.With a slight modification to the traditional MaxEnt model, the Power Function MaxEnt Learner fits the parameters that are necessary for the power function model.More importantly, the mathematical nature of the power function guarantees more violations to be penalized more severely, when the exponent b is larger than 1.

Case study 1: laryngeally marked consonants in Korean compound tensification
As the first real-life example that exhibits a super-additive cumulativity, Korean compound tensifcation is investigated in this section.I introduce the phenomenon in §3.1.In §3.2,I analyze the observed pattern using a set of OCP constraints that are sensitive to the position of a word boundary and whether the offending structure is derived or not.The observed super-additivity is attributed to the counting cumulativity of a specific OCP constraint.In §3.3,I report a learning simulation using the Power Function MaxEnt Learner and show that the suggested method correctly captures the observed super-additive pattern.In the model reported in Table 2, the coefficients indicate how strongly each factor contributes to decreasing or increasing the likelihood of tensification.The model reflects all the observed OCP effects: (i) a single laryngeally marked consonant in W B significantly reduces the likelihood of tensification, (ii) the presence of one marked consonant in W A has no effect on tensification and is not significantly different from the absence of marked consonant, and (iii) the presence of two marked consonants in W A significantly dampens tensification.Moreover, model comparison using the anova() function revealed that the number of marked consonants in W A makes significant improvements (χ 2 (2) = 24.3,p < 0.001).

Analysis: word boundary and counting cumulativity
This section provides a formal analysis of Korean compound tensification.I first introduce constraints that are responsible for the occurrence of tensification.And then, I analyze the OCP pattern observed in §3.1: one laryngeally marked consonant in W B suffices to block tensification while more than one in W A is needed to clearly show the blocking effect.This is attributed to different markedness thresholds in different morphological domains: word and compound.
Traditionally, tensification in Korean compounds has been considered a way of helping a compound be perceived as consisting of two elements (Kim-Renaud 1974;Chung 1980;Ahn 1985).Tensification of the initial segment of W B has been formally described in the literature as an insertion of a stop segment such as /s/ or /t/ at the compound juncture, which would allow an automatic process of post-obstruent tensification, or an insertion of a floating feature that would allow tensification of the following segment, such as [constricted glottis] or [tense].Among many, following the recent study of Ito (2014), I assume in the paper that a floating [tense] feature is inserted at the compound juncture.Regarding the realization of this [tense] feature, I adopt RealizeMorpheme (Kurisu 2001), which requires that the floating [tense] be realized on some segment.This [tense] feature cannot dock onto the W A final coda, because codas are unreleased and hence cannot be tensified in Korean (Kim-Renaud 1974) An offending structure (a co-occurrence of two laryngeally marked consonants) is treated differently depending on whether a word boundary intervenes or not.First, considering that the site of tensification is the initial onset of W B , the reason for the stronger OCP effect in W B is that a co-occurrence of two laryngeally marked consonants is significantly less preferred within a single morpheme.This strong OCP effect in W B can be captured by a trigram that penalizes an offending structure with a preceding word boundary, defined in (12).In the constraint definition, T represents a laryngeally marked consonant and + represents a boundary.
(12) *+TT: Assign a violation mark for every subsequence of a word boundary followed by a laryngeally marked consonant followed by another laryngeally marked consonant.
On the other hand, a laryngeally marked consonant in W A cannot block tensification as strongly because a co-occurrence of two marked consonants is better tolerated in a compound.This can be captured by a similar trigram (13), which penalizes an offending structure with an intervening word boundary.
(13) *T+T: Assign a violation mark for every subsequence of a laryngeally marked consonant followed by a word boundary followed by another laryngeally marked consonant.
The constraint *+TT should be weighted highly enough to capture the clear OCP effect in W B while *T+T should be weighted lowly enough to allow two marked consonants to still co-occur over a word boundary.The OCP pattern of Korean compound tensification corroborates the observation that some phonological restrictions that hold within a morpheme can be loosened or even lifted at morpheme boundaries (Martin 2011;Gallagher et al. 2019).
The tensification-blocking effect of two laryngeally marked consonants in W A is attributed to the counting cumulativity of *T+T violations, as summarized in ( 14).A single violation of *T+T is outweighed by the tensification-triggering constraint (RealizeMorpheme), which is in turn outweighed by multiple violations of *T+T.Sample tableaux and the weights of the constraints that are computed by the MaxEnt Grammar Tool (Hayes et al. 2009) are shown in ( 15) and ( 16).

Learning super-additive counting cumulativity
In this section, I fit the Korean data using the Power Function MaxEnt Learner.I first summarize the necessary constraints.Then, I present the result of the learning simulation on the Korean data.I show that incorporating the power function into violation assessment allows accurate frequency matching of the input data.I also show that the Power Function MaxEnt Learner is able to raise b only for the super-additive constraint and let it remain at 1 for the others.
As mentioned above, I assume that RealizeMorpheme (Kurisu 2001) requires the inserted [tense] feature to be phonologically realized, by the association to the initial onset consonant of W B .In contrast, Ident (tense) requires that the specification for the [tense] feature of the input segments be identical to that of the output segments.
The laryngeal co-occurrence patterns are captured by a set of two OCP constraints, which penalize derived marked pairs with a preceding or intervening word boundary, as defined again in ( 19) and ( 20).The counterpart constraints *T+T O and *+TT O are excluded in the analysis as these constraints are not decisive in predicting candidate distributions; *+TT O is never violated in the entire dataset as mentioned above and *T+T O is violated by either all candidates or no candidates for each input.
(19) *+TT N : Assign a violation mark for every subsequence of a word boundary followed by a laryngeally marked consonant followed by another laryngeally marked consononant that is not present in the input.
(20) *T+T N : Assign a violation mark for every subsequence of a laryngeally marked consonant followed by a word boundary followed by another laryngeally marked consonant that is not present in the input.
The Korean data was fitted using the Power Function MaxEnt Learner.The weights of the constraints were all initialized at 0. The update to a constraint's weight was the negative of the learning rate (0.1) times the gradient that was computed by Autograd each iteration.The b values for each constraint were also all initialized at 1.The update to a constraint's exponent b was the negative of the learning rate (0.01) times the gradient that was computed by Autograd each iteration.The learner converged after 39,587 iterations.The resulting set of weights and exponents are shown in ( 21).In the inputs and outputs, T represents a syllable with a laryngeally marked consonant whereas σ represents a syllable without one.The other factors that could potentially be relevant to the likelihood of tensification, such as the word length and the position of laryngeally marked segments, were not significant in the statistical analysis above and were disregarded in this paper.For instance, the input in ( 21)-d represents words of any length, which is either 3 or 4 syllables long in the data, whose W B underlyingly has a single laryngeally marked segment.
In the tableau, only the exponent for the super-additive constraint *T+T N increased to 4.5 whereas those for all the other constraints stayed at 1, which shows that the learner is able to detect the super-additive constraint given the input data and let b rise only for that constraint.
With these weights and powers, the grammar reproduced the observed pattern closely.
( The fact that one violation of *T+T N has no effect of blocking tensification is reflected in the very low weight of *T+T N (w = 0.2).Also, the large disparity between one and two violations of *T+T N is reflected in its exponent 4.5.With these parameters, the weighted grammar correctly captures the super-additive counting cumulativity of *T+T N : one violation of *T+T N (w = 0.2) is outweighed by the tensification triggering constraint (w = 0.7), which in turn is much outweighed by multiple violations of *T+T N (w = 0.2 * 2 4.5 = 4.5).

Summary
In

Case study 2: nasals in Japanese Rendaku
As another case of super-additive counting cumulativity, nasals in Japanese Rendaku are investigated in this section.I provide a brief introduction of the process in §4.1.I describe the compound databases that I use and report some tendencies found in the data in §4.2.In §4.3,I give a phonological analysis of the observed pattern using OCP constraints that operate on different natural classes, one of which cumulates super-additively.In §4.4,I use the Power Function MaxEnt Learner to fit the Japanese data.I summarize the section in §4.5.

Background: Rendaku and Lyman's Law
In Law (Lyman 1894), which states that a stem must not contain more than one voiced obstruent.
This pattern has been formally accounted for through the OCP effect or co-occurrence restrictions on the voice feature (Itô & Mester 1986).

Corpus study
I employ two complementary compound databases for investigating various phonological restrictions of Rendaku in this paper.The main database that I use, Irwin et al. (2020), is a collection of all 35,328 Rendaku candidate compounds that are found in two major dictionaries, Kenkyūsha (Watanabe et al. 2008) and Kōjien (Shinmura 2008).Here, Rendaku candidate compounds refer to the compounds whose W B begins with a voiceless obstruent and does not include a voiced obstruent.The database provides the compounds' pronunciations that can be retrieved from either or both of those two dictionaries and I work with the pronunciation data sourced from the Kenkyūsha dictionary in this paper.I included compounds whose pronunciations in Kenkyūsha are either specified as "+", meaning Rendaku is exhibited, or "-", meaning Rendaku is not exhibited.I excluded compounds whose pronunciations are not available in Kenkyūsha or annotated as "+-", meaning that both Rendaku and non-Rendaku pronunciations are possible, in order to treat Rendaku application as a binary variable.
As I am investigating phonological restrictions of Rendaku, I excluded items that could potentially be affected by other factors that have been reported to dampen Rendaku.Irwin et al. (2020) not only includes noun-noun compounds but also those that are comprised of verbs or adjectives; I excluded compounds that are only comprised of verbs as these rarely undergo Rendaku (Okumura 1955;Vance 2008;Irwin 2012).Irwin et al. (2020) includes compounds with elements from any of the Japanese lexical strata: Yamato (native), Sino-Japanese, or foreign.I only included compounds whose W B is from the native stratum, as it has been reported that W B from either the Sino-Japanese or the foreign stratum can resist Rendaku (Vance 2007;Irwin 2005;2011).It has been noted in the literature that Rendaku is only applicable to elements on the rightmost branch of the morphological tree (Right Branching Condition; Itô & Mester 1986;Otsu 1980).I included compounds where W B is written with exactly one Kanji character in order to eliminate the effect of the Right Branching Condition.For example, compounds with / kanamono/ (金物) 'hardware' or /tatemono/ (建物) 'building' as the second noun were excluded.
It is not the perfect metric for determining the monomorphemicity of a word because there are simplex words that can be written with two Kanji characters, as Tanaka (2017) points out, but it definitely guarantees that W B is monomorphemic and the initial segment of W B belongs to the right branch of the compound.
I also excluded coordinate compounds, as they are known to systematically avoid Rendaku application (Okumura 1955).Lastly, I only include compounds whose W B is three moras or longer.The process above resulted in an ideal dataset for investigating phonological restrictions on Rendaku.The data includes 2,130 items, of which 1,772 underwent Rendaku (83%).
Since compounds with a voiced obstruent in W B are already excised from the main database Irwin et al. (2020), I obtained another database: Rosen (2018).This database consists of 645 Yamato compounds with an initial voiceless obstruent in W B , whose total moraic count does not exceed four.Rosen (2018) and the equivalent subset of Irwin et al. (2020), which is restricted to 4,519 Yamato noun-noun compounds that fall within the same length limit (<5μ), are highly comparable in terms of the Rendaku probabilities under different phonological conditions; (i) no voiced obstruent in W B : 78% vs. 75%, (ii) no nasal in W B : 75% vs. 72%, (iii) one nasal in W B : 85% and 87%, and Rosen (2018) include no compounds with two nasals in W B .Therefore, I use the counts and Rendaku probability obtained from Rosen (2018) for the category of compounds with one voiced obstruent in W B .With these items included in the extracted Irwin et al. (2020) dataset, the overall Rendaku probability shows a slight decrease (79%; 1,772 out of 2,246), as none of these 116 added items undergo Rendaku.
The Rendaku probability depends on the type and number of context consonants, as summarized in Table 3. First, the most influential factor in Rendaku, the effect of Lyman's Law, was confirmed in the corpus data; the presence of one voiced obstruent in W B can categorically block Rendaku.Unlike the strong effect of voiced obstruents, it has been described in the literature that the presence of a nasal in W B does not block Rendaku (Rice 2005), as in /kaza/ 'wind' + /kuruma/ 'car' → [kazaguruma] 'windmill'.This traditional observation was confirmed in the corpus data; for the 2,129 compounds that do not contain any voiced obstruents, the presence of a single nasal is different from the absence of nasals in Rendaku by only 4.4%.However, two or three nasals in W B can heavily dampen Rendaku, as in /touzoku/ 'thief' + /kamome/ 'seagull' → [touzokukamome] 'pomarine jaeger (a type of seabird)' and /hito/ 'one' + /tsumami/ 'knob' → [hitotsumami] 'easy victory'.If we regard a nasal consonant as a Rendaku blocker, the effect of two blockers goes beyond merely doubling the effect of one; two nasals in W B cumulate in a super-additive manner and block the application of Rendaku. 3 3 One of the reviewers asked if two liquids and two approximants also can contribute to blocking Rendaku.The current dataset does not allow a clear investigation of these effects due to the lack of relevant data but it is likely that those effects exist.First, of 51 compounds with two /r/s in W B , only 7 (14%) undergo voicing; but all of these examples can be explained away by either having a verb as W B or having a tautomorphemic W B , violating the Right Branching Condition.Second, while there were no compounds with two approximants in W B in the current database, the recent nonceword study by Kawahara & Kugamai (2021) found a significant super-additive cumulativity of two approximants in blocking Rendaku.Moving forward, I focus on the nasals in the paper because of the insufficient data and leave investigating the effects of other sonorants to future study.In the model, a negative coefficient means that the factor contributes to dampening Rendaku and the larger absolute value means that the effect is stronger.As can be seen in As can be seen, the effect of Right Branching Condition is clearly present, which is compatible with the traditional description on Rendaku: Rendaku is more likely with a monomorphemic W B .However, this model also reveals that the nasal effects are still significant even if the Right Branching Condition is included in the model.Moreover, model comparison using the anova() function also showed that the addition of the number of nasals in W A makes significant improvements to the fit to the data (χ 2 (2) = 27.86,p < 0.001).This confirms that the Rendakudampening effect of two nasals in W B is robust in the lexicon of Japanese.Kumagai (2017a) is the first study that reports the significant Rendaku-blocking effect of a sequence of nasals, discovered through a nonce word experiment.A larger nonce word experiment was recently carried out by Kawahara & Kugamai (2021), re-examining the Rendakublocking effect of two nasals.The general trend in their result is compatible with the nonceword study of Kumagai (2017a) and my corpus data; Rendaku probability was indeed lowered by two nasals but not by one.However, the effect of two nasals was not statistically significant in their experiment.Hopefully future research will explain why the nasal effect was not significant in the experiments of Kawahara and Kugamai.

Analysis: similarity avoidance and counting cumulativity
In this section, I first introduce constraints that are responsible for the occurrence of Rendaku.
Then, I give a formal analysis of the patterns described in §4.2.I explain why a single nasal would still tolerate the application of Rendaku whereas a single voiced obstruent and multiple nasals would significantly lower the Rendaku probability in W B , by making a connection to previously established similarity avoidance effects in Japanese; higher similarity between surface segments is more strongly avoided.I define OCP constraints that operate on different natural classes, one of which is responsible for the super-additive cumulativity of nasals.
Similar to the Korean compound tensification case, it is assumed that the compound juncture marker [+voice] is inserted between two nouns when they form a compound.The constraint RealizeMorpheme is responsible for the [+voice] feature to be realized (Kurisu 2001), by the association to the initial onset consonant of W B , whereas Ident (voice) blocks voicing.
The resistance to Rendaku increases if the Rendaku process results in a co-occurrence of more similar segments.).Thus, the application of Rendaku is a way of either resolving or preventing identity at the moraic level.
Moreover, there is a growing body of experimental studies proving that the similarity avoidance effect is gradient and that higher similarity tends to be more strongly avoided.Sano (2013) showed that identity at the segmental level (e.g.C is even more strongly avoided in Japanese verbal inflection.Kumagai (2017b)  In order to capture the different Rendaku-blocking effects of nasals and voiced obstruents, I employ OCP constraints that are defined over two sets of segments: voiced consonant and voiced obstruent.As stated by the gradient similarity-avoidance effects, the strength of these constraints depends on the homogeneity of the natural class the constraint is defined over; voiced obstruents ([bdzg]) are more homogenous than voiced consonants ([bdgzmnjwr]), and the OCP constraint on voiced obstruents is stronger.
First, the OCP constraint that bans a co-occurrence of two voiced consonants is defined in ( 22).In the constraint definition, C̬ is used to represent any consonant that is voiced ([+voice, +consonantal]).This constraint is responsible for blocking Rendaku application with a nasal in W B .The weights computed by the MaxEnt Grammar Tool (Hayes et al. 2009) are shown in sample tableaux ( 23)-( 24).Throughout the section, tableaux with abstract inputs will be presented in order to show the overall trend in the lexicon instead of the behavior of individual items.Since this paper only focuses on the effects of W B consonants on Rendaku, the first element of the compound is represented as an abstract form W A .The second element is represented by three consecutive CV moraic units, or light syllables (e.g., CVCVCV).For the simplicity, syllables with a voiced obstruent will be represented by D, syllables with a nasal will be represented by N, syllables with neither will be represented by σ.
(22) *+C̬ C̬ : Assign a violation mark for every subsequence of a word boundary followed by a voiced consonant followed by another voiced consonant. (

Learning super-additive counting cumulativity
In this section, I present the result of a learning simulation on the Japanese data to show that incorporating the power function into violation assessment enables the grammar to accurately match the frequencies of the data with super-additive counting cumulativity.I also show that the Power Function MaxEnt Learner is able to detect the super-additive constraint and raise the b value only for that constraint, while letting b remain as 1 for the others.

Summary
In this section, the effects of nasals on blocking Rendaku were examined.The presence of one voiced obstruent in the second element significantly blocks voicing due to Lyman's Law (Lyman 1894), which states that stems must not contain more than one voiced obstruent.Nasals show a super-additive counting cumulativity in blocking the voicing process; whereas one nasal in the second noun only slightly lowers the likelihood of Rendaku, two nasals cumulate in a superadditive manner and significantly dampen Rendaku application.The observed patterns were attributed to a gradient similarity-avoidance effect in Japanese, which is formally analyzed by a set of OCP constraints that operate on different natural classes with varying degrees of counting cumulativity.I presented a learning simulation on the quantitative data and showed that the Power Function MaxEnt Learner successfully detected the super-additive constraint and matched the input distributions.

Conclusions
This paper investigated a type of cumulativity observable in natural languages, called superadditive counting cumulativity, where multiple violations of a weaker constraint go beyond a mere multiplication of its own severity.I demonstrated through a toy example that the super-additivity in counting cumulativity cannot be entirely captured by the traditional MaxEnt grammar where violations are assessed linearly.Instead, a slight modification was made to the conventional violation assessment method to accommodate this specific case; a power function f(a) = a b was applied to the number of violations in order to scale up the penalty (b > 1).The advantage of the power function model is best recognized when compared to the alternative, a constraint family model; whereas a constraint family employs a set of self-conjoined constraints that can be all independently and arbitrarily weighted, the power function model internalizes counting cumulativity by relying on the nature of the mathematical concept, in which the weights are guaranteed to increase strictly and monotonically with a larger number of violations.
I reported two natural language examples that show super-additive counting cumulativity.
In a compounding process of Korean and Japanese, the initial onset of the second element undergoes an alternation.The presence of marked consonants inhibits the likelihood of this alternation, motivated by OCP constraints, in a super-additive manner; the presence of one only has negligible effects whereas the presence of two significantly dampens the alternation process.
The negligible effect of one violation was analyzed in Korean as the morpheme-internal restriction being weakened over a morpheme boundary (e.g., w (*+TT) > w (*T+T)), and in Japanese as a weaker co-occurrence restriction on a natural class that is less homogeneous (e.g., w (*+DD) > w (*+C̬ C̬ )).In the analysis, I associated these weaker constraints with super-additivity, which is why multiple violations of these can block the process.
This paper also has provided an implementation of the modified MaxEnt model which learns exponents and weights of the constraints.Learning simulations on the quantitative data obtained from a survey and a corpus show that the implemented MaxEnt learner successfully detected super-additive constraints and raised the exponents only for those constraints.With the adjusted parameters, the MaxEnt model was able to reproduce the probabilistic candidate distributions.
start by defining terms that are used in the paper in §2.1.In §2.2,I present a toy dataset with super-additive counting cumulativity and compare three different violation assessment methods to see how each captures the data.I first analyze the toy data with a conventional penalty assessment method where the product of the constraint weight and the number of violations is summed over all the constraints.I show that super-additivity cannot be entirely captured in a conventional MaxEnt Grammar.In §2.3,I take an alternative approach, a constraint family model, and discuss how it fails to capture monotonicity of such patterns wherein more violations incur more severe penalties.In §2.4,I introduce the power function model in which the degree of penalty is scaled up according to the number of violations, through a power function.In §2.5,I introduce a MaxEnt implementation of the power function model, called the Power Function MaxEnt Learner, with a focus on how it differs from the conventional MaxEnt learner.I present two learning simulations to show the capacity of this learner.I first fit a dataset without super-additive counting cumulativity using the Power Function MaxEnt Learner; I show that the learner successfully lets all the b values stay at 1. Subsequently, I fit the super-additive data illustrated in §2.2, showing that the learner is able to detect the super-additive constraint and adjust the parameters accordingly, producing precise frequency matching.I conclude the section by explaining how the property of the power function guarantees a monotonic increase in weights with more violations.
-a and (3)-b, a single violation of ℂ 2 lowers the observed frequency of the winner only marginally and does not reverse the choice of winner.Multiple violations of ℂ 2 can reverse the winner, however, as seen in (3)-c and (3)-d.This is a clear case of counting cumulativity, where multiple violations of a weaker constraint ℂ 2 overpower a single violation of the more powerful constraint ℂ 1 .The weighting condition for these two constraints is summarized in (4).The notation w(ℂ) refers to the weight of the constraint ℂ and n refers to the number of violations.
to a constant power b.I substitute the number of ℂ 2 violations n(ℂ 2 ) with n(ℂ 2 ) b (b > 1) to scale-up the penalty, following Zymet (2014) (scaling of penalty was also used inCoetzee 2009;Pater 2009;Kimper 2011;McPherson & Hayes 2016;Hayes 2017).Here, note that the number of violations is still the same but its contribution to harmony is scaled up.For example, one violation will indicate a severity of 1 b while two violations and three violations will indicate 2 b and 3 b , respectively.Since 1 b always equals 1 for any b, application of the power function on the number of violations does not affect the severity of a single violation.If the b value is greater than 1, the penalty is enlarged for multiple violations.Thus, the parameter b is adjusted to serve different degrees of super-additivity; larger b can be used to capture more extreme superadditivity.
this section, I investigated super-additivity of laryngeally marked consonants in Korean compound tensification, a process whereby the initial onset of W B can be tensified when compounded.The tensification likelihood of a given compound can be partially predicted a number of phonological factors, one of which is the presence of another marked consonant, motivated by laryngeal co-occurrence restrictions.Co-occurrences of marked consonants are tolerated differently in different morphological domains (word and compound); whereas a single laryngeally marked consonant in W B blocks tensification, there must be two marked consonants in W A to block the process.The observed OCP patterns were captured by multiple OCP constraints that are sensitive to the position of the word boundary and the input-output mappings, one of which (*T+T N ) was super-additive and therefore responsible for the super-additive cumulativity of a marked consonant in W A .The Power Function MaxEnt Learner was able to detect this constraint as super-additive and adjusted the parameters, successfully matching the frequencies of the input data.
and raised the exponent for that constraint only.The adjusted exponents and the weighted constraints correctly capture the crucial aspects of the observed Rendaku patterns.First, the nearly inviolable restriction of Lyman's Law is reflected by the high weight of *+DD.Second, RealizeMorpheme (w = 1.8) outweighs a single violation of *+C̬ C̬ (w = 0.3), which reflects the observation that a single nasal does not severely reduce Rendaku application.In turn, three violations of *+C̬ C̬ (w = 0.3*3 1.6 ≈ 2), which are incurred by a W B with two nasals undergoing Rendaku, outweigh RealizeMorpheme, correctly capturing the stronger Rendaku-dampening effects of multiple nasals.With these parameters, the grammar successfully reproduced the observed frequency distributions.

Table 1 :
(Bates et al. 2018ical factors that contribute to the likelihood of tensification in Korean compounds,Kim (2017)collected a large-scale pronunciation data by conducting a survey on Seoul Korean speakers.The survey material consists of 304 native noun-noun compounds, whose W B initial onset is a lax obstruent, which are collected from the two data sources: Korean Usage Frequency(Kang & Kim 2009) and the major Korean dictionary (National Institute of Korean Language 1999).In the survey questionnaire, the two components of each compound were given in separate parentheses with the morpheme boundary symbol '+' in between.For each compound, two possible pronunciation forms written in standard Korean orthography, one with and the other without tensification, were given as options.The link for the Google Docs document was sent to 21 participants, who were undergraduate or graduate students of two universities in Seoul (Seoul National University and Yonsei University).They were asked to choose their pronunciations of the word that can be compounded by the two given nouns.The Tensification rate according to the type/number of consonant in W A and W B .First, the presence of a single laryngeally marked consonant in W B significantly lowers the tensification rate.Unlike the strong effect of a marked consonant in W B , there need to be two marked consonants in W A in order to exhibit sigificant OCP effects; the presence of a single laryngeally marked consonant is not significantly different from the complete absence of any marked consonant on tensification, whereas the presence of two marked consonant in W A lets the tensification rate plunge.If a laryngeally marked consonant is considered a tensification blocker, the effect of two in W A goes beyond doubling the effect of one in W Analyzing the significance of OCP effects, I constructed a mixed effects logistic regression model using the glmer function from the lme4 package(Bates et al. 2018) in R (Team et al.   2013).The dependent variable was the occurrence of tensification (binary: no tensification (ref), tensification) and the independent variables were the number of marked consonants in W B (backward difference coded: zero, one) and the number of marked consonants in W A (backward difference coded: zero, one, two).The model also included random slopes and random intercepts for participants and compounds.The model is shown in Table2.

Table 2 :
Regression model for the Korean survey results.
(Hayes et al. 2009ure docks onto the W B initial onset segment.While RealizeMorpheme encourages tensification, Id (tense) prevents W B initial onset segments from undergoing tensification, as demonstrated in(11).The weights in the tableau are computed by the MaxEnt Grammar Tool(Hayes et al. 2009).

W
A does not block tensification.The example in (16), /k'oc h i/ 'skewer' + /kui/ 'roast', represents cases where two marked consonants in W A block tensification.In these tableaux, the input frequencies are not accurately matched with these weights although they satisfy the weighting condition laid out in (14), because two violations of *T+T are more severe than simply doubling the severity of one violation, indicating that *T+T cumulates in a super-additive manner. 1 This supports a replacement of *+TT with *+TT N and *+TT O , where *+TT N is violated by derived TT pairs and *+TT O is violated by underlying TT pairs that are already present in the fully faithful candidate, which is essentially the input.In the paper, *+TT O is not included in the analysis because the dataset used here has no compound with two underlying marked consonants in W B ; *+TT O is vacuously satisfied by all the compounds.
McCarthy 2003) /kuks'u/ 'noodle', illustrates cases where W A and W B both include a laryngeally marked consonant.The candidate with tensification (17)-a still occurs 20% of the time, contrary to (16)-a almost never occurring, although it violates *T+T twice for having two pairs of laryngeally marked consonants over the word boundary: k h +k' and k h +s'.This is because the underlying T+T subsequence, k h +s', is tolerated and does not contribute to the super-additivity of *T+T violations.In contrast, the two T+T pairs included in (16)-a, k'+k' and c h +k', both contribute to the super-additivity since both include a derived tense consonant k'.Since underlying T+T subsequences behave differently than derived ones, I replace *T+T with *T+T O and *T+T N , as seen in the re-evaluation of the same example from (17), presented in (18).In the tableau, RM represents RealizeMorpheme.Whereas *T+T N is violated by the derived T+T subsequence, *T+T O is violated by the subsequences that are already present in the fully faithful candidate, which are defined as a candidate without violating any faithfulness constraints (Comparative markedness;McCarthy 2003).2Whilethere is no evidence that *T+T O is super-additive, *T+T N is clearly super-additive.

Table 3 :
Rendaku rate according to the type/number of consonant in W B .Analyzing the significance of the observed nasal effects, I constructed a logistic regression model using the glm function from the lme4 package (Bates et al. 2018) in R (Team et al. 2013).The occurrence of Rendaku was the dependent variable (binary: no Rendaku (ref), Rendaku) and the number of nasals in W B (backward difference coded: zero, one, two or more) was the independent variable.The result is shown in Table4.

Table 4 :
Regression model for the Rendaku database.

Table 4 ,
the presence of a single nasal significantly reduces the likelihood of Rendaku application but the presence of two nasals exerts an even bigger effect in reducing Rendaku probability.Note that the regression model here is based on the extracted database (2,130 items), in which potential effects of the known confounding factors, such as Right Branching Condition, are removed.I ran another regression model with a larger database, which includes all the 2,130 compounds used in the model 4 as well as 312 compounds that had to be excluded due to their polymorphemic W B (= 2,442 items in total).The model is reported in Table5.

Table 5 :
Regression model with Right Branching Condition as an extra predictor.

Table 6
illustrates how the presence of a single nasal, two nasals, and a voiced obstruent in W B differently contribute to the overall similarity between consonants with the

Table 6 :
Kawahara & Sano (2014b)f hypothetical Rendaku-applied W B forms.The effect of similarity avoidance on Rendaku applicability has been frequently brought up in the literature.Kawahara & Sano (2014b)showed through a nonce word experiment that Rendaku is both blocked and triggered by an output constraint that bans a co-occurrence of two identical CV morae; Rendaku is less likely if it creates a total identity of two consecutive morae over the boundary (/iga/+/kaniro/ → [igakaniro], *[igaganiro]) and more likely if it underlyingly contains two identical CV morae (/ika/+/kaniro/ → [ikaganiro], *[ikakaniro]).By the same token, Kawahara & Sano (2014a) showed that Rendaku is more likely to be blocked if it results in two consecutive identical CV morae in W B (W A +/tadanu/ → [W A tadanu], *[W A dadanu] Unlike the Korean case where the OCP constraint needed to be sensitive to whether the offending structure is derived, Japanese does not require *+C̬ C̬ to be separated into *+C̬ C̬ N and *+C̬ C̬ O because the constraint *+C̬ C̬ is consistently inert in both dynamic alternations and static phonotactic restrictions of Japanese; as we just observed, a single nasal in W B does not inhibit Rendaku, which implies that *+C̬ C̬ is not powerful enough to work against the morphophonological alternation.Moreover, it is violated fairly frequently in existing words, as there are stems where two nasals freely co-occur (/mono/ 'object').
and Ident (voice) as the constraints that are responsible for the application of Rendaku.Different Rendaku-blocking effects of nasals and voiced obstruents in W B are captured by a set of OCP constraints: *C̬ C̬ and *DD.The Japanese database was fitted using the Power Function MaxEnt Learner.Weights and exponents of the constraints were initialized and updated the same way as in the Korean simulation.The learner converged after 13,280 iterations.The resulting set of weights and exponents, as well as the fit of the grammar to the data are reported in (28).As mentioned above, tableau (28) is only showing the phonological contexts of W B .To reiterate, D represents a light syllable with a voiced obstruent, N represents a light syllable with a nasal consonant, and σ represents a light syllable with neither in the input and output forms.Similarly to the Korean case study, other phonological factors, such as the position of Rendaku blocking segments, were disregarded here.Thus, for example, input (28)-b represents the cases where a trimoraic/ trisyllabic W B underlyingly has a single nasal consonant.The Power Function MaxEnt Learner successfully detected the super-additive constraint, *+C̬ C̬ ,