MaxEnt grammar is a probabilistic version of Harmonic Grammar in which the harmony scores of candidates are mapped onto probabilities. It has become the tool of choice for analyzing phonological phenomena involving probabilistic variation or gradient acceptability, but there is a competing proposal for making Harmonic Grammar probabilistic, Noisy Harmonic Grammar, in which variation is derived by adding random ‘noise’ to constraint weights. In this paper these grammar frameworks, and variants of them, are analyzed by reformulating them all in a format where noise is added to candidate harmonies, and the differences between frameworks lie in the distribution of this noise. This analysis reveals a basic difference between the models: in MaxEnt the relative probabilities of two candidates depend only on the difference in their harmony scores, whereas in Noisy Harmonic Grammar it also depends on the differences in the constraint violations incurred by the two candidates. This difference leads to testable predictions which are evaluated against data on variable realization of schwa in French (
Stochastic phonological grammars assign probabilities to outputs, making it possible to analyze variation and gradient acceptability in phonology. While phonological variation has long been a central concern in sociolinguistics (e.g.
MaxEnt grammar is currently the most widely used framework for stochastic phonological grammars. It is based on Harmonic Grammar (
Maxent grammar and NHG at least superficially involve very different approaches to making Harmonic Grammar stochastic: MaxEnt takes the harmony scores assigned by a Harmonic Grammar and maps them onto probabilities, while NHG derives variation by adding random ‘noise’ to constraint weights. Given this difference we would expect these frameworks to be empirically distinguishable, but while previous work has demonstrated distinct predictions of the two frameworks (
The approach adopted here is to identify a uniform framework for analyzing and comparing stochastic Harmonic Grammars. We then use analysis based on this uniform framework to draw out distinct predictions of MaxEnt and NHG, and test these predictions against data on variable realization of schwa in French (
In the uniform framework for stochastic Harmonic Grammars proposed here, Harmonic Grammar is made stochastic by adding random noise to the harmony scores of candidates, then selecting the candidate with the highest harmony. We will see that the difference between MaxEnt and NHG lies in the distribution of the noise: independent Gumbel noise in MaxEnt grammar, and normal noise that can be correlated between candidates in NHG. Variants of these grammar formalisms can easily be accommodated in the same framework, such as a variant of MaxEnt grammar with independent normal noise.
Given this formulation of stochastic Harmonic Grammars, the probability of a candidate being selected is the probability that its harmony is higher than that of any other candidate. This probability depends on the distribution of the noise added to candidate harmonies, so the relationship between harmony and candidate probabilities differs between stochastic Harmonic Grammar frameworks. Specifically, it will be shown that in MaxEnt the relative probabilities of two candidates depends only on the difference in their harmonies, whereas in NHG it also depends on the pattern of their constraint violations. This difference leads to testable predictions concerning the effects on the probabilities of candidates of adding or subtracting constraint violations from a tableau. We will see that testing these predictions against data from Smith & Pater’s (
In the next section we review Harmonic Grammar, and the two dominant proposals for making Harmonic Grammar stochastic, MaxEnt and NHG.
Harmonic Grammar (
The mechanics of Harmonic Grammar are illustrated by the tableau in (1). The constraint weights are shown in the top row of the tableau. Constraints assign violations, as in OT, but the violations are negative integers, representing the number of times that the relevant candidate violates the constraint. Candidates are compared in terms of the sum of their weighted constraint violations, or harmony score. The harmony score of a candidate
(1)
Harmonic Grammar tableau
(2)
As can be seen from this example, Harmonic Grammar is deterministic, like standard Optimality Theory. That is, each input is mapped onto a single output, the optimal candidate for that input, so Harmonic Grammar must be modified to be able to assign probabilities to candidates. We turn now to the two main proposals for making Harmonic Grammar probabilistic.
As observed in the introduction, the two main approaches to making Harmonic Grammar probabilistic are Noisy Harmonic Grammar and Maximum Entropy Grammar.
In Noisy Harmonic Grammar (NHG), Harmonic Grammar is made stochastic by adding random ‘noise’ to each constraint weight at each evaluation. As a result, even with a fixed input, the harmony of a given candidate varies each time we derive an output, so different candidates can win on different occasions.
In Boersma & Pater (
The probability of a candidate being selected as the output is the probability that it has higher harmony than all the other candidates. These probabilities are recorded in the last column of (3), headed
(3)
Noisy Harmonic Grammar tableau
Maximum Entropy Grammar is also a stochastic form of Harmonic Grammar, but adopts what appears to be a very different mechanism from Noisy Harmonic Grammar, directly mapping candidate harmonies onto probabilities (
(4)
Probability of candidate
The formula in (4) can be understood as asserting that the probability of a candidate is proportional to the exponential of its harmony,
(5)
Maximum Entropy Grammar tableau
Comparing the tableaux in (4) and (5) it can be seen that NHG and MaxEnt can yield different probabilities when applied to the same HG tableau. Of course, both assign the highest probability to the candidate with the highest harmony, candidate (a), but MaxEnt assigns equal probability to candidates (b) and (c) because they have equal harmony, while tableau (3) shows that relationship between harmony and probability is less straightforward in NHG, because it assigns a higher probability to candidate (b) than to candidate (c).
This comparison shows that these two proposals for stochastic versions of Harmonic Grammar make different predictions. The goal of the paper is to draw out these differences so they can be tested against data. The strategy we adopt in comparing and contrasting these models is to reformulate them in a common framework. This helps to clarify their similarities and differences, and situates then within a broader space of Stochastic Harmonic Grammar models. The common framework we use to characterize these models is that of Random Utility Models, a type of model that is widely used to model choice between discrete alternatives in economics (e.g.
The common format we will use to analyze and compare stochastic Harmonic Grammars in one in which the harmony of candidate
It is straightforward to map NHG onto this structure: Although we described the random noise as being added to the constraint weights rather than to the harmonies of each candidate, the resulting harmony expression can be separated into fixed and random parts,
It is less obvious that MaxEnt Grammar can be reformulated as a Random Utility Model, but it is a basic result in the analysis of these models that the MaxEnt equation (4) follows from a Random Utility Model where the
Probability density functions of the Gumbel (solid) and normal (dashed) distributions.
Thus NHG and MaxEnt can be analyzed as adopting the same basic strategy for making Harmonic Grammar stochastic: add random noise to the harmony of each candidate. The difference between the models lies in the nature of the noise that is added to candidate harmonies: In MaxEnt the noise terms are drawn from identical Gumbel distributions, whereas in NHG, the noise terms are drawn from normal distributions whose variance depends on the number of constraint violations.
This analysis suggests a space of possibilities for stochastic Harmonic Grammars differentiated by the nature of the noise that is added to candidates’ harmonies. An obvious third candidate to consider is one which is like MaxEnt in that the noise terms are independent and drawn from identical distributions, but the distribution is the more familiar normal distribution, in place of the Gumbel distribution (
The Random Utility Model formulation of stochastic Harmonic Grammars provides the basis for a general analysis of the relationship between harmony and candidate probabilities in these frameworks. We will see that the testable differences between stochastic Harmonic Grammar frameworks follow from differences in this relationship. Specifically, in MaxEnt the relative probabilities of two candidates depends only on the difference in their harmonies, whereas in NHG the relative probabilities of candidates also depend on the pattern of their constraint violations. We turn to this analysis next.
The building block for a general analysis of the relationship between the harmonies and probabilities of candidates is an analysis of the competition between two candidates. Given two candidates,
(6)
This situation is illustrated in
The probability density function of the noise difference
The probability of a random variable having a value below some threshold is given by the cumulative distribution function of that variable, so
(7)
We will see that it is also useful to be able to express the harmony difference between candidates as a function of candidate probabilities. This is achieved by applying the inverse of
(8)
The differences between varieties of stochastic HG lie in the nature of the cumulative distribution function,
Density functions of the normal (solid) and logistic (dashed) distributions.
The logistic cumulative distribution function
(9)
(10)
Note that expression in (9) looks different from the formula we derive by applying the usual MaxEnt probability formula in (4) to the case of two candidates (11), but the two are in fact equivalent: (11) is derived from (9) by multiplying its numerator and denominator by
(11)
The relationship between harmony and probability is more complicated in NHG. As discussed above, in NHG noise is added to constraint weights, so the
(12)
Noisy Harmonic Grammar tableau
The variance of the
(13)
The variance of the difference between
(14)
Variance of
Since the noise difference
(15)
The inverse of the normal cumulative distribution, Φ^{–1}, is called the probit function. If we apply this function to both sides of (15), we obtain (16).
(16)
Comparing (10) and (16), the expressions relating candidate probabilities to candidate harmonies in MaxEnt and NHG, respectively, we can see that in both cases the relative probabilities of two candidates depends on the difference in their harmonies,
The second difference is the more important – we will show that it leads to testable predictions regarding the effects on the relative probabilities of a pair of candidates of changing some of their constraint violations. The first difference is relatively minor because the logit and probit functions are very similar (
The logit (dashed line) and probit (solid line) functions.
In MaxEnt with normal noise the
(17)
(18)
We will see that normal MaxEnt is similar to MaxEnt, as expected, but the difference between the logit and probit functions at probabilities close to 0 and 1 is large enough to result in significant differences in fits to data. We will also see in section 9 that there are further differences that become apparent in cases where there are three or more variant realizations for a given input, but those predictions will not be tested here since we do not have relevant data.
Before we draw out the predictions that follow from this analysis of MaxEnt and NHG, it is important to clarify that we have only analyzed the relationship between harmony and probability for a pair of candidates. We will see in section 9 that this analysis provides the building blocks for calculating the probability of a candidate winning over any number of competitors, but we defer that discussion because the current analysis is sufficient for tableaux in which only two candidates have probabilities significantly greater than zero, and that is true of our test case, which concerns the probabilities of forms with and without a schwa. Even if a tableau contains many candidates, if all but two of those have sufficient constraint violations that their probability is effectively zero, then all of the probability mass is divided between the two remaining candidates, and the other candidates are irrelevant to the calculation of their probabilities.
At this point it is useful to introduce the data that will be used to test the distinct predictions made by MaxEnt and NHG so we can use those data to exemplify the predictions. The data are from an experiment reported in Smith & Pater (
(19)
Environments for schwa realization studied by Smith & Pater (
Context:
clitic-final /ə/
word-final /∅/
C_σ́
eva t(ə) 'ʃɔk
yn bɔt(ə) 'ʒon
CC_σ́
mɔʁiz t(ə) 'sit
yn vɛst(ə) 'ʒon
C_σσ́
eva t(ə) ʃɔ'kɛ
yn bɔt(ə) ʃin'waz
CC_σσ́
mɔʁiz t(ə) si'tɛ
yn vɛst(ə) ʃin'waz
Smith & Pater’s analysis of schwa realization in these contexts involves the constraints in (20)–(23), together with M
(20) | N |
Assign one violation for every [ə] in the output. |
(21) | *CCC: | Assign one violation for every sequence of three consonants. |
(22) | *C |
Assign one violation for every sequence of two or more consonants. |
(23) | *C |
Assign one violation for every two adjacent stressed syllables. |
The most general constraints in Smith & Pater’s analysis are *C
Final schwa is realized more frequently in clitics than in full words. Smith & Pater analyze this difference as following from the schwa being underlying in the clitic, whereas it is epenthetic in word-final position, consequently M
Schwa is realized more frequently in the context CC_C, where it is preceded by two consonants, than in C_C, where it is preceded by only one. This is attributed to a constraint *CCC, which penalizes the triconsonantal cluster that results from non-realization of schwa in the former context ((28)
Schwa is also realized more frequently when the following word is a monosyllable rather than a disyllable (bɔtə'ʒon > bɔtəʃin'waz). Smith & Pater attribute this to clash avoidance: since stress falls on the last non-schwa vowel in a word, non-realization of schwa results in adjacent stressed syllables when the following word is a monosyllable (['bɔt'ʒon]), but not if it is a disyllable (or longer) (['bɔtʃin'waz]), e.g. (26) vs. (25). Adjacent stressed syllables are penalized by *C
(24)
/ə/, C_σσ́
(25)
/∅/, C_σσ́
(26)
/∅/, C_σ́
(27)
/∅/, CC_σσ́
(28)
/∅/, CC_σ́
Smith & Pater’s data set provides a good testing ground for distinguishing MaxEnt from NHG because the factorial design of the experiment allows us to compare many pairs of tableaux that differ minimally in their constraint violations, and these frameworks make distinct predictions concerning the relationship between candidate probabilities across such pairs of tableaux, as is shown in the next section.
Our starting point for analysis follows Smith & Pater, but we will also consider variants of their analysis. In particular we consider analyses that eliminate redundancies from their constraint set. For example, Smith & Pater follow standard practice in positing separate D
These alternative constraint sets encompass some competing analyses of the distribution of schwa in French. For example, it has been argued that schwa at clitic boundaries is epenthetic just like schwa at word boundaries (e.g.
We will now use Smith & Pater’s data and analysis to illustrate the implications of the difference between MaxEnt and NHG demonstrated in section 5.
The implications of the difference between MaxEnt and NHG can be seen by considering the effect of adding or subtracting constraint violations from a tableau. We will see that in MaxEnt a given change in constraint violations always has the same effect on logit(
Consider a tableau with two candidates
We can see how this analysis applies to the French schwa data by considering pairs of tableaux such as those in (24)–(28), above. We will refer to the candidates in these tableaux as the ə candidate and the ∅ candidate. The pairs of tableaux (25)–(26) and (27)–(28) are identical except that in the second tableau of each pair, the ∅ candidate incurs an additional violation of *C
The relevant information in these tableaux is more succinctly represented in a difference tableau, which records the constraint violations of the ə candidate minus those of the ∅ candidate. For example, the difference tableaux corresponding to (25) and (26) are shown in (29) and (30), with illustrative constraint weights. The harmony difference
(29)
/∅/, C_σσ́
(30)
/∅/, C_σ́
The same reasoning generalizes to pairs of tableaux that each differ in violations of a set of constraints. For example, pairs of tableaux for words and clitics in the same context differ in violations of both M
(31)
/∅/, C_σ́
(32)
/ə/, C_σ́
(33)
/∅/, C_σσ́
(34)
/ə/, C_σσ́
Smith & Pater’s data set provides several comparisons of these kinds. The table in (35) summarizes the difference tableaux for all eight contexts. It can be seen that four pairs differ by adding 1 to the difference in *C
(35)
Difference tableaux for all contexts
N
*CCC
*C
M
D
*C
1
/∅/, C, _σσ́
–1
0
0
0
–1
+1
2
/∅/, C, _σ́
–1
0
+1
0
–1
+1
3
/∅/, CC, _σσ́
–1
+1
0
0
–1
0
4
/∅/, CC, _σ́
–1
+1
+1
0
–1
0
5
/ ə /, C, _σσ́
–1
0
0
+1
0
+1
6
/ ə /, C, _σ́
–1
0
+1
+1
0
+1
7
/ ə /, CC,_σσ́
–1
+1
0
+1
0
0
8
/ ə /, CC, _σ́
–1
+1
+1
+1
0
0
NHG does not predict that adding or subtracting a constraint violation should always have the same effect on candidate probabilities because changing constraint violations alters both the harmony difference
(36)
(37)
Variance of noise difference
If we start from a tableau where the harmony difference is
(38)
Change in probit(
(a)
(b)
For example, consider the tableaux in (29) and (30). The harmony difference in (29) is –2.2 and, assuming that the variance of the noise added to constraint weights,
Another pair of tableaux that differ by a single *C
(39)
/∅/, CC_σσ́
(40)
/∅/, CC_σ́
In summary, we have seen that MaxEnt and NHG make distinct predictions concerning the effect on candidate probabilities of adding or subtracting constraint violations. In MaxEnt, the effect on logit(
We will test these predictions against Smith & Pater’s experimental data on the rate of realization of schwa in French. However, before turning to these tests we need to add one more form of stochastic HG to the comparison because many researchers who have adopted NHG have employed a variant of NHG in which the noise added to constraint weights is prevented from making those weights negative, so it is important to analyze the properties of this framework as well.
The final stochastic grammar model that we will consider is a variant of NHG with a non-normal noise distribution. This variant is motivated by a desire to prevent noise making constraint weights negative. Adding normal noise to a low constraint weight can easily result in a negative weight, and this effectively reverses the constraint, favoring the configurations that it is supposed to penalize. It is also necessary to ensure that noise cannot make constraint weights less than or equal to zero for harmonically bounded candidates to be assigned zero probability (
Smoothed samples from censored normal distributions, censored at –2 (left) and –0.5 (right).
Summing censored normal random variables results in
One novel feature of this model that turns out to have considerable importance is that the variance of
We test the predictions outlined in the previous section against Smith & Pater’s data on the probability of realizing schwa in a variety of contexts in French. First, we ask which predictions are best supported by the data, by comparing the overall fit of grammars in each framework, and by probing how well the specific predictions are supported. However, this process reveals that even the best grammars fail to account for significant patterns observed in the data, motivating the addition of a constraint to the analysis. A second round of comparisons using this revised constraint set leads to the conclusion that the predictions of MaxEnt are best supported, although censored NHG also fits the data well.
We want to test the performance of the various stochastic Harmonic Grammars as grammar frameworks, independent of the performance of any learning algorithms that might be proposed to learn constraint weights in that grammar framework, so it is important to compare the grammars that provide the best fit to the data. For example, our conclusions differ somewhat from Smith & Pater (
Our criterion for goodness of fit is Maximum Likelihood (ML). That is, we searched for the constraint weights that maximize the probability of the data given that grammar model (e.g.
NHG does not correspond to a standard statistical model, but given constraint weights, it is straightforward to calculate candidate probabilities using equations (14) and (15), so standard optimization algorithms can be used to search for the ML constraint weights. We used the Nelder-Mead algorithm, as implemented in the
Censored NHG is more problematic because it is not possible to calculate candidate probabilities – they have to be estimated through simulation. However, with one million simulations per grammar it was possible to obtain probability estimates that were sufficiently stable to search for ML constraint weights using the Nelder-Mead algorithm. This process was slow, but was able to find substantially better constraint weights than those found by Smith & Pater using the HG-GLA algorithm (
The results are summarized in
ML constraint weights for the stochastic HGs described in the text.
MaxEnt | Normal MaxEnt | NHG | Censored NHG | |
---|---|---|---|---|
N |
2.08 | 1.68 | 2.20 | 11.23 |
*CCC | 2.84 | 2.29 | 2.98 | 12.02 |
*C |
0.48 | 0.39 | 0.57 | –0.04 |
M |
2.14 | 1.69 | 2.21 | 1.96 |
D |
0 | 0 | 0 | –1.32 |
*C |
0 | 0 | 0 | 9.43 |
Observed probabilities of pronouncing schwa in each context, and fitted probabilities from each stochastic HG.
Fitted probabilities | |||||
---|---|---|---|---|---|
Context | MaxEnt | Normal |
NHG | Censored NHG | |
/∅/, C, _σσ́ | 0.09 | 0.11 | 0.12 | 0.10 | 0.10 |
/∅/, C, _σ́ | 0.12 | 0.17 | 0.18 | 0.21 | 0.17 |
/∅/, CC, –σσ́ | 0.68 | 0.68 | 0.67 | 0.67 | 0.70 |
/∅/, CC, _σ́ | 0.83 | 0.78 | 0.76 | 0.75 | 0.77 |
/ə/, C, _σσ́ | 0.56 | 0.52 | 0.50 | 0.50 | 0.54 |
/ə/, C, _σ́ | 0.65 | 0.63 | 0.61 | 0.61 | 0.62 |
/ə/, CC, _σσ́ | 0.91 | 0.95 | 0.95 | 0.96 | 0.95 |
/ə/, CC, _σ́ | 0.94 | 0.97 | 0.97 | 0.96 | 0.96 |
deviance | 14.6 | 21.7 | 26.0 | 12.8 | |
The MaxEnt and NHG grammars have 0 weights for two constraints, D
D
A second point to observe is that the censored NHG grammar has constraints with negative weights (*C
The performance of the different grammars is summarized in
A common way to compare models is in terms of their AIC values (
Comparisons between MaxEnt and censored NHG are more complicated. As discussed above, the MaxEnt models make use of only four of the six constraints in
Whether censored NHG is penalized for additional parameters or not, the conclusions are similar. With the complexity penalty, MaxEnt has the lowest AIC, but the AIC of censored NHG is only 2.2 higher, and Burnham & Anderson (
So in terms of AIC, MaxEnt and censored NHG are similar. Normal MaxEnt is substantially worse than the closely comparable MaxEnt model, and normal NHG is clearly the worst model. However, it is revealing to look more closely at the details of the model fits and how they relate to the distinct predictions laid out in section 7.
The observed probabilities of pronouncing schwa in each of the eight contexts are compared to the fitted probabilities from the models in
Observed and fitted probabilities of pronouncing schwa in each context. Fitted probabilities for the models have been separated on the x-axis to make it easier to distinguish their plotting symbols.
Observed and fitted logit probabilities of pronouncing schwa in each context. Contexts are numbered for ease of reference.
Observed and fitted probit probabilities of pronouncing schwa in each context. Contexts are numbered for ease of reference. Fitted probabilities for the models have been separated on the x-axis.
The analysis in section 7 demonstrated that MaxEnt predicts that all pairs of contexts that differ in the same constraint violations should show the same difference in logit(
Examination of the data in
These visual impressions can be confirmed statistically. In a logistic regression model of
Normal MaxEnt makes similar predictions but with regard to probit(
NHG predicts systematic variation in differences in probit(
(41)
Change in probit(
normal NHG | censored NHG | ||||
---|---|---|---|---|---|
context | |||||
1 | /∅/, C, _σσ́ | –2.20 | –1.84 | 1.43 | |
2 | /∅/, C, _σ́ | –1.63 | 2 | –1.46 | 1.54 |
3 | /∅/, CC, _σσ́ | 0.77 | 0.74 | 1.43 | |
4 | /∅/, CC, _σ́ | 1.34 | 2 | 1.13 | 1.54 |
5 | / ə /, C, _σσ́ | 0.01 | 0.17 | 1.72 | |
6 | / ə /, C, _σ́ | 0.58 | 2 | 0.56 | 1.81 |
7 | / ə /, CC,_σσ́ | 2.98 | 2.76 | 1.72 | |
8 | / ə /, CC, _σ́ | 3.55 | 2 | 3.14 | 1.81 |
The adjacent contexts in
Turning to pairs that differ only in whether their preceding context is CC_ or C_ (1-3, 2-4, 5-7, 6-8), the CC_ context adds 1 to the difference in *CCC violations while subtracting 1 from the difference in *C
In summary, the contextual variation in differences in probit(
While the formula for change in probit(
This phenomenon has two consequences: First, low-weighted constraints contribute less noise, so their effect on
In addition, the presence of redundant pairs of constraints like M
In summary, with censored NHG, it is possible to mitigate the bad predictions observed with normal NHG, and it is possible to use a redundant constraint to adjust noise variance to partially model some observed contextual variation in the effects of differences in constraint violations on probit(
It is clear from examination of both varieties of NHG that the distinctive predictions of these models about the ways in which the effect of adding or subtracting a constraint should depend on the pattern of violations in the rest of the tableau are not confirmed. Censored NHG is only able to compete with MaxEnt because it can exploit the redundancy between M
Besides revealing the unanticipated effect of redundant constraints in censored NHG, this examination of the fit of the four stochastic HG’s suggests that none of them capture all of the significant patterns in the data. We have already noted that MaxEnt fails to capture a significance difference in the effect of *CCC/*C
Given that this interaction effect is not successfully modeled in any of the frameworks, the source of the problem presumably lies in the constraint set: an additional constraint is required to fit the schwa data. We will see that comparison of the stochastic Harmonic Grammar frameworks with respect to this revised constraint set provides a better test of their predictions since the best models fit the data well and redundant constraints no longer contribute to the fit of the censored NHG model. The results provide support for MaxEnt over censored NHG, and the procedure illustrates methods that are generally applicable to the analysis of stochastic Harmonic Grammars.
Evaluating the adequacy of stochastic grammars is tricky. The condition for adequacy cannot be a precise match between observed and predicted probabilities because we expect mismatches between observed and predicted probabilities in any finite sample of data, even given the true grammar. Instead we want to determine when those mismatches are small enough to conclude that the grammar accounts for the data.
Here we adopt a standard statistical method for assessing the fit of a probability model to data, a Likelihood Ratio Test of lack of fit (
The test reveals that all of the grammars considered show significant lack of fit. For models with four constraint weights (and thus four residual degrees of freedom, given that we are analyzing eight contexts), the deviance threshold for significant lack of fit at
The main shortcoming of these grammars is that they fail to capture the fact that the difference in the probability of schwa candidates between C_ and CC_ contexts is smaller in clitics (with underlying /ə/) than in words (with underlying /∅/). Given the current constraint set, MaxEnt predicts that the difference in logit(
We can verify that this is the source of the problem by adding a constraint whose violation depends on both preceding context (CC_ vs. C_) and whether the form contains underlying /∅/ or /ə/, and showing that this makes it possible to formulate grammars that show no significant lack of fit. The additional constraint could take a variety of forms, but one possibility is a constraint that penalizes CCC clusters only if the entire cluster falls within the same intermediate phrase (iP), inspired by related constraints proposed by Côté (
This analysis posits that the relevant difference is not between clitics and words per se, but between the prosodic contexts in which they appear in the experimental materials. In the clitic sentences, potential clusters are split over the boundary between the subject and the VP (e.g. [mɔʁi
Some support for the hypothesis that the effect is due to prosodic structure comes from Dell’s (
The ML constraint weights for grammars using this expanded constraint set are shown in
ML constraint weights and deviances for grammars using the revised constraint set.
MaxEnt | Normal MaxEnt | NHG | Censored NHG | |
---|---|---|---|---|
N |
1.10 | 0.89 | 1.19 | 11.31 |
*CCC | 2.12 | 1.68 | 2.18 | 12.25 |
*CCC/iP | 1.18 | 1.09 | 1.62 | 1.55 |
*C |
0.50 | 0.40 | 0.59 | 0.16 |
M |
1.28 | 1.07 | 1.41 | 1.29 |
*C |
0 | 0 | 0 | 10.24 |
deviance | 2.2 | 2.5 | 6.7 | 4.5 |
The revised grammars provide a better test of the predictions of the different stochastic Harmonic Grammar models because the comparison set now includes grammars that fit the data well, and because the MaxEnt and NHG grammars are now distinguished by their fundamental predictions rather than by their ability to exploit constraints that happen to be redundant in the present data set. The revised MaxEnt grammar has lower deviance than the revised censored NHG grammar: 2.2 vs. 4.5. Since the NHG grammar still requires one more constraint than the MaxEnt grammar, *C
In summary, we have tested a basic difference between MaxEnt and NHG against data on schwa realization in French: In MaxEnt, a given change in constraint violations always has the same effect on logit(
The predictions are most directly tested by the comparison between MaxEnt and normal MaxEnt on the one hand and regular NHG on the other, and NHG gives a substantially poorer fit to the data with both Smith & Pater’s original constraint set and with the augmented constraint set including *CCC/iP. MaxEnt differs from NHG not only in these basic predictions, but also in the function that relates probability to harmony: logit in MaxEnt and probit in NHG. This difference is eliminated in the comparison between normal MaxEnt and NHG, and the normal MaxEnt grammar still performs substantially better than NHG, especially with the augmented constraint set.
The comparison between MaxEnt and censored NHG introduces a third difference: censored NHG predicts that the relative probabilities of candidates should be affected by the weights of the constraints that show violation differences, because lower-weighted constraints introduce less noise in this framework. This property enables censored NHG to achieve a fit comparable to MaxEnt with the original constraint set, but that is only in conjunction with redundant constraints that make it possible to use constraint weights purely to adjust noise. If that redundancy is eliminated by reducing the constraint set, or made irrelevant by augmenting it, then censored NHG performs worse than MaxEnt. Censored NHG only achieves a lower deviance than NHG with the augmented constraint set because censoring results in more uniform standard deviations for the noise difference (
So (i) MaxEnt’s prediction that a given change in constraint violations should always result in the same change in candidate probabilities when those probabilities are measured on the appropriate scale is supported over the NHG’s prediction that changes should depend on the number of violation differences between the candidates. (ii) Measuring probability changes on the logit scale (MaxEnt) seems to yield better results than using the probit scale (normal MaxEnt), but the difference is minimal with the augmented constraint set.
Before concluding, we will briefly address the extension of the analysis of stochastic Harmonic Grammars to cases where three or more variant forms have probabilities significantly above zero.
As noted in section 5, the analysis of the relationship between candidate harmonies and their probabilities developed so far only applies to the analysis of tableaux where two candidates have probabilities significantly above zero, as in the French schwa data. In this section we show how the analysis can be generalized to tableaux with any number of variants and briefly consider further predictions that arise.
The analysis in section 5 considered the case of competition between two candidates,
(42)
For candidate
(43)
Candidate probability in MaxEnt
It is apparent from (43) that it remains true in the general case that the relative probabilities of two candidates depends only on the difference in their harmonies (44).
(44)
Ratio of probabilities of two candidates in MaxEnt
If the
For example, consider tableau (1), repeated here as (45). In normal MaxEnt, where the noise added to the harmony of each candidate is drawn from identical normal distributions. Candidate (a) is selected if its harmony is higher than the harmonies of candidates (b) and (c), which is the case if
Contour plots of the joint probability distribution of pairs of noise terms derived from tableau (45). The shaded areas are the regions in which candidate (a) (left panel) and candidate (b) (right panel) are optimal.
The calculation for candidate (b) is represented in
(45)
Tableau with probabilities assigned by Normal MaxEnt and NHG
In general, given a tableau with N candidates, the problem of calculating the probability of candidate
With three or more variants, the predictions of MaxEnt and its normal variant diverge: In MaxEnt the relative probabilities of a pair of candidates depends only on the difference in their harmonies, as shown above, but in normal MaxEnt, candidate probabilities depend on the harmonies of all candidates in the tableau. For example, in (45), if the only candidates were
NHG is still distinguished from both varieties of MaxEnt by the fact that candidate probabilities depend on the pattern of violations across the whole tableau, not just on the harmony differences between candidates. As we have already seen, the variance of the noise difference between a pair of candidates is equal to the sum of the squared violation differences between the two candidates (14), so in (45) the variance of
Censored NHG has to be analyzed by simulation regardless of the number of candidates, but its predictions remain qualitatively similar to NHG, modulated by the effect of constraint weights on noise variances and covariances.
We have analyzed stochastic Harmonic Grammars by reformulating them as Random Utility Models, in which Harmonic Grammar is made stochastic by adding random noise to the harmonies of each candidate. In this formulation, the differences between varieties of stochastic Harmonic Grammar follow from differences in the nature of this added noise. More precisely, it is the distribution of differences between these noise terms that is crucial because the relative probabilities of two candidates depends on the difference in their harmonies divided by the standard deviation of the difference between their noise variables, so the probability of reversing a given difference in harmony between two candidates increases as the variance of the noise added to the candidate harmonies increases (Section 5).
The varieties of stochastic Harmonic Grammar that we have considered differ in the shape of the distribution of noise differences and whether the variance of the distribution is fixed or depends on the pattern of constraint violations. In MaxEnt noise differences follow a logistic distribution, while they follow a normal distribution in NHG and normal MaxEnt, and a sum of censored normal distributions in censored NHG. The shape of the distribution determines the precise function that relates the difference in harmonies of two candidates to their probabilities. However, the logistic and normal distributions are similar, so the effects of this difference are generally subtle, although it can result in measurably distinct predictions as probabilities approach 0 or 1, as seen in Section 8.
The more important difference between these models concerns the variance of the noise differences: In MaxEnt and normal MaxEnt, the noise added to each candidate’s harmony has the same variance, so the noise difference also has the same variance for any pair of candidates. In NHG and censored NHG, the variance of the noise difference depends on the number of violation differences for that pair of candidates, so it differs between pairs. Given that candidate probabilities depend on the difference in their harmonies divided by the standard deviation of the noise difference, the fixed variance of the noise difference in both varieties of MaxEnt means that the relative probabilities of candidates depend only on the differences in their harmonies in these frameworks. Where variance of the noise difference depends on violation differences, as in both varieties of NHG, candidate probabilities also depend on the differences in constraint violations between the candidates.
This basic distinction between the grammar models leads to testable predictions concerning the effects of changing constraint violations: In MaxEnt, a given change in constraint violations always has the same effect on the logit of candidate probabilities, whereas in NHG, the effect on candidate probabilities depends on the violation pattern in the whole tableau. In all frameworks, a given change in constraint violations always has the same effect on the harmony difference between candidates. Given fixed variance of noise differences, as in MaxEnt, this means the change in probabilities is also always the same (when measured in logits), but in NHG, the variance of the noise difference can change when constraint violations are changed, so the effect on candidate probabilities depends on the differences in constraint violations between the candidates.
We tested these predictions against Smith & Pater’s (
Censored NHG was more competitive with MaxEnt, but that is because in censored NHG noise variance is lower on candidates that violate lower-weighted constraints. This makes it possible to use the weights of redundant constraints to adjust noise variances to better fit the data. However this is not an advantage of the censored NHG framework because the redundancy of constraints here is an artifact of the limited data set being studied. In the absence of the effects of redundant constraints, censored NHG performed worse than MaxEnt, and comparably to regular NHG.
However, evidence from a single data set is obviously not decisive concerning the relative merits of these stochastic Harmonic Grammar frameworks, so the value of this study lies as much in the methods developed here for comparing and evaluating stochastic Harmonic Grammars that can be applied in further studies.
The additional file for this article can be found as follows:
Calculating candidate probabilities in stochastic HGs with normal noise. DOI:
R code for working with stochastic HGs and reproducing analyses in the paper. DOI:
HG = Harmonic Grammar, MaxEnt = Maximum Entropy Grammar, ML = Maximum Likelihood, NHG = Noisy Harmonic Grammar, OT = Optimality Theory
This is a somewhat misleading label, since the Maximum Entropy principle that gives Maxent grammar its name actually yields the logistic model. A more appropriate label for the normal variant might be ‘HG with normal candidate noise’, but ‘normal Maxent’ is shorter and makes explicit the similarity to Maxent grammar.
R code for analyses reported in this paper is included in the supplementary materials.
Smith & Pater’s censored NHG grammar has deviance 27.2 compared to 12.8 for the grammar reported here. The grammar reported here also performs better on the metrics employed by Smith & Pater: summed absolute errors 0.247 vs. 0.295, summed squared errors 0.010 vs. 0.015.
Smith & Pater use an underlying normal distribution with standard deviation of 0.2 for censored NHG, resulting in lower constraint weights.
Deviance is –2 times the difference in Log-likelihood between the model and a ‘saturated’ model with one parameter for each observation (
Thanks to Benjamin Storme for bringing this work to my attention and suggesting that the difference in syntactic structure might be relevant here. Côté (
Thanks to Joe Pater for suggesting that data he had collected with Brian Smith might be a good testing ground for MaxEnt and NHG, to an audience at AMP 2017 at NYU for feedback on the early stages of this project, and to two anonymous reviewers for helpful comments on this paper.
The author has no competing interests to declare.