We explore the interaction of two phonological factors that condition schwa–zero alternations in French: schwa is more likely after two consonants than a singleton; and schwa is more likely between stressed syllables than elsewhere. Using new data from a judgment study, we show that both factors play a role in schwa epenthesis and deletion, and that the two factors interact cumulatively: they have a stronger effect together than individually. Treating each factor as a constraint, we find that their cumulative interaction is better modeled with weighted than with ranked constraints. We provide a characterization of patterns of cumulativity in probability space in terms of the effect of constraint on its own versus its effect in a cumulative interaction with another constraint. Stochastic OT can model cumulative interactions, but only sublinear ones, where the effect of a constraint is weaker in the cumulative context than on its own. Weighted constraint models, MaxEnt and Noisy HG, can model the full range of cumulativity — sublinear, linear, and superlinear. In examining the ability of these models to fit our experimental data, we find that Stochastic OT is hampered by the fact that the data displays superlinear cumulativity. Noisy HG and MaxEnt fare better on this dataset, with MaxEnt yielding the best fit.
In his landmark study originally published in 1973, Dell (
(1)
Schwa deletion and epenthesis
a.
/dœvrɛ/ → [dvrɛ]
Tu
‘You should go.’
b.
/mœ/ → [m]
Tu
‘You owe me money.’
c.
/film/ → [filmœ]
un
‘a Danish film’
In all of these examples, the process applies variably: ((1)a, b) could be produced with a surface schwa, and ((1)c) without one. Although speech rate and speech register affect the probability of deletion in these contexts, according to Dell both variants are possible in what might be described as a neutral rate and register.
In this paper, we focus on the interaction of two phonological factors that affect the probability of schwa deletion and epenthesis. The first factor is whether a singleton consonant or a consonant cluster precedes the schwa. Deletion is less likely, and epenthesis more likely, when schwa is preceded by a cluster. Dell’s rule of schwa deletion applies only when a single consonant precedes, as in the examples in ((1)a, b), and his rule of epenthesis applies only after a morpheme that ends in a cluster, as in ((1)c). Dell’s analysis abstracts away from the fact that schwa deletion can also apply when a cluster precedes, as in ((2)a, b). Deletion almost certainly applies in these examples with lower probability than in ((1)a, b), but it seems possible for most if not all speakers of this variety.
(2)
Examples of deletion in the CC_ context in
a.
[ʒak
Jacques
dvʁɛ
devrait
paʁtiʁ]
partir.
‘Jacques should leave.’
b.
[ʒak
Jacques
m
me
dwa
doit
dœ
de
laʁʒɑ̃]
l’argent.
‘Jacques owes me money.’
A second factor that plays a role in conditioning the probability of both deletion and epenthesis is the position of the schwa in the phrase. Deletion is less likely, and epenthesis more likely, when the schwa is followed by a stressed monosyllabic word, and schwa’s presence avoids a stress clash (see Section 2 for references). For example,
Examples like these raise both empirical and theoretical challenges. On the empirical side, data on the relative frequency of outcomes are harder to collect than data on categorical differences. Single speaker intuitions and observations like those of Dell (
Constraintbased models address both of these theoretical challenges. Such models allow a single factor, or constraint, to play a role across multiple processes, and as we will discuss below, there are several probabilistic constraintbased models that can generate degrees of optionality as required by the schwa data (see
(3)  Constraints on schwa deletion and epenthesis  
a.  *C 

Definition:  Assign one violation for every two adjacent stressed syllables.  
Effect:  Schwa is more likely to be realized in σ́_σ́ than in σ́_σσ́.  
b.  *CCC  
Definition:  Assign one violation for every sequence of three consonants.  
Effect:  Schwa is more likely to be realized in VCC_C than in VC_C. 
In French, these two constraints appear to interact
Probability of realizing an underlying schwa extimatedd from experiment.
Following context  Preceding context  

C_  CC_  
_σ́  0.65  0.94 
_σσ́  0.56  0.91 
We use these differences in predictions to compare three constraintbased models of variation in detail: Stochastic OT (
To compare the three frameworks, we report and model experimental data on French schwa, using judgments from multiple native speakers on the acceptability of realized schwa across contexts. Of the three models, MaxEnt provides the best fit to our data. Our results add to a growing body of work showing that weighted constraints provide a better fit to probabilistic natural language data than ranked constraints, particularly when it comes to cumulativity (
The paper is structured as follows. In Section 2, we provide a brief review of the two phonological factors conditioning French schwa and formalize these factors as phonological constraints. After the presentation of the experiment in Section 3, we present a full model of the data in Section 4, using the probabilities from the experiment to compare different constraintbased models of phonological variation.
In this section, we provide background on the two phonological factors, repeated in (4), which play a role in both schwa epenthesis and deletion, and define the constraints for the formal analysis.
(4)  Phonological conditions on schwa realization  
a.  The cluster factor: schwa is more likely to be realized in CC_C than in C_C  
b.  The stress factor: schwa is more likely to be realized in σ́_σ́ than in σ́_σσ́ 
There are three morphological environments where schwa alternates with zero: clitic boundaries, word boundaries, and morphemeinternally. Our analysis assumes that underlying schwas are found morphemeinternally, such as the one in
(5)  Schwa deletion and epenthesis  
a.  /dœvʁɛ/ → [dvʁɛ]  Tu 

b.  /mœ/ → [m]  Tu 

c.  /film/ → [filmœ]  un 

d.  /vɛst/ → [vɛstœ]  un 
The treatment of underlying and epenthetic schwa is not universal. Dell (
The justification for treating schwas at word boundaries as epenthetic is the alternation’s productivity. Schwa can appear at
(6)
Data from Dell (
a.
[yn
une
vɛst
vest
ʁuʒ]
rouge
(
‘a red jacket’
b.
[yn
une
vɛst
vestɇ
ʁuʒ
rouge
e
et
blɑ̃ʃ]
blanc
(
‘a red and white jacket’
c.
[ɛgzakt
exact
(
‘exactly’
d.
[masivmɑ̃]
massivɇment
(
‘massively’
e.
[ɛ̃
un
ʃɔʁt
short
vɛʁ]
vert
(
‘a green pair of shorts’
We treat schwas at clitic boundaries as underlying because schwazero alternations only occur in a subset of clitics. For example, schwa is optional in the object clitic
(7)
Schwazero alternations are lexically restricted
a.
[sœ
ce
kœ
que
ʒo
Joe
t(œ)
t(e)
di]
dit
‘what Joe told you’
b.
[sœ
ce
kœ
que
ʒo
Joe
lœʁ
leur
di],
dit
*[sœ kœ ʒo lœʁœ di]
‘what Joe told them’
c.
[si
si
ʒ(œ)
j(e)
kuʁ]
cours
‘if I run’
d.
[si
si
ɛl
ellɇ
kuʁ],
court
*[si ɛlœ kuʁ]
‘if she runs’
A model in which schwas at clitic boundaries are epenthetic must prevent epenthesis in contexts such as (7b) and (7d), while motivating optional epenthesis in (7a) and (7c).
To solve this problem, we posit that alternating [œ]’s in clitics are underlying. This analysis has the added benefit of straightforwardly accounting for the generalization that schwa is realized more often in clitics than at word boundaries (see e.g.
(8)  M 
(9)  D 
M
For both underlying and epenthetic schwa, schwa is realized more often after two or more consonants than after a singleton consonant. The examples in (10) show this for deletion, while controlling for phrase position. In all examples, schwa is also followed by a consonant, since schwa is rarely realized adjacent to a vowel.
(10)
Deletion and the cluster factor (
a.
[mɑ̃ʒ
mange
l
l
gato]
gateau
CC
b.
[mɑ̃ʒɛ
mangez
l(œ)
l(e)
gato]
gâteau
C(e) σσ
c.
[ʒak
Jacques
d
d
paʁtiʁ]
partir
CC
d.
[ɑ̃ʁi
Henri
d(œ)vʁɛ
d(e)vrait
paʁtiʁ]
partir
C(e) σσσ
The number of preceding consonants also plays a role in epenthesis, as shown in (11).
(11)
Epenthesis and the cluster factor (
a.
[la
la
sɛkt(œ)
sect(e)
paʁtɛ]
partait
CC(e) σσ
b.
[l
l’
astɛk
Aztèquɇ
paʁtɛ]
partait
Cɇ σσ
Support for the probability judgments above are found in our experimental results, previewed in Table
Probability of schwa realization from our experiment.
Following context  Preceding context  

C_  CC_  
_σ́  0.65  0.94  
_σσ́  0.56  0.91  
_σ́  0.12  0.83  
_σσ́  0.09  0.68 
In the constraintbased models that follow, we model the cluster factor with the constraint *
(12)  * 
Under the *CCC analysis, schwa is inserted in phrases such as
An effect of phrase position on the probability of realizing schwa has been observed at least since Léon (
(13)
Epenthesis: position plays a role when schwa is after two consonants (
a.
[lœ
le
gaʁd
gard
mɑ̃]
ment
CC
‘the guard lies’
b.
[lœ
le
gaʁd(œ)
gard(e)
mɑ̃ˈtɛ]
mentait
CC (e) σσ
‘the guard was lying’
(14)
Deletion: position plays a role when schwa is after two consonants (
a.
[la
la
tɛʁ
terre
s
s
vɑ̃]
vend
CC
‘the land is selling’
b.
[la
la
tɛʁ
terre
s(œ)
s(e)
vɑ̃
vend
bjɛ̃]
bien
CC (e) σσ
‘the land is selling well’
In both (13) and (14), schwa occurs after two consonants in the context VCC_C. In the context VC_C, it has been claimed that there is no effect of the number of following syllables, regardless of whether the schwa is at a word boundary or clitic boundary. Côté and Morrison (
(15)
Realization of schwa in the context VC_C is unaffected by the number of following syllables (
a.
[lo
l’eau
s(œ)
s(e)
vɑ̃]
vend
C (e) σ
‘water sells’
b.
[lo
l’eau
s(œ)
s(e)
vɑ̃
vend
bjɛ̃]
bien
C (e) σσ
‘water sells well’
c.
[il
il
dɔn
donnɇ
pø]
peu
C ɇ σ
‘he gives little’
d.
[il
il
dɔn
donnɇ
boku]
beaucoup
C ɇ σσ
‘he gives a lot’
Contrary to the generalization in (15), our experimental data show an effect of the number of following syllables even after a single consonant. Across segmental and morphological contexts, schwa is realized more often before one syllable than before two syllables, although the effect is very weak at floor and ceiling, when probabilities are close to 0 (as in C_ at a word boundary) or 1 (as in CC_ at a clitic boundary).
In our model, the fact that schwa is realized more often before one syllable than two follows from stress clash avoidance. The constraint *C
(16)  *C 
Assign one violation for every two adjacent stressed syllables. 
This approach is similar in spirit to the analysis of Mazzola (
Stress in French is not fixed at the word level, but falls on the last nonschwa syllable of the phonological phrase (
(17)
Examples of stress assignment in French
a.
(le garde)_{PP}
σ σ́
(mentait)_{PP}
σ σ́
‘the guard was lying’
b.
(une
σ
veste
σ̀
marron)_{PP}
σ σ́
‘a brown jacket’
c.
(une
σ
veste)_{PP}
σ́
(marron)_{PP}
σ σ́
‘a brown jacket’
Given that stress always occurs on the last full syllable of the phonological phrase, when schwa is followed by a phrasefinal monosyllabic word, it’s also followed by a stressed syllable. As shown in (18), *C
(18)  Schwa insertion avoids a stress clash  
a.  (le garde)_{PP} (ment)_{PP}  [lœ ˈgaʁd 
σ́ 

b.  (le garde)_{PP} (mentait)_{PP}  [lœ ˈgaʁd(œ) mɑ̃ˈte]  σ́(e) σσ́  
c.  (une veste rouge)_{PP}  [yn ˌvɛst 
σ̀ 

d.  (une veste marron)_{PP}  [yn ˌvɛst(œ) maˈʁɔ̃]  σ̀(e) σσ́ 
*C
(19)
Clash resolution, stressed syllables are in small caps (
a.
[laˌ
l’
midalˈ
a
fʁɛd]
Al
σσ̀ σσ́
‘Alfred’s friend’
b.
[ˌlamidˈ
pjɛʁ]
P
σ̀σ σ́
‘Pierre’s friend’
c.
[laˌ
l’
midœˈ
a
pjɛʁ]
P
σσ̀e σ́
‘Pierre’s friend’
Crucially, the example in (19c) shows that schwa can serve as a buffer between stresses, avoiding a stress clash and making retraction unnecessary.
Côté (
(20)
A position effect without stress clash (
a.
[d
d
ˈlo]
l’eau
‘some water’
b.
[d(œ)
d(e)
loˈdas]
l’audace
(e)σσ́
‘some audacity’
(21)
a.
[v(œ)ˈne]
v(e)nez
(e)σ́
‘come’
b.
[v(œ)ne
v(e)nez
iˈsi]
ici
(e)σσ́
‘come here’
Côté (
One last restriction on schwa, which is relevant to our experimental design, is that schwa generally doesn’t occur next to another vowel, even in contexts where the stress factor favors its realization.
(22)
No schwa next to a vowel
a.
[ɛma
Emma
tɛd],
t’aide
*[ɛma tœ ɛd]
‘Emma helps you’
b.
[ɛma
Emma
tœ
t
gid]
guide
‘Emma guides you’
c.
[uvʁ œf],
ouvrɇoeuf
*[uvʁœ œf]
‘egg opener’
d.
[uvʁœ
une
bwat]
ouvr
‘can opener’
The exception to this generalization is haspiré words, which phonetically begin with a vowel (or glottal stop), but pattern in many ways as if they begin with a consonant (see e.g.
As shown in the previous section, the realization of schwa is conditioned by segmental context and rhythmic context, which we analyze using the constraints *CCC and *C
This section reports the results of a judgment experiment designed to estimate the probability of schwa across contexts, and determine how the four constraints contribute to the probability of schwa realization.
We conducted the experiment over the internet, using the webbased psycholinguistics experiment platform Ibex (
In addition to choosing between schwa and no schwa, participants indicated their confidence in the answer as
Screenshot of experiment in progress.
Previous work has shown that French speakers are capable of estimating the frequency of schwa realization in this manner. For example, Racine (
The experiment followed a 2 × 2 × 2 factorial design, with 8 conditions.
(23)  Factorial design  
a.  Cluster before schwa site  C_ 

b.  Position of schwa site  _σ́ 

c.  Underlying or epenthetic schwa  clitic boundary 
The construction of items differed for underlying and epenthetic schwas. Items with epenthetic schwas were constructed according to the template in (24), consisting of a noun followed by a postnominal adjective, with the site of the epenthetic schwa at the boundary between them.
(24)  C’est un <Noun> <Adjective> 
<Noun>: Cfinal or CCfinal, all final consonants are obstruents, mostly monosyllabic  
<Adjective>: σ́ or σσ́, all obstruentinitial 
Depending on the condition, nouns ended in either one or two consonants, and adjectives were one or two syllables long. We controlled for segmental and prosodic context as much as possible. All but two nouns were monosyllabic, and the disyllabic nouns were balanced across conditions. All nouns in the experiment ended in stops, and all adjectives began with obstruents. This ensured that all clusters in the experiment consisted of only obstruents, and in threeconsonant clusters, the middle consonant was always a stop, controlling for the influence of sonority on the rate of schwa realization. Examples of the four epenthesis conditions are in (25), with parentheses indicating the alternating
(25)
Examples of epenthesis items with alternating schwa in parentheses
a.
C_σ́
[yn
une
bɔt(œ)
bott(e)
ˈʒon]
jaune
‘a yellow boot’
b.
CC_σ́
[yn
une
vɛst(œ)
vest(e)
ˈʒon]
jaune
‘a yellow jacket’
c.
C_σσ́
[yn
une
bɔt(œ)
bott(e)
ʃinˈwaz]
chinoise
‘a Chinese boot’
d.
CC_σσ́
[yn
une
vɛst(œ)
vest(e)
ʃinˈwaz]
chinoise
‘a Chinese jacket’
Deletion items contained the clitic
(26)  <Name> te <Verb> 
<Name>: Cfinal or Vfinal, all final consonants are obstruents, disyllabic  
<Verb>: σ́ (present) or σσ́ (imperfect), all obstruentinitial 
All of the names that occurred before
(27)
Examples of deletion items with alternating schwa in parentheses
a.
C_σ́
[eva t(œ) ˈʃɔk]
Eva t(e) choque
‘Eva shocks you’
b.
CC_σ́
[mɔʁiz t(œ) ˈsit]
Maurice t(e) cite
‘Maurice cites you’
c.
C_σσ́
[eva t(œ) ʃɔˈkɛ]
Eva t(e) choquait
‘Eva shocked you’
d.
CC_σσ́
[mɔʁiz t(œ) siˈtɛ]
Maurice t(e) citait
‘Maurice cited you’
Each participant saw 6 items per condition, 24 for deletion and 24 for epenthesis, in addition to 30 fillers. Fillers consisted of tenses (simple past, simple future) and phonological environments that differed from the test items. Most importantly, some fillers contained phrases with schwa adjacent to vowels, which we used as catch trials. We excluded from analysis any participant who judged that schwa should
(28)  Summary of experimental design 
78 judgments per participant  
24 deletion: 6 per type in (25), no name or verb repeated  
24 epenthesis: 6 per type in (27), no adjective or noun repeated  
20 fillers for deletion (e.g. Anna s(e) est levée)  
10 fillers for epenthesis (e.g. un iguan(e) solitaire) 
Participants were recruited over the internet through word of mouth. We excluded any participant who did not selfidentify as a native speaker of French or chose
The proportion of schwa responses for both underlying and epenthetic contexts are presented in Table
Proportion of schwa realization from experiment. The values in parentheses indicate the range of the 95% confidence interval, specifically the Wilson score interval.
Following context  Preceding context  

C_  CC_  
_σ́  0.65 (0.57–0.72)  0.94 (0.89–0.97)  
_σσ́  0.56 (0.48–0.64)  0.91 (0.86–0.95)  
_σ́  0.12 (0.08–0.18)  0.83 (0.76–0.89)  
_σσ́  0.09 (0.05–0.14)  0.68 (0.61–0.75) 
Proportion of schwa realization from experiment. Whiskers show Wilson score intervals.
Across all four phonological contexts, schwa is judged as better in deletion contexts than in epenthesis contexts. Schwa is also generally judged as better after two consonants than one consonant (the cluster factor), and better before one syllable than two syllables (the stress factor).
To evaluate the statistical significance and effect size of the factors, we fit a mixed effects logistic regression model in R (
Coding of fixed effects for regression model.
Fixed effect  Level  Coding 

Stress (stress factor)  _σ́  +1 
_σσ́  –1  
Seg (cluster factor)  CC_  +1 
C_  –1  
Ep/Del (epenthesis or deletion)  Deletion  +1 
Epenthesis  –1 
Given interspeaker variation in the production of schwa, the use of a random intercepts and slopes ensures that the model generalizes across speakers. The inclusion of random intercepts means that speakers with exceptionally high or low baseline rates of schwa will have less of an influence on the estimates of the model predictors, while the inclusion of random slopes means that some speakers can be exceptionally sensitive or insensitive to phonological context. In this way, the random effects structure controls for dialectal and sociolinguistic variation, both of which are welldocumented for French schwa. Previous work modeling French schwa has taken the same approach. In Bayles et al. (
The coding of the fixed effects is shown in Table
The fitted values for the model are shown in Table
Mixed effects model: logistic regression (positive = greater likelihood of schwa).
Coefficient (β)  S.E.  Z  Pr > Z  

(Intercept)  0.94  0.26  
Stress = _σ́  0.31  0.11  2.70  <0.01 
Seg = CC_  1.75  0.15  11.51  <0.001 
Ep/Del = deletion  1.48  0.24  6.25  <0.001 
Stress × Seg  –0.06  0.11  0.55  0.59 
All fixed effects are significant, except the interaction of Stress × Seg. The presence of a preceding cluster has the biggest effect on the realization of schwa; as shown by the coefficient of Seg (β = 1.75), schwa is more likely after clusters than singletons. Schwa is also more likely in deletion contexts than epenthesis contexts (β = 1.48), and more likely when followed by one syllable than when followed by two (β = 0.31). Although the effect of stress is relatively small, it’s significant in the model. The lack of significance for Stress × Seg suggests that the effect of stress is not limited to one segmental context (or vice versa). Both Stress and Seg exhibit independent effects on the probability of schwa realization.
In this section, we compare the ability of three models of variation to fit our experimental data: MaxEnt, Stochastic OT and Noisy HG. In the first section, we introduce the models by discussing some of the distributions that each one can generate for a subset of the French contexts, and some of the restrictions that each model places on the distributions it can generate relative to the other models. We then show how the models fare in fitting the actual French data. There has been some previous comparison of these theories (see Hayes &
To illustrate how the models function, we will consider the set of contexts that we analyze as environments for schwa deletion, as opposed to epenthesis. The constraints are given in (29–31). For simplicity, we omit faithfulness constraints here, but include them below when needed.
(29)  * 
Assign one violation for every sequence of three consonants. 
(30)  *C 
Assign one violation for every two adjacent stressed syllables. 
(31)  N 
Assign one violation for every [œ] in the output. 
The contexts are illustrated in Table
Examples of schwa in the four phonological contexts to be modeled.
Following context  Preceding context  

C_  CC_  
_σ́  le vín s 
la térre s 
_σσ́  le vín s 
la térre s 
We consider two candidates for each context: faithful realization of an underlying schwa, and deletion. The tableau in (32) shows violations for the two candidates in the context where two constraints are violated by deletion. Violations are marked with negative integers.
(32)  Constraint violations marked with negative integers  
The table in (33) uses the more compact representation of difference vectors, which result from subtracting the deletion candidate’s violations from the faithful candidate’s violations. Positive values indicate constraints that prefer schwa’s presence, negative values indicate constraints that prefer schwa’s absence, and zeroes indicate constraints that are indifferent.
(33)  Difference vectors for constraint scores: negative values favor schwa’s absence, positive values favor schwa’s presence  
The table of difference vectors in (33) clearly shows the tradeoffs in constraint violations in each context. Faithful realization of the schwa always violates N
In Optimality Theory (OT:
In a deterministic version of Harmonic Grammar (HG; see
(34)  Cumulativity in deterministic Harmonic Grammar  
In terms of our difference vectors, schwa realization is optimal when the sum of the difference scores, each times its constraints’ weight, is above zero (see further
(35)  Weighted Harmony differences  
This gang effect, or cumulative constraint interaction, cannot be modeled in standard deterministic OT: no ranking of these constraints will produce an output schwa in only the top row; it will always be accompanied by an optimal output schwa in one of the middle rows. In the next section, we will see that a probabilistic version of OT will give the top row higher probability than the middle ones.
Stochastic OT (
Jäger & Rosenbach (
To get to a definition of sublinear cumulativity, we must first explain what we mean by the
Contexts: neutral (A), noncumulative (C, B), and cumulative (D).
_σσ́  _σ́  
CC_  C  D 
C_  A  B 
Proportion realized schwa in output distributions with constraints set to ranking value 1: sublinear cumulativity in Stochastic OT.
Context  P(schwa)  List of constraint contributions 

la terre se vend (CC_σ́)  0.67  Δ*C 
la terre se vend bien (CC_σσ́)  0.5  Δ*CCC*C 
le vin se vend (C_σ́)  0.5  Δ*C 
le vin se vend bien (C_σσ́)  0  Δ*CCC = 0.5 
Whether a case of cumulativity is
To see why Stochastic OT can only represent sublinear cumulativity, we can consider the differences across environments in terms of the summed probabilities of constraint rankings. In Table
Illustration of Stochastic OT cumulativity.
Ranking  C_σσ́  CC_σσ́  C_σ́  CC_σ́  

a.  N 

b.  N 

c.  *CCC >> N 
X  X  
d.  *CCC >> *C 
X  X  X  
e.  *C 
X  X  
f.  *C 
X  X  X 
In Table
We now turn to patterns of cumulativity in Maximum Entropy Grammar (MaxEnt;
So, given the weights (3, 2, 2) used for illustrative purposes in (35), the probability of realized schwa would be 0.73 in the top row (since the difference in harmonies is 1), 0.27 in each of the middle rows (since the difference is –1), and 0.05 in the bottom row (since the difference is –3). This is shown in Table
Proportion realized schwa in output distributions with N
Context  P(schwa)  List of constraint contributions 

la terre se vend (CC_σ́)  0.73  Δ*C 
la terre se vend bien (CC_σσ́)  0.27  Δ*CCC*C 
le vin se vend (C_σ́)  0.27  Δ*C 
le vin se vend bien (C_σσ́)  0.05  Δ*CCC = 0.22 
Noisy HG is like Stochastic OT, except the values of the constraints are used in a weighted constraint evaluation of the candidate set. Like MaxEnt, it can generate superlinear cumulativity, though as we will see, the patterns the two models predict are not identical.
To begin our comparison of the three models, we first consider the probability distributions they produce when constraints values are set at 1, shown in Table
Proportion realized schwa in output distributions with constraint weights set to 1.
Context  Stochastic OT  Noisy HG  MaxEnt  List of contributions in MaxEnt 

la terre se vend  0.67  1  0.73  Δ*C 
la terre se vend bien  0.5  0.5  0.5  Δ*CCC*C 
le vin se vend  0.5  0.5  0.5  Δ*C 
le vin se vend bien  0  0  0.27  Δ*CCC = 0.23 
The MaxEnt probabilities arise because realized schwa is preferred by a Harmony score of 1 in the top row, dispreferred relative to deletion by 1 in the bottom row, and the two outcomes have equal Harmony in the middle. Noisy HG also assigns equal probability in the middle rows (as does Stochastic OT). For the top row in Noisy HG, a noise value of 0.2 has a very low probability of subverting the prenoise preference for the faithful candidate in the top row by making the sum of the weights of *
Because cumulativity in Stochastic OT is predictably sublinear, we know that there is no set of constraint values that will allow it to model the linear cumulativity produced in Noisy HG and MaxEnt with values of 1. It is also the case that Noisy HG is unable to match the distribution produced by Stochastic OT. For Noisy HG, if the weights are small enough to allow N
When the probability in the middle rows is at 0.5, MaxEnt is necessarily strictly linear. This can be understood based on Zuraw & Hayes’ (
Probability of a candidate relative to Harmony difference.
The contribution on either side of probability 0.5 is equal: if adding a violation difference increases probability from a baseline of 0.4 to 0.5, it will also increase probability from 0.5 to 0.6. This is the situation we have looked at in the tables thus far, and this explains why MaxEnt cannot match the Stochastic OT (0.67, 0.5, 0.5, 0) distribution in the table, nor the Noisy HG (0.72, 0.5, 0.5, 0.12) distribution discussed in the text.
To escape the clutches of linearity in MaxEnt, we can change the probability of faithful schwa in the noncumulative context. For example, if we give *
Proportion realized schwa in output distributions with N
Context  Stochastic OT  Noisy HG  MaxEnt  List of contributions in MaxEnt 

la terre se vend  1  1  0.95  Δ*C 
la terre se vend bien  1  1  0.73  Δ*CCC*C 
le vin se vend  1  1  0.73  Δ*C 
le vin se vend bien  0  0  0.27  Δ*CCC = 0.46 
MaxEnt can of course match the Stochastic OT and Noisy HG distributions to the degree of resolution we are examining. With the current constraint set, the MaxEnt distribution is completely out of reach of the other frameworks because the faithful schwa gets nonnegligible probability in the bottom row, and it is harmonically bounded by deletion. To give them a chance to match it, we can add McCarthy and Prince’s (
Finally, Noisy HG and MaxEnt can display superlinear cumulativity in probability differences, as shown in Table
Proportion realized schwa in output distributions with N
Context  Stochastic OT  Noisy HG  MaxEnt  List of contributions in MaxEnt 

la terre se vend  0  0.5  0.5  Δ*C 
la terre se vend bien  0  0  0.27  Δ*CCC*C 
le vin se vend  0  0  0.27  Δ*C 
le vin se vend bien  0  0  0.12  Δ*CCC = 0.15 
Since Stochastic OT is predictably sublinear, superlinear patterns are predictably beyond its scope. MaxEnt and Noisy HG can model the Stochastic OT pattern by assigning N
In sum, we have shown that each model has restrictions on the types of probabilistic patterns it can model. This means that we should be able to test them in their relative ability to match natural language cumulativity. The biggest difference amongst the models appears to be Stochastic OT’s weaker cumulativity with respect to the other two: it is always sublinear. MaxEnt’s degree of cumulativity, sublinear, linear, or superlinear, was shown to be related to where the effect of a single competing constraint lands in probability space, below 0.50, at 0.50, or above. Noisy HG’s degree of cumulativity is less predictable in that it can model sublinear patterns out of reach of MaxEnt, and in that respect, seems like it falls between the two other theories, as might be expected as it combines Stochastic OT’s noise with MaxEnt’s weighted evaluation.
Along with cases of underlying schwa discussed in the previous section, our judgment experiment examined four parallel epenthesis contexts, illustrated in Table
Examples of epenthetic schwa contexts.
Following context  Preceding context  

C_  CC_  
_σ́  la bott 
mets ta vest 
_σσ́  la bott 
mets ta vest 
We assume that the vowels in these cases are not underlying, but are supplied through epenthesis. In the contexts in the rightmost column, the epenthetic schwa avoids a consonant cluster, and in those in the top row, it avoids a stress clash.
The grand means of realized schwa from the experiment are repeated in Table
Experimental results (proportion realized schwa).
Following context  Preceding context  

C_  CC_  
_σ́  0.648  0.938  
_σσ́  0.562  0.914  
_σ́  0.122  0.833  
_σσ́  0.090  0.683 
The constraint set for these models includes the three markedness constraints introduced in the last section for the deletion cases: N
(36)  Difference vectors for constraint scores: negative values favor schwa deletion, positive differences favor schwa realization  
We first present a MaxEnt model whose weights were obtained by using a batch learner (
Table
MaxEnt’s predicted probabilities after batch training, errors in parentheses.
Following context  Preceding context  

C_  CC_  
_σ́  0.633 (–0.015)  0.967 (0.029)  
_σσ́  0.514 (–0.048)  0.948 (0.034)  
_σ́  0.167 (0.045)  0.775 (–0.058)  
_σσ́  0.109 (0.019)  0.678 (–0.005) 
The constraint weights producing these probabilities are shown in the Table
MaxEnt constraint weights after batch training.
* 
2.845 
D 
1.084 
M 
1.069 
N 
1.015 
*C 
0.490 
*C 
0.000 
The contribution of the high weighted *
*C
To obtain fitted models for Stochastic OT and Noisy HG, we must use online learners; no batch approaches are available because it is computationally costly to calculate or estimate model predicted probabilities in those frameworks. In online learning, the learner receives a single piece of data at each learning step and uses the grammar to generate a prediction just for that datum, updating the constraint values if the learning datum and the prediction mismatch. Conveniently, it is possible to conduct online learning in a nearly identical way across the three frameworks. For MaxEnt, the online method is referred to as Stochastic Gradient Ascent (
The learning simulations were conducted in Praat (
The MaxEnt model trained online predicts distributions very similar to those of the model trained in a batch fashion. Table
Proportions of schwa for MaxEnt after online training.
Following context  Preceding context  

C_  CC_  
_σ́  0.637 (–0.011)  0.968 (0.030)  
_σσ́  0.518 (–0.043)  0.950 (0.036)  
_σ́  0.166 (0.043)  0.778 (–0.056)  
_σσ́  0.109 (0.019)  0.682 (–0.001) 
The weights producing that distribution, shown in Table
MaxEnt constraint weights after online training.
* 
3.532 
N 
1.798 
M 
1.184 
D 
0.982 
*C 
0.670 
*C 
0.502 
The predictions of the best fitting Stochastic OT model are shown in Table
Proportions of schwa of the best fitting Stochastic OT model.
Following context  Preceding context  

C_  CC_  
_σ́  0.648 (0.000)  0.914 (–0.025)  
_σσ́  0.567 (0.005)  0.907 (–0.006)  
_σ́  0.169 (0.047)  0.778 (–0.064)  
_σσ́  0.109 (0.019)  0.682 (0.073) 
Like the MaxEnt models, the Stochastic OT predictions get the general pattern of cumulative constraint interactions, and the individual fits are sometimes even somewhat better. The bulk of the error is in the rightmost column epenthetic schwa: the values of the two rows are too close together with respect to the empirical data, which means the effect of *C
The Stochastic OT constraint values producing this distribution are shown in Table
Stochastic OT constraint values.
* 
2.402 
D 
2.144 
M 
2.097 
N 
2.047 
*C 
1.977 
*C 
1.551 
The final set of predictions is those of the best fitting Noisy HG model, shown in Table
Proportion of schwa of the bestfitting Noisy HG model.
Following context  Preceding context  

C_  CC_  
_σ́  0.634 (–0.014)  0.977 (0.038)  
_σσ́  0.527 (–0.035)  0.963 (0.050)  
_σ́  0.195 (0.072)  0.766 (–0.067)  
_σσ́  0.107 (0.016)  0.690 (0.002) 
The Noisy HG model succeeds in getting a greater spread than Stochastic OT between CC_σ́ and CC_σσ́ for epenthesis. In this respect mimicking MaxEnt, and approaching the empirical spread. In doing this, though, it also creates a greater spread between the values in the C_ column than is motivated by the empirical data. Here Noisy HG is producing a slightly sublinear pattern: the effect of *C
The weights producing the Noisy HG distribution are given in Table
Noisy HG constraint weights.
* 
2.299 
N 
1.955 
*C 
1.746 
M 
0.211 
D 
0.166 
*C 
0.034 
Our comparisons of models’ fit to the empirical data have thus far been made in terms of differences in raw probability. There are other ways of measuring fit, and one might wonder whether the outcome is different using other metrics. In Table
Error for each model.
Absolute Error  Sum of Squared Error  KL Divergence  

Mean  Min  Max  Mean  Min  Max  Mean  Min  Max  
Stochastic OT  0.330  0.299  0.381  0.043  0.037  0.052  0.086  0.064  0.112 
NoisyHG  0.327  0.295  0.371  0.035  0.031  0.045  0.035  0.034  0.037 
MaxEnt  0.256  0.240  0.269  0.021  0.019  0.023  0.020  0.020  0.021 
In sum, all three models – MaxEnt, Noisy HG and Stochastic OT – were able to capture the general pattern of cumulative constraint interaction seen in the empirical data, and provided reasonable fits to the attested values. The MaxEnt model did slightly better than the other models, anxd in comparison to Stochastic OT, at least some of that success is attributable to its ability to produce superlinear cumulativity in probability space.
Since the predictions of our generative models are only as trustworthy as the data they’re trained on, we’ve taken many steps to model the simplest, most controlled data set possible — collecting a lot of judgment data for a relatively small set of contexts. This is necessary because the realization of schwa is conditioned by a multitude of factors, naturally occuring data are very noisy, and accurately estimating probabilities requires many tokens. As a result, using corpus data makes it difficult to isolate the finegrained differences in the predictions of the models.
Although we’ve looked at the interaction of just three factors that condition the realization of schwa (type of boundary, stress, and number of preceding segments), we expect the same types of constraint interaction regardless of the phonological constraints under consideration. A richer model would take into account factors that we controlled for and mentioned in passing, such as the sonority profile of the consononant cluster, the number of preceding syllables, haspiré, and individual differences between speakers. Future work will determine whether our present findings scale up when more factors are considered in light of naturally occuring speech.
In this paper, we described and modeled the interaction of two phonological factors that condition French schwa alternations: schwa is more likely after two consonants than one (the cluster factor) and in the penultimate syllable than elsewhere (the stress factor). Each of these factors has been identified in the literature on French schwa, but their interaction in probability space hasn’t been previously described or formalized. Using data from a judgment study, we showed that both factors play a role in schwa epenthesis and deletion, including in contexts where the stress factor has previously been described as having no effect. We then provided a characterization of patterns of cumulative interaction as sub through superlinear, showing that Stochastic OT is limited to sublinear cumulativity. Because superlinearity is attested in our experimental data, Stochastic OT fared less well in fitting the data than the weighted constraint probabilistic models Noisy HG and MaxEnt, with MaxEnt yielding the best fit to the data. These results add to a growing body of work showing that weighted constraints provide a better fit to probabilistic natural language data than ranked constraints, particularly when it comes to cumulativity.
The additional files for this article can be found as follows:
List of experimental items. DOI:
Participant background. DOI:
HG = Harmonic Grammar, MaxEnt = Maximum Entropy Grammar, OT = Optimality Theory
Two of these papers — Guy (
Using our experimental data, we can roughly estimate how these notational devices correspond to the probability of schwa realization. Contexts for which schwa realization is described as forbidden, “ɇ”, have a probability of schwa realization of up to 0.12 in our experiment, contexts for which realization is described as obligatory, “
One such possibility is presented in Kaplan (
There were three nouns requiring an
Location is relevant because French schwa is subject to regional variation. Interspeaker differences (e.g., region, gender, age, social class) are discussed in the context of the statistical model in the next section, where we use random effects in our statistical model to minimize the influence of interspeaker differences on our conclusions.
The glmer equation in R: Schwa ~ Ep/Del + Stress * Seg + (1  Item) + (1 + Ep/Del + Stress * Seg  Subject).
The usual MaxEnt calculation for the probability of one of two candidates with Harmony H1 and H2 respectively is e^{H1}/(e^{H1} + e^{H2}). Because we have subtracted out the constraint scores for one of the candidates, its probability in the equation can be represented as e^{0} = 1. See Zuraw and Hayes (
Edward Flemming (p.c.) points out that one can characterize the difference between MaxEnt and Noisy HG in terms of MaxEnt, but not Noisy HG, being linear in log space.
Because *
Absolute error was calculated with respect to the probability of schwa in each context. Sum of squared error and KL divergence were calculated over the probability of each of schwa and noschwa. KL divergence is formulated to be calculated over entire probability distributions. If SSE were calculated over just probability of schwa, the value would be half of that reported, and if absolute error were calculated for both schwa and noschwa, it would double.
We are grateful to three anonymous reviewers for their comments, which greatly improved the quality of the manuscript. We are also grateful for feedback and comments from the audiences at the 46th Linguistic Symposium on Romance Languages, Stonybrook University, as well as discussions with François Dell, Edward Flemming, Bruce Hayes, and Kie Zuraw.
Joe Pater’s work on this project was supported by NSF grants 1424077 and 1650957 to the University of Massachusetts Amherst.
The authors have no competing interests to declare.