In this paper, we introduce a novel domaingeneral, statistical learning model for P&P grammars: the Expectation Driven Parameter Learner (EDPL). We show that the EDPL provides a mathematically principled solution to the Credit Problem (
Understanding how learners overcome the pervasive ambiguity inherent to the language acquisition process is a foundational question of linguistics, and cognitive science more generally. In this paper, we focus on a type of structural ambiguity sometimes referred to as the Credit Problem (
The Principles and Parameters (P&P) approach to language typology and acquisition (
Within phonology, the Credit Problem is particularly clear in the domain of metrical stress. To succeed, the P&P learner must discover the languagespecific settings of stress parameters, such as footing directionality and headedness, which are not directly observable in the language input. In this context, the Credit Problem refers to the learner’s uncertainty about which parameter setting to “credit” for a successful prediction and which to “blame” for an unsuccessful prediction. As mentioned above, to deal with the Credit Problem, existing work on stress parameter learning (
Contra Dresher & Kaye (
Our results indicate that the NPL does not possess the necessary mechanisms to cope with the pervasive ambiguity inherent to parametric stress, while the EDPL does. We conclude that domaingeneral learning of parametric stress remains a viable hypothesis but only if it incorporates mechanisms that directly address the Credit Problem. These conclusions have implications about the nature and content of Universal Grammar (UG), but they also complement theoretical work within and across frameworks by enriching our understanding of various theoretical frameworks’ computational properties. While our focus is on the P&P approach to metrical phonology, the learning models we examine are broadly applicable to P&P theories in phonology and beyond, and connect to current approaches to learning in Optimality Theory (OT;
The rest of this paper is structured as follows. We present a detailed overview of the Credit Problem in Dresher & Kaye’s (
Dresher & Kaye (
(1)  i.  The wordtree is strong on [Left/Right] ( 
ii.  Feet are [Binary/Unbounded] ( 

iii.  Feet are built from the [Left/Right] ( 

iv.  Feet are strong on the [Left/Right] ( 

v.  Feet are quantity sensitive (QS) [Yes/No] ( 

vi.  Feet are QS to the [Rime/Nucleus] ( 

vii.  A strong branch of a foot must itself branch [No/Yes] ( 

viii.  There is an extrametrical syllable [No/Yes] ( 

ix.  It is extrametrical on the [Left/Right] ( 

x.  Feet consisting of a single light syllable are removed [No/Yes] ( 

xi.  Feet are noniterative [No/Yes] ( 
For D&K, bounded feet can have the shapes {(L), (LL), (H), (HL), (LH)}. Feet of the shape (
When
The most ambiguous stress pattern in D&K’s framework is initial/final stress. A form with initial stress, as in (2), could be attributed to settings of completely unrelated parameters, corresponding to distinct assignments of hidden structure, as illustrated in (2a–c). In (2a), binary trochees are built throughout the word, of which the leftmost receives main stress, with no overt stress projected from the other trochees. In (2b), a single unbounded trochee is built over the entire word, and it receives main stress; note that this is consistent with
(2)  a.  ( 
b.  ( 

c.  σ́σσσσσ  
In (2), the three sets of parameter settings are compatible with one another: combining any of these three settings will also yield initial stress. In other cases, structural analyses of the same overt stress pattern can be mutually incompatible. The pattern with penultimate main stress and alternating secondary stress provides an example, as shown in (3) for oddsyllable words and (4) for evensyllable words. In the iambic parses in (3a) and (4a), RighttoLeft iambic feet are built with right extrametricality such that degenerate feet are allowed, and the rightmost foot receives main stress. In (3b) and (4b), RighttoLeft trochaic feet are built without extrametricality and degenerate feet are disallowed. This is also global ambiguity since both types of parses yield the same stress patterns for all words. However, here the learner must find a consistent combination of several interdependent parameters to produce the right stress pattern. If the learner chooses trochees, it must also posit no extrametricality and no degenerate feet; if it chooses iambs, it must also posit right extrametricality and permit degenerate feet.
(3)  a.  (σ 
b.  σ( 

(4)  a.  ( 
b.  ( 

Such cases of global ambiguity require learners to be sensitive to the interdependence between parameters: the learning data will never provide unambiguous information about the settings of some parameters. In (3–4), both odd and evenparity words are compatible with trochees and iambs: there is no learning data that will unambiguously require one or the other foot type. The same holds for the extrametricality, directionality and degenerate foot parameters.
In addition to global ambiguity, the learner must also cope with
The learner must be able to combine information across multiple data forms to arrive at a grammar that accounts for all observed stress patterns. For example, a language with antepenultimate main stress and alternating leftward secondary stresses (e.g., [ta.mà.na.pò.la.tú.ti.la]) must be analysed with RighttoLeft trochees and right extrametricality. However, each individual word has an alternative analysis with LefttoRight trochees. For evenparity words like <na>(pò.la)(tú.ti)la, this LefttoRight analysis requires left extrametricality, whereas oddparity words like (mà.na)(pò.la)(tú.ti)la require no left extrametricality. It is only by comparing evenparity and oddparity words that the learner can conclude that the correct analysis is indeed RighttoLeft. Note, this conclusion also requires sensitivity to the interdependence between parameters since it depends on identifying a consistent setting of the extrametricality parameters which, in combination with the directionality setting, produces the correct stress pattern across all forms.
Thus, learning parametric stress requires facing ambiguity resulting from two types of interdependence: interdependence between parameters and interdependence between word forms. The data may be globally ambiguous and require the learner to commit to a combination of interdependent parameter settings to specify a working grammar. Moreover, a given learning datum may be locally ambiguous and require the learner to cope with interdependence between data forms to set crucial parameter settings. This results in a difficult computational challenge since the learner cannot solve the problem by considering parameters or data points in isolation. Such ambiguities, especially the one in (3), are also relevant in other frameworks like Hayes (
Interdependencies like these are a challenge for an incremental learner. When the learner’s current hypothesis correctly generates the stress pattern for an observed word, the learner faces the Credit Problem: it is unclear which parameter setting to credit with this correctness. For example, if the learner’s current grammar generates (σσ̀)(σσ́) with RighttoLeft iambs, matching the observed pattern σσ̀σσ́, the learner does not know whether it is iambic footing or RighttoLeft directionality that should get credit for this match. The same problem occurs when the model fails to generate the correct stress pattern. For example, if the observed pattern is σσ̀σσ́ and the learner’s current grammar produces the mismatching <σ>(σ́σ)σ, it is not immediately obvious whether left extrametricality, the presence of extrametricality, the trochaic foot, or the lack of degenerate feet led to the mismatch between the observed and the predicted stress pattern. Simply observing that a combination of parameter settings leads to a match or mismatch does not mean that all parameter settings should share credit or blame equally.
To address the Credit Problem, the learner must have a way to gauge which parameters are the most relevant to and responsible for a given data point. Not only must the learner resolve this ambiguity and ultimately succeed in reaching the target grammar, it must do so using a computationally feasible learning procedure. In addition to considering learning success, in this paper we consider two fundamental measures of computational complexity.
Given the preceding discussion, it should be clear that bruteforce approaches cannot cope with the computational challenges inherent to learning parameter settings from structurally ambiguous data, as has been discussed extensively in previous work (D&K 1990;
Unfortunately, this strategy has two fatal flaws. First, as Fodor points out, explicitly enumerating all combinations of all parameter settings for each datum is computationally intractable: the processing complexity grows exponentially with the number of parameters. Second, this strategy would fail to learn a complete grammar in cases of global ambiguity like examples (2–4). There, successful analyses of the learning data vary on all settings of numerous parameters: no single parameter setting is shared between them, and no datum in the language can help this learner break out of the ambiguity. In other words, the data are not guaranteed to contain triggers for every parameter. For extensive discussion of this issue in the domain of syntactic parameters, see Gibson & Wexler (
In the domain of stress, several learning models for parameter setting have been proposed and can be broadly classified either as domainspecific or domaingeneral. Both domainspecific and domaingeneral learning approaches assume UG is available to the learner: the learner has access to the universal set of parameters, their possible settings, and the system that generates linguistic structures based on specified parameter settings. They differ, however, in whether the posited learning mechanisms are domainspecific themselves.
Domainspecific learners for parameters may have prior knowledge of ambiguities that arise for linguistic data for a particular set of parameters, the best order in which to set these parameters, and/or default settings of these parameters (
In contrast, domaingeneral learners, while having access to UG and being able to manipulate parameters, have learning strategies that do not depend on the content or identity of any specific parameter (domainspecific knowledge). Crucially, a domaingeneral learner cannot rely on the identity of a parameter to make inferences about its setting or connect it to data. Domaingeneral learners rely on mechanisms that generalise beyond a given system or linguistic domain.
In the domain of stress, one wellknown domainspecific learning approach relies on cues to parameter setting (D&K 1990;
Cues tell the learner which data points are informative for a given parameter setting. For instance, if we again consider the example of ambiguity in initialstress words like σ́σσσσσ (see (2) in §2.1), cues can tell the learner that this data point is uninformative for many different parameters. D&K (1990) propose that
In addition to cues for each parameter, D&K specify an order in which parameters should be acquired.
Gillis et al. (
More recently, Pearl (
Thus, Pearl’s arguments are based on a direct comparison of domainspecific and domaingeneral learning strategies, but the argument focuses on one analysis of one language. D&K propose a parametric system and a detailed domainspecific learning model for that system, but they do not explore alternative models with weaker assumptions. This is the question we undertake in this paper by systematically evaluating two domaingeneral learning models on D&K’s typology.
Domaingeneral approaches have not been extensively explored in the domain of stress parameter learning. Beyond our own proposal, the only application of domaingeneral learning models to parametric stress is the series of studies by Pearl (
The NPL is an online, incremental learning algorithm for probabilistic P&P grammars, the details of which are presented in §4. The algorithm is
Pearl (
To summarise, the most prominent and extensively studied approaches to the Credit Problem in stress parameter learning rely on domainspecific learning mechanisms. Existing work on domaingeneral learning of stress (
Domainspecific approaches have a number of drawbacks. One is the strong assumptions made about the genetic endowment, since they assume domainspecific knowledge beyond the knowledge of UG itself. Tesar & Smolensky (
Domaingeneral approaches do not share these disadvantages. The assumptions about the genetic endowment are more modest, and a domaingeneral model tested on one parametric system can be applied without modification to any other parametric system, whether that be an alternative theory of stress parameters or a parametric system in another domain, such as syntax.
In the rest of this paper, we show the possibilities of domaingeneral learning in D&K’s parametric stress framework. Before presenting our novel approach in §5, we first explain how the NPL works, with which our proposal shares all but one aspect of its inner workings.
Yang (
The novel proposal in the current paper, the Expectation Driven Parameter Learner (EDPL, see §5), shares the majority of its machinery with the NPL: the probabilistic parameter grammar framework and the linear update rule, which will be discussed in §4.1 and §4.2, respectively. The difference between the learners lies in how the reward/penalty value in the update rule is calculated. The NPL’s method for doing so is covered in §4.2 (see §5 for the EDPL’s method). Then, a pseudobatch modification to the NPL is presented in §4.3, while §4.4 presents some crucial challenges for the NPL.
Yang defines a probabilistic parameter grammar in terms of a set of independent Bernoulli distributions, one for each (binary) parameter in UG, as exemplified in (5). The probability of a parameter setting stands for how often this setting will be chosen at a given instance of the grammar’s use. The probabilities of the settings for each parameter sum to 1, and there is no relationship between the probabilities of settings of different parameters.
(5) 
When a grammar is used to generate an output for a given input (in our case, a stress pattern given a sequence of syllables), each parameter is given a categorical setting sampled from that parameter’s probability distribution. An output is then generated based on the parameter specification generated in this way (cf. (6)).
When learning, this predicted output is compared to an observed output, resulting in a stress match or mismatch. In (6), two parameter specifications are generated from the grammar in (5). In (6a), rightmost main stress and LtoR feet are selected, while in (6b), leftmost main stress and RtoL feet are selected. Because of the probabilities in the grammar, the specification in (6b) is more likely to be chosen than the one in (6a). In both cases, feet are bounded, since this option has a probability of 1 in the grammar. In terms of production and comparison to the hypothetical observed forms, the specification in (6a) leads to a match between the predicted and the observed stress pattern, while the specification (6b) leads to a mismatch.
(6)  a.  Sample parameter specification 1: 

probability: 0.4 × 1 × 0.3 = 0.12  
(kàla)(máta)na  observed: [kàlamátana]  match=TRUE  
b.  Sample parameter specification 2: 

probability: 0.6 × 1 × 0.7 = 0.42  
ka(láma)(tàna)  observed: [kàlamátana]  match=FALSE 
Both the NPL and the EDPL represent their knowledge of language in terms of Yangstyle grammars. If the stress system of a language is categorical, with no variation between or within words, this can be represented by a grammar where all crucial parameter settings have a probability of 1. However, if a language does exhibit variation, it is possible to represent this by giving parameter settings probabilities between 0 and 1. In this paper, we only consider categorical systems as targets of learning, but patterns with variation are an important intermediate stage, and target patterns with variation are an important test case for future work.
Both the NPL and EDPL use the Linear RewardPenalty Scheme (LRPS;
(7) 
λ∈[0,1] is the learning rate
For the NPL, each parameter’s Reward value is either 0 or 1, with no intermediate values. This value is determined based on a single parameter specification sample,
For example, consider scenario (6a).
The NPL fares exceedingly well on processing complexity. For each data point, it takes one sample for each parameter, and uses the resulting match or mismatch to compute which settings get rewarded and which get penalised. The time complexity grows linearly with the number of parameters: each additional parameter in the grammar requires a constant amount of additional computation for the processing of one data form. In §6, we present tests that investigate the NPL’s data complexity and success rates.
As discussed earlier, Pearl (
The result of this pseudobatch procedure is that parameter settings that work for certain words but not others are less likely to be rewarded, since their successes may be offset with immediately following failures, leading to a counter vacillating around 0. On the other hand, parameter settings that are crucial for the target language are more likely to be successful in succession, leading to a rapidly rising counter and a Reward value of 1 in many cases. However, this does not guarantee finding all crucial parameter settings due to the accomplice scenario explained in §4.4.
We examine the effect of pseudobatch learning in the simulations in §6, comparing the NPL with and without pseudobatch learning sidebyside.
Yang (
In the hitchhiker scenario, parameter settings are rewarded despite having no responsibility for a match. For example, suppose the learner observes the data point [kàlamátana] (cf. (6)) and generates its output using a parameter specification that contains the crucial setting
In the accomplice scenario, parameter settings are penalised despite not being responsible for a mismatch. Consider again the data point [kàlamátana], and suppose the learner samples a parameter specification from their grammar with the crucial setting
Because of the widespread ambiguity in stress setting discussed earlier, hitchhikers and accomplices can lead to serious challenges for the model. The same parameter setting can be rewarded by accident for one data point and penalised by accident for the next data point. These spurious updates create substantial noise that disrupts the learning process. As discussed above, the pseudobatch strategy may smooth over some of this noise, and we consider its effects on the performance of the model in §6.
More problematic than random noise is the general failure of the model to cope with interdependence between parameters, which is the underlying source of accomplices and hitchhikers in NPL. A crucial parameter setting only leads to a match with the data if all other crucial parameters for that data point are set appropriately for the target language. Unless the learner’s grammar is already very close to the target grammar, sampling the correct combination of all crucial parameter settings is a statistically rare occurrence. This means that the vast majority of updates are mismatches involving spurious penalties for crucial parameter settings (accomplices), causing the learner to make no progress, vacillating endlessly until they happen to sample a correct combination of all crucial parameter settings to reward (in which case they are at risk of rewarding hitchhikers). However, it is possible to design a domaingeneral learner that overcomes these challenges, as shown in §5, where we propose the Expectation Driven Parameter Learner, which addresses the Credit Problem directly in the formulation of the Reward value.
The EDPL model proposed here extends Jarosz’s (
Formally, the Reward is defined as the expected value of a parameter setting given the current grammar and the data point currently under examination. Computing this value relies on two crucial steps (
Rather than updating all parameters equally, the EDPL defines the Reward value
(9)  a.  
b. 
Rather than computing the probability in (9a) directly using parsing, we follow Jarosz (
(10)  a.  
b. 
Defining the Reward value this way yields an online, samplingbased approximation to Expectation Maximization (EM;
We use constrained sampling from the production grammar to estimate the conditional likelihood
(11) 
During this process, the probabilities for the other parameters are left untouched. This makes it possible to isolate the effects of manipulating a single parameter. At the same time, using the full production grammar ensures that the consequences of any interactions with other parameters are taken into account.
To summarise, the EDPL computes a separate Reward value for each parameter setting (e.g., for the grammar in (5):
The EDPL provides a principled solution to the Credit Problem without domainspecific mechanisms. It distinguishes necessary, incompatible, and irrelevant parameter settings by computing Reward values separately for each parameter setting, where Reward values are defined as the degree to which each parameter setting is responsible for each data point.
A parameter setting necessary for data point
Conversely, a parameter setting that is incompatible with data point
Finally, for a parameter setting irrelevant to data point
(12)  a.  Parameter setting 
b.  Parameter setting 

c.  Parameter setting 

It must be stressed that the three updated scenarios discussed above are only the edge cases. As discussed in §2, stress parameters are often ambiguous: multiple parameter settings are potentially compatible with the same data point. It is in such cases that EDPL Reward values other than 1, 0, or
The EDPL relies on the learner’s current knowledge of one parameter to make inferences about settings of other, interdependent parameters to deal with such ambiguous cases. For example, suppose the learner’s current grammar has
In this section, we will present systematic evaluations of the NPL and the EDPL on a diverse range of stress systems, constituting the first systematic typological test for these learners in the stress domain. Our primary goal is to gauge how well these models cope with the Credit Problem and the kinds of interdependencies that are inherent to parametric stress systems. By considering both models’ performance on the full typology proposed by D&K (1990), we get the most complete picture possible regarding their capacities to deal with hidden structure in a complex stress parameter system. This kind of evaluation is analogous to Tesar & Smolensky’s (
The 11 stress parameters defined by D&K together define 2^{11} = 2048 unique parameter specifications. To complement the discussion of ambiguity in §2, here we introduce a quantitative measure of ambiguity and use it to examine the rate and distribution of ambiguity in the typological space defined by these parameters. This measure provides a richer characterization of the types of ambiguities present in parametric systems, as exemplified by D&K’s system.
The 2048 combinations in D&K’s system yield just 302 unique stress systems on overt forms of up to 7 syllables. Long words were included to avoid collapsing differences between systems that arise only in long words (
Pvolume distribution in the stress systems defined by D&K (1990).
Pvolume  330  118  32  16  8  6  4  3  2  1 
# systems  2  2  8  16  4  16  42  8  116  88 
As can be seen in
Pvolume provides a way to assess how the learner responds to global ambiguity. On the one hand, systems with high Pvolume should be easier to find by pure chance so a learner that relies substantially on luck to find the target language may be expected to fare better on high Pvolume systems on average. On the other hand, systems with high Pvolume are highly globally ambiguous and often involve interdependencies between parameters that must be disentangled to settle on a complete specification for a target language. From this perspective, high Pvolume languages may pose a challenge to a learning model that must incrementally commit to parameter settings, without clear or consistent evidence for crucial settings.
The 302 unique stress systems described in §6.1 were presented to the learners as datafiles consisting of strings of CV, CVV, and CVC syllables with corresponding stress patterns and likelihoods of occurrence (e.g., CVV.CV.CV.CV.CV σ̀σσ́σσ 0.0003, CV.CV.CVC.CV σ̀σσ́σ 0.0003). Strings with a length of 1–7 syllables were used, and all possible combinations of the 3 syllable types of these lengths were included, yielding a total of 3 + 3^{2} + … + 3^{7} = 3,279 pseudowords for each stress system. During learning, a pseudoword was sampled at each iteration. Here, we assumed equal likelihood for each pseudoword.
Both the NPL and the EDPL were tested on these datasets. For the NPL, three settings for batch size were used: no pseudobatch learning (henceforth: NPL0), pseudobatch learning with
To better evaluate these four learners, we also ran a random baseline (brute force) model as a sanity check. Checking performance against a simple baseline is standard practice in computational linguistics and important to ensure that the proposed learning mechanisms compare favourably to random search and other bruteforce strategies. It provides a way to gauge whether learning models’ performance on relatively simple learning tasks have the potential to scale to the full problem of language learning.
The random baseline model encounters one data point at a time, just like the other models, but instead of gradually updating a probabilistic parameter grammar, it simply samples a random parameter specification from a uniform distribution at the first iteration and whenever a stress mismatch occurs. Since the space of categorical grammars is finite, and all the languages in D&K’s typology are categorical, this baseline model will eventually reach any target system. Because it only considers a finite space of stress systems and can flip parameters categorically, this baseline is quite strict with respect to the NPL and EDPL, which are designed to search an infinite space of probabilistic target grammars and can only update their parameter settings gradually.
Crucially, this baseline is not a serious proposal for learning stress parameters. As discussed earlier, the learning time for random search grows exponentially with the number of parameters, quickly becoming intractable as the language learning problem grows. Only learning strategies that can cope with the learning data more efficiently than random search have a chance of solving the actual language learning problem faced by children. In addition, while simulations involving variable stress are beyond the scope of the current paper, both the NPL and EDPL can in principle cope with nondeterminism, while this baseline cannot. Finally, the random baseline cannot model the kind of incremental acquisition of stress seen in children, while the NPL and EDPL can. Thus, the baseline provides a benchmark for interpreting the data complexity results, but we do not consider it to be a competing model of language acquisition.
Each model was run 10 times on each of the 302 stress patterns, yielding 3020 runs per model. For the three versions of the NPL and the random baseline, each run was allowed up to 10,000,000 iterations, where an iteration is the processing of one data point in the system. Pearl (
For all learners, the simulation was stopped once the convergence criterion (at least 99% accuracy on each word in the corpus) was reached. Accuracy was assessed by sampling 100 parameter specifications from the current grammar, computing the resulting stress patterns for all words, and, for each word, counting how many specifications led to a stress match for each word; if even one word had more than 1 mismatch out of 100, there was no convergence. For globally ambiguous stress systems, this means that any grammar that leads to the desired stress assignment is accepted (see §2.1). Since assessing convergence is the most computationally intensive component of running the models, this was only done every 100 iterations (for the NPL and the random baseline, accuracy was checked even less often between 20,000 and 9,999,900 iterations: it was checked every 10,000 iterations between 20,000 and 100,000, every 100,000 iterations between 100,000 and 1,000,000, and afterwards every 1,000,000 iterations until 9,999,900 and 10,000,000).
Success rates and data complexity results (λ = 0.1 for all EDPL and NPL runs).
EDPL  NPL0  NPL5  NPL10  Random baseline  

# of runs that converge (% of 3020)  2765 (91.6%)  20 (0.7%)  143 (4.7%)  143 (4.7%)  3020 (100%) 
# of stress systems that converge on ≥1 run (% of 3020)  281 (93.0%)  2 (0.7%)  28 (9.3%)  24 (7.9%)  302 (100%) 
# of stress systems that converge on all 10 runs (% of 320)  269 (89.1%)  2 (0.7%)  8 (2.6%)  9 (3.0%)  302 (100%) 
Median (maximum) # of iterations/data points till convergence  200 (66,200)  200,000 (700,000)  6,300 (9,999,900)  3,400 (9,999,900)  800 (30,000) 
As can be seen in
The NPL, on the other hand, fares poorly: it is unable to learn the typology, and if it does converge, it does so considerably slower than guessing at random. NPL0 learns less than 1% of the typology, and for the successful 0.7% of runs, the median number of data points required is much higher than for the random baseline (200,000 vs. 800). The NPL with pseudobatch learning fares slightly better, but still fails to learn more than 90% of stress systems, and the median number of iterations is still considerably greater than for the random baseline on the small proportion of runs that are successful. This level of performance falls short of the goals we set out for the learning models and below a minimal standard required for successful language learning.
The simulation results reveal a marked difference between the NPL and the EDPL. While the EDPL learns almost all stress systems and does so faster than random guessing, the NPL learns only a few stress systems, and does so slower than random guessing. We conclude that the NPL is not a viable model of stress parameter learning, at least not for complex stress systems with the sorts of ambiguities that are present in D&K’s typology.
Beyond establishing extensive quantitative evaluations for both learning models, these results also undermine the argument for domainspecific mechanisms. Pearl’s (
The encouraging results of the EDPL on D&K’s full typology show that the domainspecific learning mechanisms D&K posit for their parametric system are likely unnecessary. The EDPL, a domaingeneral learner, succeeds in efficiently learning 93% of stress systems in this typology without the use of cues, parameter ordering, or defaults. As mentioned earlier, Gillis et al. report a 80% success rate on a similar experiment with D&K’s domainspecific learner, which does not surpass the EDPL’s performance. To better understand how to interpret our quantitative results, we present in §7 an indepth analysis of the NPL’s and EDPL’s learning outcomes.
The few stress systems that NPL learns successfully tend to be those with very high global ambiguity. As shown in
Somers’ D rank correlation: successful convergence dependent on Pvolume.
EDPL  NPL0  NPL5  NPL10  

Somers’ D  .03  1  .75  .77 
In contrast, the EDPL’s successes shows virtually no correlation with Pvolume. This confirms that the EDPL is not driven by random chance. The only randomness to its updates comes from the sampling used to estimate the Reward value and the random order in which data are presented to the learner. Since the EDPL offers a principled solution to the Credit Problem, global ambiguity does not necessarily make learning easier. It can be helpful in cases where high Pvolume corresponds to a stress system with many mutually compatible analyses and few crucial parameter settings. However, as discussed in §2.1, high Pvolume can also correspond to cases where there are many mutually incompatible analyses, and a learner sensitive to the Credit Problem must disentangle this confusing evidence. The next section takes a closer look at how this affects learning for the EDPL.
For the EDPL, learning success depends on the extent to which learning data unambiguously supports crucial parameter settings. See Hucklebridge (
Success on learning stress systems split up by secondary stress and foot shape.
Type A  Type B  Type C  Type D  

Number of stress systems (runs)  196 (1960)  40 (400)  36 (360)  30 (300) 
EDPL successful runs (%)  1932 (98.6%)  400 (100%)  320 (88.9%)  113 (37.7%) 
NPL0 successful runs (%)  0 (0%)  20 (5.0%)  0 (0%)  0 (0%) 
NPL5 successful runs (%)  0 (0%)  112 (28.0%)  25 (6.9%)  6 (2.0%) 
NPL10 successful runs (%)  0 (0%)  123 (30.8%)  24 (6.7%)  0 (0%) 
(13) 
As indicated in (13), Type A stress systems are those with overt secondary stress. Type B stress systems have no overt secondary stress and are compatible with unbounded feet. Type B includes systems with fixed initial, peninitial, penultimate, and final stress.
Overt secondary stress (Type A) guarantees that the head of each foot is expressed as a stress mark, which means that the number of feet and the approximate location of foot boundaries (always adjacent to a stressed syllable) can be read off the overt form. This gives Type A stress systems a relatively unambiguous relationship between overt form and foot structure.
The absence of overt secondary stress in the data (Types B/C/D) introduces additional ambiguity: there could be multiple feet even though there is just one stress. However, for unbounded feet (Type B), the division of the word into feet is still signalled by the segmental makeup of the word: foot boundaries are either at the word boundary (modulo extrametricality) or just before/after a heavy syllable: consider (ka)(láːdamatana)<bi> for
Type C and D lack secondary stress and require bounded feet: they require multiple feet in (longer) words, but only the head foot gets stress. Such systems have one stress mark that fluctuates between one or two positions in the word (e.g., penult vs. final): this minimal information is the only evidence of the existence and details of an iterative bounded footing system, e.g., if stress falls on the rightmost LtoR QI nondegenerate trochee, its location varies between the penultimate and antepenultimate syllable depending on the length of the word: (kala)(máta), (kala)(máta)na.
Among the patterns with bounded feet, the Type C systems, which can be represented with full QS feet, provide the most information to finding silent bounded feet: heavy syllables provide landmarks for the location of feet, while a restriction to full (nondegenerate) feet severely limits the possible locations of feet relative to stress position: initial stress on a light syllable only works with trochees – and final stress on a light syllable only works with iambs – if degenerate feet are prohibited. For instance, this is the case in Creek (
The remaining stress systems (Type D) provide the greatest hidden structure challenge. These systems lack secondary stress, but require bounded feet that are either quantityinsensitive or (optionally) degenerate. In a quantityinsensitive system, the only landmarks for the location of bounded feet are the word edges and the location of main stress, meaning that the foot boundaries between unstressed syllables are not overtly cued – consider the case where stress falls on the rightmost LtoR QI nondegenerate iamb: (kala)(maːta)(nabí); in this case, the boundary la)(maː is only signalled by the location of primary stress on [bi], while foot boundaries are also signalled by heavy syllables in Type C Creek. If degenerate feet are necessary to account for the stress pattern, evidence for foot headedness can be unclear: for instance, there might be initial stress in a crucially iambic system – consider the case where stress falls on the leftmost RtoL potentially degenerate iamb: (ká)(lama)(tana). Compare this to Type C Creek, where there can be final but not initial stress, which cues iambs. As discussed in §7.3, Type D systems are unattested except for fixed antepenultimate stress (where stress falls on the rightmost LtoR QI trochee with right extrametricality and silent secondary stress).
A further significant division in terms of ambiguity among Type C/D stress systems is whether they ever assign stress more than 1 syllable away from the word edge. Let “1in” stress systems be those that have stress at most 1 syllable from the edge (initial, peninitial, penultimate, or final) and “2in” stress systems be those that can have stress 2 syllables from the edge (postpeninitial or antepenultimate). Examples: Type C 1in: stress rightmost LtoR QS nondegenerate iamb, no extrametricality: (kala)(matá)na, (kala)(matá), (kaː)(lamá)ta; Type C 2in: like Type C 1in, but with right extrametricality: (kala)(matá)<na>, (kalá)ma<ta>, (kaː)(lamá)<ta>; Type D 1in: stress rightmost LtoR QI potentially degenerate trochee, no extrametricality: (kala)(mata)(ná), (kala)(máta); Type D 2in: like Type D 1in, but with right extrametricality: (kala)(matá)<na>, (kalá)ma<ta>.
In “1in” stress systems, none of the individual data points require bounded feet, while the stress system as a whole does: final/initial stress and penult/peninitial stress are all consistent with unbounded feet (see (2,3) in §2.1). Bounded feet are only strictly necessary to represent postpenititial or antepenultimate stress data points: <σ>(σσ́)… or …(σ́σ)<σ>. This gives “2in” stress systems an advantage: they feature data points with unambiguous evidence for bounded feet.
As shown in
Success on learning Type C/D systems split by stresstoword edge distance.
Type C  Type D  

2in  1in  2in  1in  
Number of stress systems (runs)  20 (200)  16 (160)  14 (140)  16 (160)  
EDPL successful runs (%)  200 (100%)  120 (75.0%)  100 (71.4%)  13 (8.1%)  
NPL0 successful runs (%)  0 (0%)  0 (0%)  0 (0%)  0 (0%)  
NPL5 successful runs (%)  1 (0.5%)  24 (15.0%)  0 (0%)  6 (3.8%)  
NPL10 successful runs (%)  15 (7.5%)  9 (5.6%)  0 (0%)  0 (0%)  
Our analysis indicates that, in general, learning data that provide highly ambiguous or contradictory support for settings of crucial parameters are particularly difficult for the EDPL. It is not the presence of hidden structure per se that creates challenges for the learner; rather, challenges arise when hidden structure creates ambiguity about crucial parameter settings. Sensitivity to this sort of ambiguity is therefore a general prediction of the EDPL that extends beyond D&K’s framework. However, the relative learning difficulty of each stress system also depends on the grammatical framework assumed, which determines what analytical options are available to the learner. This connection between theory and learning has potential to provide novel predictions to differentiate linguistic theories. We return to this topic in the conclusion.
A rich body of experimental and computational work supports the hypothesis that soft biases outside of the grammatical system play an important role in shaping linguistic typology (see
While we believe these results are encouraging, we wish to emphasize several limitations of these findings. First, as discussed above, the relative learning difficulty of each stress system depends crucially on the D&K framework. Second, the simulations here make the simplifying assumption that all syllable type sequences are equally frequent in the input: learning outcomes could change with different input statistics. Third, classifying stress systems as attested or not is not entirely trivial. Our analysis relies on a manual search of StressTyp2 (
Attested stress systems broken down by Type.
Type A  Type B  Type C  Type D  

Attested languages  23  16  8  1 
Convergent runs  228/230 (99.1%)  160/160 (100%)  60/80 (75%)  10/10 (100%) 
In fact, there are just two attested stress systems that are consistently not learned by the EDPL.
We tentatively conclude there is an intriguing correspondence between patterns that pose difficulty for the EDPL and patterns that appear to be unattested typologically. The two exceptions discussed above show that much further work, beyond the scope of this paper, is needed before firm conclusions about the connections between learning and typology can be made. This work will require examining how distributional properties of the data – in particular, the distribution of word lengths (
In this paper, we introduce a novel domaingeneral learning model for P&P grammars. We show how the proposed learning model provides a mathematically principled solution to the Credit Problem. The solution relies on probabilistic inference to formalise and quantify each parameter setting’s relative responsibility for each data point. We show that these learning updates can be computed efficiently and incrementally without the need for any specialized parsing mechanisms. The proposed learning algorithm, the EDPL, can be viewed as an extension of the NPL wherein the parameterspecific, continuouslyvalued responsibility replaces the global, binary reward value used by the NPL.
We present the first systematic tests of both the NPL and EDPL on a full stress typology, namely, the one predicted by Dresher & Kaye’s (
These results have further implications for P&P frameworks of stress and phonological theory more generally. Since the EDPL is a domaingeneral learning model, its high performance undermines arguments made in previous work for the necessity of domainspecific learning mechanisms. We provide the first results to suggest that combining a P&P theory of universal grammar with general statistical learning mechanisms may be sufficient to account for successful learning of stress. This brings learning results for P&P closer in line with those of OT, where a number of existing domaingeneral learning models have been demonstrated to have similar levels of performance on metrical stress systems (
Our findings have broader implications beyond P&P and beyond the domain of stress. The EDPL has already been applied outside phonology to the domain of syntax with promising results (
D&K – Dresher and Kaye
EM – Expectation Maximization
EDPL – Expectation Driven Parameter Learner
LRPS – Linear RewardPenalty Scheme
NPL – Naïve Parameter Learner
OT – Optimality Theory/Theoretic
P&P – Principles and Parameters
QI – QuantityInsensitive
QS – QuantitySensitive
UG – Universal Grammar
Although the MarkednessoverFaithfulness bias (
We use “trochee”/“iamb” as a shorthand for “leftheaded”/“rightheaded” even when feet are longer than two syllables.
D&K acknowledge that this cue would also be triggered by morphological/lexical influences on stress. Dresher (
In our simulations, we disregard such builtin dependencies between parameters since they make the model more complex and do not affect which stress patterns the model generates.
The proposed order is
We present the algorithm using binary parameters here, but the algorithm can be straightforwardly extended to
To avoid division by zero, we added a very small number (10^{–250}) to the number of matches for each parameter setting, which results in no update to the grammar when
The Reward value of
We also considered including 8 and 9syllable words, but these do not lead to any additional unique stress systems.
As an anonymous reviewer points out, this global ambiguity may be greatly reduced if footconditioned segmental phonological processes provide additional evidence for foot boundaries.
An anonymous reviewer points out that these high Pvolumes will be drastically lower if primary stress is assigned nonmetrically (
Unlike the baseline, the EDPL’s and NPL’s speed of learning depends on the learning rate. Since the EDPL outperformed the baseline with this learning rate, we did not investigate even larger learning rates, which would be expected to result in faster learning.
This was computed using SomersDelta() in the DescTools package for R (
As opposed to D&K’s (1990) original proposal, we did consider unbounded QI feet in our simulations; see footnote 4.
The EDPL needs slightly more iterations to learn Type B QS systems (median=300, maximum=800) than it does for the Type B QI systems (median=200, maximum=500).
One other system not learned by the EDPL resembles Cairene Arabic to some extent (
See, e.g., Jun & Fougeron (
See Apoussidou (
We would like to thank Adam Albright, Gašper Beguš, Elan Dresher, Naomi Feldman, Edward Flemming, Isaac Gould, Bruce Hayes, Mark Johnson, Michael Kenstowicz, Andrew Lamont, Armin Mester, Erin Olson, Joe Pater, Lisa Pearl, Ezer Rasin, Juliet Stanton, Kristine Yu, Sam Zukoff, the students in LING 730 in the Fall of 2020 at UMass Amherst, and audiences at UMass Amherst, MIT, USC, Warsaw University, NECPhon 10, AMP 2017, the LSA 2017 Annual Meeting, and SCIL 2019 for their feedback that has greatly benefited this work. We would also like to thank John Iyalla Alamina for his help with technical issues. All errors remain our own.
The authors have no competing interests to declare.