1 Introduction

Knowing a language involves at least being able to judge the well-formedness of strings relative to that language. Explanations of this ability differ on a variety of axes: (i) whether or not there is a psychologically real distinction between notions of well-formedness, such as grammaticality and acceptability (Chomsky 1965 et seq); (ii) whether either (or both) concepts are discrete or continuous; and (iii) what exactly is required to collect the supporting judgments (Bard et al. 1996; Keller 2000; Sorace & Keller 2005; Sprouse 2007; 2011; Featherston 2005; 2007; Gibson & Fedorenko 2010; 2013; Sprouse & Almeida 2013; Sprouse et al. 2013; Schütze & Sprouse 2014; Lau et al. 2017; Sprouse et al. 2018 among many others). But few doubt that some explanation of well-formedness is a crucial component of a theory of linguistic knowledge.

Well-formedness is something that must be learned—at least in part—from linguistic experience. A major question is how direct the relationship between well-formedness and linguistic experience is. Approaches that directly address this question have tended to focus on complex syntactic effects—e.g. effects that might arise from constraints on syntactic movement, such as island effects (see Kluender & Kutas 1993; Hofmeister & Sag 2010; Sprouse et al. 2012; Kush et al. 2018 among others). The abstractness of these constraints makes them useful test cases, since they are prime candidates for knowledge that might arise from inductive biases innate to language learners—i.e. not directly from statistical properties of the input (though see Pearl & Sprouse 2013). However, complex syntactic phenomena are far from the only factor contributing to well-formedness.

Effects on well-formedness that are at least partially a product of lexical knowledge have garnered much less focused attention within theoretical linguistics—possibly, because they seem more directly dependent on statistical properties of the input (though see Bresnan 2007; Bresnan et al. 2007; White 2015; White et al. 2018a).1 That is, lexical (as opposed to grammatical) constraints on well-formedness might have a direct connection to co-occurrence statistics in language learners’ input, and therefore might be learnable using relatively simple strategies—e.g. tracking co-occurrence frequencies, which has long been believed to be well within the capabilities of even young children (Saffran et al. 1996a; b; Aslin et al. 1998; Maye et al. 2002). As such, lexical constraints on well-formedness could in principle make a better case for a direct connection between linguistic experience and grammatical knowledge.

Against this background, this paper makes three main contributions: (i) we introduce and validate an experimental method for collecting lexically determined well-formedness patterns at the scale of an entire lexicon; (ii) we use this method to collect a lexicon-scale dataset focused on verbs’ c(ategory)-selection behavior—i.e. what kinds of syntactic structures verbs are acceptable in; and (iii) we use this dataset to investigate the connection between linguistic experience (in the form of corpus data) and linguistic knowledge (in the form of acceptability patterns). The findings of this investigation cast doubt on the possibility of a direct connection between frequency and acceptability, rather supporting the idea that language learners must employ substantial abstraction in order to be able to achieve adult/human-like knowledge of acceptability.

Our investigation focuses, in particular, on verbs that take subordinate clauses—henceforth, clause-embedding verbs—such as think, want, and tell.

(1) a. Jo {thought, told Mo} that Bo left.
  b. Jo {wanted, told} Bo to leave.

Clause-embedding verbs are a useful test case because (i) subordinate clauses can have a wide variety of syntactic structures; (ii) many verbs can take a large subset of these clause types; and (iii) across the (at least) 1,000 clause-embedding verbs in English (White & Rawlins 2016), there is high variability in which subset of clause types different verbs can take. For instance, remember can combine with a wide variety of differently structured clauses, as in (2a)–(2e), as well as noun phrases (2f) or nothing at all (2g).

(2) a. Jo remembered that Bo left.
  b. Jo remembered Bo to have left.
  c. Jo remembered Bo leaving.
  d. Jo remembered to leave.
  e. Jo remembered leaving.
  f. Jo remembered Bo.
  g. Jo remembered.

Our investigation contrasts with prior work, which tends to focus on only a small set of key verbs and frames (Fisher et al. 1991; Lederer et al. 1995; Bresnan et al. 2007; White et al. 2018b; though see Kann et al. 2019). The relatively small size of previous investigations is likely a product of the fact that scaling standard methodologies to a larger set of verbs is infeasible without introducing unwanted biases or insurmountable workload. Nonetheless, we take a lexicon-scale investigation to be necessary for answering the sorts of questions about lexical knowledge we focus on here. This necessity is the impetus for the novel method we propose for automatically scaling standard acceptability judgments methods while avoiding the introduction of such biases into the judgments: the bleaching method.

In Section 3, we report on an experiment validating the bleaching method against a more standard acceptability judgment collection method, focusing on a small set of clause-embedding verbs. In Section 4, we report on an experiment in which we deploy the bleaching method on 1,000 clause-embedding verbs in 50 syntactic frames to create the MegaAcceptability dataset, which was first reported on in White & Rawlins 2016 and is publicly available at megaattitude.io under the auspices of the MegaAttitude Project.2 In Section 5, we use the MegaAcceptability dataset, in conjunction with a very large dataset of verbs’ subcategorization frequencies, to show that the relationship between acceptability and frequency, when considering this entire sublexicon, is surprisingly weak. This finding throws into question the assumption that c-selectional behavior can be directly read off frequency distributions.

Nonetheless, adult native speakers are able to judge the acceptability of the items that make up this task. Even if they are unable to acquire the information needed to do this directly from cooccurrence frequencies—even assuming access to the data that is both ideal and uniform—they must have some way to get it. This suggests that some abstraction of the frequency distributions in the input is necessary. In Section 6, we consider a variety of such abstractions, showing (i) that common, shallow factorization methods yield miniscule improvements in the prediction of acceptability over the more direct models investigated in Section 5; but (ii) that methods involving multiple layers of abstraction can predict acceptability quite well. The models we present here are proof-of-concept models rather than attempts at actual learning models, and thus, our results do not fully answer the question of exactly how a learner will gather this information; but we take them to strongly confirm the necessity of substantial abstraction in grammatical theory relative to input frequency. In Section 7, we conclude with remarks on what these findings imply for the acquisition of grammatical knowledge about lexical items.

2 Background

We begin with a discussion of previous work relating frequency and acceptability (Section 2.1), acceptability and selection (Section 2.2), and frequency and selection (Section 2.3) and then discuss two hypotheses about the joint relationship among the three (Section 2.4).

2.1 Frequency and acceptability

In a substantial body of work, Clark, Lappin, and colleagues argue that knowledge of grammaticality, by way of acceptability, can be modeled, to a large extent, with probabilistic models that involve a direct link between probability and acceptability (Clark & Lappin 2011; Clark et al. 2013a; b; Lau et al. 2017; see also Bresnan et al. 2007; Bresnan 2007). These models have two components: (i) a way of estimating the probability of a sentence of a language; and (ii) a way of translating probabilities to acceptabilities. For the most part, the latter is quite straightforward: some variant of log probabilities, normalized for sentence length and unigram frequency effects. The former is where much of the action is.

This body of work considers a range of possibilities for how to model probability, ranging from simple n-gram frequency models to neural language models in Lau et al. 2017 (see also Warstadt et al. 2019). Lau et al. consider two kinds of data: (i) sentences sampled from the BNC corpus, automatically translated to a range of languages (using Google Translate), and then translated back to English (with the goal of obtaining a range of acceptabilities in the resulting English sentences); and (ii) a sample of sentences from Adger 2003 by way of Sprouse & Almeida’s (2013) dataset. They then run a variety of acceptability judgment studies on Amazon Mechanical Turk, providing the core data for their experiments. Across the board, the models that Lau et al. present show what the authors describe as an “encouraging degree of accuracy” in predicting human judgments of acceptability—well exceeding baseline models but generally falling well short of human performance.

The authors take this to be a signal that probabilistic models are the current best way to incorporate gradience in human judgments about acceptability into a theory of grammatical knowledge. This conclusion is controversial (Sprouse et al. 2018), but we take it as a starting point that current probabilistic models can at least do reasonably well at capturing some—though probably not all—facets of acceptability.

Clark, Lappin, and colleagues’ work therefore sets the stage for the central question that we are addressing here: what is the relationship between frequency of use, probability, and acceptability? Though the model generating the probabilities itself may be complicated, on Lau et al.’s view, the relationship between the two is direct. That is, there is a simple transformation that, as long as some basic normalization is taken care of, more or less directly predicts acceptability.

Existing work has focused on testing hypotheses like this on data that is extremely varied—e.g. random samples of naturalistic corpus data or broad datasets of grammaticality judgments from linguistic theory (Sprouse et al. 2018). For this reason, it is an independently challenging and interesting problem to estimate the probability of sentences across such data, and as Lau et al. (2017) demonstrate, in at least some domains more sophisticated models from natural language processing (NLP) will lead to better predictions of acceptability—e.g. models which involve more complex relationships between the frequencies they are trained on and the probabilities they output. But this approach to the underlying data leads to an additional problem: conclusions about the probabilistic nature of the grammar rest on the degree to which the [0, 1] interval values that these models are producing are in fact good estimates of the probabilities of particular sentences, also making it rather challenging to know what the driving factor for sentence-level probabilities is.

In the present work, we take a different approach. Rather than a broad and diverse data sample, we pick a single phenomenon where we can obtain acceptability data exhaustively, and estimate probabilities from corpora in a relatively transparent way, producing a direct idealization of linguistic experience.3 The particular data we investigate is selectional patterns for clause-embedding predicates, where the main point of variation between items is just the verb and its selectional frame. This also allows us to begin to localize the kinds of grammatical knowledge that are likely to be involved in variation in acceptability.

2.2 Acceptability and selection

The selectional behavior of verbs in general—and clause-embedding verbs in particular—is a classic topic in linguistic theory (Chomsky 1965; Gruber 1965; Fillmore 1970; Jackendoff 1972; Chomsky 1973; Grimshaw 1979; 1990; Pesetsky 1982; 1991 among many others). There are broadly accepted to be two crucial factors leading to variability in whether a verb is acceptable in a particular sentence (other things being equal): (i) semantic constraints imposed by the argument structure of the verb (s(emantic)-selection); and (ii) (morpho-)syntactic constraints imposed by a verb on its complements (c(omplement)-selection; Grimshaw 1979).

It is a matter of substantial debate whether some or all of these constraints might be predictable from independent factors—e.g. event structure (see Levin & Rappaport Hovav 2005 and references therein). We do not attempt to settle this question here. Regardless, this domain is useful for our purposes because (i) it is empirically rich—even if we focus just on selection of clauses; and (ii) it involves both a large set of possible patterns and a large set of apparently lexical item-specific idiosyncracies, such as Grimshaw’s classic contrast between the clause-taking behavior of wonder and think (her Ex. 1).

(3) a.   John wondered who Bill saw.
  b. *John wondered that Bill saw someone.
  c.   John thought that Bill saw someone.
  d. *John thought who Bill saw.

We suggest that this kind of data, when scaled up to the entire lexicon, is a perfect test-bed for questions about acceptability and grammar.

In this case, the two verbs differ inversely in whether they license interrogative vs. declarative complements (though see White 2020), something we expect would be mirrored in corpus frequencies (after controlling for, e.g., the fact that think is itself much more frequent than wonder).

While there are many more frames than just these two, there are a limited number of possible selectional patterns instantiated in English; and while, as we have suggested, there are many more verbs that might potentially participate in patterns like this, this number (likely currently around 1,000) is still tractable with modern experimental and corpus methods. Further, there is a large body of literature arguing that such patterns involve many subregularities that a learner could find, and the key points of between-sentence variation involved in pairs like this are fairly minimal and (relatively) possible to extract from corpus data. Finally, we would expect some gradience in the judgments—e.g. (3b) is sometimes claimed to be “not as bad” as (3d).

To date, there is not a large body of experimental linguistics work on acceptability and selection. But the role of selectional patterns has been crucial in research on language acquisition—and verb learning, in particular—since Landau & Gleitman’s (1985) and Gleitman’s (1990) seminal work. This line of work suggests that children use syntactic frame information to infer semantic representation when learning the meanings of verbs. We do not review this literature in detail (see White 2015 for a recent review), but since the acceptability judgment method in Fisher et al. 1991 (experiment 1, part B) is a crucial predecessor to what we develop here, we briefly discuss it (see also Lederer et al. 1995).

In Fisher et al.’s (1991) method, a set of verbs and a set of syntactic frames are selected, and the full cartesian product of these sets is constructed. For each pair in the cartesian product one or more sentences are produced by instantiating the syntactic frames with lexical items, placing the past tense form of the verb in the resulting instantiation (unless the frame is explicitly specified for some other tense/modal). Participants then do an acceptability judgment task that (across all participants) exhausts the matrix of verb-frame pairs.

In this study, Fisher et al. use carefully hand-constructed non-idiomatic sentences for each cell of this matrix: 24 verbs × 39 frames = 936 sentences. For many purposes, this method can work well as long as the verbs and frames are appropriately sampled relative to the questions at hand; but compared to the total scale of the lexicon, it does not come close to being exhaustive. Moreover, the method does not easily scale much beyond the size of the original Fisher et al. study due to the labor-intensive nature of hand-constructing items; for 1,000 verbs and the same number of frames, hand-construction would require the experimenter to write 39,000 sentences. While it may be possible to automate this process to some degree, as Fisher et al. (1991: 347) note, substantial care and effort is required in selecting the sentences to be judged because of the potential for unintended item effects. Thus, the challenge is to understand how it might be possible to generate sentences on this scale without requiring hand-inspection of every such sentence. We take this challenge on in Sections 3 and 4.

2.3 Frequency and selection

The connection between frequency and acceptability in the domain of clause selection is not in general well-studied—or really, even on the radar of the theoretical linguistics work mentioned above (though see Bresnan 2007; Bresnan et al. 2007 and references in Footnote 1). But as for acceptability and selection, frequency of exposure has played a role in discussions of the acquisition of verb meaning. In particular, it has been hypothesized that the frequency with which a verb occurs in a syntactic structure (along with the frequency of that syntactic structure across verbs) plays a role in learning that verb’s meaning (Lederer et al. 1995; Alishahi & Stevenson 2008; Barak et al. 2012; White 2015; White et al. 2018a).

An important component of these proposals are the mechanisms they employ for normalizing frequency information across verbs—in particular, the cooccurrence frequencies for verbs in different syntactic structures—and subsequently abstracting that frequency information. These mechanisms tend to take the form of clustering models (Lederer et al. 1995; Schulte im Walde 2006), mixture models (Alishahi & Stevenson 2008; Perfors et al. 2010; Parisien & Stevenson 2010; Barak et al. 2012), or matrix factorization models (White 2015; White et al. 2018a). We defer detailed discussion of these models until Sections 5 and 6.

2.4 Hypotheses

Based on the prior work discussed in this section, we suggest two hypotheses.

(4) H1: Verb-frame co-occurrence frequencies predict acceptability.
(5) H2: Verb-frame co-occurrence frequencies require nontrivial abstractions of raw distributional information in order to predict acceptability.

Given the findings of this prior work, it would be quite surprising if H1 turns out to be entirely false: if a verb frequently occurs in a particular syntactic context (relative to the overall frequency of that verb), it seems likely that the verb is acceptable in that context. It is less intuitively clear that the converse holds: if a verb is acceptable in a particular syntactic context, does it occur frequently in that context? The answer may be affirmative; but insofar as that acceptability is inferable from co-occurrence statistics—e.g. by abstracting across distributional patterns present in the co-occurrence frequencies—the answer need not be affirmative for the acceptability of a verb-frame pair to be acquirable. To this end, prior acquisition models, such as those cited above, have proposed fairly “shallow” forms of abstraction with some success, but since researchers have not had access to lexicon-scale acceptability data, these models have yet to be tested at that scale. Further, no comparison between the performance of such “shallow” abstractions and the “deeper” abstractions now common in NLP (see Goldberg 2017 and references therein) has been undertaken in this domain.

The program for testing these hypotheses is, at this point, straightforward: we need a large acceptability dataset focused on c-selectional phenomena, an estimation of verb-frame co-occurrence frequencies, and models that compare the two by varying the amount of abstraction applied to the co-occurrence frequencies. We now turn to the first of these requirements.

3 The bleaching method

A major obstacle to scaling standard acceptability judgment tasks to entire subregions of the lexicon is the need to control for plausibility effects—i.e. effects on acceptability that are driven by how prototypical the situation described by the sentence is (as opposed to effects driven by syntactic well-formedness). We propose a method to control for such effects by “semantically bleaching” all lexical category words besides a word of interest—in our case, a clause-embedding verb. Specifically, we manipulate the syntactic context that a word appears in while instantiating all NPs in that context with indefinites (someone, something) and all verbs in that context (besides the one of interest) with a “low content” eventive (happen, do) or stative (have) verb.

We first demonstrate the validity of this method on a small set of verbs by showing that agreement is high among naïve participants’ acceptability ratings when responding to “contentful sentences” v. when responding to “bleached sentences” that are otherwise matched in terms of structure—effectively comparing our bleaching method against Fisher et al.’s (1991) more standard method (described above). We use the data reported in White et al. 2018b as a dataset of acceptability judgments for contentful sentences—as it focuses on exactly the phenomena we are interested in—and collect acceptability judgments for bleached sentences that are structurally matched with theirs.

3.1 Materials

We follow White et al. (2018b) in using the 30 propositional attitude verbs found in (6), which were selected in such a way that they evenly span the verb classes presented in Hacquard & Wellwood 2012.

(6) a. think
  b. realize
  c. understand
  d. suppose
  e. guess
  f. expect
  g. imagine
  h. remember
  i. forget
  j. see
  k. hear
  l. feel
  m. tell
  n. say
  o. promise
  p. hope
  q. worry
  r. doubt
  s. pretend
  t. deny
  u. forbid
  v. allow
  w. promise
  x. love
  y. hate
  z. bother
  aa. amaze
  bb. demand
  cc. want
  dd. need

We also follow White et al. in testing frames built from the same combinations of syntactic features—including various tense-aspect combinations within matrix and embedded clauses as well as various forms of NP and PP arguments.4 They construct 30 subcategorization frames from these combinations of features.5 These frames are given in abstracted form in (7), with our instantiation for each constituent type in (9). Importantly, these frames cover a wide range of syntactic contexts and do not just limit themselves to frames with clauses in them. This choice is driven by our main research questions, which are about verb knowledge, and so we cannot exclude, e.g., intransitive or simple transitive NP-taking frames for verbs that also take clauses in some cases.

(7) a. NP ___ed
  b. NP ___ed NP
  c. NP ___ed NP NP
  d. NP ___ed about NP
  e. NP ___ed NP about NP
  f. NP ___ed so
  g. NP ___ed to
  h. NP ___ed S
  i. NP ___ed that S
  j. NP ___ed if S
  k. NP ___ed SWH
  l. NP ___ed NP S
  m. NP ___ed NP that S
  n. NP was___ed that S
  o. NP ___ed it that S
  p. NP ___ed to NP that S
  q. NP ___ed for NP to VP
  r. NP ___ed to VP
  s. NP ___ed WH to VP
  t. NP ___ed NP to VP
  u. NP was___ed to VP
  v. NP ___ed there to VP
  w. NP ___ed VPing
  x. NP ___ed NP VPing
  y. NP ___ed NP VP
  z. It___ed NP that S
  aa. It___ed NP Swh
  bb. It___ed NP WH to VP
  cc. It___ed NP to VP
  dd. S, NP ___ed
  ee. S, I ___
  ff. S, NP ___ed

To instantiate these abstract frames, White et al instantiate each of the phrases with contentful lexical items (following Fisher et al. 1991).6 For instance, (8) shows one of three items that instantiate the pair (think, NP ___ed that S) in White et al.’s experiment.

(8) Gary thought that she fit the part.

As noted above, one potential issue that arises when using contentful lexical items to instantiate a frame is that ratings of the resulting items are susceptible to plausibility effects—some verbs might sound less plausible in certain frame instantiations even if that verb is otherwise perfectly fine in some other instantiation of that same frame. White et al. control for such effects by creating three different instantiations for each frame and taking into account possible item variability in their analysis. But if at all possible, it is ideal not to do this, since it increases the number of items and lowers the statistical power of subsequent analyses—thus requiring more ratings to get an accurate estimate of the acceptability of any particular verb-frame pair.

To address this both in the current experiment and in the large-scale experiment reported in Section 4, we instead create only one instantiation of each frame with as little lexical content as possible. All instantiations we use are listed in (9).

(9) a. NP Noun phrase (someone or something)
  b. VP Verb phrase with verb in bare form (do something)
  c. VPing Verb phrase with verb in present progressive form (doing something)
  d. S Full clause without complementizer (something happened)
  e. SWH Full embedded interrogative clause7 (which thing happened)
  f. S[-TENSE] Tenseless embedded clause (something happen)

For example, (10) gives the item instantiating the pair (think, NP ___ed that S).8

(10) Someone thought that something happened.

From the resulting set of 30 verbs × 46 frames = 1,380 items, we constructed 23 lists of 60 items, constrained such that each verb occurred exactly twice in each list (always with a distinct frame) and each frame occurred between one and two times in each list.9 Each item consists of a sentence paired with an ordinal acceptability judgment using a 7-point ordinal (or Likert) scale value. Participants were presented with items in the browser, following instructions and other introductory material. Figure 4 gives the instructions, which were the same for the pilot and the full experiment, modulo the number of items mentioned.

3.2 Participants

We recruited 115 unique participants = 23 list × 5 participants per list through Amazon Mechanical Turk. All participants reported speaking American English as their native language.

3.3 Predictions

To measure interannotator agreement, White et al. compute the Spearman rank correlation between the responses for each pair of participants that did the same list and report a mean correlation of 0.64. This measure is not comparable to ours, however, since our items were specifically designed to block lexical information from being used in the acceptability judgments. This means that participants in principle have more potential interpretations to consider when making a judgment; and depending on the variability in acceptability of such interpretations, we expect no less (and possibly more) variability in responses to bleached items. Thus, we expect lower agreement. To assess how much lower we should expect, we attempt to factor out the variability across items that instantiate a particular verb-frame pair. To derive this more comparable measure, we simulate the amount of agreement we would expect, assuming they had used a method like ours.

First, we fit an ordinal (linked logit) mixed effects model to the ratings from White et al.’s data, with fixed effects for verb, frame, and their interaction and random unconstrained cutpoints for each participant (see Appendix B for details). We then use this model to simulate how each participant from their experiment would respond to each verb-frame pair in their experiment by (i) using the ordinal model to produce a predicted probability distribution over the ordinal scale ratings for each participant and item; (ii) sampling once from each of those distributions; and (iii) computing the Spearman correlation between responses given by all pairs of simulated participants. We repeat this simulation 999 times, computing the mean agreement each time.10 This yields a mean correlation of 0.516 (95% CI: [0.511, 0.521]) across all simulations.

3.4 Results

We are concerned with two sorts of results in this validation: (i) interannotator agreement among participants in the validation compared to interannotator agreement in White et al.’s data; and (ii) agreement in the aggregated ratings for verb-frame pairs.

To measure interannotator agreement, we compute the Spearman rank correlation between the responses for each pair of participants that did the same list. This yields a mean correlation of 0.528 (95% CI: [0.509, 0.545]). Thus, the level of interannotator agreement we observe is exactly what we expect given data collected under a more standard methodology.

Next, we turn to agreement between the average ratings for each verb-frame pair computed from each dataset. To compute these average ratings, we fit the ordinal mixed model described in the last subsection to each dataset separately and then compute the predicted real-valued acceptability for each verb-frame pair.11

Figure 2 plots the Spearman rank correlation between these normalized, real-valued verb-frame acceptability by frame (across verbs), and Figure 1 plots the same correlation by verb (across frames). In both plots, the dashed line shows the mean interannotator agreement, and error bars show 95% confidence intervals.

Figure 1
Figure 1

Correlation by verb between mean normalized verb-frame acceptability in White et al.’s (2018b) data and our replication.

Figure 2
Figure 2

Correlation by frame between mean normalized verb-frame acceptability in White et al.’s (2018b) data and replication.

In Figure 1, we see that all verbs show correlations above the mean interannotator agreement. This suggests that as a measure of verbs’ syntactic distributions, our data are encoding essentially the same distributional information that White et al.’s data are, and there are no substantial differences tied directly to the two verbs.

In Figure 2, we see that most frames show average correlations close to or above the mean interannotator agreement, and we also take these cases to involve no substantial differences between the two experiments tied to those frames. There are five frames that do not show a correlation that is significantly different from zero: NP ___ed NP, NP ___ed NP NP, It ___ed NP that S, It ___ed NP WH S, and It ___ed NP.WH to VP. We therefore discuss potential explanations for the reasons why these frames differ between the two experiments.

For NP ___ed NP and NP ___ed NP NP, it seems likely that the disagreement arises because White et al.’s instantiations of those frames only ever include object NPs that denote concrete inanimate entities that cannot be straightforwardly associated with propositional content—e.g. cups, tables, bottles. In contrast, the inanimate indefinites we use could denote either contentful or non-contentful inanimates. This is likely the cause of higher acceptabilities observed for predicates like believe and tell. Compare (11a) and (11b) with (12a) and (12b).

(11) a. #I believed the table.
  b. #I told her the table.12
(12) a. Someone believed something.
  b. Someone told someone something.

Since we aim to factor out effects due to lexical items, this is a point in favor of our method.

For It ___ed NP that S, It ___ed NP WH S, and It ___ed NP.WH to VP, we suspect that the low agreement stems from inherent variability in the judgments for items with expletive subjects. One reason this may arise—pointed out by White et al.—is that it can be read referentially in these frames, and thus participants’ judgments might vary depending on their interpretation of it. This predicts that expletive subject frames should show lower agreement on average, which appears to be the case.

Regardless of its source, the existence of this low agreement for expletive subject frames suggests that we should be wary of including such frames in a large-scale experiment like the one we report on in this paper. Nonetheless, we would still like to capture information about whether a verb allows expletive subjects. We discuss our approach to this below.

3.5 Discussion

The results reported above suggest that the bleaching method is a promising way to avoid item effects in acceptability judgment tasks for selectional patterns. Since it involves an extremely simple generation strategy, it therefore is also a promising way of scaling standard acceptability judgment tasks to entire subregions of the lexicon. But even with bleaching, one must thread the needle between useful referential ambiguities—such as the one between entity and propositional reference introduced by the use of something—and plausibly syntactic ambiguities that introduce variability into the judgments—such as the one between referential and expletive it pointed out above.

We address the particular issue of expletive it in our lexicon-scale annotation by using an alternative set of structures for capturing the acceptability of predicates that occur with expletive it: at least those predicates that take experiencer objects. Our approach, which we describe in the next section, is to use passivized version of the expletive object frames: compare (13a) and (13b).

(13) a. It amazed someone that something happened.
  b. Someone was amazed that something happened.

This approach introduces some amount of ambiguity—we don’t know whether a verb that is acceptabile in contexts like (13b) takes contentful or expletive subjects—but this ambiguity is resolvable by looking at the acceptability of that verb in contexts such as (14).

(14) ???Someone amazed someone that something happened.

That is, a verb is licensed in an expletive subject frame if it is licensed in a passive transitive frame, and not licensed in a non-passive ditransitive frame. In our large-scale experiment, we present an expanded set of frames using this manipulation.

4 The MegaAcceptability dataset

The main goal of collecting our large-scale acceptability judgment dataset is to obtain a single normalized acceptability score for every clause-embedding verb in the English lexicon, along with an estimate of the variability in judgments for that item. We discuss our data collection method and how we derive these estimates here. In Section 5, we describe experiments that attempt to predict these normalized acceptabilities from frequency data.

4.1 Materials and data collection method

To scale up the materials, we selected a set of frames, a set of verbs, and automated a method of constructing bleached sentences for every member of the cartesian product of the two.13

For verb selection, we attempted to exhaustively select every verb in English which could take a clause of some kind. First, we took the union of several lists of clause-embedding verbs collected in previous work (Hacquard & Wellwood 2012; Anand & Hacquard 2013; 2014; Rawlins 2013; White et al. 2014) as a seed set. Helpfully, a range of existing lists were already aggregated in Rawlins (2013) and constituted the bulk of the seed. This gave us about 500 verbs.

We then searched in VerbNet (Kipper-Schuler 2005)—a database that is, in large part, directly derived from the verb classes in Levin 1993—to find all verbs in all VerbNet classes that any of the seed verbs were present in. We then conducted a hand-filtering pass to remove obvious errors—e.g. cooking verbs.

To pick the set of frames, we first collected a set of eight basic syntactic features that are believed to be relevant to selectional patterns and selected either all or the most frequent values for these features. In this case, we did not aim for full exhaustivity, but rather to get as big a sample as possible within constraints imposed by the already large experiment.

For example, for prepositional phrases, we consider only the prepositions to and about, though many other prepositional markers, such as of and from, may be relevant to the ultimate question of how to represent selectional patterns. For embedded constituent interrogatives, we chose to use an embedded D-linked WH-phrase (which thing) in order to maximize acceptability. To these features, we added passivization—in order to handle expletive subjects as described above—and two more idiosyncratic frame manipulations: declarative slifting (Ross 1973) and the pro-form so (Ross 1972; Hankamer & Sag 1976).

(15) a. Complementizer: ∅, that, for, whether, which thing
  b. Embedded tense/aspect/modaility: past, future, infinitival, present participle, bare
  c. Matrix direct object count: 0, 1, 2
  d. Matrix preposition phrase: ∅, to, about
  e. Embedded subject?: true, false
  f. Passivized verb?: true, false
  g. So pro-form?: true, false
  h. Slifting manipulation: Something happened, I ___

The frame instantiations are shown in Figure 3. As can be seen in this figure, the bleaching manipulation we applied is identical to that applied in the validation experiment reported in Section 3.

Figure 3
Figure 3

All frame instantiations in the MegaAcceptability dataset (White & Rawlins 2016, Figure 4).

Figure 4
Figure 4

Instructions and example items for each list.

Each item consisted of a sentence constructed using the bleaching method from a verb and a frame as described above, paired with an ordinal acceptability judgment using a 1–7 point ordinal (Likert) scale. From the base set of verbs and frames described so far, we constructed 1,000 lists of 50 items each. For this experiment (in contrast to the pilot experiment reported above), each frame and each verb appear at most once in each list. Each list was presented to the participant in a browser via Amazon Mechanical Turk (AMT) as a single page, with subsequent items reached by scrolling. A sample view of the first four items of a list are provided in Figure 5.

Figure 5
Figure 5

Four items from a sample list.

Each list was provided to participants as a Human Intelligence Task (HIT) in AMT. Participants were presented with the instructions and training items depicted in Figure 4, followed by several demographics questions (including native language), IRB information, and finally the items from their list. Each item involved rating the acceptability of one of the constructed sentences using a 1–7 point ordinal (Likert) scale. In order to submit each HIT, participants needed to check a box indicating their consent to participate in the study.

4.2 Participants

727 unique participants were recruited through AMT to rate sentences in the 1,000 lists of 50. Participants were allowed to respond to as many unique lists as they liked.14 No one participant was allowed to rate the same list twice, and each list was rated by five unique participants, leading to five unique ratings per item. Each participant responded to a median of four lists (mean: 6.9, min: 1, max: 56).

Four participants reported being native speakers of a language other than English. These participants’ responses were removed from the dataset prior to analysis, for a loss of 600 responses total (∼0.2% of the data). None of these participants rated the same list.

4.3 Response normalization

We use a slightly modified form of the ordinal model-based normalization procedure described in Section 3 to produce two pieces of information associated with each verb-frame pair: a real-valued acceptability value (more positive is more acceptable) and the mean (log-)likelihood associated with all acceptability judgments for a particular item.15 The second score can be viewed as a measure of variability in the judgments: the lower this likelihood score is, the higher the variability in ordinal responses to a particular verb-frame.

As an example of what these two measures look like and their relationship to the original ratings, Figure 6 plots the mean ratings for the NP ___ed that S and NP ___ed NP that S frames (treating the ordinal ratings as though they were interval data), and Figure 7 plots the normalized acceptability scores for those same frames, where more to the right (top) means higher normalized acceptability and more to the left (bottom) means lower acceptability.16 Each point is a verb and only a subset of points are labeled. In Figure 7 smaller labels and grayer points correspond to higher mean variability—i.e. lower likelihood score.

Figure 6
Figure 6

Mean of raw ratings for two frames. Each point is a verb (jittered to mitigate overplotting) and only a subset of points are labeled.

Figure 7
Figure 7

Normalized judgments for two frames. Each point is a verb and only a subset of points are labeled. Smaller labels and grayer points correspond to higher mean variability.

As one would expect, verbs like think, assume, discover, and notice are very acceptable in the NP ___ed that S frame but quite bad in the NP ___ed NP that S frame, and there is little variability in these ratings. In contrast, verbs like tell, remind, and notify are very good in the NP ___ed NP that S frame but middling in the NP ___ed that S frame, with more variability in their ratings. This variability is due in particular to the judgments for the NP ___ed that S frame, suggesting that some participants are okay with dropping the object while others are not. This contrasts with a verb like persuade, for which participants are more unified in their dislike of object drop.

4.4 Reliability

In Section 3, we compared the normalized acceptability obtained from White et al.’s (2018b) experiment to our replication that used the bleaching method. We cannot derive similar agreement estimates here because we only have a single dataset and because the lists were built in such a way that participants only saw one instance of a particular verb and one instance of a particular frame; it is therefore not possible to assess the correlations within a particular frame.

But because we do have estimates of the variability in judgments for each verb-frame pair, we can get a sense for how much agreement there is within judgments for a particular frame by taking the mean of the above-defined variability scores for each frame, across verbs. Remember that these variability scores are just mean likelihood values and that higher likelihood values correspond to lower variability (see Appendix B). Figure 8 plots these means in terms of probabilities. A value of 0.14 ( 17 ) is the lowest possible probability—roughly corresponding to each participant giving equally spaced values along the ordinal scale. We see that nearly all frames fall within a narrow band between 0.3 and 0.5, suggesting that no frame shows particularly high disagreement—in contrast to what we saw in Section 3 for the expletive subject and NP direct object frames. Further, our replacements for the expletive subject frames—the passivized frames—do not show systematically lower variability, suggesting that our approach was successful.17

Figure 8
Figure 8

Means of verb-frame variability judgments, for each frame. Higher probability means lower variability.

5 Relating frequency and acceptability

We now turn to the main question of this paper: to what extent can a verb’s subcategorization behavior be predicted directly from the frequency with which it occurs in different syntactic structures in a corpus? To obtain our measure of frequency, we use the VALEX dataset, which is the largest publicly available dataset of subcategorization frame frequencies (Korhonen et al. 2006).18 VALEX is built from over 900 million words of text and contains 163 subcategorization frame types, described in Briscoe & Carroll 1997, and over 6,000 verbs, 958 of which are shared with the MegaAcceptability dataset. (The verbs that are missing tend to be particle verbs, as these are not treated as separate verbs by VALEX.)

One obstacle to using any frequency dataset for predicting acceptability is that we must determine the importance of any particular verb-frame co-occurrence in the context of its entire distribution. For example, observing a high-frequency verb like think once or twice in a ditransitive frame should make us less certain that think is highly acceptable in that frame than observing a lower frequency verb like begrudge. This situation is delicate, though, since it requires the specification of some frequency normalization model for processing the raw frequency data, and we do not want to introduce too much inductive bias at this stage to ensure that we are in fact testing the predictability of acceptability directly from frequency.

To thread this needle, we consider only normalization models that represent verbs’ distributions directly in terms of the original subcategorization frames—as opposed to some set of latent syntactic or semantic factors—while accounting for the importance of a particular observation in the context of those distributions. We then use each of these representations as predictors in a linear model of the normalized acceptability judgments described in the last section.

The use of a linear model for this purpose is important so as not to introduce any further inductive bias—as, e.g., the use of a kernelized support vector machine or multi-layer perceptron might. Further, since linear models learn linear functions from one representation to another and since linear functions are all and only the homomorphisms (structure-preserving mappings) between those representations, this setup allows us to make stronger conclusion about the character of the relationship—specifically, whether or not the (normalized) frequency distributions and acceptability are homomorphic (structurally similar).

5.1 Normalization models

Inspired by work in the subcategorization frame extraction literature (Brent 1991; 1993; Manning 1993; Ushioda et al. 1993; Briscoe & Carroll 1997; Carroll & Rooth 1998; Gahl 1998; Lapata 1999; Korhonen et al. 2000; O’Donovan et al. 2005; Preiss et al. 2007; Van de Cruys et al. 2012; Lippincott et al. 2012; Baker et al. 2014; among others), we consider two probabilistic models and two information theoretic models of subcategorization frame frequency distributions. Our aim in looking at multiple models is less to compare their relative performance, and more to give the frequency information the best chance at explaining acceptability.

5.1.1 Probabilistic models

The first probabilistic model we consider models the conditional probability ℙ(f | v) of seeing a particular frame f given a particular verb v as a categorical distribution with parameters (probabilities) θv.


We use the frequencies cvf for each verb v and frame f to compute the posterior probability ℙ(θv | cv) of θv under the assumptions (i) that the verb-frame pairs are sampled independently; and (ii) that the prior probability ℙ(θv; α) is given by a Dirichlet distribution with parameter α.


We then use the most likely probabilities θ^v —i.e. the maximum a posteriori (MAP) estimate—for each verb as our representation of a verb’s distribution—i.e. θ^v is what we use to predict a verb’s acceptability. When α is a constant positive vector (λ+1)1NF (where NF is the number of frames), this turns out to be equivalent to standard add-λ smoothing.

θ^v=argθvmaxℙ(θv|cv;α=(λ+1)1NF)=[ cv1+λicvi+λ,cv2+λicvi+λ, ]

Thus, a special case of this model (λ = 0) just involves dividing a verb’s frequency in a frame by the verb’s frequency across all frames. Regardless of the setting of λ, the verb’s frequency representation always sums to 1 in this model, and λ > 0 enforces that, even if a verb-frame pair hasn’t been seen, there is still some probability that it might be seen in the future, with the amount of probability assigned to those unseen verb-frame pairs dependent on the size of λ (see Jurafsky & Martin 2009: Chapter 4).

The second probabilistic model attempts to directly extract acceptability from the frequency data by finding, for each verb-frame pair, a probability that that verb is acceptable in that frame (White 2015; White et al. 2018a). In this model, the conditional probability of seeing a particular frame f with some frequency cvf given a verb v is assumed to have a negative binomial distribution with probability πvf (the probability that v is acceptable in f) and rate rv (roughly, corresponding to the overall frequency of the verb v).


This distribution is a natural choice both (i) because it is known to be a good model of similar kinds of count data (Church & Gale 1995); and (ii) because the parameters themselves have natural interpretations: (a) the parameter πvf can be viewed as a probability of acceptability: when it is close to one, we expect to see more instances of a frame with a verb (though there is a non-zero probability of seeing it rarely); when it is close to zero, we expect to see fewer; and (b) the verb’s rate parameter rv roughly controls its overall frequency. Thus, unlike for the Dirichlet-Categorical model, we straightforwardly separate our knowledge of acceptability (πvf) from our knowledge of frequency (rv; for further discussion, see White 2015). This is particularly evident from the fact that, the Dirichlet-Categorical model’s representation must sum to 1—thus, telling us the probability of seeing a particular verb-frame pair—while the Beta-Binomial model’s representation does not have such a requirement: a verb v can be acceptable in more than one frame f, represented by πvf being near 1 for each of those frames; information about the probability of actually seeing the verb-frame pair is largely factored into rv.

As with the θv parameter of the Categorical-Dirichlet model, we aim to find the most likely pairing π^v,r^v for each verb v, given the counts cv. We assume that the prior probability ℙ(πvf; β1, β2) in this case is beta distributed with parameters β1 = β2—henceforth, referred to via γ = β11 = β21.19 We assume an improper (uniform) prior for rv.


Unlike for θ^v , this MAP estimate for π^v,r^v cannot be computed using a closed form and so we use gradient descent to obtain it.

For both the Dirichlet-Categorical model and the Beta-Negative Binomial model, we refer to the hyperparameters λ, in α=(λ+1)1NF , and γ as smoothing parameters. We consider multiple different settings of smoothing parameters in our experiments (described below).

5.1.2 Information theoretic models

The first information theoretic model we consider uses the pointwise-mutual information (PMI; Church & Hanks 1990) between a verb and a frame.


This quantity is commonly used to find collocations—i.e. common pairings of words or phrases (see Manning & Schütze 1999: Chapter 5)—by measuring how much more (or less) likely two items are to occur together than one might expect if those items were independently sampled.

To compute this quantity, we assume the Dirichlet-Categorical model described above and obtain MAP estimates for the parameters of the joint distribution ℙ(v, f). The parameters of the marginal distributions ℙ(v) and ℙ(f) can then be obtained from the joint. We estimate these distributions using the same definition of the smoothing parameter λ used above.

The second information theoretic model we consider uses the terms of the G statistic, which Dunning (1993) suggests to be superior to PMI for finding collocations because it better controls for relatively poorer probability estimates for pairings involving low frequency items. In our case, we might expect it to perform better for low frequency verbs, since the frames are all relatively high frequency.

Computing this statistic’s terms amounts to scaling the PMI by the smoothed co-occurrence count of the verb and the frame.20


As for PMI, we assume the Dirichlet-Categorical model described above and obtain MAP estimates for the parameters of the joint and marginal distributions using the same definition of the smoothing parameter λ used above.

5.2 Experiments

We compute MAP estimates for the parameters of the Dirichlet-Categorical model θv and the Beta-Negative Binomial model πv, rv with smoothing parameters λ, γ ∈ {0, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50}. We compute PMI and G using MAP estimates based on the same settings of λ.

We regress the normalized acceptability judgments for each verb in each frame on each of these representations in a multivariate ridge regression—i.e. a linear regression with L2 regularization. To set the regularization parameter α ∈ {0.01, 0.1, 0.2, 0.5, 1, 2, 5, 10}, we use a 10-fold cross-validation. To compute the generalizability of this model, we nest this cross-validation within another 10-fold cross-validation and compute the mean R2 (variance explained) on the held out datasets in this outer cross-validation.

Similar to our reasoning for using multiple kinds of normalization models, our aim in using ridge regression with cross-validation as opposed to an unregularized regression fit to the entire dataset is to give each model the best chance at explaining acceptability for verb-frame pairs it has not seen. If we simply fit a linear regression for the whole dataset and then reported measures of fit on the same data, we could substantially overestimate the model’s performance (see standard machine learning texts, such as Bishop 2006: Chapters 1–3). Importantly, though, the result of ridge regression just is a linear model, thus satisfying our goal of finding a homomorphic (structure-preserving) mapping from the frequency representation to the acceptability representation.

5.3 Results

Figure 9 shows the mean R2 across the 10 cross-validation folds for each model and smoothing parameter—except the G model, which reliably has R2 < 0 across all values of the smoothing parameter λ. We see that the Beta-Negative Binomial model (γ = 0.1) is the best-performing model, with the PMI model (λ = 5) a close second. The Dirichlet-Categorical model (λ = 0) is a close third, performing only slightly (but reliably) worse. The G model consistently does more poorly than the other models—possibly because it too strongly downweights the scores for low frequency verbs (see Korhonen et al. 2000 for additional discussion in a related context). But even for the best-performing models, the scores are quite low. This suggests that, while the joint frequency of a verb and a frame carries some information about the acceptability of that verb in that frame, it is far from enough to determine that acceptability.

Figure 9
Figure 9

Mean variance explained in normalized acceptability judgments in 10-fold/10-fold nested cross-validation for each model and smoothing parameter. The G model is excluded because it performs reliably worse than zero across all smoothing parameters.

One question that arises here is whether this poor performance is due to verb-frame pairs that received highly variable judgments. In this case, we should expect a positive correlation between judgment variability (as defined in Section 4) and models’ absolute errors. We test this hypothesis using our best-performing model’s absolute error on the held-out data for each fold of the cross-validation. Rather, than finding a positive correlation, we find a weak (but reliable) negative Spearman rank correlation of –0.192 (95% CI = [–0.200, –0.184]).

This negative correlation suggests that highly variable judgments are actually slightly easier to predict than less variable judgments. One reason this may come about is that more variable judgments tend to have normalized acceptabilities near the center of the acceptability scale, which arises from the fact that high variability is a consequence of extreme responses from participants that average out to the middle of the scale, as can be seen in Figure 10. This means that if a model incorrectly predicts a very high or very low acceptability score it will tend to be less wrong for the highly variable predicates in the middle of the scale than for predicates that participants were more certain about.

Figure 10
Figure 10

Normalized acceptability plotted against variability.

A similar question arises with respect to frequency: the poor performance we observe could be due to poor estimates for the distributions of low frequency verbs. In this case, we should expect a negative correlation between verb frequency and models’ absolute errors. We again test this hypothesis using our best-performing model’s absolute error on the held-out data for each fold of the cross-validation. We instead find a reliably positive Spearman rank correlation here, though it is extremely weak 0.021 (95% CI = [0.011, 0.029]). This finding suggests that poor estimates of verbs’ distributions—at least their frequency distributions—is not the cause of our models’ poor performance.

5.4 Discussion

What is the source of the models’ low performance then? We believe it is likely due to a systematic bias in the kinds of information frequency distributions contain. Specifically, we posit that those aspects of a verbs’ distributions that are predictable either from their abstract syntactic properties or from their meaning will not necessarily be directly encoded in their frequency distributions. That is, one will not necessarily observe all frames a particular verb is acceptable in insofar as the acceptability of that verb in that frame is predictable from its acceptability in another frame (see Grimshaw 1981; Pinker 1984; 1989; Lasnik 1989; Kako 1997; Lidz et al. 2004 for how this might work; see also Featherston 2008 on the Iceberg Phenomenon). Conversely, a verb being acceptable in a frame does not entail observing that verb in that frame, unless that acceptability is not predictable from its meaning.

One piece of evidence for this comes from which frames our models perform worst on. Figure 11 plots the R2 for the best-performing Beta-Negative Binomial model (γ = 0.1) broken down by frame. We see that the model systematically does more poorly in predicting verbs’ acceptability in frames involving direct and indirect objects and a tensed embedded clause. This is consonant with our hypothesis insofar as verbs’ acceptability in these frames is predictable from some semantic property—plausibly in this case, whether the verb is communicative or not.

Figure 11
Figure 11

Variance explained by best-performing Beta-Negative Binomial model (γ = 0.1) broken out by frame.

Part of this hypothesis appears to fly in the face of previous work in the syntactic bootstrapping literature demonstrating that distributional cues are useful for inferring a word’s meaning (Landau & Gleitman 1985; Gleitman 1990; Naigles 1990; 1996; Fisher et al. 1991; 1994; 2010; Naigles et al. 1993; Fisher 1994; Lederer et al. 1995; Gillette et al. 1999; Snedeker & Gleitman 2004; Lidz et al. 2004; Gleitman et al. 2005; Papafragou et al. 2007; White 2015; Dudley 2017; Lewis et al. 2017; White et al. 2018a; b). But this conflict is only apparent. A key part of our hypothesis is that acceptability is not directly encoded in frequency distributions. But certain components of that distribution may be observed for particular verbs, and the observation of that component may imply the acceptability of others. For instance, there is a relatively strong correlation between acceptability in an NP ___ed NP that S frame and acceptability in an NP ___ed NP whether S and so it is generally a safe bet that if a verb is acceptable in one it will be acceptable in another.

This view implies that, insofar as abstract syntactic and semantic properties reveal themselves in verbs’ subcategorization frame frequency distributions, it may be possible to infer those properties from regularities observable across those distributions. So far, the results are consistent only with our hypothesis H2 from Section 2, and not with its direct counterpart: we apparently cannot predict acceptability in selectional behavior from frequency. We consider various models of this abstraction in the next section.

6 Abstracting frequency

Constructing useful abstractions of frequency distributions is a major component of much work in NLP. Abstraction techniques can take myriad forms, both probabilistic and neural. We consider four popular abstraction techniques, selected to be roughly analogous to the probabilistic and information theoretic models presented in the last section.

Our main goal in doing this is to determine the extent to which these abstractions represent acceptability directly, by which we mean that the space of abstractions and acceptability are homomorphic. As noted in the last section, the set of homomorphisms in a vector space are just the linear functions, and so as in the last section, we will attempt to predict acceptability using linear regression on the different representations we consider. We point this out because, for some of the representations we construct, it is common practice to learn nonlinear functions to a quantity of interest; and while this can be useful for understanding whether a particular abstraction implicitly contains information about a quantity, it does not tell us the extent to which that abstraction is potentially a representation of that quantity in an algebraic sense.

6.1 Models

The first model we consider is Latent Dirichlet Allocation (LDA; Blei et al. 2003), which is analogous to the Dirichlet-Categorical model presented in Section 5. This model is closely related to Alishahi & Stevenson’s (2008) model of verb learning (see also Perfors et al. 2010; Parisien & Stevenson 2010; Barak et al. 2012 for similar acquisition models as well as related methods discussed by Schulte im Walde & Brew 2002; Korhonen et al. 2003; Schulte im Walde 2006; Sun et al. 2008; Vlachos et al. 2009). It assumes that each verb is probabilistically associated with a set of K latent syntactic and semantic properties via a conditional categorical distribution ℙ(k | v) = θvk and that each frame is probabilistically associated with that same set of properties via a conditional categorical distribution ℙ(f | k) = φkf. The probability of seeing a verb v in a frame f is then modeled via these two distributions.


As for the Dirichlet-Categorical model from Section 5, the parameters θv and φk are assumed to be distributed Dirichlet.

The second model we consider is logistic factor analysis (LFA) with a negative binomial likelihood (see Zhou 2018). This model is closely related to the Poisson Factor Analysis (Zhou et al. 2012; Zhou & Carin 2015) model proposed as a model of syntactic bootstrapping by White (2015) and further developed in White et al. (2018a). This model is analogous to the Beta-Negative Binomial model presented in Section 5, using the same likelihood function but modeling Π via two matrices UNV×K and AK×NF , where NV is the number of verbs and NF is the number of frames.


Similar to Θ in LDA, one way to view U is as encoding verbs’ abstract syntactic and/or semantic properties; and similar to Φ in LDA, one way to view A is as encoding the syntactic properties of a frame, along with whatever aspects of verbal semantics project onto that frame (White & Rawlins 2016). As for the Beta-Negative Binomial model presented in Section 5, we infer U, V, and the rate parameters r of the negative binomial likelihood using gradient descent.

The third model we consider uses Global Vectors (GloVe; Pennington et al. 2014), which is a popular word embedding method in NLP. GloVe itself is not directly analogous to the PMI method from Section 5, but it is closely related (Levy & Goldberg 2014; Suzuki & Nagata 2015). In essence, it is a factor analysis of the log cooccurrence counts cij for words i and j.21


As for LDA and LFA, W represents a relation between verbs and latent syntactic and/or semantic properties and W′ represents the relation between these properties and frames.

We consider two versions of this GloVe-based model. The first uses pretrained GloVe to compute a neural bag of words (NBoW) representation for each sentence (Iyyer et al. 2015).22 In NBoW, the point-wise mean of the vector for each word in a multiset is computed. In our case, this multiset is the multiset of words in each sentence of MegaAcceptability. We then predict the acceptability for that sentence from its NBoW representation.

In addition to using pretrained GloVe, we train our own GloVe embeddings on the basis of the VALEX verb-frame counts cvf. This yields an embedding for each verb and each frame on a K-dimensional latent space. We consider the same settings of K for this space as for the LDA and LFA models.

The final model we consider uses contextual word embeddings produced using pretrained Bidirectional Encoder Representations from Transformers (BERT; Devlin et al. 2019).23 A full technical explication of BERT is not possible here, but in essence, BERT consists of multiple layers of interacting neural network modules known as a transformers (Vaswani et al. 2017) that are trained to predict the probability of a word given the surrounding words in the sentence as well as sentence ordering in a document. This means that BERT’s representation of each word in a sentence contains some amount of information about other words in the sentence.

We use BERT to encode each sentence in MegaAcceptability and then extract the embedding of the sentence start token—i.e. the classifier token ([CLS])—following standard practice (Devlin et al. 2019).24 This is analogous to using NBoW with GloVe embeddings in that the sentence start token contains information about all words in the sentence (along with their positions) due to the way the model is trained.

6.2 Experiments

We compute MAP estimates for the parameters of LDA using the default hyperparameters in the sklearn package, and we compute maximum likelihood estimates for the parameters of the logistic factor analysis model and the GloVe representations we train on VALEX. For all three models, we consider numbers of latent components K ∈ {2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50}. For LDA and LFA, we additionally concatenate the predicted distributions over the subcategorization frames and the best-performing normalized distributions from the Dirichlet-Categorical (λ = 0) and Beta-Negative Binomial (γ = 0.1) models, respectively.

As in Section 5, we regress the normalized acceptability judgments for each verb in each frame on each of these representations in a multivariate ridge regression—i.e. a linear regression with L2 regularization. To set the regularization parameter α ∈ {0.01, 0.1, 0.2, 0.5, 1, 2, 5, 10}, we use a 10-fold cross-validation. To compute the generalizability of this model, we nest this cross-validation within another 10-fold cross-validation and compute the mean R2 (variance explained) on the held out datasets in this outer cross-validation.

6.3 Results

Figure 12 plots the mean R2 across the 10 cross-validation folds for each model and number of latent components. The black dashed line shows the best performing model from Section 5: Beta-Negative Binomial (γ = 0.1).

Figure 12
Figure 12

Mean variance explained in normalized acceptability judgments in 10-fold/10-fold nested cross-validation for each model and number of latent components.

We see that BERT is far and away the best performing model with LFA a distant second. In turn, LFA (K = 5) outperforms the Beta-Negative Binomial model (γ = 0.1) by approximately one point (though not reliably) as well as the best-performing LDA (K = 30) and GloVe (K = 15) models by five points (reliably).

Figure 13 plots the mean R2 across the 10 cross-validation folds for the BERT model, broken out by frame. We see that, in contrast to the analogous Figure 11, frames involving NP direct objects do not show systematically poor performance, suggesting that BERT may be able to capture some regularity about selection of a direct object that the models based on frame frequencies were not able to.

Figure 13
Figure 13

Variance explained by BERT model broken out by frame.

6.4 Discussion

At a high level, these results suggest two things. First, there is some amount of information in verbs’ subcategorization frame frequency distributions that is not accessible directly from those distributions themselves—even after various forms of clever normalization. Accessing that information requires some amount of abstraction of the frequencies, confirming both hypothesis H1 and H2 from Section 2.

Second, the amount of extra information that can be gleaned from the subcategorization frame frequency distributions alone is relatively small—especially compared to the gains obtained in using models, such as BERT, that additionally have access to the particular lexical items that cooccur with a verb (see Grimshaw 1994; Pinker 1994; Resnik 1996; White et al. 2017b for reasons this might be true). But not just any model of lexical cooccurrence will do, since pretrained GloVe, which does have access to such cooccurrence statistics, is one of the worst performing models of all.

These results, as of yet, do not support strong commitments as to the nature of the abstraction that confirms H2. Part of this poor performance of the shallow (non-BERT) models may be a product of the heavy constraints that we place on the space of functions that we considered from abstractions to acceptabilities. But this makes the good performance of BERT even more surprising because, though state-of-the-art performance has been demonstrated on multiple NLP tasks using its embeddings, those models often learn nonlinear mappings from the embeddings to the quantity of interest—as is common practice for neural methods in NLP (see Goldberg 2017 and references therein). While the results are consistent with ideas from linguistic theory and acquisition about what these abstractions might be like—BERT is typically thought to be rich enough that substantial syntactic/semantic information is present in it (Devlin et al. 2019)—it will require substantially more investigation to understand exactly how this model is predicting acceptability so well. We leave this as an open question for future work, noting that relevant (though still inconclusive) investigations exist in the NLP literature (Linzen et al. 2018; 2019).

7 Conclusion

This paper addresses the question of how direct the relationship between well-formedness and linguistic experience is, focusing in particular on lexical knowledge. To do this, we developed the bleaching method for scaling standard acceptability judgment experiments to very large sets of verbs. After validating this method against more standard methods, we deployed it on 1,000 clause-embedding verbs in 50 syntactic frames to create the MegaAcceptability dataset, which is publicly available at megaattitude.io under the auspices of the MegaAttitude Project. Using this dataset, which we take to exhaust the set of clause-embedding verbs in English, we found that the relationship between acceptability and subcategorization frame frequency is surprisingly weak and that shallow abstractions of the data yield miniscule improvements in the prediction of acceptability. The performance of BERT suggests that deeper abstractions, however, can do surprisingly well at predicting acceptability, though we still see quite a bit of variation.

We take our results to imply that accounts of how knowledge of c-selection is acquired must posit something beyond simple smoothing or shallow factorization, as previous computational accounts have done (Alishahi & Stevenson 2008; Barak et al. 2012). One form this might take is to rely exclusively on deep, domain-general abstraction mechanisms, like BERT. Another is to enrich shallow domain-general factorization models with tunable domain-specific biases (White et al. 2018a). There is an inherent trade-off between these approaches: (i) Occam’s razor implores us to posit as few inherent biases as possible; but (ii) the data hungriness of deep abstraction mechanisms makes a strictly domain-general model suspect, unless it comes with biases specific to language learning. A hybrid approach is almost certainly necessary, and we believe the MegaAcceptability dataset will prove useful in evaluating the success of such approaches.

Supplementary Files

All of the datasets collected by the authors for this paper, including the MegaAcceptability dataset, are available at megaattitude.io. All of the code necessary for replicating the analyses presented in this paper are available at megaattitude.io as well.

Additional Files

The additional files for this article can be found as follows:

Appendices of “Frequency, acceptability, and selection: A case study of clause-embedding”.

The experimental materials for the validation experiment and the MegaAcceptability dataset (Appendix A), the normalization procedures and corresponding analysis (Appendices B and C), and the description of a method for adding verbs to the MegaAcceptability dataset. DOI: https://doi.org/10.5334/gjgl.1001.s1


NLP = Natural Language Processing, NP = noun phrase, VP = verb phrase, S = sentence, WH = WH word, AMT = Amazon Mechanical Turk, HIT = Human Intelligence Task, MAP = maximum a posteriori, PMI = pointwise-mutual information, LDA = Latent Dirichlet Allocation, LFA = logistic factor analysis, GloVe = Global Vector, NBoW = neural bag of words, BERT = Bidirectional Encoder Representations from Transformers


  1. See also much work in the sentence processing literature (Trueswell et al. 1993; Spivey-Knowlton & Sedivy 1995; Garnsey et al. 1997; McRae et al. 1998; Altmann & Kamide 1999; Hale 2001; Levy 2008; Wells et al. 2009; Fine & Jaeger 2013; Linzen & Jaeger 2016 among others) and the language acquisition literature (Landau & Gleitman 1985; Pinker 1984; 1989; Gleitman 1990; Naigles 1990; 1996; Naigles et al. 1993; Fisher et al. 1991; Fisher 1994; Fisher et al. 1994; 2010; Lederer et al. 1995; Gillette et al. 1999; Yang 2003; 2016; Snedeker & Gleitman 2004; Lidz et al. 2004; Gleitman et al. 2005; Papafragou et al. 2007 among others). [^]
  2. As discussed in Section 4, this dataset has appeared in brief form in five proceedings papers: White & Rawlins 2016, where it was introduced in the context of building a computational model to infer semantic types, and White & Rawlins 2018; White et al. 2018c; An & White 2020; Moon & White to appear, where it was used as a starting point for developing datasets focused on veridicality, neg(ation)-raising, and temporal interpretation that are not relevant here. The present paper reviews in much greater detail the methods for constructing the dataset, and presents validation experiments and arguments in favor of the bleaching method that have not previously appeared. [^]
  3. This is analogous to recent approaches within the NLP literature that aim to probe what linguistic knowledge different models learn from corpus data using an array of focused datasets (Linzen et al. 2016; White et al. 2017a; 2018c; Gulordava et al. 2018; Kuncoro et al. 2018; Peters et al. 2018; Poliak et al. 2018; Conneau et al. 2018; Wang et al. 2018; Wilcox et al. 2018; McCoy et al. 2019; Kann et al. 2019; among others). [^]
  4. An anonymous reviewer asks why White et al. 2018b, and by extension us, do not investigate just frames that take clauses. The vast majority of verbs that do take clauses participate in frames where there are no clauses present, and it is not plausible to assume that cases like this involve different verb senses (at least a priori). For example, one widely discussed example is that of verbs that take so-called concealed question NPs and the relationship of the distribution of those NP frames to full clause-embedding frames (Heim 1979; Romero 2005; Nathan 2006; Frana 2010; among others). Therefore, since the research questions are about lexical representation, not clauses per se, we include all syntactic frames that we believe may bear on this lexical representation, not just ones with clauses. [^]
  5. White et al. note that their experiment in fact included some syntactic frames that involve degree modification, though they do not list these frames. White et al. 2014, which presents a preliminary analysis of the dataset presented in White et al. 2018b, lists four such frames (see the appendix of that paper). We do not include these in our experiment, since the analyses in White et al. 2018b do not include them, and thus our statistics would not be comparable to theirs if we did. [^]
  6. See Appendix A for an explicit mapping of the abstract frames in (7) to their corresponding instantiated frames. [^]
  7. White et al only use adjunct questions to avoid free relative readings. We instead opt for D-linked WH questions, though for the same reason. [^]
  8. The something instantiation of NP is only used for NPs in object position of a simple transitive (NP ___ed NP) and second object position in a double object construction (NP ___ed NP NP). [^]
  9. This distribution is necessary because there is no way to enforce that each verb occur an equal number of times and that each frame occur an equal number of times without having extremely small or extremely large lists. And because our aim is to match the frames used by White et al. as closely as possible, it would be problematic to manipulate the number of frames to make this constraint feasible. [^]
  10. Throughout the remainder of the paper, all confidence intervals are computed using nonparametric bootstraps with 999 replicates. [^]
  11. This procedure is analogous to z-scoring the ratings by participant and then computing the average z-scored rating for each item. As shown by White et al. (2018b), the method used here better models how participants actually make acceptability judgments. See Appendices B and C for further details on the conceptual and empirical relationship between z-scoring and the ordinal model-based normalization we use. [^]
  12. Note that these ditransitive frames are licit with verbs such as allow, deny, and forbid. [^]
  13. These materials were first described in White & Rawlins 2016. See Appendix A for an explicit mapping of the abstract frames to their corresponding instantiated frames in Figure 3. [^]
  14. In allowing participants to respond to multiple lists, we were attempting to balance three pressures. First, ideally, we would ask participants to respond to as many sentences as possible because an increase in the numbers of responses from any particular participant allows us to better normalize that participant’s ratings relative to other participants (via the prior distribution on random effects in the mixed effects model-based normalizer). Second, working against this first concern, we attempted to have as many distinct participants as possible to enable better estimation of common patterns in participants’ response behavior—also, helping us to better normalize judgments, especially for participants who gave fewer responses (again, via the prior distribution on random effects). And third, we did not know whether or not it would be feasible to recruit 5,000 distinct participants, since our experiment is substantially larger than others that do not allow repeat participation. Further, most crowd-sourcing annotation tasks of similar size to ours—e.g. those found in the NLP literature (see Callison-Burch 2019: and references therein)—allow repeat participation. Thus, it was not clear whether 5,000 unique annotators would actually complete our task, which is much more involved than most large-scale crowd-sourcing tasks wherein the task takes on the order of seconds. Further, as laid out in the text, we have taken pains to correct for any potential annotation biases arising from allowing repeat participation. [^]
  15. A full specification of this procedure, including a comparison to alternative methods for aggregating participants’ responses to particular sentences, can be found in Appendix C. [^]
  16. The axes in Figure 7 are not labeled because these normalized scores do not have inherent meaning beyond measuring the relative acceptability of a verb in a frame. [^]
  17. See White & Rawlins 2016 for further discussion of how this dataset might be used or validated in a formal semantics context. [^]
  18. VALEX is available at https://ilexir.co.uk/valex/. We use the raw counts provided with the data, since all other counts involve some amount of smoothing and/or filtering. [^]
  19. This is analogous to the Categorical-Dirichlet model: larger values of λ encourage estimates of θ^vf nearer 1NF , and larger values of γ encourage values of πvf nearer 12 . [^]
  20. In the special case of λ = 0, where the MAP and maximum likelihood estimates are equivalent, this formulation follows from the relationship of the G statistic to the log-likelihood ratio test statistic, with a null hypothesis that verbs and frames are independently distributed categorical and an alternative hypothesis that they are jointly distributed categorical.

    LLR(null,alt)=2log[ v,fnull(v,f)cvfv,falt(v,f)cvf ]=2v,fcvflog(v,f)(v)(f)=2v,fG(v,f)

  21. It diverges slightly from a standard factor analysis in downweighting the contributions of low frequency words by a factor of f(cij)=min(1,cijccutoff)α . In the pretrained models, ccutoff is set 100 and α is set to 34 . In the models we train ourselves, we set ccutoff to 10 and retain the same α, since VALEX contains an order of magnitude fewer observations than the cooccurrence matrices pretrained GloVe is trained on. [^]
  22. Pretrained GloVe is available at https://nlp.stanford.edu/projects/glove/. We specifically use the uncased, 300 dimensional vectors trained on 42 billion words of Common Crawl. [^]
  23. Pretrained BERT models are available at https://github.com/google-research/bert. We specifically use the BERT-base-uncased models. [^]
  24. We also experimented with extracting the embedding of the clause-embedding verb in the sentence. The results were the same. [^]


The authors wish to thank two anonymous reviewers, Ben Van Durme, Dee Ann Reisinger, Rachel Rudinger, Charles Yang, members of the Formal and Computational Semantics Lab (FACTS.lab) at UR and the JHU Semantics Lab, and audiences at SALT 26, DGfS 2017, NELS 2017, NELS 2018, Johns Hopkins University, the University of Rochester, and Stanford University for helpful discussion of this work.

A variety of Python and R packages were used for the analyses presented in this paper, including numpy (van der Walt et al. 2011), scipy (Virtanen et al. 2020), pandas (McKinney 2011), sklearn (Pedregosa et al. 2011), tensorflow (Abadi et al. 2015), torch (Paszke et al. 2019), transformers (Wolf et al. 2019), lme4 (Bates et al. 2015), and turktools (Erlewine & Kotek 2016). All plots were generated using ggplot2 (Wickham 2016). Version information is explicitly specified in the associated analysis code available at megaattitude.io.

Funding Information

This research was funded by the following National Science Foundation grants: BCS-1748969/BCS-1749025 (The MegaAttitude Project: Investigating selection and polysemy at the scale of the lexicon), DDRIG BCS-1456013 (Learning attitude verb meanings), INSPIRE BCS-1344269 (Gradient symbolic computation) as well as the JHU Science of Learning Institute.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

White and Rawlins collaborated on designing the materials for the MegaAcceptability dataset and writing this paper. White designed the validation experiment; implemented and conducted all experiments; and developed and implemented all models and analyses.


Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu & Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/.

Adger, David. 2003. Core syntax: A minimalist approach. Oxford: Oxford University Press.

Alishahi, Afra & Suzanne Stevenson. 2008. A computational model of early argument structure acquisition. Cognitive Science 32(5). 789–834. DOI:  http://doi.org/10.1080/03640210801929287

Altmann, Gerry & Yuki Kamide. 1999. Incremental interpretation at verbs: Restricting the domain of subsequent reference. Cognition 73(3). 247–264. DOI:  http://doi.org/10.1016/S0010-0277(99)00059-1

An, Hannah & Aaron White. 2020. The lexical and grammatical sources of neg-raising inferences. In Allyson Ettinger, Gaja Jarosz & Max Nelson (eds.), Proceedings of the Society for Computation in Linguistics 3. 220–233. Amherst, MA: ScholarWorks. DOI:  http://doi.org/10.7275/yts0-q989

Anand, Pranav & Valentine Hacquard. 2013. Epistemics and attitudes. Semantics and Pragmatics 6(8). 1–59. DOI:  http://doi.org/10.3765/sp.6.8

Anand, Pranav & Valentine Hacquard. 2014. Factivity, belief and discourse. In Luka Crnič & Uli Sauerland (eds.), The art and craft of semantics: A festschrift for Irene Heim 1. 69–90. Cambridge, MA: MIT Working Papers in Linguistics.

Aslin, Richard, Jenny Saffran & Elissa Newport. 1998. Computation of conditional probability statistics by 8-month-old infants. Psychological Science 9(4). 321–324. DOI:  http://doi.org/10.1111/1467-9280.00063

Baker, Simon, Roi Reichart & Anna Korhonen. 2014. An unsupervised model for instance level subcategorization acquisition. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 278–289. Doha, Qatar: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/v1/D14-1034

Barak, Libby, Afsaneh Fazly & Suzanne Stevenson. 2012. Modeling the acquisition of mental state verbs. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012), 1–10. Montréal, Canada: Association for Computational Linguistics.

Bard, Ellen Gurman, Dan Robertson & Antonella Sorace. 1996. Magnitude estimation of linguistic acceptability. Language 72(1). 32–68. DOI:  http://doi.org/10.2307/416793

Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67(1). 1–48. DOI:  http://doi.org/10.18637/jss.v067.i01

Bishop, Christopher M. 2006. Pattern recognition and machine learning. New York: Springer.

Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3. 993–1022.

Brent, Michael R. 1991. Automatic acquisition of subcategorization frames from untagged text. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, 209–214. Berkeley, CA: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/981344.981371

Brent, Michael R. 1993. From grammar to lexicon: Unsupervised learning of lexical syntax. Computational Linguistics 19(2). 243–262.

Bresnan, Joan. 2007. Is syntactic knowledge probabilistic? Experiments with the English dative alternation. In Sam Featherston & Wolfgang Sternefeld (eds.), Roots: Linguistics in search of its evidential base, vol. 96 (Studies in Generative Grammar), 77–96. Berlin: De Gruyter Mouton.

Bresnan, Joan, Anna Cueni, Tatiana Nikitina & R. Harald Baayen. 2007. Predicting the dative alternation. In Gerlof Bouma, Irene Kramer & Joost Zwarts (eds.), Cognitive foundations of interpretation, 69–94. Chicago: University of Chicago Press.

Briscoe, Ted & John Carroll. 1997. Automatic extraction of subcategorization from corpora. In Proceedings of the Fifth Conference on Applied Natural Language Processing, 356–363. Washington, DC: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/974557.974609

Callison-Burch, Chris. 2019. Crowdsourcing and human computation. http://crowdsourcing-class.org/. Accessed: 2019-11-26.

Carroll, Glenn & Mats Rooth. 1998. Valence induction with a head-lexicalized PCFG. In Proceedings of the Third Conference on Empirical Methods for Natural Language Processing, 36–45. Granada, Spain: Association for Computational Linguistics.

Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge, MA: MIT Press. DOI:  http://doi.org/10.21236/AD0616323

Chomsky, Noam. 1973. Conditions on transformations. In S. Anderson & P. Kiparsky (eds.), A festschrift for Morris Halle, 232–286. New York: Holt, Rinehart, & Winston.

Church, Kenneth W. & William A. Gale. 1995. Poisson mixtures. Natural Language Engineering 1(2). 163–190. DOI:  http://doi.org/10.1017/S1351324900000139

Church, Kenneth Ward & Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16(1). 22–29.

Clark, Alexander, Gianluca Giorgolo & Shalom Lappin. 2013a. Statistical representation of grammaticality judgements: The limits of n-gram models. In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL), 28–36. Sofia, Bulgaria: Association for Computational Linguistics.

Clark, Alexander, Gianluca Giorgolo & Shalom Lappin. 2013b. Towards a statistical model of grammaticality. In Proceedings of the 35th Annual Conference of the Cognitive Science Society, 2064–2069. Austin, TX: Cognitive Science Society.

Clark, Alexander & Shalom Lappin. 2011. Linguistic nativism and the poverty of the stimulus. Chichester, UK: Wiley-Blackwell. DOI:  http://doi.org/10.1002/9781444390568

Conneau, Alexis, German Kruszewski, Guillaume Lample, Loïc Barrault & Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2126–2136. Melbourne, Australia: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/P18-1198

Devlin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/N19-1423

Dudley, Rachel. 2017. The role of input in discovering presuppositions triggers: Figuring out what everybody already knew. College Park, MD: University of Maryland dissertation. DOI:  http://doi.org/10.13016/M2SX6496Z

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1). 61–74.

Erlewine, Michael Yoshitaka & Hadas Kotek. 2016. A streamlined approach to online linguistic surveys. Natural Language & Linguistic Theory 34(2). 481–495. DOI:  http://doi.org/10.1007/s11049-015-9305-9

Featherston, Sam. 2005. Magnitude estimation and what it can do for your syntax: Some wh-constraints in German. Lingua 115(11). 1525–1550. DOI:  http://doi.org/10.1016/j.lingua.2004.07.003

Featherston, Sam. 2007. Data in generative grammar: The stick and the carrot. Theoretical Linguistics 33(3). 269–318. DOI:  http://doi.org/10.1515/TL.2007.020

Featherston, Sam. 2008. Thermometer judgments as linguistic evidence. In Claudia Maria Riehl & Astrid Rothe (eds.), Was ist linguistische evidenz? 69–89. Aachen: Shaker Verlag.

Fillmore, Charles John. 1970. The grammar of hitting and breaking. In Roderick A. Jacobs & Peter S. Rosenbaum (eds.), Readings in English transformational grammar, 120–133. Waltham, MA: Ginn.

Fine, Alex B. & T. Florian Jaeger. 2013. Evidence for implicit learning in syntactic comprehension. Cognitive Science 37(3). 578–591. DOI:  http://doi.org/10.1111/cogs.12022

Fisher, Cynthia. 1994. Structure and meaning in the verb lexicon: Input for a syntax-aided verb learning procedure. Language and Cognitive Processes 9(4). 473–517. DOI:  http://doi.org/10.1080/01690969408402129

Fisher, Cynthia, D. Geoffrey Hall, Susan Rakowitz & Lila Gleitman. 1994. When it is better to receive than to give: Syntactic and conceptual constraints on vocabulary growth. Lingua 92. 333–375. DOI:  http://doi.org/10.1016/0024-3841(94)90346-8

Fisher, Cynthia, Henry Gleitman & Lila R. Gleitman. 1991. On the semantic content of subcategorization frames. Cognitive Psychology 23(3). 331–392. DOI:  http://doi.org/10.1016/0010-0285(91)90013-E

Fisher, Cynthia, Yael Gertner, Rose M. Scott & Sylvia Yuan. 2010. Syntactic bootstrapping. Wiley Interdisciplinary Reviews: Cognitive Science 1(2). 143–149. DOI:  http://doi.org/10.1002/wcs.17

Frana, Ilaria. 2010. Concealed questions. In search of answers. Amherst, MA: University of Massachusetts dissertation.

Gahl, Susanne. 1998. Automatic extraction of subcorpora based on subcategorization frames from a part-of-speech tagged corpus. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1. 428–432. Montreal, Quebec: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/980845.980918

Garnsey, Susan M., Neal J. Pearlmutter, Elizabeth Myers & Melanie A. Lotocky. 1997. The contributions of verb bias and plausibility to the comprehension of temporarily ambiguous sentences. Journal of Memory and Language 37(1). 58–93. DOI:  http://doi.org/10.1006/jmla.1997.2512

Gibson, Edward & Evelina Fedorenko. 2010. Weak quantitative standards in linguistics research. Trends in Cognitive Science 14(6). 233–234. DOI:  http://doi.org/10.1016/j.tics.2010.03.005

Gibson, Edward & Evelina Fedorenko. 2013. The need for quantitative methods in syntax and semantics research. Language and Cognitive Processes 28(1–2). 88–124. DOI:  http://doi.org/10.1080/01690965.2010.515080

Gillette, Jane, Henry Gleitman, Lila Gleitman & Anne Lederer. 1999. Human simulations of vocabulary learning. Cognition 73(2). 135–176. DOI:  http://doi.org/10.1016/S0010-0277(99)00036-0

Gleitman, Lila. 1990. The structural sources of verb meanings. Language Acquisition 1(1). 3–55. DOI:  http://doi.org/10.1207/s15327817la0101_2

Gleitman, Lila R., Kimberly Cassidy, Rebecca Nappa, Anna Papafragou & John C. Trueswell. 2005. Hard words. Language Learning and Development 1(1). 23–64. DOI:  http://doi.org/10.1207/s15473341lld0101_4

Goldberg, Yoav. 2017. Neural network methods for natural language processing (Synthesis Lectures on Human Language Technologies 37). San Rafael, CA: Morgan & Claypool. DOI:  http://doi.org/10.2200/S00762ED1V01Y201703HLT037

Grimshaw, Jane. 1979. Complement selection and the lexicon. Linguistic Inquiry 10(2). 279–326.

Grimshaw, Jane. 1981. Form, function and the language acquisition device. In C. L. Baker & John J. McCarthy (eds.), The logical problem of language acquisition, 165–182. Cambridge, MA: MIT Press.

Grimshaw, Jane. 1990. Argument structure. Cambridge, MA: MIT Press.

Grimshaw, Jane. 1994. Lexical reconciliation. Lingua 92. 411–430. DOI:  http://doi.org/10.1016/0024-3841(94)90348-4

Gruber, Jeffrey Steven. 1965. Studies in lexical relations. Cambridge, MA: Massachusetts Institute of Technology dissertation.

Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen & Marco Baroni. 2018. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1195–1205. New Orleans, LA: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/N18-1108

Hacquard, Valentine & Alexis Wellwood. 2012. Embedding epistemic modals in English: A corpus-based study. Semantics and Pragmatics 5(4). 1–29. DOI:  http://doi.org/10.3765/sp.5.4

Hale, John. 2001. A probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies (NAACL’01), 1–8. Stroudsburg, PA: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/1073336.1073357

Hankamer, Jorge & Ivan Sag. 1976. Deep and surface anaphora. Linguistic Inquiry 7(3). 391–428.

Heim, Irene. 1979. Concealed questions. In Rainer Bäuerle, Urs Egli & Arnim von Stechow (eds.), Semantics from different points of view (Springer Series in Language and Communication 6), 51–60. Springer. DOI:  http://doi.org/10.1007/978-3-642-67458-7_5

Hofmeister, Philip & Ivan A. Sag. 2010. Cognitive constraints and island effects. Language 86(2). 366–415. DOI:  http://doi.org/10.1353/lan.0.0223

Iyyer, Mohit, Varun Manjunatha, Jordan Boyd-Graber & Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1681–1691. Beijing, China: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/v1/P15-1162

Jackendoff, Ray. 1972. Semantic interpretation in generative grammar. Cambridge, MA: MIT Press.

Jurafsky, D. & J. H. Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Pearson-Prentice Hall.

Kako, Edward. 1997. Subcategorization semantics and the naturalness of verb-frame pairings. University of Pennsylvania Working Papers in Linguistics 4(2). 155–167.

Kann, Katharina, Alex Warstadt, Adina Williams & Samuel R. Bowman. 2019. Verb argument structure alternations in word and sentence embeddings. In Gaja Jarosz, Max Nelson, Brendan O’Connor & Joe Pater (eds.), Proceedings of the Society for Computation in Linguistics 2. 287–297. DOI:  http://doi.org/10.7275/q5js-4y86

Keller, Frank. 2000. Gradience in grammar: Experimental and computational aspects of degrees of grammaticality. Edinburgh, UK: University of Edinburgh dissertation.

Kipper-Schuler, Karin. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon. Philadelphia, PA: University of Pennsylvania dissertation.

Kluender, Robert & Marta Kutas. 1993. Subjacency as a processing phenomenon. Language and Cognitive Processes 8(4). 573–633. DOI:  http://doi.org/10.1080/01690969308407588

Korhonen, Anna, Genevieve Gorrell & Diana McCarthy. 2000. Statistical filtering and subcategorization frame acquisition. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 199–206. Hong Kong, China: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/1117794.1117819

Korhonen, Anna, Yuval Krymolowski & Ted Briscoe. 2006. A large subcategorization lexicon for natural language processing applications. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). Genoa, Italy: European Language Resources Association (ELRA).

Korhonen, Anna, Yuval Krymolowski & Zvika Marx. 2003. Clustering polysemic subcategorization frame distributions semantically. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 64–71. Sapporo, Japan: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/1075096.1075105

Kuncoro, Adhiguna, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark & Phil Blunsom. 2018. LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1426–1436. Melbourne, Australia: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/P18-1132

Kush, Dave, Terje Lohndal & Jon Sprouse. 2018. Investigating variation in island effects. Natural Language & Linguistic Theory 36(3). 743–779. DOI:  http://doi.org/10.1007/s11049-017-9390-z

Landau, Barbara & Lila R. Gleitman. 1985. Language and experience: Evidence from the blind child (Cognitive Science Series 8). Cambridge, MA: Harvard University Press.

Lapata, Maria. 1999. Acquiring lexical generalizations from corpora: A case study for diathesis alternations. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 397–404. College Park, MD: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/1034678.1034740

Lasnik, Howard. 1989. On certain substitutes for negative data. In R. J. Matthews & William Demopoulos (eds.), Learnability and linguistic theory (Studies in Theoretical Psycholinguistics 9), 89–105. Dordrecht: Springer. DOI:  http://doi.org/10.1007/978-94-009-0955-7

Lau, Jey Han, Alexander Clark & Shalom Lappin. 2017. Grammaticality, acceptability, and probability: A probabilistic view of linguistic knowledge. Cognitive Science 41(5). 1202–1241. DOI:  http://doi.org/10.1111/cogs.12414

Lederer, Anne, Henry Gleitman & Lila Gleitman. 1995. Verbs of a feather flock together: Semantic information in the structure of maternal speech. In M. Tomasello & W. E. Merriman (eds.), Beyond Names for Things: Young Children’s Acquisition of Verbs, 277–297. Hillsdale, NJ: Lawrence Erlbaum.

Levin, Beth. 1993. English verb classes and alternations: A preliminary investigation. Chicago: University of Chicago Press.

Levin, Beth & Malka Rappaport Hovav. 2005. Argument realization. Cambridge: Cambridge University Press. DOI:  http://doi.org/10.1017/CBO9780511610479

Levy, Omer & Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence & Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, 2177–2185. Red Hook, NY: Curran Associates, Inc.

Levy, Roger. 2008. Expectation-based syntactic comprehension. Cognition 106(3). 1126–1177. DOI:  http://doi.org/10.1016/j.cognition.2007.05.006

Lewis, Shevaun, Valentine Hacquard & Jeffrey Lidz. 2017. “Think” pragmatically: Children’s interpretation of belief reports. Language Learning and Development 13(4). 395–417. DOI:  http://doi.org/10.1080/15475441.2017.1296768

Lidz, Jeffrey, Henry Gleitman & Lila Gleitman. 2004. Kidz in the hood: Syntactic bootstrapping and the mental lexicon. In D. Geoffrey Hall & Sandra R. Waxman (eds.), Weaving a lexicon, 603–636. Cambridge, MA: MIT Press.

Linzen, Tal, Grzegorz Chrupała & Afra Alishahi (eds.) 2018. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP. Brussels, Belgium: Association for Computational Linguistics.

Linzen, Tal, Emmanuel Dupoux & Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4. 521–535. DOI:  http://doi.org/10.1162/tacl_a_00115

Linzen, Tal, Grzegorz Chrupała, Yonatan Belinkov & Dieuwke Hupkes (eds.) 2019. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP. Florence, Italy: Association for Computational Linguistics.

Linzen, Tal & T. Florian Jaeger. 2016. Uncertainty and expectation in sentence processing: Evidence from subcategorization distributions. Cognitive Science 40(6). 1382–1411. DOI:  http://doi.org/10.1111/cogs.12274

Lippincott, Thomas, Anna Korhonen & Diarmuid Ó Séaghdha. 2012. Learning syntactic verb frames using graphical models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 420–429. Jeju Island, Korea: Association for Computational Linguistics.

Manning, Chris & Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

Manning, Christopher D. 1993. Automatic acquisition of a large sub categorization dictionary from corpora. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 235–242. Columbus, OH: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/981574.981606

Maye, Jessica, Janet F. Werker & LouAnn Gerken. 2002. Infant sensitivity to distributional information can affect phonetic discrimination. Cognition 82(3). B101–B111. DOI:  http://doi.org/10.1016/S0010-0277(01)00157-3

McCoy, Tom, Ellie Pavlick & Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3428–3448. Florence, Italy: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/P19-1334

McKinney, Wes. 2011. pandas: A foundational Python library for data analysis and statistics. Python for High Performance and Scientific Computing 14.

McRae, Ken, Michael J. Spivey-Knowlton & Michael K. Tanenhaus. 1998. Modeling the influence of thematic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language 38(3). 283–312. DOI:  http://doi.org/10.1006/jmla.1997.2543

Moon, Ellise & Aaron Steven White. to appear. The source of nonfinite temporal interpretation. In Proceedings of the 50th Annual Meeting of the North East Linguistic Society. Amherst, MA: GLSA Publications.

Naigles, L., Henry Gleitman & Lila Gleitman. 1993. Syntactic bootstrapping and verb acquisition. In Esther Dromi (ed.), Language and cognition: A developmental perspective (Human Development Series), Norwood, NJ: Ablex.

Naigles, Letitia. 1990. Children use syntax to learn verb meanings. Journal of Child Language 17(2). 357–374. DOI:  http://doi.org/10.1017/S0305000900013817

Naigles, Letitia. 1996. The use of multiple frames in verb learning via syntactic bootstrapping. Cognition 58(2). 221–251. DOI:  http://doi.org/10.1016/0010-0277(95)00681-8

Nathan, Lance Edward. 2006. On the interpretation of concealed questions. Cambridge, MA: Massachusetts Institute of Technology dissertation.

O’Donovan, Ruth, Michael Burke, Aoife Cahill, Josef van Genabith & Andy Way. 2005. Large-scale induction and evaluation of lexical resources from the Penn-II and Penn-III treebanks. Computational Linguistics 31(3). 329–366. DOI:  http://doi.org/10.1162/089120105774321073

Papafragou, Anna, Kimberly Cassidy & Lila Gleitman. 2007. When we think about thinking: The acquisition of belief verbs. Cognition 105(1). 125–165. DOI:  http://doi.org/10.1016/j.cognition.2006.09.008

Parisien, Christopher & Suzanne Stevenson. 2010. Learning verb alternations in a usage-based bayesian model. In Stellan Ohlsson & Richard Catrambone (eds.), Proceedings of the 32nd Annual Meeting of the Cognitive Science Society, 2674–2679. Austin, TX: Cognitive Science Society.

Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai & Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Hanna Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily Fox & Roman Garnett (eds.), Advances in Neural Information Processing Systems 32, 8024–8035. Red Hook, NY: Curran Associates, Inc.

Pearl, Lisa & Jon Sprouse. 2013. Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition 20(1). 23–68. DOI:  http://doi.org/10.1080/10489223.2012.738742

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot & Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12. 2825–2830.

Pennington, Jeffrey, Richard Socher & Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. Doha, Qatar: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/v1/D14-1162

Perfors, Andrew, Joshua B. Tenenbaum & Elizabeth Wonnacott. 2010. Variability, negative evidence, and the acquisition of verb argument constructions. Journal of Child Language 37(3). 607–642. DOI:  http://doi.org/10.1017/S0305000910000012

Pesetsky, David. 1982. Paths and categories. Cambridge, MA: Massachusetts Institute of Technology dissertation.

Pesetsky, David. 1991. Zero syntax: Vol. 2: Infinitives.

Peters, Matthew, Mark Neumann, Luke Zettlemoyer & Wen-tau Yih. 2018. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1499–1509. Brussels, Belgium: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/D18-1179

Pinker, Steven. 1984. Language learnability and language development (Cognitive Science Series 7). Cambridge, MA: Harvard University Press.

Pinker, Steven. 1989. Learnability and cognition: The acquisition of argument structure (Learning, Development, and Conceptual Change). Cambridge, MA: MIT Press.

Pinker, Steven. 1994. How could a child use verb syntax to learn verb semantics? Lingua 92. 377–410. DOI:  http://doi.org/10.1016/0024-3841(94)90347-6

Poliak, Adam, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White & Benjamin Van Durme. 2018. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 67–81. Brussels, Belgium: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/D18-1007

Preiss, Judita, Ted Briscoe & Anna Korhonen. 2007. A system for large-scale acquisition of verbal, nominal and adjectival subcategorization frames from corpora. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 912–919. Prague, Czech Republic: Association for Computational Linguistics.

Rawlins, Kyle. 2013. About ‘about’. Semantics and Linguistic Theory 23. 336–357. DOI:  http://doi.org/10.3765/salt.v23i0.2688

Resnik, Philip. 1996. Selectional constraints: An information-theoretic model and its computational realization. Cognition 61(1). 127–159. DOI:  http://doi.org/10.1016/S0010-0277(96)00722-6

Romero, Maribel. 2005. Concealed questions and specificational subjects. Linguistics and Philosophy 28(6). 687–737. DOI:  http://doi.org/10.1007/s10988-005-2654-9

Ross, John Robert. 1972. Act. In Donald Davidson & Gilbert Harman (eds.), Semantics of natural language, 70–126. Dordrecht: Springer Netherlands. DOI:  http://doi.org/10.1007/978-94-010-2557-7_4

Ross, John Robert. 1973. Slifting. In Maurice Gross, Morris Halle & Marcel-Paul Schützenberger (eds.), The formal analysis of natural languages, 133–170. The Hague: Mouton de Gruyter. DOI:  http://doi.org/10.1515/9783110885248-009

Saffran, Jenny, Elissa Newport & Richard Aslin. 1996b. Word segmentation: The role of distributional cues. Journal of Memory and Language 35(4). 606–621. DOI:  http://doi.org/10.1006/jmla.1996.0032

Saffran, Jenny, Richard Aslin & Elissa Newport. 1996a. Statistical learning by 8-month-old infants. Science 274(5294). 1926–1928. DOI:  http://doi.org/10.1126/science.274.5294.1926

Schulte im Walde, Sabine. 2006. Experiments on the automatic induction of German semantic verb classes. Computational Linguistics 32(2). 159–194. DOI:  http://doi.org/10.1162/coli.2006.32.2.159

Schulte im Walde, Sabine & Chris Brew. 2002. Inducing German semantic verb classes from purely syntactic subcategorisation information. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 223–230. Philadelphia, PA: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/1073083.1073121

Schütze, Carson T. & Jon Sprouse. 2014. Judgment data. In Robert J. Podesva & Devyani Sharma (eds.), Research methods in linguistics, 27–50. Cambridge: Cambridge University Press. DOI:  http://doi.org/10.1017/CBO9781139013734.004

Snedeker, Jesse & Lila Gleitman. 2004. Why it is hard to label our concepts. In D. Geoffrey Hall & Sandra R. Waxman (eds.), Weaving a lexicon, 257–294. Cambridge, MA: MIT Press.

Sorace, Antonella & Frank Keller. 2005. Gradience in linguistic data. Lingua 115(11). 1497–1524. DOI:  http://doi.org/10.1016/j.lingua.2004.07.002

Spivey-Knowlton, Michael & Julie C. Sedivy. 1995. Resolving attachment ambiguities with multiple constraints. Cognition 55(3). 227–267. DOI:  http://doi.org/10.1016/0010-0277(94)00647-4

Sprouse, Jon. 2007. Continuous acceptability, categorical grammaticality, and experimental syntax. Biolinguistics 1. 123–134.

Sprouse, Jon. 2011. A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavorial Research 43(1). 155–167. DOI:  http://doi.org/10.3758/s13428-010-0039-7

Sprouse, Jon, Beracah Yankama, Sagar Indurkhya, Sandiway Fong & Robert C. Berwick. 2018. Colorless green ideas do sleep furiously: Gradient acceptability and the nature of the grammar. The Linguistic Review 35(3). 575–599. DOI:  http://doi.org/10.1515/tlr-2018-0005

Sprouse, Jon, Carson T. Schütze & Diogo Almeida. 2013. A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001–2010. Lingua 134. 219–248. DOI:  http://doi.org/10.1016/j.lingua.2013.07.002

Sprouse, Jon & Diogo Almeida. 2013. The empirical status of data in syntax: A reply to Gibson and Fedorenko. Language and Cognitive Processes 28(3). 222–228. DOI:  http://doi.org/10.1080/01690965.2012.703782

Sprouse, Jon, Matt Wagers & Colin Phillips. 2012. A test of the relation between working-memory capacity and syntactic island effects. Language 88(1). 82–123. DOI:  http://doi.org/10.1353/lan.2012.0004

Sun, Lin, Anna Korhonen & Yuval Krymolowski. 2008. Automatic classification of English verbs using rich syntactic features. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II. Hyderabad, India: Asian Federation of Natural Language Processing.

Suzuki, Jun & Masaaki Nagata. 2015. A unified learning framework of skipgrams and global vectors. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 186–191. Beijing, China: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/v1/P15-2031

Trueswell, John C., Michael K. Tanenhaus & Christopher Kello. 1993. Verbspecific constraints in sentence processing: Separating effects of lexical preference from garden-paths. Journal of Experimental Psychology: Learning, Memory, and Cognition 19(3). 528. DOI:  http://doi.org/10.1037//0278-7393.19.3.528

Ushioda, Akira, David A. Evans, Ted Gibson & Alex Waibel. 1993. The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora. In Branimir Boguraev & James Pustejovsky (eds.), Acquisition of lexical knowledge from text. Columbus, OH: Association for Computational Linguistics.

Van de Cruys, Tim, Laura Rimell, Thierry Poibeau & Anna Korhonen. 2012. Multi-way tensor factorization for unsupervised lexical acquisition. In Proceedings of COLING 2012, 2703–2720. Mumbai, India: The COLING 2012 Organizing Committee.

van der Walt, Stéfan, S. Chris Colbert & Gael Varoquaux. 2011. The NumPy array: A structure for efficient numerical computation. Computing in Science & Engineering 13(2). 22–30. DOI:  http://doi.org/10.1109/MCSE.2011.37

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł. ukasz Kaiser & Illia Polosukhin. 2017. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna Wallach, Rob Fergus, S. V. N. Vishwanathan & Roman Garnett (eds.), Advances in Neural Information Processing Systems 30, 5998–6008. Red Hook, NY: Curran Associates, Inc.

Virtanen, Pauli, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt & SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental algorithms for scientific computing in python. Nature Methods 17. 261–272. DOI:  http://doi.org/10.1038/s41592-019-0686-2

Vlachos, Andreas, Anna Korhonen & Zoubin Ghahramani. 2009. Unsupervised and constrained Dirichlet process mixture models for verb clustering. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, 74–82. Athens, Greece: Association for Computational Linguistics. DOI:  http://doi.org/10.3115/1705415.1705425

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy & Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, 353–355. Brussels, Belgium: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/W18-5446

Warstadt, Alex, Amanpreet Singh & Samuel R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7. 625–641. DOI:  http://doi.org/10.1162/tacl_a_00290

Wells, Justine B., Morten H. Christiansen, David S. Race, Daniel J. Acheson & Maryellen C. MacDonald. 2009. Experience and sentence processing: Statistical learning and relative clause comprehension. Cognitive Psychology 58(2). 250–271. DOI:  http://doi.org/10.1016/j.cogpsych.2008.08.002

White, Aaron Steven. 2015. Information and incrementality in syntactic bootstrapping. College Park, MD: University of Maryland dissertation. DOI:  http://doi.org/10.13016/M2X938

White, Aaron Steven. 2020. Nothing’s wrong with believing (or hoping) whether.

White, Aaron Steven & Kyle Rawlins. 2016. A computational model of Sselection. Semantics and Linguistic Theory 26. 641–663. DOI:  http://doi.org/10.3765/salt.v26i0.3819

White, Aaron Steven & Kyle Rawlins. 2018. The role of veridicality and factivity in clause selection. In Sherry Hucklebridge & Max Nelson (eds.), Proceedings of the 48th Annual Meeting of the North East Linguistic Society, 221–234. Amherst, MA: GLSA Publications.

White, Aaron Steven, Philip Resnik, Valentine Hacquard & Jeffrey Lidz. 2017b. The contextual modulation of semantic information.

White, Aaron Steven, Pushpendre Rastogi, Kevin Duh & Benjamin Van Durme. 2017a. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 996–1005. Taipei, Taiwan: Asian Federation of Natural Language Processing.

White, Aaron Steven, Rachel Dudley, Valentine Hacquard & Jeffrey Lidz. 2014. Discovering classes of attitude verbs using subcategorization frame distributions. In Hsin-Lun Huang, Ethan Poole & Amanda Rysling (eds.), Proceedings of the 43rd Annual Meeting of the North East Linguistic Society, 249–260. Amherst, MA: GLSA Publications.

White, Aaron Steven, Rachel Rudinger, Kyle Rawlins & Benjamin Van Durme. 2018c. Lexicosyntactic inference in neural models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4717–4724. Brussels, Belgium: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/D18-1501

White, Aaron Steven, Valentine Hacquard & Jeffrey Lidz. 2018a. The labeling problem in syntactic bootstrapping: Main clause syntax in the acquisition of propositional attitude verbs. In Kristen Syrett & Sudha Arunachalam (eds.), Semantics in Acquisition (Trends in Language Acquisition Research 24), 198–220. John Benjamins Publishing Company. DOI:  http://doi.org/10.1075/tilar.24.09whi

White, Aaron Steven, Valentine Hacquard & Jeffrey Lidz. 2018b. Semantic information and the syntax of propositional attitude verbs. Cognitive Science 42(2). 416–456. DOI:  http://doi.org/10.1111/cogs.12512

Wickham, Hadley. 2016. ggplot2: Elegant graphics for data analysis. New York: Springer-Verlag. DOI:  http://doi.org/10.1007/978-0-387-98141-3

Wilcox, Ethan, Roger Levy, Takashi Morita & Richard Futrell. 2018. What do RNN language models learn about filler–gap dependencies? In Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, 211–221. Brussels, Belgium: Association for Computational Linguistics. DOI:  http://doi.org/10.18653/v1/W18-5423

Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest & Alexander M. Rush. 2019. Huggingface’s Transformers: State-of-the-art natural language processing.

Yang, Charles. 2003. Knowledge and learning in natural language. Oxford: Oxford University Press.

Yang, Charles. 2016. The price of linguistic productivity: How children learn to break the rules of language. Cambridge, MA: MIT Press. DOI:  http://doi.org/10.7551/mitpress/9780262035323.001.0001

Zhou, M. & L. Carin. 2015. Negative binomial process count and mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(2). 307–320. DOI:  http://doi.org/10.1109/TPAMI.2013.211

Zhou, Mingyuan. 2018. Nonparametric Bayesian negative binomial factor analysis. Bayesian Analysis 13(4). 1065–1093. DOI:  http://doi.org/10.1214/17-BA1070

Zhou, Mingyuan, Lauren Hannah, David Dunson & Lawrence Carin. 2012. Beta-negative binomial process and poisson factor analysis. In Neil D. Lawrence & Mark Girolami (eds.), Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, vol. 22 (Proceedings of Machine Learning Research), 1462–1471. La Palma, Canary Islands: PMLR.