The reliability of acceptability judgments across languages

The reliability of acceptability judgments made by individual linguists has often been called into question. Recent large-scale replication studies conducted in response to this criticism have shown that the majority of published English acceptability judgments are robust. We make two observations about these replication studies. First, we raise the concern that English acceptability judgments may be more reliable than judgments in other languages. Second, we argue that it is unnecessary to replicate judgments that illustrate uncontroversial descriptive facts; rather, candidates for replication can emerge during formal or informal peer review. We present two experiments motivated by these arguments. Published Hebrew and Japanese acceptability contrasts considered questionable by the authors of the present paper were rated for acceptability by a large sample of naive participants. Approximately half of the contrasts did not replicate. We suggest that the reliability of acceptability judgments, especially in languages other than English, can be improved using a simple open review system, and that formal experiments are only necessary in controversial cases.


Introduction
Acceptability judgments are a major source of data in linguistics.Most of the acceptability contrasts reported in the literature reflect the judgment of a single individual -the author of the article -occasionally with feedback from colleagues.The reliability of such judgments has repeatedly come under criticism (Langendoen et al. 1973;Schütze 1996;Edelman & Christiansen 2003;Gibson & Fedorenko 2010;Gibson et al. 2013).It has been argued, for example, that "the journals are full of papers containing highly questionable data, as readers can verify simply by perusing the examples in nearly any syntax article about a familiar language" (Wasow & Arnold 2005: 1484).If this criticism turned out to be correct, decades of syntactic theory would appear to be standing on shaky empirical ground.Although 1 some of the critics have targeted generative syntacticians in particular for criticism, this issue applies to other research communities as well, as the debate around the use of introspective judgments in The Cambridge Grammar of the English Language illustrates (Huddleston & Pullum 2002a;b).
Other authors have defended the field's reliance on individual linguists' judgments (Phillips & Lasnik 2003;Featherston 2009;Phillips 2010).Proponents of this methodology have pointed out that while acceptability judgments are initially made by a single author, they are subsequently subjected to several stages of formal and informal peer review before being published, and as such are likely to be robust.This defense of judgments given by an individual author is supported by the results of a number of recent judgment collection experiments, which replicated the overwhelming majority of English judgments from a Minimalist syntax textbook and from articles published in the generative linguistics journal Linguistic Inquiry (Sprouse & Almeida 2012;Sprouse et al. 2013;Mahowald et al. 2016).
The debate surrounding the reliability of acceptability judgments has so far been limited to English.While English is the source of a sizable proportion of the judgments in the literature -for better or worse, it accounted for about half of the syntactic acceptability judgments reported between 2001 and 2010 in Linguistic Inquiry (Sprouse et al. 2013) -theoretical developments in linguistics are often driven by data from other languages.The current study makes the first step in expanding the debate on judgment reliability beyond English by conducting acceptability judgment replication studies in Hebrew and Japanese.Although there are fairly active communities of syntacticians working on both of these languages, those communities are undoubtedly smaller than the community of syntacticians working on English; the English syntax community includes not only a large number of native English speakers but also an even larger number of linguists who do not speak English as a native language but are highly proficient in that language.
The contrasts selected for replication in previous experiments were sampled at random.In those studies, a large representative sample was necessary to estimate the proportion of reliable judgments in a particular body of work (Sprouse et al. 2013;Mahowald et al. 2016).As we argue in Section 2, however, many of the judgments in the literature are self-evident; for example, a syntax textbook might include the example *the bear snuffleds to illustrate that past tense forms in English do not carry overt person agreement marking.These are not the judgments that critics take issue with; short of deliberate fraud by the author, such judgments are very likely to be replicable.Since such judgments are fairly common, the replication rate in a study based on a random sample of sentences conflates the proportion of self-evident judgments in the sample with the reliability of potentially questionable judgments, and is therefore difficult to interpret.
Our goal is to show that questionable contrasts can be reliably identified by a linguist, and consequently a more extensive review process, such as the one to which English judgments tend to be subjected, would have kept those judgments out of the published record.We therefore do not attempt to construct a representative sample of judgments; instead, we undersample self-evident contrasts (four in each language) and oversample ones that we as native speakers of Hebrew and Japanese deemed to be questionable (14 in each language).
The rest of this paper is organized as follows.Section 2 describes the methods we used to conduct acceptability judgment replication experiments in Hebrew and Japanese.In Section 3, we show that half of the Hebrew contrasts and a third of the Japanese contrasts that we deemed to be questionable failed to replicate in formal experiments.In Section 4, we discuss the interpretation of these results and suggest ways in which the benefits of informal peer review can be extended beyond English.Section 5 concludes the paper.

Participants
We conducted two acceptability rating experiments: one in Hebrew and one in Japanese.The Hebrew experiment was completed by 76 participants, and the Japanese experiment by 98 participants.All participants were volunteers recruited through Facebook (it is difficult to recruit a large enough sample of Hebrew and Japanese speaking subjects on paid platforms such as Amazon Mechanical Turk).We asked participants not to participate in the study if they did not satisfy the following two conditions: (1) they lived in Israel / Japan in the first 13 years of their lives, except for short breaks; and (2) their parents spoke Hebrew / Japanese to them.

Materials
As we mentioned above, we did not attempt to select a representative sample of judgments from the literature.We illustrate the motivation for this decision using the three-way classification of syntactic judgments proposed by Marantz (2005).The first category of judgments discussed by Marantz, which we refer to as Class I judgments, consists of "word salads" -sequences of words that are so far from the grammar of the language that they cannot even be assigned a phonological representation.The following "word salad", for example, illustrates what English sentences would look like if English were a head-final language like Japanese (Marantz 2005: 433): (1) *Man the book a woman those to given has.
The second category (Class II) includes judgments that illustrate uncontroversial facts about the grammar of the language, facts of the sort that might be presupposed in a theoretical analysis.The following contrast, for example, shows that English verbs agree in number with their subject (Marantz 2005: 434): (2) a.The men are leaving.b. *The men is leaving.
The third category (Class III) includes more subtle contrasts, such as constraints on wh-movement or on possible coreference relations across noun phrases.The judgments that critics take issue with typically fall into this category.Gibson et al. (2013) refer to this class as "theoretically meaningful contrasts".We prefer to use the more neutral term Class III judgments: it is often difficult to assess the theoretical significance of a particular contrast, and it is not clear whether there is a relationship between the "obviousness" of a judgment and its theoretical import.The work on English by Sprouse and colleagues attempted to replicate a random sample of all published judgments, regardless of their class.One of the contrasts from Adger (2003) replicated by Sprouse & Almeida (2012) is shown in (3a): (3) a.The bear snuffled.b. *The bear snuffleds.
This contrast was replicated by a large margin, as were other Class II judgments.One might object that textbooks such as Adger (2003) contain more Class II judgments for pedagogical reasons.In reality, however, such judgments are also fairly common in the sample from Linguistic Inquiry investigated by Sprouse et al. (2013), e.g.: (4) a.I hate eating sushi.b. *I seem eating sushi.
(5) a. Wallace and Greg like each other.b. *Each other like Wallace and Greg.
We can be fairly confident that judgments of this type will be reliably replicated with naive subjects (Mahowald et al. 2016).Consequently, our study focuses on Class III judgments.For the experiments reported below, we -linguists who are native speakers of Hebrew or Japanese -selected 18 acceptability contrasts in each language, 14 of which were Class III contrasts from the literature that we believed were potentially questionable (henceforth "critical items") and four were uncontroversial Class II contrasts (henceforth "control items").We limited the total number of contrasts in each language to 18 to keep the experiment short (this was necessary since all of our participants were volunteers).While we did not record the overall number of judgments (questionable and unquestionable) in the articles we examined,1 it is clear that there were many more unquestionable judgments than questionable ones.For example, Borer (1995), which was the source of three of our questionable Hebrew contrasts, contains more than a hundred Hebrew examples.At the same time, we did not attempt to compile an exhaustive list of all questionable judgments in the articles we have examined; the particular paper mentioned above, for example, contains additional questionable judgments that are relatively similar to the three judgments we tested and so were not included in our experiment.
The full list of materials is given in Section 2.3 (for Hebrew) and Section 2.4 (for Japanese); these sections can be safely skipped in a first reading of this article.

Hebrew contrasts
The Hebrew judgments were primarily drawn from peer-reviewed articles, in particular the Special Hebrew Issue of Natural Language and Linguistic Theory (August 1995) and other issues of Natural Language and Linguistic Theory and Linguistic Inquiry, as well as from two books: a collection of articles (Armon-Lotem et al. 2008) and a frequently cited dissertation published as a book (Shlonsky 1997).Some of the Hebrew judgments concerned DPs (noun phrases) rather than entire sentences, such as the following contrast (Belletti & Shlonsky 1995: 517) We embedded these DPs in simple sentences; the added material is represented in the appendix using squared brackets.All of the contrasts involved judgments on strings (is this sentence acceptable?), rather than judgments under an interpretation (can this sentence have this particular meaning?).The Hebrew judgments concerned word order (H3, H8, H9, H10, H13), argument optionality (H12) and omissibility of elements in coordination (H5, H14), among other phenomena.The articles that the judgments were drawn from used a variety of romanization schemes; here we use a unified scheme that reflects modern pronunciation (x represents the voiceless velar fricative [x]).We kept the glosses used in the original articles even when our own judgments about the meaning of certain words diverged from the original authors'.

Japanese contrasts
The Japanese judgments were selected from a number of sources: peer-reviewed papers published in Natural Language and Linguistic Theory, Linguistic Inquiry, and Journal of East Asian Linguistics, as well as in Japanese-specific journals; a dissertation published as a book (Miyagawa 1989); and three unpublished but widely cited dissertations (Farmer 1980;Hoji 1985;Oku 1998).Some of the Japanese judgments were bound to particular semantic interpretations (e.g., scope interpretations).In those cases in which the acceptability of the sentences was to be evaluated given a particular interpretation, explicit contexts were given to the participants; the participants were asked to rate the sentences under those contexts.The English translations of the context sentences are indicated with parentheses.

Procedure
The experiments were administered using a website created for this purpose.The instructions were based on those used by Sprouse & Almeida (2012).The participants were requested to rate each sentence on a scale from 1 (very bad) to 7 (very good).We emphasized that an acceptable sentence was not necessarily one that would be approved by official language institutions, but rather one that would not sound out of place when uttered by a native speaker in a conversation.Only a single lexicalization of each contrast was presented to participants.
Our participants rated both members of each contrast separately; other sentences were presented between the two members of the contrast, as we describe below.This design differs from standard practice in cognitive psychology, where care is taken to ensure that the same participant is not exposed to multiple versions of the same item.We believe that the concerns that motivate this practice in cognitive psychology do not apply to the case of acceptability judgments, because the original data point is itself an explicit comparison between two sentences; in fact, in some judgment replication studies both members of the contrast are displayed simultaneously and participants are instructed to make a forced choice between them (Sprouse et al. 2013).
The stimuli were divided into two blocks; each block contained one of the members of each contrast.Participants were not made aware of this division.The assignment of contrast members to blocks was counterbalanced across participants: for a given contrast, approximately half of the participants rated the unstarred member of the contrast first, and the other half read the starred member first.The allocation of contrast members to blocks was performed such that each block contained an equal number of unstarred and starred sentences, to avoid response bias (Sprouse 2009).The order of sentences within each block was pseudo-randomized such that no more than three consecutive sentences had the same acceptability annotation (starred or unstarred).The uncontroversial judgments were presented first in each block, to familiarize the participants with the task.Finally, we ensured that the first three sentences presented to a participant always included both starred and unstarred sentences (presented without the stars, of course).
Participants in the Hebrew experiment also rated a few unpaired sentences for acceptability; these sentences appeared in a middle block, between the two blocks reserved for acceptability contrasts.The ratings of these sentences are not analyzed in the current paper.

Results
The mean acceptability ratings for each of the sentences are shown in Figure 1 (for Hebrew) and Figure 2 (for Japanese).We assessed the statistical significance of the results using a two-tailed paired t test for each contrast separately (see Sprouse et al. 2013 for a discussion of analysis methods for this paradigm).Before the ratings were entered into the t test they were normalized ("z transformed") within each participant by subtracting the participant's mean rating and dividing the result by the standard deviation of the participant's ratings.This transformation, whose aim is to correct for differences between participants in their use of the scale, affected the resulting t statistics only slightly; none of the qualitative results for an individual contrast depended on whether or not it was applied.The full numerical results are reported in Table 1 for Hebrew and Table 2 for Japanese.

Control
Critical Mean acceptability rating Mean acceptability rating

Control contrasts
The control contrasts in both languages were robustly replicated (for all control contrasts, t > 15, p < 0.001).The average rating of each of the unstarred sentences was 5 or higher, whereas the starred sentences were rated 3 or lower.

Hebrew
Seven of the 14 Hebrew contrasts were replicated at the conventional statistical threshold of p < 0.05.Two contrasts showed a significant difference in the opposite direction than expected; in other words, the starred sentence was rated more highly than the unstarred one (H2: p = 0.003; H11: p = 0.04).The difference in ratings within the remaining five contrasts failed to reach significance.The sign of the difference in four of those contrasts was consistent with the originally reported judgments.This suggests that a larger sample size may result in a higher replication rate, though it should be kept in mind that our sample was already quite large (n = 76).Based on the variability of the responses, we estimate that the experiment was sensitive enough on average to detect a difference of 0.55 in ratings (see sensitivity analysis below).

Japanese
Ten of the 14 Japanese contrasts were replicated at the p < 0.05 level.The remaining four contrasts did not reach significance in either direction; the numerical difference in three of these contrasts went in the opposite direction than predicted.The higher proportion of replicated Japanese contrasts was not due to the larger sample of participants -the sensitivity of the Japanese experiment was almost identical to the Hebrew experiment, with an average detectable difference of 0.54 -but rather to somewhat larger effect sizes: the average difference in ratings between unstarred and starred sentence in Japanese was 0.87 compared to 0.77 in Hebrew.

Variability across participants
Each of the participants rated both of the members of each contrast in their language.This makes it possible to investigate the distribution of the differences in ratings between the unstarred and starred member of each contrast (shown in Figure 3).In an ideal replication, all of the difference scores would be positive: every participant would rate the unstarred member higher than the starred one.This was only the case for one contrast (J102), though the other control contrasts approached this ideal picture; for example, only one out of 76 participants rated the starred member of H101 higher than the unstarred one.Most of the critical contrasts showed considerable variability; the most common difference score was often 0, indicating that a plurality of the participants gave the same rating to both members of the contrast.
In principle, the absence of a significant difference between the unstarred and starred members of a contrast could reflect dialectal differences (see Section 4.5.1):if there are two dialects, one consistent with the original judgment and the other consistent with its opposite, the two dialects could cancel out when the average is computed across all participants.The histograms in Figure 3 do not provide clear evidence for such an interpretation: the distributions appear to be unimodal (having a single peak), rather than bimodal as the dialectal differences hypothesis would predict.

Variability across contrasts
From the perspective of linguistic theory, only differences in rating within each contrast are relevant to the replicability debate.Syntactic theories typically do not make predictions about differences across unrelated contrasts; such differences could in principle be due to any number of non-syntactic factors (e.g., plausibility or lexical predictability).We nevertheless comment on the striking variability in ratings across contrasts.For instance, while contrast H8 was replicated (p < 0.001), its starred version received an average rating of 6.06 -higher than the rating of nine of the 14 unstarred critical sentences in the Hebrew experiment (e.g., the rating of the unstarred member of the replicated contrast H4 was 4.03).In Japanese, while contrast J4 was replicated (p < 0.001), its unstarred version was rated 2.89 on average -lower than six of the 14 starred critical sentences in the Japanese experiment.
This pattern of results illustrates the care that should be taken in relating acceptability to grammaticality; clearly, it makes little sense to interpret a mean acceptability rating of 6.06 as showing that a sentence is ungrammatical if a mean acceptability rating of 2.89 is taken to show that a sentence is grammatical.At the same time, precisely because acceptability ratings reflect the influence of a multitude of factors, the fact that both sentences in a replicated contrast were rated very highly casts doubt on the conclusion that the slightly diminished acceptability of the starred sentence was due to ungrammaticality (the fact that it categorically cannot be generated by the grammar) rather than due to other factors, such as pragmatics, frequency or gradient preferences.In cases of syntactic variation, for example, one variant may be moderately but systematically preferred to another, even though neither variant is ungrammatical in the categorical sense.

Sensitivity of tests
We defined the sensitivity of our tests as the minimal mean difference in ratings between the unstarred and starred member of a contrast for which our power to detect the difference with a p < 0.05 threshold was at the standard level of 0.8 (calculated using the power.t.test function in R; Cohen 1992).We estimated the standard deviation of the difference in ratings by averaging the empirical standard deviations across all contrasts within a given language; the resulting estimated standard deviation was 1.7 for Hebrew and 1.88 for Japanese.The corresponding sensitivity estimates were 0.55 and 0.54, respectively.

Reanalysis with a smaller sample size
The number of participants in our experiments was fairly large (around twice that of Sprouse & Almeida 2012, for example).Such samples are not always easy to obtain for less widely spoken languages.To determine whether the statistical significance of our findings crucially depended on the large sample size, we repeated our analysis, this time restricting ourselves to the responses given by an arbitrarily selected subset of 20 participants (without collecting any new data).
With the smaller sample size, only four of the Hebrew contrasts were replicated at the conventional statistical threshold of p < 0.05.None of the differences in the remaining ten contrasts reached significance; four out of these were negative, and the other six were positive.In Japanese, seven of the 14 contrasts were replicated, one (J6) showed a significant difference in the opposite direction than predicted, and the remaining six contrasts did not reach significance.The detailed results of the subset analysis are shown in Table 3 for Hebrew and in Table 4 for Japanese.Since the participants in the smaller sample were selected at random, any differences in the pattern of results between the two analyses (e.g., the reversal of the sign of the nonsignificant contrast H7) are due only to sampling noise.
We conclude that given the small effect sizes of the differences in rating in contrasts such as the one we tested, a large number of participants (perhaps 100) is necessary to obtain clear results.The sensitivity of the paradigm can be increased by including multiple lexicalization of the same contrast for each subject -i.e.creating multiple versions of the same contrast by replacing some lexical items with equivalent items -or by presenting both members of each contrast simultaneously (Sprouse & Almeida 2017)

Discussion
Half of the Hebrew contrasts and a third of the Japanese contrasts that we identified as potentially controversial did not replicate in formal experiments.The experiments included a relatively large number of participants, and were sufficiently powerful to detect a difference in rating of approximately half a point on a 7-point Likert scale.Our participants rated six of the controversial contrasts (three in each language) in the opposite direction from the originally reported judgments: the starred sentences received higher ratings than their unstarred counterparts (the difference in the unexpected direction was significant in two of these cases, and nonsignificant in the remaining four).By contrast, cases that we judged to be Class II contrasts (control items) were consistently replicated by a comfortable margin.
The results of the experiments presented in this paper indicate that individual linguists can identify controversial contrasts with considerable accuracy.If between a third and a half of the controversial judgments that a single linguist was able to identify did not replicate, the field as a whole is indeed likely to be able to identify the majority of questionable judgments (keeping in mind, of course, that we do not have an estimate of the number of questionable judgments that we failed to identify).2This validates the intuition that informal peer review can effectively weed out such judgments from the literature (Phillips 2010).
Our results reinforce the concern that some Class III contrasts in the literature may not be replicable, and suggest that replicability issues may be more common in languages other than English.Gibson et al. (2013) propose an uncompromising approach to addressing this concern: they argue that every acceptability judgment must be validated in a formal experiment.Our view is that given that the robustness of Class II contrasts is obvious to any native speaker of the language, it would be a waste of resources to test each and every judgment in a formal experiment (Culicover & Jackendoff 2010; Poeppel 2010), especially in smaller language communities where a large sample of participants would be difficult to recruit.We suggest that linguists concerned with data quality should focus on the small minority of potentially questionable Class III contrasts; formal acceptability rating experiments are necessary only in the cases in which there is disagreement among linguists about a particular Class III judgment.

The peer review process
Our results suggest that the peer review of Hebrew and Japanese judgments may be insufficient.To understand why, it is instructive to divide the review mechanisms discussed by Phillips (2010) into three stages.
The first stage is pre-publication peer review, which takes place primarily at conferences.As we have pointed out, pre-publication peer review is likely to be less rigorous in languages other than English: most conference are likely to have few, if any, native speakers of the language in question, with the exception of conferences that focus on particular language families.
The second stage is the formal review that takes place as part of the journal publication process.Many papers do not undergo this process at all (e.g., book chapters, dissertations and conference proceeding papers).Questionable judgments can slip even into papers that are formally peer-reviewed; indeed, some of the judgments that failed to replicate in our experiments were drawn from journal articles.This issue is likely to be more acute in articles published in journals that are not language-specific; the editors of those journals may not be able to find reviewers who are simultaneously native speakers of the language and experts on the theoretical topic of the article.Anecdotally, the four Japanese judgments in our sample that were drawn from peer-reviewed East Asian linguistics journals (Journal of East Asian Linguistics and Journal of Japanese Linguistics), where the reviewers were more likely to be native speakers, were replicated in our experiment; the four Japanese judgments that did not replicate were taken books or general journals (Natural Language and Linguistic Theory). 3 Finally, judgments are even less likely to be vetted by a reviewer who is a native speaker when the paper is not predominately about a particular language but includes one or two judgments from each of several languages (such judgments are typically elicited from the authors' colleagues via email or in hallway conversations).
The third stage in the process outlined by Phillips (2010) can be referred to as historical peer review.Phillips argues that questionable judgments do not make it into the "lore" of the discipline: they are ignored by subsequent researchers.Yet it is unclear whether this process could be effective if those researchers do not speak the language and are therefore incapable of evaluating the original judgments.As an example, after conducting our experiments we discovered that contrast H8 in the Hebrew experiment has been challenged by Siloni (2001: Footnote 15), but this fact does not seem to have undermined the influence of the analysis motivated in part by that contrast (Shlonsky 2004).Indeed, it is unclear whether the field is aware of Siloni's challenge to the validity of the contrast: a later paper on Welsh cites the Hebrew contrast without noting the disagreement about its status (Willis 2006).

Improving the peer review process
The weaknesses of the peer review process for less widely spoken languages can be remedied in a straightforward way.We propose an online crowdsourced database of published acceptability judgments, modeled after existing community resources such as Stack Overflow and Urban Dictionary.To help linguists who are not experts on a particular language to discover existing post-publication criticisms of published judgments in that language, links between different papers that discuss a given judgment will be automatically generated to the extent possible (some manual annotation may be necessary to complement this automatic process).Users will be given the option to comment on judgments online.Such comments might specify the set of contexts in which the judgment is valid, or provide attested examples that challenge it.A voting mechanism will allow users to quickly evaluate a judgment without commenting on it.Such "upvotes" and "downvotes" have been successful in weeding out uninformed answers to questions on websites such as Stack Overflow.The website could also provide facilities for collecting judgments from a large sample of naive participants, in the infrequent cases in which this will be found to be necessary.Some of the issues with peer review processes as currently implemented apply to widely studied languages as well: a questionable English judgment that has made it into a published paper may mislead linguists who are not native English speakers and are not aware of the controversy surrounding the judgment.We therefore believe that work on English will also benefit from the online crowdsourced database we have sketched.
In a recently published paper, Mahowald et al. (2016) recognized that many contrasts are very robust and that large-scale experiments are not always necessary to validate them.They calculated that unanimous judgments from seven participants on seven unique lexicalizations of a contrast are sufficient to establish the robustness of the contrast (see also Myers 2009).While we are not convinced that even a lightweight experiment is necessary to establish the robustness of judgments such as *the bear snuffleds, Mahowald et al.'s (2016) proposal strikes us as a reasonable middle ground between traditional methdology and the formal-experiments-only position expressed in Gibson et al. (2013); in fact, their proposal can be straightforwardly implemented using the platform we sketched above.

What is a failure to replicate?
A reviewer correctly points out that a failure to replicate a judgment does not necessarily indicate that the original judgment was incorrect; in particular, statistical analysis of the results of an experiment can fail to reach statistical significance due to insufficient statistical power (a Type II error) rather than because the underlying effect is exactly 0. The reviewer suggests that only sign reversals constitute evidence against a contrast provided in the literature.We disagree with this argument: even significant sign reversals can occur by chance, and are not always more informative than nonsignificant results (Gelman & Carlin 2014).In fact, the argument can be made that in the social sciences, including linguistics, no empirical effect is exactly zero.Given an extremely large sample size (say, five million subjects), any judgment would either be significantly replicated or yield a significant sign reversal; indeed, a randomly generated contrast would be "replicated" about half of the time.
In practice, the sample size of a replication experiment should be based on the minimal effect size that is seen as robust enough to inform theory formation in syntax.According to our sensitivity analysis, our experiments were able to detect a difference in ratings of 0.5 on a 7-point scale (at the conventional threshold of p < 0.05), with a sample size of 76 participants in Hebrew and 98 in Japanese.If linguists believe (1) that the difference in acceptability between a sentence that is generated by the grammar and a minimally different sentence that is not generated by the grammar can be much smaller than 0.5 on a 7-point scale, and (2) that an individual linguist can detect such a small effect by introspection, many thousands of participants will be necessary for adequately powered replications.Of course, the combination of these two assumptions creates a significant burden-of-proof asymmetry: the intuition of the original linguist is privileged over that of the dissenting linguist, who is required to provide evidence from an enormous number of subjects to support their position.

Should we expect judgments to replicate?
The notion that acceptability judgments given by linguists are expected to replicate in a representative sample of the population is not without its opponents.Some linguists have argued that judgments always reflect a particular linguist's idiolect, in which case replication studies with naive participants are entirely irrelevant -those participants may well have a different idiolect from the original author (Den Dikken et al. 2007).In a more nuanced criticism, Hoji (2015) argues that replication does have value, but only if it has been established that the participants' idiolect is similar to the original author's in the relevant respect.We cannot conclusively refute this objection.The lack of bimodality in the pattern of responses does not provide evidence for idiolectal variability in our sample, but it is of course possible that only a handful of participants shared the original author's idiolect; such a small subset of participants will not show up as a discernible second peak in the distribution.

Dialectal and generational variation
Dialectal variation in the population is hard to rule out definitively as an explanation for replication failures.We did not find evidence for different patterns of responses among our participants that could be ascribed to different dialects (see Section 3.3), but there may certainly be systematic dialectal or generational differences between our participants as a group and the original authors.We recruited our participants on the internet in the 2010s; it is quite plausible that they were at least one generation younger than the authors of the original papers, most of which were published in the 1980s and 1990s.This objection raises interesting questions about our ability to rely for theory construction on contrasts that have accumulated in the literature across different generations; these questions cannot be addressed by our data and are not specific to the interpretation of judgment replication experiments.

Comparison to English
We conjectured that English judgments are more reliable than judgments from other languages.Our empirical results are consistent with this conjecture but do not prove it.Our experiments included an intentionally biased sample of sentences; in contrast to previous experiments that attempted to replicate a random sample of English judgments, our design does not provide us with a simple way to estimate the proportion of judgments in Hebrew and Japanese that are difficult to replicate.We can nevertheless attempt to assess the difference between our results and the results of Sprouse and Almeida's work on English, in two ways (we thank Diogo Almeida for these suggestions).
First, out of the 148 sentence types that Sprouse et al. attempted to replicate, 13 were originally reported with a question mark (? or *?; Sprouse et al. 2013: 233), a sample size that is quite similar to ours.The linguists who originally provided these acceptability judgments may have anticipated objections to the judgments and used the question marks as a hedge indicating that these are "subtler" contrasts (presumably, with a smaller effect size).Yet only one out of these 13 test cases failed to replicate in the Sprouse et al. survey, a much lower rate than in our experiments (seven in Hebrew and four in Japanese); if the question mark annotation is a reliable guide to how subtle the judgment is, then, this analysis indicates that English judgments are more reliable than Hebrew and Japanese ones.
Alternatively, we can examine English contrasts that are more likely to be controversial based on the numerical effect size estimated in Sprouse et al.'s experiment.There were 20 contrasts for which the effect size was medium (0.5) or smaller in the Sprouse et al. sample of 148 sentence types.In the Likert Scale task of Sprouse et al. (the task we used), five results went in the opposite direction than argued in the original articles (see their Table 3), and six went in the predicted direction but did not reach significance.This pattern is more similar to our Hebrew and Japanese results.The conflicting results of these two indirect methods suggest that a more direct comparison between Hebrew, Japanese and English, and perhaps additional languages, would be an important direction for future work.

Sample of languages
While our study expands the number of languages in which judgments replication studies have been conducted from one (English) to three, we did not test a representative sample of languages.In particular, neither of the languages we have investigated is as underrepresented in linguistics as Estonian, Maltese or Chichewa might be.Our choice of Hebrew and Japanese was a matter of convenience: we are native speakers of those languages; the existence of a medium-sized community of linguists working on those languages made it possible to test a diverse range of acceptability judgments made by multiple authors; and the fact that millions of people speak each of those languages facilitated recruiting a satisfactory number of experimental participants.If our concerns about the peer review process turn out to be well-founded, however, we expect replication failures in languages with even smaller research communities to be at least as common as in the languages we have examined here.

Theoretical import of judgments
As in earlier studies by Sprouse and colleagues, we did not attempt to trace the influence of each acceptability judgment (replicated or not) on subsequent linguistic theory building: our goal was to evaluate the quality of the data reported in linguistics papers rather than the quality of the theories constructed based on that data.This kind of detective work could be quite informative: it is not unreasonable to conjecture that data points that crucially support one theory over another face greater scrutiny, especially if those theories themselves are widely cited; such theoretically critical data points are likely to be more robust (Phillips 2010).Yet such an exercise, while certainly worthwhile, would not be straightforward.Theories are rarely constructed based on a single data point, and it is often unclear which particular data points are seen as crucially supporting a theory.This work is best left to experts on the theoretical domains that have been informed by those data points.

Conclusion
The vast majority of published English judgments can be replicated with naive participants (Sprouse & Almeida 2012;Sprouse et al. 2013).We argued that this is due to two reasons.First, a large proportion of acceptability judgments illustrate obvious and uncontroversial contrasts (Class I/II judgments).Second, more subtle contrasts (Class III judgments) are informally vetted by a large community of linguists who are native English speakers.While not foolproof, this informal peer review process weeds out most questionable judgments (Phillips 2010).
To examine the efficacy of the peer review process in languages other than English, we selected acceptability judgments in Hebrew and Japanese that we deemed to be questionable.Half (in Hebrew) or a third (in Japanese) of the Class III contrasts we selected failed to replicate, while all Class II judgments were robustly replicated.These results suggest that (1) formal acceptability rating experiments are not necessary for each and every judgment, (2) linguists can effectively identify questionable contrasts, and (3) informal peer review mechanisms may be less effective for languages spoken by a smaller number of linguists.We proposed an online community resource that can extend the benefits of informal peer review to less widely spoken languages.
We stress that our results do not suggest that there is a "replicability crisis" in Hebrew or Japanese linguistics.Although we did not explicitly count the number of contrasts that we did not consider to be questionable, we estimate that there were dozens of such contrasts for each potentially questionable judgment.In other words, most judgments are not controversial.Our goal in this study was to point out that some potentially unreplicable judgments do exist in the literature, and those can be identified by linguists.
year, all the dogs biting their owners were killed.' byear, all of the dogs who bit their owners were killed.' 'The book is easy to read and to analyze.' relativizer ha can only be used directly before the present participle, whereas the relativizer she can be used anywhere in the sentence)the man who likes to talk about politics.'

Figure 1 :
Figure 1: Results of the Hebrew experiment.Error bars represent bootstrapped 95% confidence intervals.

Critical items
'He took my black long table.'

Table 2 :
Japanese results: all participants.

Table 3 :
. Hebrew results: a sample of 20 participants.

Table 4 :
Japanese results: a sample of 20 participants.