Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language: Two case studies from Welsh

Data gathered from social media have been used extensively to examine lexical dialect variation in widely used languages such as English and Spanish, but their use to date in morphosyntax and for lesser-used languages has been more limited. This paper tests the usefulness of using data derived from Twitter to address traditional questions in dialect syntax and sociolinguistics. It uses two cases studies from Welsh – the form of the second-person singular pronoun in various syntactic contexts, and the availability of auxiliary deletion – to assess whether datasets based on Twitter data can successfully replicate and enhance results derived by traditional means. The results of the case studies coincide to a large extent with distributions established in existing studies, even ones using entirely different methods, such as dialect questionnaires or acceptability judgment tests. Twitter data also show considerable success in establishing implicational hierarchies and conditioning factors comparable to those typical of the field. Where the results differ from existing studies, the differences may be due to the younger demographics of Twitter users, or to differences in the quantity of data provided by different methodologies. The results produce patterns closer to spoken data than to written data, giving us reasonable confidence in such data as a relatively good proxy for spoken usage of large numbers of language users.


Introduction
While data from social-media platforms such as Twitter and Facebook have been used by linguists to investigate lexical variation (Russ 2012, Gonçalves & Sánchez 2014 and change (Grieve, Nini & Guo 2016;, use of such material for morphosyntax has been relatively limited to date. This paper aims to demonstrate the successful application of Twitter data to investigate morphosyntax, to examine ways of dealing with methodological problems, and to test the extent to which it is possible to replicate results produced by traditional methods of investigating geospatial variation in morphosyntax (dialect surveys and spoken corpora) using social-media data.
Use of social-media data to examine dialect variation has grown substantially in recent years. Early work typically limited itself in several ways: first, by restricting itself to a corpus of tweets for which users enabled the automatic inclusion of GPS location metadata with their tweet; and, above all, by focusing on variables that can easily be extracted via searches for particular strings of characters, typically lexical variables. The pioneering work of Gonçalves & Sánchez (2014), for instance, used a corpus of 50 million GPS-localized tweets in Spanish to map lexical variation across the Spanish-speaking world. Other studies in this tradition include Scheffler et al. (2014), Gonçalves & Sánchez (2016), Huang et al. (2016), Donoso & Sánchez (2017), Eisenstein (2017), Shoemark, Kirby & Goldwater (2017) and Grieve et al. (2019).
This work successfully demonstrates broad-brush variation; for instance, Gonçalves & Sanchez (2014) show that the Spanish word for 'swimming pool' is alberca in Mexico, pileta in Argentina and Paraguay, and piscina everywhere else. While promising, it has not been fully established that these methods can identify variation within regions at the local level. Furthermore, these studies do not address central questions in current work on language variation and change, much of which focuses on understanding how phonological and morphosyntactic innovations arise (actuate) and diffuse through space. To be fully integrated into mainstream work in language variation and change, social-media data need to be applied to core theoretical questions. For instance, typical variables in quantitative work show that the frequency of variants (whether innovative or in stable variation) is conditioned by both linguistic and external factors. External factors (gender, age, social status, network) may be difficult to establish in Twitter data, while the lexical variables chosen are often too simple for linguistic factors (phonological environment, clause type etc.) to be relevant. Abitbol et al. (2018Abitbol et al. ( : 1126 thus highlight the need for future work both to go beyond the study of lexical variation in English and to introduce subtler external factors such as social class into computational sociolinguistics. Current work has begun to address some of these challenges. Eisenstein (2015) extends Twitter-based linguistic research to phonology, showing that the frequency of phonetically based unconventional spellings is sensitive to phonological context in a way that is typical for phonological variables in sociolinguistic studies. For instance, orthographic deletion of <t d> in tweets in words such as jus(t) or ol(d) is inhibited by a following vowel in the same way that phonological deletion of /t d/ (coronal stop deletion) is inhibited by a following vowel in speech. Nevertheless, the existing literature on this variable demonstrates complex hierarchies of conditioning by both preceding and following phonological context (Fasold 1972: 38-115;Tagliamonte & Temple 2005;Hazen 2011 etc.) alongside, in some studies, morphological factors (Guy & Boyd 1990). In this research context, the role of a following vowel is a relatively minor issue. Ideally, we would be able to address the same range of conditioning environments as found in sociolinguistically oriented studies.
Other studies have also demonstrated phonological variation using social-media data. Van Halteren, Van Hout & Roumans (2018) present data to suggest that the geospatial distribution of phonological variation in Limburg Dutch in Twitter data mirrors traditional dialect findings. Jones (2015) demonstrates previously undocumented geographically patterned phonological variation within tweets in African American Vernacular English (AAVE).
The question of observing ongoing change is addressed by Grieve, Nini & Guo (2016;, who use Twitter to look at real-time diffusion of lexical innovations in American English over a period of several years. A number of works have also begun to look at morphosyntactic variation. Haddican & Johnson (2012) used a Twitter corpus to test for differences between and within US and British English in the frequency of discontinuous orders with particle verbs (put the lights out vs. put out the lights). Doyle (2014) showed that Twitter data could broadly establish the distribution of double modal might could across the southern United States, although the results still do not provide good resolution at local level, with questionable pockets of double modals appearing in major cities. Furthermore, conditioning factors could not be considered and the variable could not be defined in the standard way as the relative frequency of one variant compared to other possible variants (cf. Van Halteren, Van Hout & Roumans 2018: 140), since other variants, such as might have been able to were not collected. Stevenson (2016) used Twitter to establish the geospatial distribution of variation in the syntax of ditransitive verbs (specifically the past tense of give and send) with pronominal objects in British English (gave it me/me it/it to me etc.), and showed that these patterns closely match those of the Survey of English Dialects (Upton & Widdowson 1996: 52). Claes (2017) used Twitter data to show that plural agreement in Spanish existential constructions is conditioned by tense, negation and the semantics of the associate noun phrase, and that this finding is replicated in a corpus of traditional spoken Spanish and in previous studies of Caribbean Spanish. Ljubešić, Miličević Petrović & Samardžić (2018) plot the distribution for 16 variables, including 7 morphosyntactic ones, across the Slavic languages of the western Balkans using Twitter data. While they focus mainly on examining whether dialect differences match current political boundaries, they also successfully demonstrate the viability of using social-media data to investigate both morphological and syntactic variables. Strelluf (2019) paints a picture of the geospatial distribution of positive anymore in American English tweets that is in line with established distributions, and analyses those patterns in terms of language-internal factors whose relevance for the phenomenon is well established, namely negative-polarity licensing contexts and clausal position.
Despite these welcome contributions, the total volume of work on morphosyntax remains limited, and only a few attempts (notably Claes 2017 and Strelluf 2019) have been made to incorporate language-internal conditioning factors into investigations. Again, with a handful of exceptions, work also remains heavily focused on English and Spanish, languages where the volume of tweets available is vast, and little attention has been paid to what techniques are appropriate for languages with a lesser presence in social media. Finally, while consensus isoglosses have been successfully replicated in a number of cases, apart from Van Halteren, Van Hout & Roumans's (2018) work on phonological variation within Limburg, this has mostly been done at the macro rather than the micro level. Identification of micro-level patterns, typical of existing dialect atlases, is naturally a more challenging task and success therefore more difficult to demonstrate.
This paper addresses some of these issues by considering two cases of morphosyntactic dialect variation in Welsh, a language with a relatively small presence in social media.
The first case study concerns the dialect distribution of the Welsh strong second-person singular pronoun chdi. This occurs in northwestern Welsh dialects in various syntactic environments (after a non-inflecting preposition, in fronted focus position, as subject of an auxiliary etc.). The exact set of possible syntactic environments for its use varies from dialect to dialect according to an implicational hierarchy. Traditional studies of the dialect distribution of chdi show a concentric-ring pattern, with a core dialect in which chdi is permitted in the largest set of environments, and successively more distant dialects allowing it in fewer and fewer of them. This pattern results from historical wave-like distribution of chdi in more and more contexts via contagious diffusion from a central core (Bailey 1973: 65-109; see also Britain 2013Britain [2002 on models of diffusion more generally). The research question here is therefore to what extent both the overall pattern and the details of the linguistic factors that give rise to the implicational hierarchy can be established from Twitter data.
The second case study involves deletion of finite forms of auxiliary 'be' before subject pronouns in spoken Welsh verb-initial word order. In addition to the question of the geospatial distribution of this feature, and whether it can be accurately established using Twitter data, this variable allows us to test a second question, namely the extent to which social-media data provide a good proxy for spoken data. Auxiliary deletion is known to occur at exceptionally high rates in spoken Welsh, but it is not a feature of the formal written language. If Twitter data are a good proxy for spoken data, we would expect to find rates closer to those found in spoken corpora than in written corpora. This paper begins (section 2) by considering the data-collection procedure and methodological issues involved in Twitter research for a small language. Sections 3 and 4 set out the two case studies in turn, beginning in each case by describing the linguistic variable in question and the existing state of knowledge established using traditional methods (the reference distribution), before setting out and mapping the Twitter data for these variables in comparison. In both cases, the Twitter results turn out to be broadly consistent with the reference distribution and with other findings of existing studies. The reasons for mismatches in individual points of detail are discussed both in the case studies and in the conclusion (section 5).

Methods and data collection
Using Twitter to work with Welsh presents somewhat different issues from working with a major world language. For major world languages, the volume of data is such that it may be possible to discard a very large proportion (even 99%) of it, and still have a useful body of evidence. Welsh is regularly used in social media: Kevin Scannell (indigenoustweets. com) reports over 14,000 Twitter users as tweeting in Welsh with some 5.7 million tweets having been composed in Welsh, and the number has likely risen significantly since these figures were last updated in 2014. While substantial enough to be the basis for research, this corpus is by no means so large that we can afford to disregard large quantities of useful material from it. For this reason, it is not feasible to limit oneself to tweets by users who have enabled automatic GPS geotagging of their tweets on their mobile phones, as many studies have done. Only a small proportion of tweets (1-2%, Eisenstein 2017: 369) have such metadata. Users who enable this geotagging on their phones are also likely to be more urban and technologically minded than average, exacerbating existing biases in Twitter data (for a measure of the bias against rural users in GPS-localized Twitter data in the United States, see Hecht & Stephens 2014). Furthermore, relying on GPS-localized tweets alone would reduce the number of distinct users and hence the independence of the dataset, potentially introducing overreliance on idiosyncrasies of particular individuals in some part of the data. It was therefore decided to use all available tweets containing the relevant linguistic variables during a period of observation and, for mapping, to develop a strategy for assigning geographic locations to tweets based on other information provided by Twitter users.
As linguists mapping dialect variation, we are interested not in the location of a user when they are composing a tweet, nor even particularly where they live at the time they are tweeting, but rather where they acquired their language. We would also ideally like to know other demographic information associated with the user, such as their gender, age, social class and occupation, and, in the context of a lesser-used language such as Welsh, aspects of their language background, including the means by which they acquired Welsh (in the home, at school, as an adult learner etc.) and perhaps even the extent to which they participate in Welsh-language culture. These, and others, are all demographic factors that would be taken into consideration in a well-designed dialectological or sociolinguistic study of a given linguistic variable. Unfortunately, none of them is straightforwardly available to us for Twitter users. However, the sheer volume of easily available data is highly attractive, if these limitations can be overcome: a study of variation in Welsh with 14,000 informants is vastly beyond what could normally be achieved.
Although the corpus contained only tweets marked by the Twitter language-identification algorithm as being in Welsh, a considerable number were in other languages (mostly Kurdish, Bahasa Indonesia, Tagalog or Italian). These were removed manually. Also excluded were tweets containing only quotations (Bible verses, poetry), proper names, and obvious spam. None of the tweets appeared to have been produced by automated bots. Retweets and tweets from national-level institutional accounts (government organizations, broadcasters) were removed, along with resources for learners and tweets from users who identified themselves as second-language (L2) learners in their user description. The result was a dataset of 6,664 tweets in Welsh containing second-person singular pronouns from 2,932 distinct users. Although, in principle, such Twitter searches provide only a sample of tweets, manual checking of the data suggested that all or nearly all tweets coded as Welsh by Twitter's language-identification algorithm and matching the query terms had been returned.
Tweets from accounts of local institutions were retained as potentially reflecting local usage. The choice to retain local but not wider institutional tweets inevitably influences the extent to which spoken forms are found, since even local institutions are likely to be more formal in the linguistic preferences than personal users. Self-identified L2 speakers were removed, but it is likely that others were present, hence an unknowable number remain in the dataset.
From subsequent analysis, it became clear that a proportion of the tweets collected were conventional social interactions frequently performed in Wales in Welsh even by non-Welsh-speakers (thanking people, wishing people a happy birthday and wishing people a happy Christmas). These are included in the dataset, but treated separately, and their effect will be considered separately in the analysis below.
Information about users' geographic origin is crucial to any study of dialect variation. Tweets include various metadata that are potentially of use for this task, most obviously the location and description fields of the metadata provided by users themselves. In the current study, the location field of the metadata was left blank in 26.1% of tweets. While the majority of data (73.9%) thus provides some kind of user-provided information, this was not always useful: many users provided their location simply as the "UK" (35 tweets, 0.5%), "Wales" (175 tweets, 2.6%) or its Welsh-language equivalent "Cymru" (427 tweets, 6.4%) or similar. A few gave non-geographic locations of the type "In my kitchen". Even when more specific locations were provided, they were not always particularly useful: a description such as "North Wales" or "Gogledd Cymru" (257 tweets, 3.9%) is of limited direct practical use for linguistic geography and such information was also disregarded in producing a localization. However, in many cases (3,062 tweets, 45.9%), the user location field did provide a specific city, town, village or small region sufficient to associate the tweet with a specific location. Where a user mentioned two locations, it was assumed that they had grown up in the smaller one and moved to the larger: formulations such as "Llanrwst/Cardiff" seemed mostly to be used by students to indicate their home town and university town/city. Such users were therefore treated as coming from the smaller location.
Where the user location field proved inadequate, the user description field was used. This is a free text field where users can provide any information they like about themselves, and, while some (1181 tweets, 17.7%) left this field blank, most users wrote something here, including interests, hobbies, political views, parenthood, age, and either current location or the place they grew up in or identified with. Any information here was added to that obtained from the user location field, providing a new or more fine-grained localization for a further 304 tweets (4.6%).
Of course, this procedure is no guarantee that users are mapped to the places where they grew up, but it is the best approximation that can be made. Furthermore, this procedure can be applied to many more tweets than those for which geotags (GPS location data) are available and, in any case, is more likely to yield users' actual places of upbringing. In total, a usable localization was thus established for 3,366 tweets (50.5% of the dataset). While it is inevitable that this procedure leads to some tweets being assigned localizations that do not accurately reflect where the users acquired their language, it was anticipated that such errors would be relatively insignificant in the overall picture. If the geospatial distribution of features that emerges from the study matches or enhances what we find in established work, that expectation will have been borne out.
Relatively few users (336 tweets, 5.0%) gave direct information about their age. Users who did provide such information were almost always in their twenties or late teens. This age distribution is very similar to that found by Sloan et al. (2015) for a corpus of English-language tweets where age was identified directly from information in the user's description field. This is clearly much younger than the general population, although it is also clear from manual inspection of the data than Twitter users in their thirties and older are simply much less likely to state their age directly in their user description. The overall age profile of Twitter users in the UK is younger than the general population. IPSOS Connect (2017: 18) estimate 25% of users to be under the age of 25 (compared to 15% of the general population) and 46% under 35 (compared to 32% in the general population) (Great Britain only). Chaffey (2019) estimates that, as of October-December 2014, 28% of Twitter users were aged 16-24 and over 30% in the 25-24 age group (entire UK). If we assume a similar profile in our data, it is clear that many users in their thirties and above are present in the data, but do not identify themselves as such via their user description. The limited range of ages identifiable within the data makes it difficult to say anything easily about apparent-time variation within the data without further analysis (e.g. by assuming that, collectively, the group identifying themselves as parents is on average older than the group identifying themselves as students). Pending further investigation of how to extract information on age, gender, social class etc. in social-media-based linguistic research, the age dimension will not be considered further in this study.
Finally, tweets were annotated for the form of the pronoun and for syntactic context as explained for the individual case studies below. Since both case studies involve second person singular pronouns in some form or other, the same base dataset is used in both cases. This dataset is provided in anonymized form in the Supplementary Files linked to this article.

The variable
In most varieties of Welsh, the second-person singular (informal) pronoun is ti. Another form, di, occurs in certain syntactic environments, such as the postverbal subject of certain auxiliaries that end in a vowel or as a possessor in a possessed noun phrases. In northwestern varieties of Welsh, an alternate form of the pronoun has arisen, namely chdi. Historically, this is the result of phonological reduction and dissimilation of the Middle Welsh strong second-person singular pronoun tydi. After certain prepositions and conjunctions, the consonant alternation known as aspirate mutation was (and still is) triggered (e.g. tŷ 'house' but â thŷ 'with a house', with /t/ > /θ/). This applied regularly to tydi in the relevant syntactic environments, hence Early Modern Welsh â thydi 'with you'. Syncope of the unstressed schwa led to â thdi (normally spelled â'th di in Early Modern Welsh). In northwestern and central northern varieties of Welsh, thdi was reanalysed as a new pronoun and spread beyond syntactic environments where aspirate mutation was triggered. Finally, a dissimilation occurred in some northwestern variety of nineteenthcentury Welsh, so that thdi /θdi/ became chdi /χdi/, and this new form eventually supplanted the earlier form entirely (Willis 2017).
The dialect distribution of chdi is relatively well established. It was the subject of one question asked for the Welsh Dialect Survey (Thomas et al. 2000: 555), most of whose informants were born in the 1920s; see Figure 1. 1 This shows chdi to be solidly attested in the whole of the northwest, with some scattered attestation in adjacent areas of the northeast; it is not present in the south.
The syntactic distribution of chdi is quite complicated. As we have seen, chdi (or rather its ancestor form thdi) arose in one particular syntactic environment, namely after a small group of prepositions and conjunctions that triggered aspirate mutation (namely â 'with', efo 'with', gyda 'with', â '(equative) as', na 'than', and a 'and'). From here, it seems to have spread first to focus constructions and to other independent contexts, such as free-standing answers to questions. Welsh distinguishes strong from weak pronouns: strong pronouns occur in contexts without agreement, while weak ones occur in contexts associated with agreement. The contexts in which chdi first appeared were typical strong contexts, and all dialects that use chdi use it in these contexts, namely after non-inflecting prepositions, (1); as the second conjunct in co-ordinated noun phrases, (2); in sentence-initial (fronted) focus position, (3); and in fragment answers, (4) (which can be thought of as a reduced form of the type in (3)).
(1) efo chdi with you 'with you' (2) fi a chdi me and you 'me and you' (3) Chdi sy 'n gwybod. you be.prs.rel prog know.inf 'It's you that knows.' (4) Pwy neith gymryd o? Chdi. who do.fut.3sg take.inf it you 'Who will take it? You.' However, chdi has spread beyond these environments, and is found also in various contexts traditionally associated with agreement. This may be part of a trend towards loss of agreement more generally in Welsh (Willis 2017: 44-46). Thus, we find chdi as the subject of complement clauses headed by bod, the nonfinite form (verbnoun) of bod 'be'. In place of the more traditional construction illustrated in (5), with second-person singular agreement marker dy and pronoun di, we now find (6), with no agreement marker and chdi alone. 2 (5) Dwi 'n gwbod dy fod di 'n siarad Cymraeg. be.prs.1sg prog know.inf 2sg be.inf you prog speak.inf Welsh 'I know that you speak Welsh.' Cymraeg. be.prs.1sg prog know.inf be.inf you prog speak.inf Welsh 'I know that you speak Welsh.' There are a number of other contexts to which chdi has spread. With the object of an inflecting preposition, such as am 'about', amdanat ti 'about you' is replaced by amdana chdi. With the subject of various auxiliary and modal elements, oeddat ti 'you were' is replaced by oedda chdi; (by)sat ti 'you would (be)' is replaced by (by)sa chdi; rhaid (i) ti 'you must (lit. 'it is necessary for you') is replaced by rhai' chdi; and byddi di 'you will (be)' is replaced by by(dd) chdi. As the object of a nonfinite verbal form, chdi replaces doubling of a preverbal agreement marker and postverbal di (the pattern is parallel to that in (5) and (6)  In other contexts, namely those where agreement is secure, chdi has made little to no impact and ti (or its syntactically conditioned variant di) is compulsory in all dialects. Such contexts include the subject of lexical verbs, (11); the subject of the present tense of the verb bod 'be', (12); and the subject of imperatives, (13).
(11) Cei di/*chdi un. get.fut.3sg you one 'You'll get one.' Ble wyt ti/*chdi? where be.prs.2sg you 'Where are you?' The spread of chdi is documented historically, and the historical trajectory is reflected in current dialect variation. Historically, we know that chdi is first attested in different syntactic contexts at different dates. The contexts in which it is attested earliest are also those where it is found today in the widest range of dialects. Willis (2017) investigates the Willis: Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language Art.103, page 10 of 33 historical sequence of the spread of chdi in the nineteenth and early-twentieth centuries and finds the following order of innovation: (14) object of non-inflecting preposition > other independent (focus, dyna 'there is' etc.) > subject of nonfinite bod 'be' > after i 'for, to' > object of other inflecting preposition/subject of finite auxiliary Chdi is absent in all other contexts in the historical data. Willis (2017) also conducts a study of contemporary patterns of geospatial variation by context. He shows that current variation mirrors the historical sequence of events fairly closely, consistent with wave-like diffusion from a single zone of innovation, in the sense that chdi is found to its fullest geographic extent only as the object of non-inflecting prepositions and in other independent contexts. In all other contexts, it is found in a subset of this area, with those that innovated earlier showing a wider geographic distribution than more recent ones. The relative order of contexts in dialect variation is the following (based on the intercept values of a global logistic regression, Willis 2017: 50): Of these, efo 'with' and the independent contexts had already reached their current state in speakers born in the 1920s (Thomas et al. 2000: 555). Apart from the subject of nonfinite bod 'be', which is not included in (15), and allowing for the fact that (15) includes a number of environments where chdi innovated only after the end of the historical period covered in (14), the order of the items is identical in the two implicational hierarchies. Given that the two hierarchies above are based on entirely independent data and methods, we can be reasonably confident that they reflect the actual course of development and geographical distribution of chdi. We can therefore treat (15) as the reference distribution, a standard of comparison for the Twitter data.
The first question, then, is whether Twitter data can accurately identify the geographic area in which chdi is used. If this question is answered in the affirmative, then a second, more demanding question is whether Twitter data can provide a level of contextual geographic detail comparable to that obtained by traditional questionnaire and/or corpus methods.

Global analysis of the data
First consider the overall distribution of results for all 6,664 tweets identified as containing a second-person singular pronoun. This is given in Table 1.
Before we turn to examine the overall patterns, some discussion of the status of formulas is necessary: formulaic expressions turn out to have a much higher propensity to contain ti than chdi and were therefore treated as a separate context. This concerned da (iawn) ti '(very) good (on) you', penblwydd hapus i ti 'happy birthday to you', Nadolig Llawen i ti 'Merry Christmas to you', diolch ti and diolch i ti 'thank you', helo ti 'hello, you' and hwyl ti 'bye, you'. In some of these cases the syntactic structure is not clear: is diolch ti 'thank you' an elided form of diolch i ti 'thank (to) you' or is it a calque of English thank you? If the latter, it is not clear whether ti should treated as the object of the nonfinite verb diolch 'thank' or as a syntactically independent, unintegrated unit. Even where the structure is fairly clear, these formulas did not pattern at all with their associated context: penblwydd hapus i ti 'happy birthday to you' clearly exemplifies the context i 'to', but was only ever found with ti, never with chdi, even though i chdi 'to you' was common elsewhere. Formulaic expressions are widely recognized to be linguistically conservative and may resist linguistic innovations that have spread to most productive contexts for some centuries. Furthermore, in Wales, such formulas are widely known, in their standard form with ti, by people with no other knowledge or only limited knowledge of the language, and may therefore be used in an otherwise English-language context. Obvious instances of performance of Welsh by non-speakers were manually removed from the dataset (e.g. a single formulaic Welsh expression in a tweet otherwise in English or anti-Welsh racist insults from accounts otherwise tweeting in English). However, accounts could not be systematically examined to try to establish whether the user was a competent speaker of Welsh. Thus, it is likely that some use of formulas represented unidentified cases of this, cf. difficulties identified for Twitter research by Jones (2015: 411) arising from performance of African American Vernacular English. Thus, while searching for formulaic expressions in an untagged Twitter corpus might seem like an attractive way of extracting large amounts of data quickly, in practice it seems unlikely to yield satisfactory results.
Since the Twitter data include all instances of second-person singular pronouns, they inevitably cover more syntactic contexts than other studies. Restricting ourselves solely to the contexts included in (15), the following hierarchy emerges from Table 1. This hierarchy differs from that in (15) in a number of ways. On the positive side, it correctly places bydd 'will be' and gan 'with' at the far right edge of the hierarchy, in the correct order: Willis (2017) shows that these two contexts differ sharply from the others in allowing chdi over a much narrower geographic range. However, it fails to order correctly the left-hand portion of the hierarchy. In particular, the historically earliest contexts, namely object of non-inflecting preposition and other independent are not identified as the most favourable environments. A number of comments on this are in order. First, the four leftmost environments in (16) are not statistically significantly different from one another, and their relative ordering is therefore rather insecure. 3 Secondly, the age profile of Twitter users (cf. the statistics given in section 2 above) means that we are, on average, dealing with speakers considerably younger than were used to establish the hierarchies in (14) and (15). Consequently, the data can be interpreted as representing, on average, the language of speakers born in the 1990s. All but the final two contexts in (16) are ones that Willis (2017) found had run to completion in speakers of this age group. Consequently, we would expect the frequency of chdi in these contexts to have largely reached its ceiling, this ceiling being set by the proportion of Twitter users coming from the chdi-region.
While this is reassuring about the adequacy of the Twitter data, the idea that the change has run to completion in a number of syntactic contexts sits uneasily with some other observed significant differences in the percentages given in Table 1. Some of these other differences seem to be due to another, somewhat surprising fact, namely that the syntactic contexts themselves are not evenly distributed across the map. Above all, the context non-inflecting preposition is itself spatially autocorrelated. Figure 2 shows a k-nearest neighbours kernel density estimation (KDE) (bandwidth = 2*√(dataset size) = 115.8 nearest neighbours) of the proportion of data points in the dataset that are instances of the non-inflecting preposition context; that is, it shows the mean frequency of the context across the 115.8 nearest data points to each location. 4 The red areas on Figure 2 are those with above-average frequency of this context, with the darkest red showing areas where the frequency is more than 2 standard deviations away from the mean; the blue areas show the inverse, with areas below the mean. From this, it can be seen that, somewhat counterintuitively, the non-inflecting preposition context is found in the data significantly more frequently in the south than in the north, with a zone of very significantly elevated occurrence across all of the southwest. Since so much of the data for this context comes 3 Chi-squared results: subj. of oedd vs. subj. of conditional χ 2 (df = 1, n = 487) = 0.620, p = 0.431; subj.
of oedd vs. independent use χ 2 (df = 1, n = 380) = 1.774, p = 0.183; subj. of oedd vs. obj. of inflecting preposition χ 2 (df = 1, n = 262) = 1.749, p = 0.186; subj. of conditional vs. independent use χ 2 (df = 1, n = 371) = 1.086, p = 0.297; subj. of conditional vs. obj. of inflecting preposition χ 2 (df = 1, n = 253) = 1.400, p = 0.286; independent use vs. obj. of inflecting preposition χ 2 (df = 1, n = 262) = 0.027, p = 0.870. Despite this, it is not appropriate to merge these positions on the hierarchy. It is often the case that differences between adjacent points on this kind of hierarchy are not significant, while the entire hierarchy is statistically significant. In the current instance, the context of inflecting preposition is not statistically different from non-inflecting preposition or from conditional, but conditional auxiliary and non-inflecting preposition are significantly different. It is not possible to determine statistically whether inflecting preposition should be merged on the hierarchy with non-inflecting preposition or with conditional auxiliary.
Consequently it is preferable not to perform either merger. 4 For further details on KDE as a smoothing procedure, see section 3.3 below. from the south, where chdi is not found, it is not surprising that the frequency of chdi overall in this context is lower than might be expected.
Once this issue is identified, the reason why it arises can be seen fairly readily. Some non-inflecting prepositions are found only in certain dialects; for instance fatha 'like' and efo 'with' are both northern only. However, in these cases, other dialects express the same meaning also with a non-inflecting preposition (fel and gyda respectively), cancelling out any effect. Conversely, the 'have'-construction, which varies significantly between dialects, contains a non-inflecting preposition in some dialects but not in others. Thus in southern varieties, the 'have' construction uses non-inflecting gyda 'with' in (17), while the commonest 'have' construction in the north uses inflecting gan 'with' in (18). A third construction using efo in (19) is confined to the north, but maps the object possessed to Willis: Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language Art.103, page 14 of 33 the object of the preposition, hence 'you' is the subject of the verb bod 'be' in this context and is counted in the statistics accordingly. The result is that non-inflecting prepositions are simply commoner in the south than in the north, artificially raising the global frequency of ti for this context. This highlights the need to interpret the data geospatially, as will be done in the next section.

Geospatial analysis of the data
The previous section showed the need to include a geospatial dimension to the linguistic analysis. This is what will now be attempted. Consider first the overall distribution of chdi, shown in Figure 3. Here, each circle represents a tweet location; the colour of the circle represents a KDE value for the frequency of chdi at that point: KDE calculates the mean frequency of the variant within a kernel centred on the point in question. The size of the kernel is determined by the KDE bandwidth. KDE has previously been used to map dialect variation by Blaxter (2017) and for dialect variation in Twitter data by Jones (2015) and Van Halteren, Van Hout & Roumans (2018). In all cases in the current article, the bandwidth is twice the square root of the subset of data under consideration, in this instance, the nearest 115.8 data points. 5 This method will be repeated for the various syntactic contexts below. Figure 3 shows the overall pattern for all the data. The frequency of chdi never exceeds 45% in any region, because these data include syntactic contexts in which chdi is categorically excluded in all varieties. The close similarity to the pattern found in the Welsh Dialect Survey in Figure 1 above is nevertheless clear, with a core region where chdi is strongest in the northwest, surrounded by a transitional ring to the east and southeast with moderate frequency of use.
For direct evaluation of the success of the Twitter method, we need to compare each syntactic environment with parallel traditional data. This is done below by comparing the geospatial pattern found in the chdi data with data from the Syntactic Atlas of Welsh Dialects. In each case, two maps are provided, constructed according to the same principles: KDE is conducted for the SAWD data in the same way as it was done for the Twitter data.
The data collection for SAWD assumed that chdi was not found anywhere in the south; the south is thus blank on the SAWD maps because no questionnaire data was collected there. For the Twitter data, the south is retained, as it is not obvious in advance that the method can correctly rule out the presence of chdi in the south; indeed, some instances of chdi are localized to points in the south, although KDE smoothing means that these points do not have a significant impact on the final outcome.
An examination of data across the hierarchy of syntactic contexts shows that overall distributions produced by the two methods are rather similar. Figure 4 shows the two historically primary contexts, after a non-inflecting preposition and in independent syntactic contexts (mainly focus and sentence fragments).
Here, as in all cases examined, SAWD always produces higher absolute rates for chdi in regions where it is present, presumably because the oral questionnaire that was used focused speakers' attention on spoken usage and further because the Twitter data contain a proportion of material in standard written Welsh (see also discussion of Twitter as a Willis: Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language Art.103, page 16 of 33 proxy for spoken data in section 4.2.1 below). The Twitter data for object of non-inflecting preposition are for all such prepositions occurring in the corpus; the SAWD data are for the object of the one such preposition, efo 'with', that was included in the questionnaire.
For the object of non-inflecting prepositions, the Twitter data overstate the geospatial distribution of chdi by attributing majority chdi usage to the whole of the northeast. This is due to the relatively small size of the dataset once it is subdivided into different syntactic contexts: of the 277 data points categorized as non-inflecting preposition, 140 could be localized; and of the 266 categorized as independent use, 129 could be localized. With this size of dataset, mislocalization of one or two data points (for instance, because a user mentions only the place where they now live, but not where they grew up, in their user description) can have a significant impact even once smoothing has applied. This particularly affects the northeast, where there are rather few data points (8-10) for these contexts. This is an issue that is likely to arise from time to time when using Twitter for smaller languages, but its effect would likely be reduced with a larger dataset collected over a longer period of time. For the independent-use context, the data contained no such outlier points and the issue did not arise. The result is an isogloss from the Twitter data that is remarkable close to that found using traditional means. The two contexts discussed so far are historically primarily; that is, the changes that produced the current dialect distribution occurred in the eighteenth and nineteenth centuries, and there is little evidence of ongoing change in their geographical distribution today. We turn now to consider in turn contexts where there is more ongoing change.
Consider first the object of the semi-inflecting preposition i 'to' and the fully inflecting prepositions (mainly am 'about', ar 'on' and o 'from'), shown in Figure 5.
These are contexts where chdi emerged in the late-nineteenth century, and where there is some evidence today that chdi is still spreading geospatially. For i 'to' (186 localized points), the area for chdi defined by the two methods is again remarkably similar; for inflected prepositions (83 localized points), the Twitter result again slightly overstates the use of chdi in the northeast, and for the same reasons as before: the inflected-preposition dataset is too small to be immune to the presence of a small number of points whose localization does not reflect the place where the user in question grew up.
Next consider the subject of the imperfect auxiliary oedd 'were', the modal rhaid 'must', and the conditional auxiliary (bua)sa(i) or byddai. These are grouped together because previous research (Willis 2017: 58) has shown them to be undergoing similar rates of change today with respect to the spread of chdi, and the Twitter dataset clearly contained too few instances of each context (50 localized tweets for oedd, 17 for rhaid and 55 for the conditional) for analysis on an individual basis to be viable. Results for these three contexts combined are shown in Figure 6. The Twitter results agree with SAWD in identifying a smaller geographic area for chdi in these contexts than for the earlier contexts we have considered. Nevertheless, this is one context where the Twitter results suggest a somewhat wider geographic spread that the SAWD results. This may be due to the fact that these are contexts currently experiencing rapid diffusion of chdi, and the Twitter data reflect on average a younger age group, in which chdi does indeed have a wider geographic distribution. Closer inspection of the data suggest that the difference is mainly due to chdi being the overwhelming form among tweets localized to the Llŷn Peninsula around Pwllheli, while ti is the dominant choice (although by no means the only choice) of informants in this region in the SAWD questionnaire. This probably is a genuine instance of diffusion and change in progress.
Finally, consider the two contexts where chdi is a very recent innovation found only in a very small geographic area, namely the subject of the future auxiliary bydd 'will be' and the object of the preposition gan 'with'. These results for these contexts are shown in Figure 7. In both cases, the Twitter dataset is small (47 localized tweets for bydd and 79 localized tweets for gan), and, consequently, it is difficult to draw firm conclusions. However, the results are consistent with the SAWD questionnaires. The Twitter data agree with SAWD in showing that these are the contexts where chdi is most geographically restricted and where it is the minority choice everywhere. Both methods record slightly higher frequencies in the northwest. The greater richness of the SAWD data allows the Willis: Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language Art.103, page 18 of 33 identification of a centre of innovation in the Caernarfon area. The Twitter data do not contradict this: all the localized tweets for these contexts turn out to be either from the Caernarfon area or from users ultimately from this area who have moved elsewhere (but who ended up being localized to where they currently live). Willis: Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language Art. 103, page 19 of 33

Discussion
The overall geospatial distribution of chdi established using Twitter data broadly matches that established using traditional means (Figures 1 and 3). While the global analysis of the distribution by syntactic context produces a hierarchy of contexts that differs in points of detail from the traditional one, and does not identify the historically primary contexts, the more fine-grained geospatial analysis conducted in section 3.3 suggests that this is to some extent due to spatial autocorrelation of the syntactic contexts themselves within the dataset. When plotted as maps for individual contexts, the Twitter data largely agree with the result from the earlier SAWD project. Two sources of discrepancy were identified: in a number of cases, insufficient data, either for a particular syntactic context globally, or for a particular geographic region, resulted in oversmoothing or lack of geospatial detail for the Twitter maps; in one case, subject of auxiliaries in Figure 6, the greater geographic extent of chdi identified in the Twitter data may plausibly represent real ongoing diffusion that accurately reflects the younger demographic of the Twitter data.
The hierarchy of contexts established by traditional means in (15) above is also largely respected. This can be seen by looking at the estimated smoothed frequencies at various locations, both within and beyond the traditional chdi-region, as shown in Table 2. If the hierarchy in (15) is respected, the values should decrease monotonically from left to right within the traditional chdi-region. This is broadly the case. There are some exceptions for the four most well-established contexts on the left of Table 2, contexts where the change may well have run to completion within the chdi-region. Outside of the chdi-region, we find scattered noise in the estimates for the four southern locations (Aberystwyth, Cardiff, Cardigan and Carmarthen), where chdi is absent from the local varieties. Aberystwyth and Cardiff have significant student populations; Cardiff, as the capital, has other migration from the north, and, furthermore, as an area of language revitalization, has an emerging new variety with some levelling of features from the south, the north and the literary standard. It is therefore not surprising that Aberystwyth and Cardiff have more noise, in the form of higher estimates for chdi, that other southern locations. Finally, Denbigh, in the northeast, shows a similar overall pattern to the northwest, but at rather lower frequencies. As we have seen, this is partly due to lack of data and artefacts of smoothing, although it may also suggest some genuine spread of chdi to this area.

Auxiliary deletion
We turn now to consider a second variable using the same dataset, namely deletion of auxiliary 'be' in pre-subject position.

The variable
Deletion of auxiliary bod 'be' in the present tense is characteristic of all varieties of spoken informal Welsh today. Thus, in the AuxSVO structure in (20), the auxiliary (r)wyt may be reduced or omitted entirely.
(20) ((R)w(y)t) ti 'n chwarae pêl-droed. be.prs.2sg you prog play.inf football 'You're playing football.' Such deletion is not restricted to the second person singular and is found with most other pronouns, at least in some varieties. It has even been reported for clauses containing lexical NP subjects (Davies 2010: 270).
Deletion is not dependent on the auxiliary being in absolute clause-initial position: it is grammatical in an embedded AuxSVO clause in (21) and in a main-clause wh-question in (22).
Mae 'n well na(g rwyt) ti 'n meddwl. be.prs.3sg pred better than be.prs.2sg you prog think.inf 'It's better than you think.' Auxiliary deletion in the second person singular requires ti rather than chdi as the subject pronoun (99.6% ti in Table 1 above). It thus patterns with its overt counterpart wyt ti 'you are' (100.0% ti), rather than with the independent-use context (68.0% ti). This suggests that auxiliary deletion should be interpreted as involving a real auxiliary that happens to be null, rather than as a construction with an independent pronoun not dependent on any auxiliary. Another argument in favour of this view is that a tag question with a full auxiliary may co-occur with a main clause containing auxiliary deletion (Borsley, Tallerman & Willis 2007: 260-61). If tag questions in some sense involve copying of the auxiliary of the main clause, this would suggest there is a syntactically represented auxiliary in the main clause.
It is agreed in the literature that the availability of auxiliary deletion is conditioned by person and (to a lesser extent) by number (Jones 2004: 101-2;Borsley, Tallerman & Willis 2007: 260-61;Breit 2012). There is a striking degree of variation between dialects in some person-number combinations; for instance, auxiliary deletion in the first person plural is common in the south but very rare in the north. Davies (2010: 258-335),  and Davies (2016) investigate auxiliary deletion in the second person singular by 28 Welsh speakers from the Siarad Corpus of spoken Welsh Deuchar, Webb-Davies & Donnelly 2018), 8 of whom grew up in the south and 20 of whom grew up in the north or with a northern background. Davies (2010: 285, 297) finds an overall frequency of auxiliary deletion of 92.8%, with slightly lower levels in older speakers (84.8% in the over-50 group, compared to 93.8% in the under-30 group) and thus age turns out to be a significant predictor of frequency of deletion. However, even some speakers born in the 1920s have deletion with a frequency approaching 100%, indicating that auxiliary deletion is a historically well-established feature of spoken Welsh. Indeed, Willis (2016) argues that the roots of auxiliary deletion go back to the nineteenth century, where it is used in literary representations of second-language Welsh. While Davies (2010: 293) found women to delete auxiliaries at a slightly higher rate than men, this difference was not significant; nor were there significant differences according to the region in which a speaker spent their first year of life (Davies 2010: 295). 6 The impact of factors other than person and number on the frequency of auxiliary deletion is less well understood. Davies (2010: 303-22) investigated the impact of four factors in the Siarad corpus, failing to find any significant effect for clause type (declarative vs. interrogative), linguality (the presence of absence of code-switching in the clause), or negation. He found an inverse relationship between deletion of the auxiliary and deletion of an aspect marker in the same clause. 7 Breit (2012) administered an acceptabilityjudgment task to 20 native speakers, almost all from north Wales. He found negligible degradation of acceptability of deletion in negative and interrogative clauses, in line with Davies' findings that these are not factors conditioning variation. With focus fronting, a factor not investigated by Davies, he found a large reduction of acceptability of deletion in clauses with VP-fronting: (23) Ffonio 'r gwasanaeth tân ?(wyt) ti. phone.inf the service fire be.prs.2sg you 'Phoning the fire service you are. / You're phoning the fire service.' (Breit 2012: 83) Such examples leave the sequence deleted auxiliary + subject pronoun in clause-final position and require deletion of the stressed auxiliary, leaving the unstressed pronoun to form the final phonological word of the sentence. This could be a phonological reason to disfavour deletion. Davies (2010: 323-8) argues that the initial innovation of auxiliary deletion was internally motivated and due to phonological erosion, but that, once it had been innovated internally, it was accelerated by external factors, namely isomorphism between the SV(O) word order that results and the normal word-order pattern in English. This accords well with the nineteenth-century perception of it as a feature of non-native Welsh.
Davies further suggests that dialect differences arose because deletion was favoured in contexts where the auxiliary began with a vowel, thus northern 'dan ni 'we are' resists auxiliary deletion, while the equivalent southern form ŷn ni favours it. Note that both forms are ultimately reductions from Early Modern Welsh yr ydym ni (prt be.prs.1pl we) > rydyn ni > dyn ni > dan ni in the north and yr ydym ni > rydyn ni > rŷn ni > ŷn ni > ni in the south. This account thus amounts to saying that reduction has proceeded further in the south than in the north. The reasons for this differential degree of reduction remain unclear.
The same dataset collected to investigate the geographic distribution of chdi can be used to investigate auxiliary deletion in the second-person singular. In this environment, auxiliary deletion is common in all dialects of Welsh. Previous studies have not identified geospatial variation to date. The first question addressed by the data here is whether Twitter data provide a useful proxy for spoken data. Clearly, Twitter is a written rather than a spoken medium, but the informal register of Twitter data can make it an attractive proxy for much more difficult to obtain transcriptions of spoken language. In this case, we can compare the frequency of auxiliary deletion in tweets with its frequency in spontaneous speech as reflected in the Siarad Corpus. In addition, we can examine whether existing statements about the effect of linguistic and geographic factors on variation are borne out by the Twitter data.

Twitter as a proxy for spoken data
In the Twitter corpus, the global ratio of deletion to non-deletion in the second person singular (including non-localizable tweets) is 1,784: 461 (79.5% deletion). While this is lower than the 92.8% of the Siarad corpus, it suggests a relatively good match between the two methods, especially considering that auxiliary deletion does not occur in standard written Welsh. Results derived from using Twitter data are thus unlikely to be radically different from those using spoken data, although the impact of standard forms within tweets does need to be considered in any analysis, and a gap in absolute values between the two sources will also need to be reckoned with. Further investigation is needed to establish whether the difference between recorded speech and the Twitter corpus is due to the impact of institutional and learner tweeting (both showing influence from the written standard) or whether it is an inherent property of the (ultimately written) Twitter medium even among users aiming to tweet "as they speak".

Linguistic factors
Linguistic factors examined in the data were clause type (main or subordinate), force type (declarative, interrogative, focus etc.), and polarity (affirmative or negative). All tweets containing a context for auxiliary deletion were coded for these factors. This allows us to examine the effect of several factors discussed previously in the literature, namely declarative vs. interrogative, not found to be significant by Davies (2010) and not found to lead to degradation in acceptability by Breit (2012: 46); focus vs. non-focus, with focus found to lead to degradation in some contexts by Breit; and negative polarity, not found to lead to substantial degradation by Breit. For clause type, clauses were counted as subordinate if they were introduced by a complementizer such as tra 'while', os 'if' or pan 'when', or were relative clauses or embedded wh-answers. Affirmative complement clauses are formally nonfinite in Welsh. It was assumed that clauses like (24) involve deletion of nonfinite bod 'be' (rather than of finite rwyt 'are'). Since existing studies have focused on deletion of finite 'be', such clauses were excluded from the analysis, rather than being included as subordinate.
(24) Dwi 'n gwybod ___ ti 'n ennill. be.prs.1sg prog know.inf you prog win.inf 'I know you're winning.' For force type, possible values were declarative VSO clause, declarative focus clause, yesno question, wh-question, focus question, and conditional ('if'-clauses). Focus fronting (with object fronting) is illustrated in (25)  For polarity, possible values were affirmative and declarative. 'Mond 'only' (< dim ond 'nothing but') and methu 'be unable, not be able' were treated as affirmative; heb, when used as the negative of the perfect marker wedi, was treated as negative.
The frequency of auxiliary deletion in each of the syntactic contexts examined is given in Table 3. Rates of deletion varied substantially in the Twitter data from context to context from 92.8% in declarative clauses to 33.3% in focus questions.
The impact of these factors was assessed by implementing a logistic regression model with presence or absence of auxiliary deletion as the binary dependent variable. The results of this model are given in Table 4, with affirmative declarative main clause as the reference level. Positive log odds indicate that a factor level favours auxiliary deletion relative to the reference level, while negative log odds indicate that it disfavours auxiliary deletion relative to the reference level.
The contrast between main and subordinate clauses was not a significant factor in the model (the lower frequency of auxiliary deletion in subordinate clauses reducing largely to the effect of conditionals), while all other factors were significant. All clause types had a significant inhibiting effect on auxiliary deletion as compared to affirmative declarative main clauses. While the effect of negation and conditional clauses was rather small, the effect of interrogative and focus clause type was substantial. The effect of focus agrees with Breit's findings. The effect of interrogative clauses is less expected. It should be noted that Davies's study was based on 648 observations, while the current study is based on 2,245. With a larger and more independent sample size (1,329 Twitter users as against 28 speakers), significant effects are more liable to emerge. The lower overall frequency of auxiliary deletion in the current dataset also means that significant factors are less likely to be hidden by ceiling effects.
To facilitate comparison with traditional work in quantitative sociolinguistics, another presentation of the same model is given in Table 5 using RBrul (Johnson 2009), with the mean of all observations as the reference value and the log odds estimates also transformed into factor weights in the tradition of Sankoff & Labov (1979: 199). Factor weights above 0.5 indicate that a factor level favours auxiliary deletion, and those below 0.5 indicate that it disfavours it relative to the mean probability of deletion over the entire dataset. It is important to bear in mind the fact that the reference level is different in the two  presentations when interpreting differences between them. See Johnson (2009: 359-362) for a discussion of the differences between these two modes of presentation.
A second model included interactions between subordination, clause type and polarity. Here, two interactions were significant (at p < 0.05). Negative subordinate clauses significantly decreased the propensity for auxiliary deletion (coefficient -1.622, standard error 0.694, p = 0.019). Negative yes-no questions significantly increased the propensity for auxiliary deletion (coefficient 1.691, standard error 0.683, p = 0.013).

Geospatial distribution
Having looked at the impact of linguistic factors in the global distribution, we turn now to the geospatial distribution of auxiliary deletion. In total, of the 2,245 tweets containing either overt or deleted auxiliary 'be', 1,100 (49.0%) could be localized at 189 distinct localities. The KDE-smoothed distribution is shown in Figure 8. Auxiliary deletion is the majority option everywhere, although there is some regional variation. The highest values, 80-95% are found across the northwest, with slightly lower values (75-85%) in the southwest, and the lowest values in the east (65-75% in the northeast, 65-72% in the southeast).
The success of this result is more difficult to assess than that the result for chdi in section 3.3 above, because the reference distribution against which to compare this result is itself not clearly established in the literature. While Davies (2010) found no statistically significant differences according to the region in which a speaker spent their first year of life, the analysis divided Wales into only two regions (north and south) and the speakers investigated were mostly from the north, so it is possible that such differences were missed.
A partial comparison with data from the SAWD questionnaire is possible, although it too is not ideal for current purposes. Nevertheless, such a comparison reveals rather similar overall patterns. No question in SAWD aimed specifically at testing variation in auxiliary deletion in the second person singular. However, the environment arises fortuitously in nine questionnaire items. Unfortunately, these are all in wh-questions or in subordinate clauses. Furthermore, most of the questions where the environment does arise were asked only in one region (north or south), which makes the data less than ideal for the current comparison. The only relevant item asked in all areas was question 45 ('If you're not happy, Note: Factor weights that are not statistically significant are given in parentheses. Willis: Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language Art. 103, page 27 of 33 don't come.'). No previous study has tested in detail whether a finite subordinate clause of this type is a favouring or disfavouring context for auxiliary deletion. Davies (2010: 316) notes that there are only 4 instances of this context in his materials, with a rate of deletion of 50.0%. This would in principle be a very low rate of deletion and would suggest that this is a strongly disfavouring context for auxiliary deletion. However, as Davies notes, the rarity of the context makes any conclusions difficult to draw. Nevertheless, the possibility that this is a disfavouring context should be borne in mind when interpreting the results.
A KDE plot of the responses for auxiliary deletion in this item (question 45) in SAWD is shown in Figure 9. The overall rate of auxiliary deletion is rather low, at 57.4% of 155 observations. This is either because the questionnaire-based interview favoured its retention or, in line with the discussion above, because the syntactic environment tested is itself one that favours retention. Geographically, we find the highest rates in the north, Willis: Using social-media data to investigate morphosyntactic variation and dialect syntax in a lesser-used language Art.103, page 28 of 33 with 50-75% in the northwest and 50-65% in the northeast; rates are lower in the south, with 20-40% in the southwest, 25% around Cardiff and 10-20% in the central south.
The north-south distinction is rather similar to that found in Twitter in Figure 8, albeit at lower absolute levels. An east-west division is clear in the SAWD data in the south, as in the Twitter data. This east-west effect is stronger in the Twitter data than in the SAWD data.
A reasonable hypothesis is that the east-west effect is due to the linguistic impact of language revitalization: in the east, closer to England, Welsh is a minority language and more dependent on revitalization efforts and Welsh-medium education to ensure language maintenance. Such a scenario promotes standard forms, reducing the frequency of colloquial options like auxiliary deletion. If so, it is not surprising to see this effect more strongly articulated in the Twitter data, where L2 speakers cannot be fully removed from the dataset. Further investigation is needed to establish whether this interpretation can be substantiated by additional research.

Conclusion
This paper has tested the usefulness of social-media data in examining traditional questions in dialect syntax and sociolinguistics. The three central questions considered have been: (i) to what extent do datasets based on Twitter data successfully establish geospatial distributions derived via traditional means? (ii) to what extent can Twitter data successful derive implicational hierarchies of contexts in the same way as studies based on more traditional materials? (iii) to what extent can written Twitter data act as a proxy for spoken data?
Two case studies have addressed these questions: the distribution of Welsh second person singular pronoun variants dealt with the first two, while the distribution of auxiliary deletion in Welsh dealt with the first and last of these.
We have seen that overall geospatial patterns in the data are similar to those established by traditional means. Thus, for the second person singular pronoun chdi, Figure 3 closely mirrors Figure 1. In the case of auxiliary deletion, where geospatial variation has not been fully established by traditional means, the Twitter data were not out of line with what we know from other sources, and can make a useful contribution to ongoing research when considered alongside those sources.
The overall hierarchy of syntactic contexts that emerges for the second person singular pronoun chdi in section 3.3 turned out to be broadly similar to the existing implicational hierarchy in (15), once ceiling effects due to ongoing continuation of change were taken into account. That is, in some cases, differences could be attributed to ongoing change, with Twitter typically reflecting a younger demographic. In some other cases, the quantity of Twitter data in the current study was insufficient once KDE smoothing had been applied.
For auxiliary deletion, the effects of internal linguistic factors uncovered in the data contrasted with the general absence of such effects in existing studies. While focus was found to be significant in inhibiting auxiliary deletion, in line with earlier work, both interrogative clause type and, to a lesser extent, negation, were found to inhibit deletion. These effects were found to be robust and based on a substantially larger dataset than existing work. Given the success of the Twitter data elsewhere, these results should feed in to our broader understanding of the phenomenon at hand.
Finally, in comparison with data from the spoken Siarad Corpus, Twitter data emerged as a good, but not perfect, guide to spoken usage: while auxiliary deletion in the secondperson singular occurs with a frequency of 92.8% in spoken corpora, its frequency in Twitter data was 79.5%. Social-media data in this respect occupy a grey area where the distinction between speech and writing is not so clear.
In many cases, it is striking that very different data-collection methodologies produced very similar results. We have considered written corpus data in Twitter, the spoken corpus data of the Siarad Corpus, and the elicited questionnaire data of the Syntactic Atlas of Welsh Dialects and the Welsh Dialect Survey. These striking similarities may suggest that the choice between questionnaire-based and corpus-based methodologies is not as crucial as might first appear. In any case, the findings here demonstrate the general viability of using Twitter data alongside traditional methods to investigate morphosyntactic variation and change.
Finally, the approach adopted here has implications for theoretical developments in language variation and change. Studies using social-media data have the potential to combine the large scale of dialect atlases with the social and linguistic depth of sociolinguistic studies based around a single community. This opens up the possibility that we may be able to answer questions of a theoretical nature that could not be addressed within a single study, for instance, whether linguistic and social conditioning factors are stable across geographic space. This in turn may inform our understanding of both the processes by which innovations spread and the synchronic analysis of the linguistic phenomena under investigation.

Abbreviations
Glosses follow the Leipzig glossing rules, except for: pred predicate marker.

Additional Files
The additional files for this article can be found as follows: •

Ethics and Consent
Ethical approval for this research was obtained from the Humanities and Social Sciences Research Ethics Committee of the University of Cambridge.