Formal variation in the Kata Kolok lexicon

Sign language lexicons incorporate phonological specifications. Evidence from emerging sign languages suggests that phonological structure emerges gradually in a new language. In this study, we investigate variation in the form of signs across 20 deaf adult signers of Kata Kolok, a sign language that emerged spontaneously in a Balinese village community. Combining methods previously used for sign comparisons, we introduce a new numeric measure of variation. Our nuanced yet comprehensive approach to form variation integrates three levels (iconic motivation, surface realisation, feature differences) and allows for refinement through weighting the variation score by token and signer frequency. We demonstrate that variation in the form of signs appears in different degrees at different levels. Token frequency in a given dataset greatly affects how much variation can surface, suggesting caution in interpreting previous findings. Different sign variants have different scopes of use among the signing population, with some more widely used than others. Both frequency weightings (token and signer) identify dominant sign variants, i.e., sign forms that are produced frequently or by many signers. We argue that variation does not equal the absence of conventionalisation. Indeed, especially in micro-community sign languages, variation may be key to understanding patterns of language emergence. HANNAH LUTZENBERGER


Introduction
Despite being produced and perceived in distinct modalities, signed and spoken languages parallel on all levels of linguistic structure.
Much like spoken languages, the lexicons of sign languages are shaped by phonological specifications. In 1960, Stokoe (1960) showed that signs in American Sign Language (ASL) are compositional, consisting of the parameters of handshape, movement and location, that differentiate between signs (phonemic contrasts). It was later recognised that signs are composed of finer feature distinctions which can be analysed by a feature geometry model, just like spoken phonemes (Brentari 1998;van der Kooij 2002). Feature sets are most easily determined through lexical contrast. For example, the sign pairs to-live-in 1 and holiday, and also and holiday represent minimal pairs in Sign Language of the Netherlands (NGT; Figure 1). The first pair (to-live-in and holiday) differs in handshape features and the second (holiday and also) in location features (van der Kooij 2002: 21 f.). Specifically, to-live-in uses a handshape with thumb and index touching and holiday is produced with all fingers extended; holiday is articulated at the ipsilateral side of the mouth and also at the contralateral side of the chest. Besides these differences, all other features are shared.
The featural organisation of signs allows for comparing pairs of signs in terms of their formational similarity. Minimal pairs can serve as evidence for phonologically distinctive feature values of a language. Sign comparisons have also been widely operationalised to determine form similarities within and across sign language lexicons, in order to assess genealogical relationships (Woodward 1991;1993;McKee & Kennedy 2000;Currie et al. 2002;Xu 2006;Sasaki 2007;Al-Fityani & Padden 2008), and to correlate formational variation with social factors (Bayley et al. 2000;2002;Lucas et al. 2002;Schembri et al. 2009;Fenlon et al. 2013;Siu 2016).
While sign phonology has been extensively documented and analysed, most of this research is conducted into sign languages used in large urban deaf communities, here referred to as macrocommunity sign languages 2 (Schembri 2010;Schembri et al. 2018). Among sign languages used within small and tight-knit rural communities, or micro-community sign languages (Schembri 2010;Schembri et al. 2018), the status of phonology is unclear. Initial studies of sign language emergence in these settings show that the form of signs varies greatly across signers and iconicity is the strongest force driving sign formation (Sandler et al. 2011). For example, in Al-Sayyid Bedouin Sign Language (ABSL), a young micro-community sign language of a Bedouin village in Israel, signs to refer to a dog used across signers are motivated by the dog's barking, but signers differ greatly in how the barking is encoded (Sandler et al. 2011: 520). The authors also note that minimal pairs (as illustrated in Figure 1 for NGT) are unattested in ABSL, leading them to conclude that phonological structure has not yet emerged in this language. This raises the fundamental questions as to if, how, and when phonological structure emerges in young sign languages. Crucially, micro-community sign languages are distinct from macro-community sign 1 Following the convention, translations equivalents of signs are represented as glosses in small caps: gloss. 2 The discussion about terminology especially in the field of sign language emergence is extensive and ongoing (see for example de Vos & Pfau 2015;or Hou 2016). As laid out in de Vos & Pfau (2015), many determining factors are confounded and depending on the context of discussion, different and multiple labels may be used to classify a language without being mutually exclusive. For the purpose of this study, we adopt the terminology first suggested by Schembri (2010) within the framework of sociolinguistic typology (Schembri et al. 2018). This terminology is chosen without taking position on the relevant theoretical background and for the sake of simplicity and clarity. Note that these terms are not always used by the authors themselves in previous literature (e.g. Sandler at al. (2011) and Meir et al. (2010) refer to ABSL as village sign language and as emerging sign language and Kisch (2008) describes the community as shared signing community). languages with respect to ecological niche, emergence setting and age, leaving multiple viable hypotheses regarding the causes and mechanisms underlying the attested phenomena (de Vos & Pfau 2015). Moreover, the methodological approach to studying variation has focused on the level of an iconic motivation in micro-community sign languages (e.g. Richie et al. 2014;Hou 2016;Horton 2018;Horton & Riggle 2019;Neveu 2019;Reed 2019;Mudd, Lutzenberger, et al. 2020) and on the sub-lexical level in macro-community sign languages (e.g. Bayley et al. 2000;2002;Lucas et al. 2002;Fenlon et al. 2013;Siu 2016).
This study charts the formational variation in Kata Kolok (KK), another micro-community sign language that emerged six generations ago in a village community in rural Bali, Indonesia. We examine the extent of variation in the form of signs by analysing responses across 20 deaf KK signers to a picture elicitation task. We combine different techniques that have been used for sign comparisons and introduce a newly developed measure of variation. This method yields a numeric score and incorporates variation across i) iconic motivations, ii) surface realisations, iii) feature differences. A gradient measure is able to unify variation stemming from multiple sources, each of which may vary to differing degrees within the same sign. We argue that especially in studies discussing sign language emergence, both token distributions within the dataset and distribution of variants across participants are fundamental to capturing variation adequately.

Iconicity
Iconicity, defined as the structured mapping between meaning and form, acts as an organisational force across all natural languages (Perniss et al. 2010;Dingemanse et al. 2015). Taub (2001) describes iconicity in sign languages as a process of analogue building between semantic and phonological representations. Certain aspects from or related to a sensory image are first selected, then schematised, and finally linguistically encoded by mapping them onto the articulators, i.e., the signer's hands and body. More specifically, overlap in experiences and sociocultural and linguistic background of language users impacts the construal and the interpretation of iconicity (Wilcox 2004;Occhino 2017;Occhino et al. 2017).
One concept may result in many different iconic mappings. For example, the signs for bird in ASL and Turkish Sign Language (TİD) entail iconic mappings based on different sensory images and consequently, different schematisation and encoding (Figure 2). The ASL sign selects the sensory image 'beak' and maps the signer's hand to the bird's beak, articulating it at the mouth to reflect shared structural and functional traits of a bird's beak and a human's mouth. The TİD sign selects and schematises the wings of the bird, mapping the wings onto the signer's arms, with a flapping motion to represent flying.
Examining phonology and semantics shows that iconic mappings are compositional. Indeed, for the ASL sign bird, iconic mappings can be analysed at the phonological level by examining the  features (i) handshape, (ii) location, and (iii) handshape change. Thus, (i) a handshape with an aperture relation between index finger and thumb relates to the shape of the beak, (ii) the sign is produced at the mouth, reflecting shared properties between the beak and the mouth, (iii) the handshape opens and closes, reflecting the movement of a bird's beak. Both examples provided in Figure 2 show iconic mappings on multiple formational levels, yielding signs with different iconic motivations. Recent research has further suggested that iconic mappings extend beyond single lexemes; certain mappings are recurrent across the lexicon (van der Kooij & Zwitserlood submitted). Naturally, iconic mappings are situated within the cultural context in which a language is used, and/or emerges; thus, cultural knowledge shapes what features are available for selection and mapping.
To sum up, iconicity is a powerful force organising sign language lexicons. Semantic information related to a concept may be linguistically encoded through analogue-building. Properties of or related to the concept provide the raw materials which then become schematised and mapped onto the signer's articulators, creating iconic signs. This process can lead to different forms across languages and within a sign language.

Comparisons of signs
Comparisons of signs have long been a staple in both the field of lexicostatistics, where form similarities across different sign languages are used to establish phylogenetic relations (Woodward 1991;1993;McKee & Kennedy 2000;Currie et al. 2002;Xu 2006;Sasaki 2007;Al-Fityani & Padden 2008) and sociolinguistics, where comparisons of signs within a single language target sociolinguistic variables (Bayley et al. 2000;2002;Lucas et al. 2002;Schembri et al. 2009;Fenlon et al. 2013;Siu 2016). Studies on sign language emergence, and particularly emerging phonology, have also relied on comparisons of signs produced by signers with varying sociolinguistic profiles from the same linguistic community (Israel 2009;Israel & Sandler 2009;Sandler et al. 2011;Morgan 2015). In the following sections, we provide details about the methodological approaches and findings in each of the fields where sign comparisons are frequently used.

Cross-linguistic comparisons
Within lexicostatistics, pairs of signs for the same concept from different macro-community sign languages are compared on different formational parameters (handshape, location, movement, handedness, other), often classifying signs as identical, similar, or different (Woodward 1991;1993;McKee & Kennedy 2000;Currie et al. 2002;Xu 2006;Sasaki 2007;Al-Fityani & Padden 2008). More recent cross-linguistic comparisons are based on the feature level, using a match/ non-match criterium (Yu et al. 2018;Börstell, Crasborn & Whynot 2020), or an adapted Levenshtein distance (also known as edit distance between two forms) (Parks 2011;Omardeen 2018). Such comparisons result in a numeric score rather than a categorical outcome, and thus are more appropriate to measure differences across sign language lexicons, allowing to include subtle differences in sign formation.
However, while form similarities in spoken languages have long been taken as evidence for linguistic relatedness, cognates in sign languages may also be due to overlap in iconic mappings stemming from similarities in human experiences that cross-cut cultures. Al-Fityani and Padden (2010) and Börstell and colleagues (2020) show that unrelated sign languages display form similarities. In particular across the basic vocabulary, sign languages show high degrees of formational similarity, possibly due to similar iconic mappings. For example, the sign languages of America, New Zealand and Australia share 19% of the signs from a Swadesh list, including signs like good, bird, cat, child, narrow, red, and sun (McKee & Kennedy 2000). However, integrating iconic motivations into sign comparisons has proven to be a challenge, especially for assessing relatedness, and thus, has not often been attempted. The implementation of a comparison of the iconic motivation before the parametric comparison as suggested by Xu (2006) still overestimates phylogenetic relations (Su & Tai 2009). It is possible that, as suggested by Ebling et al. (2015), a more fine-grained analysis of image-producing techniques and underlying motivation is needed. However, no successful method has been made available at this point. Summing up, cross-linguistic comparisons quantify form similarities based on parameter or feature overlap (Woodward 1991;1993;McKee & Kennedy 2000;Currie et al. 2002;Xu 2006;Sasaki 2007;Al-Fityani & Padden 2008;Parks 2011;Omardeen 2018;Börstell et al. 2020). Cross-linguistic similarities may be due to shared iconic motivations or historic relatedness, with iconicity providing a complicating factor in phylogenetic analyses. The few studies that take iconicity into account are limited in their capacity to scale-up and lack a numeric outcome (Xu 2006;Su & Tai 2009;Ebling et al. 2015).

Sociolinguistics
Comparisons of signs within the same language are frequently conducted to study sociolinguistic variation. Early work on signing varieties of Mexico and Costa Rica compares signs from word lists and dictionaries to determine dialectal variation (Bickford 1991;Woodward 1991). More recent studies are more specific about how linguistic and sociolinguistic factors affect sublexical variation, in particular age, region, and gender. Using large-scale datasets with signers of diverse demographics, these studies analyse the parametric deviation from a citation form using multivariate analyses (Bayley et al. 2000;2002;Lucas et al. 2002;Schembri et al. 2009;Fenlon et al. 2013;Siu 2016). For example, Bayley and colleagues (2000) find an effect of age by region in the use of the three variants of the ASL sign deaf (Figure 3) alongside a tendency for the third pictured variant (continuous contact) to occur in compounds. In studies examining variation within a language, sign comparisons are either concerned with a class of signs sharing a specific parameter (e.g. the extended index finger handshape in Bayley et al. 2002;Schembri et al. 2009; or the location parameter in Fenlon et al. 2013;Siu 2016;Lucas et al. 2002) or a class of signs that share the same meaning (e.g. Bayley et al. 2000). Iconicity, however, is not commonly considered, and does not appear as a major factor in the comparisons.
To conclude, comparative work on signs within a single macro-community sign language does not primarily quantify form similarities or engage with iconicity. Sociolinguistic studies analyse large samples to identify which sociolinguistic variables influence formational variation in terms of deviance from a citation form (Bayley et al. 2000;2002;Lucas et al. 2002;Schembri et al. 2009;Fenlon et al. 2013;Siu 2016). Nevertheless, the sociolinguistic determinants of the choice of specific variants in interaction remain to be investigated.

Emerging phonology
To examine phonological structure in young sign languages, researchers have employed pictureelicitation tasks, comparing how the same concept is encoded across different signers. The variants produced are then compared across signers to establish the degree of lexical convergence. To date, two studies have examined emerging phonological structure; Sandler and colleagues (Israel 2009;Israel & Sandler 2009;Sandler et al. 2011) examine ABSL, a young sign language of a rural community in Israel, and Morgan (2015; studies Kenyan Sign Language (KSL), a young sign language that arose in a deaf school in urban Kenya. These studies have suggested that in young sign languages, signers first converge on the same iconic image before aligning the exact phonological form (Sandler et al. 2011;Morgan 2015). As we are interested specifically in sublexical variation, studies investigating lexical variation in other small signing communities are not discussed here in detail. Nevertheless, these studies overlap with the present one in some methodological aspects, in particular the role of iconic motivations (Richie et al. 2014 In a cross-linguistic study, Sandler and colleagues compare the systematicity across ten signers of ABSL, Israeli Sign Language, and ASL in the form of signs on a picture-elicitation task (Israel   Israel & Sandler 2009;Sandler et al. 2011). Monomorphemic responses are analysed (these constitute 43% of the data), with the most frequent iconic motivation considered. Comparing responses within the same sign language and then across languages, they find that ABSL, the young micro-community sign language, has the lowest degree of convergence, and ASL, the older macro-community sign language used across the US, the highest. Variation is measured on the phonological feature level, using mode, count of the most frequent feature in a set, and number of variants, number of features occurring in a set of sign responses with shared iconic motivation. Sandler et al. (2011) conclude that ABSL signers strive for a holistic iconic motivation rather than a compositional sign. However, it is difficult to know how far this outcome is affected by their decision to focus on monomorphemic responses, and to exclude competing variants with different iconic motivations. Morgan (2015) also uses a picture-based elicitation task to examine the emerging lexicon of KSL. Within the KSL lexicon, the degree of conventionalisation was found to vary from sign to sign, leading Morgan (2015) to hypothesise that signers first converge on an iconic motivation (or conceptual target) before aligning their phonology. Across the 20 deaf KSL signers sampled, items like salt elicit uniform lexical responses while others such as island elicit highly descriptive and probably idiosyncratic responses, similar to ABSL. Morgan posits that during the process of conventionalisation, signs may either converge on a single iconic motivation (e.g. how the fruit is eaten in guava) or stabilise as a compound (e.g. red^tiny for beans) before aligning in phonological form (Morgan 2015).
Summing up, sub-lexical variation has often been connected to the age of a sign language and its conventionalisation. As observed in KSL and ABSL, signers appear to conventionalise iconic mappings despite large amounts of feature-level variation. The fact that both of these sign languages show variation on the level of the iconic motivation and the sub-lexical level indicates that variation does not materialise on a single level.

Kata Kolok 2.3.1 Demographic sketch
The sign language Kata Kolok (KK) emerged spontaneously in a village community in North-Bali, Indonesia, due to sudden and sustained incidences of hereditary deafness (de Vos 2012; Marsaja 2008). Deafness was propagated due to geographical isolation and consanguineous marriage patterns within the labour-intensive community (Friedman et al. 1995). KK has been passed on throughout at least six generations of deaf signers (Friedman et al. 1995;de Vos 2012). At present, 33 deaf signers from generation three through six reside in the village permanently, their ages ranging from three years (generation six) to ~80's (generation three) (Lutzenberger in press). 3 Communal living in family compounds with shared religious, social and cultural practices has led to a high proportion of hearing villagers with signing skills (Marsaja 2008).
The tight-knit community of roughly ~3,000 people is socially and geographically structured into ten clans for which membership is determined by birth (Lutzenberger in press). Deaf people have been born into all ten village clans (Marsaja 2008). Following a patrilineal tradition, women transition to their husband's clan through marriage. Within clans, intergenerational households are the norm. This results in tight family bonds where younger generations care for older generations, and childcare becomes a shared task, involving older generations.
The professional landscape of the villagers has long centred around subsistence farming, raising livestock, day labour, or running small local businesses. Recent advances in technology and mobility continue to affect the community's demographics. There is increasing employment within government-supported jobs such as constructing infrastructure or tourism. As a result, villagers may seek job opportunities outside the village and may even relocate to more densely populated areas. This also creates new opportunities for younger deaf villagers who increasingly attend school, or even grow up, in more urban parts of Bali where a variety of Indonesian Sign Language (BISINDO) is used (Lutzenberger in press). 2.3.2 Sketch of the phonology and the lexicon KK is a sign language isolate, having developed without influence from other sign languages (Marsaja 2008;Perniss & Zeshan 2008;de Vos 2012). Moreover, the surrounding spoken languages Bahasa Indonesia and Balinese do not seem to have strongly influenced the structure of KK (Marsaja 2008;Perniss & Zeshan 2008;de Vos 2012). Evidence for this is, for example, the virtual absence of mouthing, the conventional pairing of manual signs and imitating a spoken word which is a prominent feature in many macro-community sign languages Bank 2015). In KK, mouthing has been observed only with limited vocabulary and by specific deaf signers; e.g., the word kopi alongside the sign coffee or apa accompanying the sign what. Increasing contact with BISINDO especially among younger signers (Moriarty 2020) who have also received basic education may trigger increased presence of mouthing and occasional lexical borrowings in specific conversational settings, topics, and interlocutors.
Low conventionalisation is found in core domains of the KK lexicon, paralleling other microcommunity sign languages across the world (e.g. Washabaugh 1986;Nyst 2007;Schuit 2014). Both colour and kinship terms show a limited paradigm of lexicalised signs in KK. KK uses the four lexicalised colour signs white, black, red, and grue (de Vos 2011) and the three lexicalised kinship signs mother, father, and offspring (de Vos 2012).
High variation in the KK lexicon is partly influenced by social factors. Mudd et al. (2020) analyse the first target sign in response to a picture elicitation task of 36 common concepts from 20 deaf and 26 hearing KK signers to determine the effect of social factors on sign variation. Using measures for lexical distance and neighbourhood density, they find that gender and hearing status, but not other social variables, such as generation, may predict the use of specific signs (Mudd, Lutzenberger, et al. 2020). This contrasts with studies on macro-community sign languages that often find that age and region strongly predict the choice of sign variants (e.g. Bayley et al. 2002;Stamp et al. 2014).
Neither the lexicon nor the phonology of KK have been studied in detail. Marsaja (2008) summarises basic building blocks of KK in terms of handshapes, locations, and movement features attested in his corpus. However, their status as phonetic or phonologically contrastive features remains unclear. De Vos (2012) revisits Marsaja's classification and suggests some modifications in terms of frequency or discrepancy of a few handshapes. Crucially, KK uses an extended signing space, reflected in some sign locations that are infrequent or even unattested in other sign languages, e.g. the hip or the teeth (Marsaja 2008;de Vos 2012). Lutzenberger (2018) finds major cross-linguistic differences in the phonological characteristics of name signs, a particular group of signs attributed to individuals, between KK and NGT with respect to the use of locations, nonmanuals, and specific handshapes. Similar to preliminary characteristics of KK features and the atypical use of space (Marsaja 2008;de Vos 2012), these findings indicate typologically distinct patterns in KK's phonology.
To sum up, KK shows typological differences to other sign languages in terms of various aspects of its phonology and lexicon, and high degrees of variation in the lexicon.

Present study
Studies investigating spoken languages and macro-community sign languages often treat variation as a rich resource to answer questions about linguistic diversity, inter-speaker variability, and linguistic landscapes of communities. However, studies on sign language emergence typically focus on convergence between signers rather than variation. In this study, we aim to reach a better understanding of the phonological properties of the lexicon in KK through examining variation. To that end, we ask the following question: what is the variation in the form of signs found across 20 deaf KK signers in response to the same picture prompts as in Mudd et al. (2020)? We combine methods from lexicostatistics and previous work on emerging phonology in order to analyse the present state of the lexicon in KK. Systematic feature-based comparisons used in lexicostatistics have not previously been adapted to data from micro-community sign languages, and common analyses used in sociolinguistics are unsuitable for micro-community sign languages as it is hard to define a standard variant, as explained by Mudd et al. (2020). The measure introduced in this study combines techniques of all those methods, taking into account all relevant signs in a response from a large-scale sample of 60% of the deaf adult KK signers.

Participants
The sample in this study comprises the same 20 deaf signers (11 female) of Mudd et al. (2020), all of whom permanently reside in the village and use KK as their primary mode of communication. Participants are sampled from generation three (ages ~65-80) through five (ages ~18-35) (see Table 1), as there are no deaf generation two (ages ~80+) signers alive and deaf generation six (ages ~2-5) signers are young children. We also sampled in such a way to maximise diversity in the socio-demographic profiles of the signers. 4

Stimuli
Due to the low literacy rate among our participants, we used picture stimuli. Participants were shown 36 pictures of common objects, spanning the semantic domains food, animals, colours, praying, miscellaneous (see Appendix A for a full list). Pictures were either taken in the field during previous field trips, or found on the internet (for materials see Lutzenberger et al. 2018). Stimuli selection was informed by the authors' linguistic and cultural knowledge of the community. In addition, as the task was originally administered to both deaf and hearing participants (Mudd, Lutzenberger, et al. 2020), stimuli were selected to match the knowledge of hearing people with varying degrees of signing fluency. In order to get a broad sample of signs with different degrees of conventionalisation, stimuli were selected based on the expected lexical variation for each item (high, medium, low). These classifications drew on insights from the KK Corpus, consulting local deaf research assistants and language knowledge of the first author, a fluent KK signer.

Procedure
Data was collected during a field trip in November, 2018. Recording took place in an empty room, and the task was administered by a local deaf research assistant. Before each session, the task was explained to the participants and consent was obtained through signature or thumbprint. The research assistant, who was instructed to act as interlocutor, sat opposite the participant and independently navigated the participant through the stimuli one-by-one on a laptop. The entire session was videotaped using a Canon Legria HF G26 camera at 25 frames per second.
The present set of stimuli was embedded into a larger set of picture-and object-based elicitations. All participants had previously participated in this kind of elicitation and were thus fairly familiar with the task. For this reason, instructions provided to the participant were minimal and mostly delivered during the informed consent by using fictive examples while explaining the procedure. The research assistant invited the participant to respond to a picture through either directing the attention to the screen through eye gaze alone, or a short, signed construction that translates into 'What is this?'. 5

Coding
The recorded responses were annotated using ELAN (Crasborn & Sloetjes 2008; ELAN [Computer software] 2020) and glossed using the KK dataset in the lexical database Global Signbank (Crasborn et al. 2018;Lutzenberger 2020). All coding was done by the first author, including annotations for activity of both hands of the signer, facial expressions, the stimulus item, any signing of the research assistant, the response type (descriptive, naming, example, etc.), and chunks directly related to the picture stimulus (Figure 4). In cases where the research assistant named the stimulus before the participant, all subsequent signs were marked as primed 4 For more information on the participants see Mudd et al. (2020). Gender (pgloss). Immediate and exact repetitions, i.e., next following sign produced by the participant is the same surface realisation as produced by the research assistant, were marked as xgloss and excluded from the analyses.
Every sign produced in response to a picture stimulus was coded for form in order to preserve minimal formational differences (phonological or phonetic). The Global Signbank dataset was edited alongside the annotation process, creating a new entry whenever there was no entry for a specific sign variant yet. Sign variants were grouped by shared iconic motivation using numbers, e.g., pig-1 and pig-2. Different realisations of the same iconic motivation are marked with capital letters e.g., cow-1a, cow-1b, cow-1c ( Since an analysis and in-depth discussion of iconic patterns and associated mappings is beyond the scope of this paper, we provide a list of iconic motivations for signs in this study in Appendix B and refer to Mudd et al. (2020) for further discussion of patterned iconicity in the same dataset. In order to provide a measure for coding reliability, the first author re-coded the iconic motivation, 11% of the items (N = 4; first two and last two items), as well as feature coding of surface realisations of 10% of the produced iconic motivations (randomly sampled). Intra-coder reliability scores are reported in percentage of overlap, and Cohen's Kappa for iconic motivation, chunks of target responses, and feature coding: iconic motivation (95% overlap; Kappa = 0.94), chunks of target responses (95% overlap; Kappa = 0.909), gloss of surface realisations (97% overlap), and feature coding (94.6% overlap; Kappa = 0.943). Videos and ELAN transcriptions are archived in The Language Archive ).

Analyses
Unlike previous work (Israel 2009;Israel & Sandler 2009;Mudd, Lutzenberger, et al. 2020), we analyse all sign tokens naming or directly describing the picture and exclude only those parts  Variants of pig include different iconic motivations, marked by a number, namely killing a pig (pig-1) and a pig eating (pig-2). Variants of cow share the same iconic motivation of the cow's horns but show different surface realisations, marked by a number followed by a capital letter (cow-1a, cow-1b, cow-1c). of answers that provide additional information. Responses were classified as off-target in case the stimulus was not recognised (correctly), and when signers provided lengthy explanations, personal narratives, descriptions of details and elaborations without explicitly referring to the target. In the following sections, we describe the analyses in detail. First, we explain the general idea and workings behind how variation may be measured. Then, we discuss how frequency can affect the proposed measuring and how it can be weighted to gain a more nuanced measure.

Measure of variation
In order to address the question as to how to quantify formational variation in our dataset, we developed a new measure of variation. This measure integrates three levels inherent to a sign (iconic motivation; surface realisation; feature differences; Figure 6), and results in a numeric and gradient score, the variation index.
Iconic Motivation (IM) functions as a grouping criterion for sign variants. As pointed out by Israel (2009), it does not make sense to compare the form of signs that do not share the same iconic motivation as no form similarity is expected. Returning to the example of sign variants for pig and cow provided in Figure 5, we can identify different iconic mappings: the cow's horns in cow-1, killing a pig in pig-1, and how a pig eats in pig-2. Each is scored as 1 on the level of IM for the variation index. IM can be used independently as an accumulative count of unique iconic motivations occurring in response to a given stimulus. In Figure 6, accumulative IM yields a score of 2 for pig as responses include two iconic motivations, namely pig-1 and pig-2, and a score of 1 for the item cow.
Surface Realisation (SR) is a count of the different surface realisations with the same iconic motivation. In Figure 5, both variants for pig, pig-1 and pig-2, have a single surface realisation, resulting in each a score of 1 for SR. Among the cow-1 signs however, cow-1a, cow-1b, and cow-1c are three different surface realisations of the same iconic motivation. The SR for cow-1 is thus 3. Note that the relationship between IM and SR is minimally one-to-one; each iconic motivation occurs with a minimum of one SR.   FD counts the number of features that differ across the attested surface realisations. In the case of the three cow-1 variants pictured in Figure 5, all differences concern the configuration of the hands; specifically, the curvature of the fingers and thumb extension of the strong hand and the weak hand. The FD in cow-1 is therefore 2.
Overall, this measure provides the opportunity to compare different levels as well as different sign variants. Within one item, we can compare variation across different iconic motivations based on the following formula: A variation index is calculated per iconic motivation, hence IM is the constant 1; each variation index is normalised by dividing feature differences by the constant 14 for the 14 available features, and the number of surface realisations SR; the entire formula is divided by the three levels (IM, SR, FD). For example, cow-1 received a score of 1 for IM, a score of 3 on SR, and a score of 2 on FD. Each of these levels of variation are weighted equally, hence (1 + 3 + (2/14)/3))/3 = 1.35. Another fictious sign variant cow-2 with a different iconic motivation and a single surface realisation result in a variation index of 0.67 (see Table 2). The difference in variation between the two cow-variants is reflected in their variation indices: cow-1 shows a higher score, i.e., more variation, than cow-2. Note that a variation index of 0.67 is the minimal value of any sign variant with a single surface realisation because of the one-to-one relationship between iconic motivation and surface realisation. We will refer to this baseline as no variation here after and elaborate in Section 6 on some implications of the fact that the absence of variation may be caused by a hapax or extreme uniformity.
Variation indices may be calculated per iconic motivation and account for the formal variation within an item at both the level of surface realisations and feature differences. Note that this measure focuses on the forms that occurred in the data. It does not take into account the linguistic environment of specific forms, any aspects of the participant/social factors, or the frequency of a sign variant. The following section discusses how frequency can be woven into the measure.

Weighted variation index
As a next step, we further develop the variation index in order to increase the ecological validity of the comparison among different sign variants. First, we combine the variation index with token frequency in the dataset. Second, we combine the variation index with the number of signers producing a specific variant, i.e., scope of use across the population. In other words, we apply weight by frequency, once token-based and once signer-based.
Token-based weighting aims to relativise the variation index in terms of reflecting high frequency or idiosyncratic sign variants. In the scope of this study, token frequency refers to sign variants produced within an item or pooled across all items. A token-weighted variation index is calculated as the product of the proportion of each sign variant and the unweighted variation index as explained in the section here above: First, we calculate token-weighted variation indices within items. We obtain the proportions for each iconic motivation as the number of tokens of an iconic motivation/surface realisation within an item divided by the total number of tokens of all iconic motivations/surface realisations of the particular item. For a fictious example, cow-1 has been produced at 20/30 tokens and cow-2 at 10/30 tokens in cow. Thus, cow-1 accounts for 67% and cow-2 for 33% of the 30 sign tokens. The token-weighted variation index for cow-1 then is 0.9 (1.35*0.67) and for cow-2 0.2 (0.67*0.33). Comparing the token-weighted variation indices more accurately captures the idea that cow-1 shows more variation while also being more frequent than cow-2.
Second, we move beyond the item by deriving proportional values for each sign variant from the entire dataset (tokens of sign variant/total tokens). Following the same approach, proportional values for iconic motivations or surface realisations are multiplied with an updated unweighted variation index. Updating the unweighted variation index is necessary as some sign variants may feature in responses to multiple picture stimuli, e.g., sign variants for red are frequently produced in response to the picture stimulus dragon fruit (pitaya) as well as the colour red. Now, sign variants can be compared freely to each other, allowing us to fully capture the variation in the dataset.
However, token frequency does not account for a single signer producing a certain sign variant several times. We gain a more nuanced insight into how widespread specific sign variants are across our participant pool when applying weight through attested variants per signer. Here, we count how many of the 20 signers produce a specific sign variant in each item and then multiply the fraction with the unweighted variation index:

Descriptive results
The data yielded a total of 1,739 relevant sign variants (151 iconic motivations) to refer to 35 stimuli that entered the analyses. Due to confusability in the picture stimulus we excluded responses to salt altogether; the picture showed two different packages of salt, leading to misinterpretations of the picture and some signers creating contrast and others providing general descriptions. In all other items, off-target responses (n = 24) and immediate and exact repetition of signs prompted by the research assistant (n = 21) were excluded. Note that we deal with the form of signs and their token frequency in the data obtained in this experiment only. Characteristics of the participants, e.g., individual verbosity or correlations of idiosyncratic variants are not under investigation here. Initial analyses of selected social variables have been addressed in Mudd et al. (2020). cooker (see Figure 7). 6 This broad range may indicate different degrees of conventionalisation in sign variants. Equally possible is a more general problem inherent to the nature of picture elicitations, namely differing degrees of precision when naming or describing the picture stimuli, or both.

Variation index
Variation indices are calculated for each iconic motivation in each item. Figure 8 plots all signs in the analysis and their respective variation indices by item. The majority of signs shows a variation index between 1 and 1.5 (mean = 1.43; sd = 0.63; range = 0.67-3.03). Signs near the y-axis show no variation (variation index 0.67), mostly due to a single surface realisation. Signs with a variation index higher than 2 are infrequent. 6 Note that the number of iconic motivations elicited per item differs from the one reported in Mudd and colleagues (2020) as this study is based on the full response rather than a single target sign per signer for each stimulus.   Variation occurs to different degrees on different levels. Stimuli that are represented by a single iconic motivation such as chicken, dog, and cow show different variation indices caused by variation on the surface and feature level (Figures 8 and 9). chicken-1 has three different surface realisations of the same iconic motivation while there are five for dog-1 and seven for cow-1 (Figure 9). The three surface realisations of chicken-1 vary in three features, the five surface realisations of dog-1 are a result of differences in six features, and the seven surface realisations of cow-1 differ in four features. Taken together, these differences reflect in different variation indices for chicken-1 (1.37), dog-1 (2.03), and cow-1 (2.68). These examples clearly show that items eliciting a single iconic motivation do not necessarily yield similar degrees of variation on the other levels.
For items eliciting multiple iconic motivations such as camera, variation indices do not distribute evenly across all iconic motivations as sign tokens split into more iconic motivations ( Figure   8). Three of the six iconic motivations in response to the picture stimulus camera, bright-1, camera-1, and camera-4, occur with a single surface realisation, resulting in no variation (variation index 0.67). The other three iconic motivations show different variation indices: camera-2 has two realisations that differ in seven features (variation index 1.10), camera-3 has six realisations with nine feature differences (variation index 2.37), and camera-5 has three variants varying in five features (variation index 1.37) (Figure 10). Once again, this shows that (i) variation emerges on different levels that may vary independently from each other (i.e., fewer surface realisations do not always mean fewer feature differences), and (ii) some iconic motivations show no variation.   The variation indices reported until now quantify variation in terms of unique iconic motivations. However, the more responses feature a certain iconic motivation, the more opportunity it has to show variation. Sign variants that are produced only once can only have a single surface realisation, resulting in no variation, while sign variants that occur × times could have maximally × different surface realisations and could differ in up to 14 features. Token frequency shapes how much variation can be attested in a given dataset, and indeed, variation indices correlate positively with frequency (r = 0.69, p < .001): sign variants that have been produced more often tend to have higher variation indices.

Weighted variation
In the following, we report weighted variation indices, first taking into account token frequency, and then the number of signers producing specific sign variants as laid out in Section 3.5.2.
The token-weighted variation index within an item shows different effects on stimuli eliciting a single iconic motivation or multiple ones. For items such as chicken, dog, and cow that elicited a single iconic motivation, weighting results in identical scores as the weighting factor is 1: the token-weighted variation index remains at 1.36 for chicken-1, at 2.03 for dog-1 and at 2.68 for cow-1 (Figure 11). For items that elicited more iconic motivations such as camera, the token-weighted variation index highlights differences: bright-1 (token-weighted 0.04; unweighted 0.67), camera-1 (token-weighted 0.09; unweighted 0.67), camera-2 (weighted 0.12; unweighted 1.10), camera-3 (token-weighted 1.0; unweighted 2.37), camera-4 (tokenweighted 0.05; unweighted 0.67) and camera-5 (token-weighted 0.29; unweighted 1.37), reflecting the amount of variation in light of how much opportunity a sign variant has to vary. camera-3 remains the variant with the highest token-weighted variation index among the iconic motivations in this item, yet the variation is now proportional to the high number of signs corresponding to this iconic motivation (Figure 11). This facilitates comparisons of variation across iconic motivations from the same item. In short, a token-weighted variation index approximates the actual productions in the data better than the unweighted variation index and thus improves how the scores reflect the observed variation.
Treating the data as a mini-corpus, i.e., no grouping by item, the token-weighted variation index identifies a tendency for dominant variants. A large part of the data clusters with low(er) tokenweighted variation indices, indicating no dominant variants (see Appendix C for all datapoints with variation), yet dominant surface realisations become apparent through a considerably higher token-weighted variation index. Figure 12 provides only a snippet of the data for the sake of  readability, showing surface realisations elicited in three selected items with different degrees of variation: tea, dog, and camera. tea-variants cluster together, showing a wider distribution of token-weighted variation indices; no clearly dominant variant becomes apparent. In contrast, dog-1a (n = 24; token-weighted 0.03) shows higher token-weighted variation index than dog-1b (n = 3; token-weighted 0.004), dog-1c (n = 3; token-weighted 0.004), dog-1d (n = 1; token-weighted 0.001), or dog-1e (n = 3; token-weighted 0.004), due to considerably higher frequency applied to the same unweighted variation index (Figure 12). Similarly, camera-3a (n = 24; token-weighted 0.01) has the highest token-weighted variation index out of all cameravariants as it occurred most frequently in the data (n = 7). Note that the power of frequency is particularly visible in iconic motivations that occur in response to more than one stimulus, leading to a very high overall frequency (e.g., red-1 was produced in 68 tokens as compared to 24 dog-1 tokens and 16 camera-3 tokens). However, we must not forget that a considerable portion of the data shows no variation.
The signer-weighted variation index corroborates dominant variants. Whereas weight by token frequency only identifies which variants are produced often, weight by signer identifies how widespread variants are across participants. For ease of readability, Figure 13 illustrates the same selected examples as in Figure 12 (find full graph with all data with variation in Appendix D). Figure 13 demonstrates that tea-1 (brewing loose leaves) is more widespread across participants than tea-2 (steeping tea bag). Indeed, we may even pinpoint specific dominant surface realisations. Within tea-1 (unweighted 2.04), two surface realisations are produced by 12/20 signers: tea-1a (signer-weighted 0.51) and tea-1f (signer-weighted 0.71), only differing in handedness (tea-1a is two-handed; tea-1f is one-handed). Maybe even more striking is the example of dog-1 variants (unweighted 2.03): dog-1a (signer-weighted 1.42) was produced by 14/20 signers, dog-1b (signer-weighted 0.20) and dog-1c (signer-weighted 0.20) by two and dog-1d (signer-weighted 0.10) and dog-1e (signer-weighted 0.10) by one participant. Without doubt, dog-1a represents the most widespread variant and the high unweighted variation index stems from variants produced by only few participants. These examples are clear evidence of (i) patterned variation that often includes particularly widespread surface realisations, and (ii) variation as a result of few signers producing non-dominant variants, either due to idiosyncratic characteristics (potentially caused by individual socio-demographic profiles), or preference of another iconic motivations.
To sum up, both frequency-based weightings improve the ecological validity of the measure; a token-weighted variation index demonstrates how tightly variation and frequency are intertwined, a signer-weighted variation index identifies the dispersion of sign variants both in terms of iconic motivations and surface realisations across the population. In other words, Figures 12 and 13 as well as the graphs of the full data (Appendix C and Appendix D) provide two different ways of looking at the same dataset (i.e. the effect of, first, the number of produced tokens, and second, the number of signers using a particular variant). Nevertheless, both point towards the same dominant variants, i.e., iconic motivations and/or surface realisations that are (i) more frequent than others (token-weighted) and (ii) more widespread across participants (signer-weighted).  Given that we allow for one signer producing multiple signs, the signer-weighted variation index more clearly displays dominant variants than the frequency-weighted variation index.

Measuring variation
Studies in sign language emergence focus on the iconic motivation as developing systems may not yet exhibit phonological structure. The following section is dedicated to laying out how the variation index compares to and improves existing methods. To this end, we provide a case study of three items from our data comparing our method to the measure of Israel (2009) and Israel and Sandler (2009): mode and number of variants. As a reminder, Israel and Sandler narrow their analysis to the most frequent iconic motivation provided in responses to a picture stimulus; they then do a feature analysis of this subset of signs calculating mode and number of variants for each feature class. Note that the mode is calculated relative to the number of signers who produce this iconic motivation, leading to different items having differently sized sets. Israel and Sandler use the number of surface realisations as a measure of sign-level variation.
For this case study, we re-use the three stimuli from Figures 12 and 13 as they elicited different degrees of variation: tea, dog, and camera. Following Israel and Sandler's method of analysing one-sign responses, we selected the first target sign in each response when the response contained multiple. For each of the three items, we identified the most frequent iconic motivation (tea-1, dog-1, camera-3) and then calculated the mode and the number of variants for each feature as described in Israel (2009) (Figure 14A). The set sizes differ greatly: dog-1 was produced by 19 signers, tea-1 by 16 signers, and camera-3 by nine signers.    Figure 14 illustrates that dog-1, tea-1 and camera-3 differ in the variation they exhibit in different feature classes. A low number of variants often coincides with high mode (i.e., low variation); for example, tea-1 has only one variant in Location, Repeated Movement and Alternating Movement and accordingly shows maximal mode in these features while camera-3's four variants in the handshape feature Aperture co-occur with a mode of 0.56 in this feature. Moreover, dog-1 is an example that nearly all signers produced, i.e., it represents the example with the largest set, and displays similar amounts of variation in different handshape subcomponents of the Strong Hand (2.23 variants and mode of 0.89): 1.9 variants in Finger Selection, 1.77 variants in Aperture, 1.66 in Spreading and 1.56 in Finger Configuration, and a mode of 0.87 for Finger Configuration, 0.85 for Spreading and 0.84 for both Finger Selection and Aperture. Sign-level variation, captured as the number of surface realisations in Israel (2009), is diverse in this case study: five variants were produced for dog-1, five for tea-1, and six for camera-3. Hence, on this measure, dog-1 and tea-1 are less variable than camera-3. Figure 13 exemplifies the same surface realisations using the signer-weighted variation. We chose the signer-weighted variation index as a baseline for comparison to Israel and Sandler's method as the analysed iconic motivation is identified by the number of signers. In line with this method, dog-1, tea-1 and camera-3 are the dominant iconic motivations using the variation index. Nevertheless, Figure 13 demonstrates that in many cases, more than one iconic motivation is frequent, e.g., tea-1 is used by 16 signers and drink-1 by seven signers in tea and camera-3 was produced by nine signers and camera-5 by six signers in camera. This may be linked to including multiple signs per response; the knowledge of signers is not restricted to a single variant, and indeed, drink-1 sometimes features the same response as tea-1. In cases like camera, it may be less clear what variant should be considered the most frequent one; camera-3 is only marginally more widespread than camera-5, which suggests that analysing more than one iconic motivation is necessary to appropriately represent the variation. In addition to identifying the most widely shared iconic motivations, we find that specific surface realisations are more frequent than others with the same iconic motivation: dog-1a (14/20 signers), tea-1f (7/20 signers), and camera-3a (6/20 signers). These surface realisations have higher signerweighted variation indices (Figure 13) because they are shared across more signers. In short, while zeroing in on the most frequent iconic motivation is one way of reducing "noise", the variation index shows that preserving variation through factoring in token-frequency or signer-frequency later on may be helpful to uncover structure in variation.
While both the variation index and Israel and Sandler's method point to different degrees of variation within a sign, one of the major differences between the methods lies in the measure itself. Israel and Sandler's method presents us with many connected measures rather than one unified measure. Their studies aim at a cross-linguistic comparison of phonological stability on the language-level which might explain why they opted for this method. However, features do not occur in isolation and we therefore argue that separate feature measures need to be re-incorporated into the context of the sign. The variation index aims for a more encompassing measure that unifies different levels of a sign and takes into account contextual information, here token-and signer-frequency. In contrast to Israel and Sandler's measures, the variation index does not integrate particular feature values but instead tallies the mismatching features. Accounting for feature value differences could be incorporated into the variation index as an extra level in the future, potentially even by adapting Israel and Sandler's mode.
Israel and Sandler's method is driven by maximal convergence between signers while the variation index is based on charting variation, and thus capturing structure in variation also in less frequent sign variants. There are three main advantages of the variation index: (i) in line with many other studies, the measures of Israel and Sandler (2009) are based on one-sign-perresponse type of data (e.g. Sandler et al. 2011;Hartzell et al. 2019;Mudd, Lutzenberger, et al. 2020). As multi-sign responses are commonly reported for lexical elicitation data from microcommunity sign languages (e.g. Morgan 2015;Hartzell et al. 2019;Mudd, Lutzenberger, et al. 2020), the variation index accommodates responses with multiple signs. (ii) Mode and number of variants focus on the most frequent iconic motivation, leading to eliminating much of the variation in the very first step. By examining all produced iconic motivations, the variation index allows for participants knowing and providing synonyms or multiple variants in an elicitation task. (iii) Israel and Sandler's and our own data show that there is a lot of variation as to whether signers produce the same iconic motivation in response to an item. For some items, signers highly aligned on iconic motivation (e.g. dog, chicken, cow) while for others they are very dispersed (e.g. rice cooker). Rather than selecting only one iconic motivation (which may be only marginally more frequent), the variation index factors in token-frequency and participants as a weighting factor later on and thus preserves variation. This enables us to uncover conventions also in less frequent iconic motivations. Although the variation index may not (yet) integrate exact feature value measures, it accounts better for the ecological realities of a signing community. Integrating all levels of a sign yields not only a more comprehensive but also a more nuanced account of variation that allows us to objectively assess and compare variation within and across emerging and established systems. In addition, the automatic comparison makes this method easy to be scaled up to large datasets.
Nevertheless, there are also some shortcomings of the variation index. The three levels considered are all interrelated which may make the variation index somewhat counter-intuitive. While iconic motivation and surface realisations stand in a one-to-one relationship to each other, this is not the case for surface realisations and feature differences. Surface realisations that differ from each other but share the same iconic motivation are grouped together. Form variation arises within an iconic motivation as minimally one surface realisation. Variation among these surface realisations is driven by differing features, resulting in more (or less) surface realisations than feature differences. Despite the different relations to each other, all levels contribute to the overall score where a higher value represents more variation. The major contributor of variation in a variant can, however, only be localised in a separate step; each level can provide a different piece of the puzzle, e.g., the number of produced iconic motivations per item provides information about how many mappings are used to refer to a given concept. Altogether, the variation index builds on the complexity of interrelated levels to provide a comprehensive approach to variation.
As explained previously, no variation means that an iconic motivation has been produced with a single surface realisation. Due to the one-to-one relationship between iconic motivation and surface realisation, this yields the variation index of 0.67. For the majority of cases, no variation is due to the lack of tokens; hapaxes i.e., single tokens of (idiosyncratic) sign variants cannot show variation. A limited number of signs, however, show no variation due to extreme uniformity across all tokens. Shoe-1a (n token = 29; n signer = 18), offering-2a (n token = 19; n signer = 10), temple-ceremony-1a (n token = 13; n signer = 8), rice-1a (n token = 12; n signer = 8), mangooval-1a (n token = 12; n signer = 4), gecko-5a (n token = 10; n signer = 5), tridatu-bracelet-2a (n token = 9; n signer = 4) have been produced with a high(er) frequency but with a single surface form. Thus, albeit less frequent, a variation index of 0.67 may also result from the lack of type variation. Developing the variation index further could aim for normalising the scores to make them more easily interpretable and potentially even circumvent this ambiguity.

Discussion
This study has introduced and applied a new way of measuring variation in sign formation across a signing community numerically, taking into account three interrelated levels inherent to every sign: 1) iconic motivation, 2) surface realisation, 3) feature differences. As such, it builds on previous work which primarily focuses on the feature level (Israel 2009;Israel & Sandler 2009;Parks 2011;Sandler et al. 2011;Morgan 2015;Omardeen 2018;Börstell et al. 2020). These variation indices yield gradient outcome measures for sign variants, allowing us to capture that, across the 20 deaf signers sampled, different sign variants exhibit various degrees of variation on different levels. Moreover, this study shows the impact of frequency on what variation can be attested, cautioning the generalizations that can be made from a limited dataset. We have suggested two types of frequency-based weightings of the variation index to increase the ecological validity of the measure; weighting by token frequency identifies sign variants of high usage and weighting by signer identifies sign variants that are particularly widespread across the population. Both identify dominant and non-dominant variants, demonstrating a variation continuum.

Variation and frequency
This study demonstrates that limited datasets are at risk of overinflating variation; signs that are produced frequently have more opportunity to vary while a single token of a sign variant 20 Lutzenberger et al. Glossa: a journal of general linguistics DOI: 10.16995/glossa.5880 cannot vary from anything. Indeed, most studies on the emergence of phonology and the lexicon in micro-community sign languages are based on picture elicitations instead of corpus data with robust frequency information (e.g. Israel 2009;Israel & Sandler 2009;Sandler et al. 2011;Richie et al. 2014;Morgan 2015;Horton 2018;Hou 2018;Reed 2019;Hartzell et al. 2019). Variation as measured in this study can be applied to different kinds of data, ideally drawing on both elicited and spontaneous corpus data from different languages or even home sign data that can be compared directly.
Previous analyses on emerging phonology have focused on commonalities across signers, by singling out the most frequent iconic motivation and/or features among responses to a stimulus for analysing variation (Israel 2009;Israel & Sandler 2009;Sandler et al. 2011). For example, Sandler and colleagues (Israel 2009;Israel & Sandler 2009;Sandler et al. 2011) view the most frequent iconic motivation as (most) conventionalised. On these grounds, they examine variation by zooming in on the iconic motivation of barking for dog in ABSL, and examine handshape variation within that subset. Our results suggest that such methods might overinflate estimates of variation, given our observation that high frequency of an iconic motivation does not necessarily equal low variation on other levels, and that high-variation variants do not all vary to the same extent and in the same aspects. Especially in studies with limited datasets i.e. elicited data, it is fundamental to understand and acknowledge the correlation between frequency and variation to avoid overinterpretation.
In corpus-based research on spoken languages it has been established that high frequency words change more quickly than low frequency words (Frisch 1996;Bybee & Hopper 2001;Bybee 2010). We found that highly frequent iconic motivations often go hand in hand with high variation indices. As indicated previously, this may indeed be related to the increased opportunity for variation to surface. This phenomenon may also be linked to forces of language change, with high frequency signs demonstrating a locus of rapid language change while less frequent sign variants may be less variable. Nevertheless, our study deals with frequency as the tokens within this limited dataset of elicited productions rather than corpus-based frequencies.
It is unclear how the type of language task influences the obtained frequency distributions. To corroborate whether or not the variation patterns in this data set are related to token-frequency or reflect language change, we would need to expand the dataset.

High variation and synonyms
Micro-community sign languages have often been described to exhibit a high degree of variation (e.g. Washabaugh 1986;Meir et al. 2012). Multiple explanations have been suggested to account for this, including the lack of pressure to converge on linguistic symbols due to high common ground (de Vos 2011;Meir et al. 2012). Morgan (2015) observes that iconic motivations themselves are a locus of variation in KSL. We observe the same in KK; few items in our study elicit a single iconic motivation, e.g., dog, while the majority elicit multiple iconic motivations, e.g., camera or pig. Of the elicited iconic motivations, some are both more frequent and widely shared across signers, e.g., pig-1 with 25 tokens across 17 signers compared to pig-2 with 14 tokens across eight signers and animalears-1 with seven tokens by four signers. Morgan (2015) attributes the presence of multiple iconic motivations to ongoing convergence across signers and Mudd et al. (2020) argue that high familiarity with a concept may reduce variation. On top of this, greater prominence of the stimulus and/or the iconic mapping in one's surrounding might also trigger (individual) preferences for certain iconic mappings and thereby stimulate the selection and persistence of different iconic motivations. For example, pigs are commonly killed by men while feeding pigs might be considered more (but not exclusively) a female task. To test and disentangle these hypotheses empirically, iconicity ratings would have to be collected, potentially alongside a measure of visual prominence.
Research on patterned iconicity suggests that specific types of objects tend to device specific iconic strategies, i.e. signs for tools often relate to handling or manipulating the tool while signs for food items are often linked to size, shape and manipulation (Padden et al. 2013;Hwang et al. 2017;Hou 2018). In line with this, sign variants elicited for the items camera and pig show different strategies: most iconic motivations elicited for camera relate to holding, handling, or manipulating a video camera whereas iconic motivations in pig map to different aspects around a pig, namely handling/manipulating (killing), embodiment of the animal (feeding), and appearance. While it is possible that sign variants are indeed the result of selection and convergence as suggested in Morgan (2015), it is also possible that qualitatively different mappings such as in the pig-variants affect the preservation of multiple different iconic motivations in KK and other micro-community sign languages. Different ideas have been put forth to explain high variation in micro-community sign languages. Previously suggested by de Vos (2011), Meir and colleagues (2012) and Meir and Sandler (2019) and corroborated by a computational model by Thompson et al. (2019), the high degree of shared knowledge and the limited number of community members may allow for high variation. Both make it possible to tolerate idiosyncrasy, i.e., remembering idiosyncratic variants. The high overlap in experiences among signers of small communities may enhance the availability of a large range of possible iconic mappings and hereby accommodate for a large number of (idiosyncratic) sign variants. Recently, Tkachman and Hudson Kam (2020) have argued that rather than community size, tight kinship relations and early signing exposure across deaf signers accounts for the high lexical variation in young micro-community sign languages. While the present study adds that dominant sign variants are shared by many participants and produced often, the question of how to account for the considerable variation of non-dominant variants still remains.
Nevertheless, pressures leading to reducing the number of synonyms with different iconic motivations are unclear. As explained in Section 2.1, concepts provide many properties for potential iconic mappings that may be linguistically encoded. As a result, sign language lexicons may include synonyms with different iconic motivations. For example, three variants of dog in NGT are based on three different iconic motivations (Figure 15): i) the dog's paws, ii) holding something in the mouth as dogs often do, iii) calling a dog by patting one's thigh.
If the reduction of iconic motivations is indeed a first step in conventionalisation as suggested by Morgan (2015: 14 f.), the abundance of synonyms with or without shared iconic motivation in macro-community sign languages points to an obvious lack of conventionalisation. In BSL, 22 conventionalised variants of purple 7 have been found as a result of regional variation (Stamp et al. 2014). In contrast to work on sign language emergence, this variation in the BSL or the NGT lexicon is argued to be rooted in and maintained by various sociolinguistic factors; e.g., age, region, educational background. Altogether, it is unclear how different pressures resulting in the reduction of variation on the level of iconic motivations interact, and whether they differ fundamentally in different signing communities.

Variation and conventionalisation
Different fields approach variation in different ways: studies on macro-community sign languages embrace variation as sociolinguistic phenomenon, while, in research on sign language emergence, variation across signers is generally taken as a lack of conventionalisation. Where 7 BSL Signbank lists 17 different variants: https://bslsignbank.ucl.ac.uk/dictionary/words/purple-1.html.   Sandler et al. (2011) argue that the extreme variation in ABSL is explained by the lack of a phonological system, we would like to propose that a more in-depth approach may help to uncover processes and mechanisms underlying variation and conventionalisation.
This study shows that a gradient measure exposes structured variation. Sign variants for dog in ABSL are presented as an example of extreme variation in Sandler et al. (2011). All dogvariants share the iconic motivation 'barking' but all ten ABSL participants produce different surface forms, leading to the claim that ABSL signs are driven by holistic, iconic prototypes without combinatorial structure (Sandler et al. 2011: 520). Among our KK signers, dog elicited a single iconic motivation, also 'barking', with substantial variation in surface realisations and feature differences: dog-variants in KK yield feature-level variation on six features distributed over five surface realisations. The signer-weighted variation index reflects that 70% of the KK participants produced the same surface realisation dog-1a. In contrast to ABSL, the variation in dog-1 results from six participants who produce different surface realisations. Thus, although dog elicits variation in both ABSL and KK, our approach gives more insight into the underlying structure of variation. In the case of dog-1, we can localise the variation to 30% of the signers, suggesting that the driver is likely to be individual participant effects rather than the lack of a phonological system. 8 We plan to explore the effect of social factors on sub-lexical variation in a future study by combining the current method with the method used in Mudd et al. (2020).
Sign language emergence scenarios often describe a development from i) no language to ii) a communication system with high variation and little structure, in which variation decreases as structure increases until iii) reaching structural benchmarks set by research on macrocommunity sign languages (Meir & Sandler 2019). While this route may be justified for some linguistic aspects, for example grammatical elements (e.g. Pfau & Steinbach 2011;Senghas & Coppola 2011;Johnston et al. 2015;Pfau 2015; but see Safar 2020), the fundamental idea is nurtured by reducing variation to optimally low variation (without taking into account the ecological niche of the particular language). Macro-community sign languages, however, may previously have escaped this pressure and are, now characterised as established, analysed on different grounds. For example, levelling, i.e. the reduction of variation, is attributed to language emergence in Nicaraguan Sign Language for spatially modified verbs (Senghas 2003) while it has been explained as sociolinguistic variation in the BSL lexicon, an older macrocommunity sign language (Stamp et al. 2014). In macro-community sign languages, variation is increasingly perceived as a sign of linguistic diversity and richness, while variation in microcommunity sign languages retains a negative connotation of immaturity (Moriarty Harrelson 2017;Kusters & Sahasrabudhe 2018;Braithwaite 2020;Hou & Kusters 2020;).
This tension may arise from the focus on convergence rather than variation in the literature on sign language emergence (Israel 2009;Israel & Sandler 2009;Morgan 2015;Meir & Sandler 2019). Conventionalisation and variation are often understood as "opposing forces" (Meir & Sandler 2019: 9), with the reduction of variation signaling conventionalisation. While we do not intend to question the general idea that decreasing variation increases conventionalisation, we would like to discuss two main problems with a reduction-based definition that are more broadly linked to a discussion about language emergence, language change, and language ideologies (Moriarty Harrelson 2017;Kusters & Sahasrabudhe 2018;Braithwaite 2020;Hou & Kusters 2020;Kusters et al. 2020): First, a reduction-based definition implies an ideal state of maximal convergence and minimal variation, i.e. a single variant, that any (emerging) language strives for. However, studies increasingly document substantial (sociolinguistic) variation on different levels of description across spoken languages (Bybee 2006; and well-researched macrocommunity sign languages (e.g. Stamp et al. 2014;Börstell & Östling 2016;Schembri et al. 2018). It is thus unclear why minimal variation is expected to be a hallmark of an "established language". Linked to this is the second issue, namely the paradox of determining a start and an end point of a process. Although conventionalisation is generally described as a process Lutzenberger et al. Glossa: a journal of general linguistics DOI: 10.16995/glossa.5880 (Burling 1999;Schmid 2020), signs tend to be discussed in categorical terms (lexicalised vs nonlexicalised, conventionalised vs. non-conventionalised). This generates the expectation of an ideal end state that can (and should) be reached, i.e. selection of one conventionalised variant and dismisses the fact that variation may be structured; like in other domains, the increasing trend to examine gradience in language use (e.g. Cormier et al. 2012;Ferrara & Halvorsen 2017;Lepic 2019) may be beneficial to understanding both variation and conventionalisation.
Rather than viewing conventionalisation and variation as opposing forces, we may want to move on to understanding them as two sides of the same coin. Lepic (2019) argues that understanding signs as a forced choice between two categories (lexicalised vs. non-lexicalised) instead of on a lexicalisation continuum prevents us from understanding the degree to which mental representations are established. Similarly, internalising conventionalisation (and variation) as continua may help us gain a deeper understanding of how they shape language emergence, language change, and language use. The findings of this paper stress that discussing conventionalisation in a binary manner can be misleading: unlike states, processes require gradient measures that are complex, sensitive to frequency and able to capture different degrees of variation on different aspects of a sign (see also Meir & Sandler 2019;Tkachman & Hudson Kam 2020). Understanding conventionalisation and variation as tightly related continua may allow us to acknowledge and deal with them more appropriately in different types of signing communities, paving the way for valid comparisons across languages that differ greatly in their socio-cultural and ecological niche.

Conclusion
This paper has developed and applied a new measure of variation in the forms of signs. The comprehensive approach to variation suggested here relies heavily on integrating three interrelated levels as they may show different degrees of variation. In other words, there is not always a logical trade-off between more and less variation among different levels within a sign; low variation on one level does not equal low variation on another level. Indeed, we propose to consider the possibility that variation does not necessarily equal to the absence of conventionalisation. We also suggest that, particularly in micro-community sign languages, footprints of individual language users may be especially prominent. Measuring variation as suggested here allows to calculate and weight variation indices for individual iconic motivations and surface realisations, which can then be utilised to answer further questions about the effect of social factors, or phonological characteristics. In this study, all signs produced in this dataset were treated as individual signs. As single sign responses are rare, weighted variation indices may be devised to address the issue of chains of signs, potentially identifying compounds and collocations in a more efficient way. Moreover, weight by signer touches upon social factors that may play a fundamental role in variation across small communities. In a future study, we plan to address the contribution of social factors in order to explore the source and context of sign variants, helping to further disentangle the high collinearity of social factors in micro-community sign languages such as KK. Our study teaches us to be cautious about overinterpreting data from restricted datasets and provides a new tool to examine and even compare variation within diverse sign language lexicons. This sets the ground for comprehensive comparisons of variation across micro-and macro-community sign languages with minimal methodological differences, bringing us closer to embrace variation irrespective of the ecological niche in which they emerged.