Varying Abstractions: a conceptual vs. distributional view on prepositional polysemy

The term ‘meaning’, as it is presently employed in Linguistics, is a polysemous concept, covering a broad range of operational definitions. Focussing on two of these definitions, meaning as ‘concept’ and meaning as ‘context’ (also known as ‘distributional semantics’), this paper explores to what extent these operational definitions lead to converging conclusions regarding the number and nature of distinct senses a polysemous form covers. More specifically, it investigates whether the sense network that emerges from the principled polysemy model of over as proposed by Tyler & Evans (2003; 2001) can be reconstructed by the neural language model BERT. The study assesses whether the contextual information encoded in BERT embeddings can be employed to succesfully (i) recognize the abstract sense categories and (ii) replicate the relative distances between the senses of over proposed in the principled polysemy model. The results suggest that, while there is partial convergence, the two models ultimately lead to different global abstractions because the imagistic information that plays a key role in conceptual approaches to prepositional meaning may not be encoded in contextualized word embeddings


Introduction
The aim of the present study is to empirically investigate whether there is any correspondence between the generalizations that emerge from different operational models of meaning representation.More specifically, this study focuses on the operational definition of meaning as 'concept', as commonly employed in Cognitive Linguistics, and meaning defined as (or derived from) 'context', also known as distributional semantics, which has rapidly gained popularity in Corpus Linguistics and Computational Linguistics/NLP.As a case study, it will home in on the semantics of the English preposition over.
Following Brugman (1988) 's and Lakoff (1987)'s extensive treatment of over, prepositions have become default examples in descriptions of the core tenets 'the cognitive approach' to meaning (see, e.g.Lemmens 2016).In the cognitive-conceptual approach to semantics, meanings are defined as 'concepts' which are connected and grounded in complex knowledge structures (e.g.Clausner & Croft 1999), often described as complex chains or 'networks' of connected senses.Central to the conceptual approach is that these concepts (and their network of senses) are to a large extent experiential -that is, they are grounded in the physical or cultural experience of the language user.
When it comes to their discussion of polysemy, such cognitive-conceptual accounts have faced substantial criticism, predominantly aimed at their apparent lack of principled and objective methods to determine how many senses can be distinguished, and how the global design of the polysemy networks is construed.One part of the problem appeared to be that the identification of the core node (or 'prototype') of the network seemed to rely solely on subjective, introspective judgements (Sandra & Rice 1995;Rice 1996: 137).These criticisms triggered a search for concrete criteria and data-driven tests for prototypicality (e.g.Gilquin & McMichael 2018;Newman 2011).A similar discussion also arose regarding the position of derived nodes, with many scholars acknowledging that more objective, non-introspective discussions regarding 'distances' between senses will remain all but impossible as long as we are not "able to measure the degree of similarity between senses" (Gries & Divjak 2009: 57; emphasis mine).
In response to the need for more well-defined and data-driven definitions of word senses (e.g Kilgarriff 2003), and more objective, measurable ways of determining distances between senses, a number of proposals have been devised that could be subsumed under the header of 'contextual' or 'distributional' approaches to semantics.Underlying these approaches is the premise that the meaning of a word can be derived from the context in which it occurs (an idea which dates back to at least Harris (1954) and Firth (1957)).While not necessarily equating context to meaning, distributional approaches to semantics are based on the assumption that co-occurrence patterns and other distributional frequencies are indicative of the meaning of a linguistic item (i.e., they serve as "proxies" for meaning representation; see, for instance Baroni et al. (2014: 238)).Hence, distributional similarities between linguistic items in a corpus can be used to approximate a measurement of their functional or semantic similarity, thus placing more prominence on the geometrical relationship between linguistic items (Boleda & Erk 2015).Particularly in recent years, the approach has been met with great enthusiasm in Computational Linguistics and Machine Learning, as recent incarnations of such distributional semantic models seem to perform astonishingly well on a wide range of NLP, production, and machine translation tasks (Young et al. 2018;Radford et al. 2018).
A similar (yet overall more cautious) enthusiasm has been expressed in Linguistics: being almost exclusively corpus-based, distributional approaches have provides a welcome bridge between the "well-established corpus-linguistic research tradition and Langacker's idea that linguistic representations emerge from linguistic usage" (Stefanowitsch 2010: 370).Furthermore, because some of the more traditional conceptual models of polysemy already depended on distributional criteria to some extent (Gries & Divjak 2009: 58;Geeraerts 2016: 241), the path to a full-fledged distributional approach had already been cleared.Such an approach involves virtually no introspective manual annotation (but see, e.g., comments in Gries & Divjak (2009) and Heylen et al. (2015: 154)), and hence it allows the elusive concept of meaning to be studied in a more rigorous, large-scale, and ultimately quantifiable way.The confidence that distributional methods provide an appropriate, objective and fully data-driven alternative to more introspective models appears to be growing steadily, as there have been some successes in replicating experimental, survey-based accounts of polysemy by means of distributional Prioritizing depth over width, this study is set up as a detailed empirical comparison between two operational models of meaning representation, with one serving as a representative of the cognitive semantic ('concept') approach, and one representing the distributional ('context') approach.More specifically, this study focuses on one of the most well-developed cognitiveconceptual proposals -the principled polysemy model of over as set out by Tyler & Evans (2001;2003) -and aims to investigate whether the polysemy network that emerges from this theoretical model of meaning representation can be reconstructed by means of a recent neural distributional language model called BERT ('Bidirectional encoder Representations from Transformers').To this end, a stratified sample of 808 contextualized instances of the preposition over has been annotated following the criteria outlined in Tyler & Evans (2001;2003).This annotated data set serves as a 'theory-specific standard' against which the output of BERT will be assessed.This assessment is targeted at determining whether one can distinguish the same sense categories as proposed in the principled polysemy model by means of BERT embeddings, but it also addresses the question whether there is overlap between the operational models in terms of the suggested similarities and relations between these senses.
What emerges from the analysis is that BERT clearly captures fine-grained, local semantic similarities between tokens.Even with an entirely unsupervised application of BERT, discrete, coherent token groupings can be discerned that correspond relatively well with the sense categories proposed by means of the principled polysemy model.Furthermore, embeddings of over also clearly encode information about conceptual domains, as concrete, spatial uses of over are neatly distinguished from more abstract, metaphorical extensions (into the conceptual domain of time, or other non-spatial domains).However, there are no indications that BERT embeddings also encode information about the abstract configurational resemblances between tokens across those domains.As such, the global picture of resemblance between sense categories that emerges from the unsupervised application of BERT differs substantially from the theoretical proposal by Tyler & Evans (2003;2001), which heavily relies on the language user's ability to recognize schematic, imagistic similarities within and across conceptual domains.These findings highlight the fact that such imagistic similarities are not captured by the embeddings of over, which provides further insight into the kind of semantic information that can be encoded by means of (unsupervised) BERT embeddings.This can provide an interesting basis for further experimental research (e.g.testing to what extent these different operational models of meaning representation are complementary when assessed against Fonteyn Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1323ellicited behavioural data), as well as a discussion on how we can bring about a "greater crossfertilization of theoretical and computational approaches" to the study of meaning (Boleda 2020: 2; also see, e.g., Baroni & Lenci 2011;Pater 2019).

Cognitive-conceptual approaches to prepositional semantics
The interest in prepositional semantics in Cognitive Linguistics stems from the observation that language users are able to use a relatively small set of prepositions refer to an indefinitely large number of relations and scenes because of their cognitive ability to categorize concepts schematically (Kreitzer 1997).The cognitive-conceptual approach to prepositional semantics relies on two important constructs: the so-called "embodiment" of meaning, i.e. " [t]he idea that the properties of certain categories are a consequence of the nature of human biological capacities and of the experience of functioning in a physical and social environment" (Lakoff 1987: 12), and the notion of image schemas, which can be considered as condensed, schematic, recurring patterns of perceptual experience (e.g.Oakley 2010;Gibbs et al. 1994).
Some key publications in developing the notion of image schemas and embodiment, and integrating those notions into the discussion of meaning representation, are Brugman (1988) and Lakoff (1987).In their analyses of over, Brugman and Lakoff distinguish a vast number of distinct image schemas, all of which map onto a distinct 'sense' of the preposition.The image schema underlying an example such as Devi lives over the hill, for instance, conveys a static horizontal spatial configuration, whereby the focal point or "trajector" (TR), Devi, is positioned on the other side of the "landmark" (LM), the hill.This schema is different from the one underlying an example such as The helicopter hung over the hill, which conveys a vertical spatial configuration in which the helicopter (TR) is positioned above the hill (LM).The schema furthermore differs from those underlying examples such as Devi walks over the hill or The helicopter flies over (the hill), which involve a (horizontal) path, and so on.
Yet, while they evoke different spatial configurations, the image schemas underlying these examples are still connected to one another, as humans are able to recognize general similarities between abstract image schemas (as demonstrated experimentally by, for instance, Gibbs et al. (1994)).Thus, a complex yet structured 'network' of linked senses is formed.Such networks, often termed 'lexical networks' or 'polysemy networks', comprise of nodes which are situated at varying distances from one another, and are centered around a primary sense or prototype (Lakoff 1987;Rice 1996).At their core, prepositions are spatial expressions, but the general human ability to apply metaphorical and analogical reasoning allows them to extend the use of prepositions to embody the non-physical domain ubiquitously (Kreitzer 1997: 317;Lee 1998: 334;Rice 1996: 135;Rice 1999: 227), as demonstrated by examples such as Devi works over the weekend (embodiment of time) and Devi is over her ex-boyfriend (embodiment of mental state).
As each small modification to an image schema is mapped onto a discrete sense category, the meticulous and comprehensive accounts set out by Brugman and Lakoff are sometimes called the "full-specification" approach.In the case of Lakoff (1987), the full-specification approach led to a fine-grained overview of 24 senses of over, which are connected in a sizable polysemy network.In later work, Lakoff's proposal was criticized amply for the fact that it leads to a virtually unconstrained number of sense categories, and for lacking methodological rigour (e.g.Rice 1996;Kreitzer 1997;Sandra & Rice 1995;Tyler & Evans 2001;2003).This led Sandra (1998: 361) to coin the term "polysemy fallacy" in reference to "the tendency to look for polysemy even when there is no evidence for it".
Subsequent proposals, then, set out ways to tackle the apparent lack of a principled procedure to determine the number of distinct (sub)senses.Two notable examples are the proposal of Kreitzer (1997), and the "principled polysemy" approach advocated by Tyler & Evans (2001;2003).
Drawing strongly on the spatial information encoded in linguistic expression, Kreitzer (1997: 308) defines a prepositional sense as "a class of uses sharing a unique relational level image schema".In the case of over, Kreitzer argues that only three such schemata can be distinguished: (1) a static relation between two points on a vertical axis (over 1 ), (2) a dynamic relation involving a path schema (over 2 ), and (3) a static relation where one point occludes the other (over 3 ): Fonteyn Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1323 (1) The painting hung over the fireplace. (2) The cat jumped over the fence.
(3) The mask is over my face.
These three relational schemata are also applicable to non-spatial domains, which, Kreitzer explains, are consistently conceptualized in terms of spatial image schemata: the use of over in I finally got over that relationship (indicating a path obstructed by an obstacle), for instance, can be motivated by the dynamic schema underlying over 2 , whereas over in The box is over six feet tall (indicating excess) is motivated by over 1 .Yet, as pointed out by Tyler & Evans (2001: 729), Kreitzer does not motivate the existence of those three relational image schemas in light of each other, as he "makes no attempt to account for how over 1 could give rise to over 2 and over 3 respectively".Additionally, many senses touched on by Lakoff (1987) are simply ignored in Kreitzer's account.
Addressing these issues, Tyler & Evans (2001;2003) devised a more encompassing proposal based on slightly different principles.More specifically, they argue that senses can be considered distinct if (and only if) under the following criteria: (i) First, assuming that the primary sense of the preposition involves "a particular spatial relation between a TR and an LM", the distinct sense "must involve a meaning that is not purely spatial in nature", or "the spatial configuration between the TR and LM is changed vis-a-vis the other senses associated with a particular preposition" (Tyler & Evans 2001: 731).(ii) Second, the sense must exist in examples where it cannot be inferred from the combination of another sense and encyclopedic or contextual knowledge (i.e., they must be instantiated in semantic memory; Evans (2005)).Following these criteria, there is no reason to assume that the use of over involves two distinct and separately stored senses in The helicopter hovered over the hill and The helicopter flew over the hill, as both examples involve a spatial relation in which the TR (the helicopter) is located above the LM (the hill).Furthermore, the difference in stativity/dynamicity of the scene can simply be inferred from the lexical verb (hover vs. fly).As such, Tyler and Evans effectively constrain the full-specification network to a more digestible size.
An issue that remains, however, is that there is still no objective, measurable means of determining the global structure of the polysemy network.Besides further attempts to establish which sense constitutes the core node or prototype (Sandra & Rice 1995;Rice 1996;Gilquin & McMichael 2018;Newman 2011: 137), there is still much room for discussion regarding the position of derived nodes.To illustrate the issue, we can consider the multitude of possible derivation pathways of the repetitive sense of over, as in She sang the same song over (and over) (Tyler & Evans 2003: 105-106).First, based on their comparable, cyclical image schemas, the repetitive sense can be connected to reflexive uses of over (e.g.She turned the page over/ The vase tipped over).Second, it is possible that repetitive over marks an iterative trajectory, in which case the sense could be derived from cases where over marks the end of a linear temporal trajectory or process (e.g.The race is over).A third possibility is that the repetitive sense constitutes a conceptual blend of reflexivity and trajectory completion, a notion which may equally apply to many other derived senses.
In their accounts, Tyler & Evans choose to remain agnostic on the matter, explaining that "language does not function like a logical calculus which would allow us to … establish absolutely a single, precise derivation for each sense" (Tyler & Evans 2003: 62).This is not to say that 'anything goes', but rather that there is a delimited set of general principles or paths of derivation which may individually or simultaneously give rise to derived senses, and different individuals may draw different connections between senses, if they draw any such connections at all (Langacker 2010: Ch.10).Yet, even so, the agnostic position is somewhat unsatisfactory if one is interested in, for instance, comparing the general probability of multiple derivational paths across individuals, or even across time.Such queries will remain difficult to address in absence of methods that enable researchers to quantify and measure the degree of similarity between sense categories.The further integration of distributional semantic models into Cognitive Linguistics, then, can at least partially be linked to the research community's growing desire to approach the study of polysemy (and synonymy) in a more rigidly corpusdriven and measurable way (Gries & Divjak 2009;Newman 2011).Fonteyn Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1323

Advances in Distributional Semantic Models
At its core, the distributional approach conceptualizes the meaning of a word (or, more generally, of constructions) as a function of its lexical and grammatical context, and as such, meaning can be approached statistically (Turney & Pantel 2010;Boleda 2020).Statistical approaches to meaning have a long tradition in corpus linguistics, with functional-semantic classifications into distinct usages being increasingly based on explicit, automatically detectable contextual cues as corpora grew increasingly large.
An interesting observation made by Heylen et al. (2015) concerns the statistical-manual hybridity of the corpus linguistic tradition.As an example, they take the "British tradition in corpus linguistics", in which lexical collocations and syntactic patterns are employed to capture or approximate word meaning, while the classification of these meanings into categories is conducted manually.By contrast, the more recent application of "Behavioral Profiles" (Gries 2006;Gries & Divjak 2009), for instance, presents a means of statistically automating the classification by means of hierarchical cluster analysis (or correspondence analysis in Glynn (2010)).In such cases, a set of tokens can be annotated along a number of variables or dimensions, such as the type of trajector (TR), landmark (LM), as illustrated in Table 1 from Newman (2011).
Such data frames can subsequently be converted into a table with numeric infomation (e.g. the relative frequency of each example with each label), which can then be used as input for statistical analysis (Gries & Otani 2010).Focusing only on Trajector-Landmark combinations found in the ICE-GB corpus, Newman (2011)  The appeal of this approach, according to Newman (2011: 541-542), is that it "offers a systematic corpus-based procedure", which is "strongly grounded in facts of usage, complementing any other (intuition-based or experimentally based) methods the researcher might employ".
Still, the selection and annotation of the variables (which have been selected and defined by the analyst) is predominantly manual.As a "logical extension of the statistical state-of-art"  (Heylen et al. 2015: 154), then, Semantic Vector Space Models were introduced, in which all aspects of semantic analysis are approached statistically.In such models, the contextual properties that are fed into statistical classification models are no longer manually annotated features, but automatically generated numeric representations of syntactic and lexical cooccurence patterns.
Because the distributional approach to meaning is based on a relatively simple, and concrete premise, one may be tempted to assume that studies adopting this approach are highly comparable, if not identical in how they operationalize and model meaning.This would, however, be a mistaken assumption.It would be far beyond the scope of the present paper to survey the many different ways in which 'meaning as context' has been operationalized (for such a survey, one may consult Turney & Pantel (2010), Lenci (2018), Boleda (2020), or specifically for deep learning based models, Young et al. (2018)).However, to clarify the model choice in the present study, a brief discussion of two relatively recent developments is warranted.This concerns (i) the rise of models operating with contextualized (or, rather, tokenbased) semantic vectors, and (ii) the rise of context-predicting models (also known as 'neural language models') that create semantic vectors often referred to as 'embeddings'.

Semantic vectors: type vs. token
A first development of note is the gradual turn from models that produce vectors of word types, to models that are able to create token-based (or 'contextualized') vectors.The distinction between type-based and token-based is not so much one of whether or not the resulting vector representations include contextual information -this is the case for both type-level as well as token-level vectors -but whether or not all contextual occurences of a single word are conflated into a single vector representation.
Type-based models work from the assumption that a word has a single, constant, 'core' meaning (which can be understood as a prototype, cf.Erk & Padó (2010: 92)), thus representing a 'lumped' approach to meaning representation.Given a number of examples involving the words cat, mouse, a type-based model will provide a single numeric representation for all of the context in which these words occur (see Table 2). (11) The cat ate some food and purred.
(12) Do not pet the paws of a cat unless it purrs. (13) The mouse held some food between its paws.
(14) I bought an external mouse and keyboard for my computer.
For a word such as mouse, for instance, the contextual information that suggests it is an animal would therefore be conflated with contextual information typical of the object.The problem with such aggregated vector representations is that they may render unsatisfactory or problematic vector representations in cases of polysemy, and, unarguably even more so, in cases of homonymy (Erk & Padó 2010;Desagulier 2019;De Pascale 2019;'meaning conflation deficiency' in Camacho-Collados & Pilehvar).
In response to this issue, models that generate token-specific vector representations were developed.These token-based distributional models -in which individual vectors are assigned to, for instance, the two different examples of mouse as in Table 3 -are better equipped to handle the complex internal semantic structure of words, and, hence, are naturally better suited for specific NLP tasks such as word sense disambiguation (see, e.g.ELMo (Peters et al. 2018a), as well as the model described in, e.g., Heylen et al. (2015)).Because the aim of the present study is precisely to home in on the differences and similarities between different uses of a single preposition, it evidently employs a distributional model that produces vector presentations at the token level.

Semantic vectors: count vs. predict
Using the terminology employed in Baroni et al. (2014), I wish to point out that a distinction can be made between 'count models', and 'predict(ive) models' (also see 'explicit' and 'implicit' models in Dubossarsky et al. (2017Dubossarsky et al. ( : 1136))).Count models represent, in a sense, the most straightforward way of operationalizing the distributional hypothesis, in that they make use of numerical vectors that are essentially based on co-occurrence counts (for an accessible explanation of how such vectors are constructed for word types and word tokens, see, for instance, Heylen et al. (2015) and Hilpert & Correia Saavedra (2017)).Still, describing count models as such is a severe simplification, as more than often the vectors are optimized in some way (e.g. by changing context window sizes, reweighting function words, leaving out function words, applying dimensionality reduction, etc.).
By contrast, context-predicting models (yet again a cover-term for an extremely varied group of models, including weighted bag-of-words, to more syntactically informed variations, with new types of model architectures being added continuously) are designed to approach the construction of semantic vectors from a training-based angle: "Instead of first collecting context vectors and then reweighting these vectors based on various criteria, the vector weights are directly set to optimally predict the contexts in which the corresponding words tend to appear" (Baroni et al. 2014: 238).In other words, predictive models construct vectors as part of a learning task, which, to some degree, eliminates the vector transformation and optimization process.This, in addition to the performance improvements observed in a range of NLP tasks compared to count models, is why the relatively recent emergence predictive models is often portrayed as an attractive advancement (Baroni et al. 2014).This is, however, not to say that predictive models involve absolutely no parameter tuning -and it has been suggested that, given comparable settings and tuning, the vectors created with count models are as effective as the embeddings yielded by predictive models (Levy et al. 2015).Yet, what does make predictive neural language models particularly appealing for the present study, which focuses on prepositional semantics, is that count models are generally less successful in providing useful representations of function words (e.g.Bullinaria & Levy 2012: 7), whereas "recent neural network models do provide usable representations for them" (Boleda 2020: 7).

BERT
In the present study, the distributional approach is represented by a single model architecture: In a nutshell, BERT is a deep contextualized model based on a particular type of neural architecture, called "the Transformer", which is entirely based on so-called "attention mechanisms" (Vaswani et al. 2017).A context-predicting model, BERT has been pre-trained on approximately 3.3 billion words (800 million words taken from the BooksCorpus, and 2.5 billion words from English Wikipedia) of unlabelled data over a masked word prediction task (in which the objective is to predict randomly masked input tokens based only on the context in which they occur) and a next sentence prediction task (so that the model will also understand sentence relationships).
Like other Transformers, BERT consists of multiple layers (or 'transformer blocks'), all of which contain multiple self-attention heads which behave similarly within their layer.The smallest pre-trained model, called BERT base , consists of 12 layers with 12 attention heads, whereas the larger model, called BERT large , consists of 24 layers with 16 attention heads.Each of these layers captures the n tokens in the input sentence (or rather 'sequence', as the input need not correspond with what linguists have traditionally defined as a sentence) in compressed numerical vector representations or 'embeddings'.
The attention heads within BERT's layers have been probed for the linguistic phenomena they capture.This revealed that particular heads capture syntactic relations (e.g.valency patterns and dependency relations), while others perform well at coreference resolution (Clark et al. 2019) -which is remarkable given that the model has not received any explicit input about syntax or coreference.This "syntax-aware attention" (Clark et al. 2019) may be why BERT is succesful the downstream NLP tasks it has been employed in (cf.Peters et al. 2018b).Finally, it is important to note that the different layers (and accompanying attention heads) perform slightly differently on different tasks.In various sources, the second-to-last layer (or a concatenation of the last four layers) is suggested to perform best on token-level tasks such as word sense disambiguation (e.g.Devlin et al. 2019;Wiedemann et al. 2019), but many applications also operate with the final hidden layer (e.g.Huang et al. 2019;Blevins & Zettlemoyer 2020).
With respect to linguistic investigation into polysemy and sense disambiguation, the con textualized embeddings produced by BERT have thus far not been explored.One reason may be that neural models have grown into increasingly intransparant systems (Linzen et al. 2019: iii), making linguists more reluctant to rely on them for lingusitic analysis.Still, it is worth investigating to what extent they could be employed as analytic tools in linguistic research, as neural language models like (but consistently outperformed by) BERT have already been shown to capture very nuanced aspects of meaning, and they even seem to provide usable representations for function words (see Boleda (2020: 18), in reference to Peters et al. (2018a)).Furthermore, a model such as BERT also unites the strengths of different types of token-based distributional methods.First, the fact that the model is syntax-aware agrees with the cognitive-linguistic (and constructionist) view that differences in syntactic structures reflect differences in meaning (Langacker 1991;Goldberg 1995).As such, its syntax-awareness sets BERT apart from bag-of-words approaches to contextualized vectors (e.g.Heylen et al. 2015), and thus makes it more akin to, for example, the Behavioural Profiles approach.Second, the application of BERT to the study of polysemy does not involve any manual annotation, and neither does it involve making an a priori selection of syntactic features to be included, making it a fully data-driven approach to the question at hand.

Data and Methodology
In the present study, BERT base has been used to create contextualized embeddings for all occurrences of over in the final decade of the Corpus of Historical American English (COHA, 2000(COHA, -2010)).In total, embeddings were created for 39,834 tokens of over using the Spacy implemenation of BERT base (which, at the time the analysis was conducted, only offered access to the final hidden layer).The embeddings of the target tokens were created with a context window set to 20 words preceding and 20 words following over.In principle, the performance of the model in the task at hand could still be improved by experimenting with different hyperparameter settings or by fine-tuning the model to specific tasks or corpus data, but no such operations were undertaken.
Of the 39,834 tokens, 808 examples were manually annotated by two human annotators, following the sense description in Tyler & Evans (2001;2003).Note that the proposal by Tyler & Evans does not consitute a 'gold standard', as its cognitive reality remains to be tested against ellicited, experimental data.Yet, their proposal was chosen as a point of comparison because (i) it is well-documented, (ii) is firmly grounded in and motivated by linguistic theory, (iii) and presents the most comprehensive assessment of all possible senses of over since Lakoff (1987).Furthermore, because their proposal focusses not only on motivating the number of distinct Fonteyn Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1323 senses, but also on motivating the connections between those senses by foregrounding the importance of image schemas, it lends itself well to an assessment of whether such imagistic information is encoded in corpus data and captured by word embeddings.
In total, 16 different sense categories were distinguished.Following the example of Tyler & Evans (2001), the categories are given a label that corresponds with their status as a discrete sense (i.e., 1, 3, etc.) and subsense (i.e., A, B, etc.).Note that the 808 examples constitute a stratified sample: first, a random sample of 300 tokens was manually annotated.Subsequently, the sample was expanded with further examples until each sense category was represented by at least 10 tokens.The token frequencies per sense category are listed in Table 4.
Between the two human annotators, inter-rater agreement was found to be very good (Fleiss' Kappa = 0.867).In the statistical analyses presented below, 26 were examples were excluded because they were considered indeterminate between multiple categories.Further information on the sense categories is provided in Section 3.1.
Ultimately, the sense categorization proposed by Tyler and Evans was created in response to models that are too fine-grained, and hence lack what Tyler & Evans consider to be meaningful, principled abstractions.Thus, the question we are in fact asking is to what extent these abstractions are also 'meaningful' to models such as BERT, which approach prepositional meaning by compressing contextual data.To address this question, I adopt a procedure based on the Varying Abstraction Model (Vanpaemel & Storms 2008).Originally, the Varying Abstraction Model (VAM) was designed in response to the debate in psychology on how the classification accuracy of exemplar-based models (involving no abstraction) compares to that of prototype models (involving complete abstraction) as well as models involving intermediate levels of abstraction.The procedure adopted here is based on the k-means variant of the VAM (Verbeemen et al. 2005), where a particular level of abstraction is operationalized by the degree to which category members are clustered.
The VAM conducts a series of evaluation tasks, where it predicts the category label of unseen test tokens against a manually assigned label.The series start with the prediction of the category label of an unseen test token based on its nearest neighbour embedding in a labelled training set.At this level, none of the token embeddings in the training set have been clustered, which, one could argue, means that the number of 'clusters' k equals the number of tokens n in the training set (k = n).This is also called the 'exemplar level'.In other word sense disambiguation studies, the performance of models such as BERT is commonly assessed solely at this level (see, e.g.Wiedemann et al. 2019).In the present study, however, the assessment is also taken beyond the exemplar level: the VAM will subsequently attempt the same classification task again, but instead of using all the embeddings of the concrete tokens in the training set as a reference set, it will create a slightly higher level of 'abstraction' by merging the embeddings of some concrete tokens of the same sense category into a slightly more schematic, averaged representation.At every step of the VAM procedure, an increasing number of token embeddings are merged, until all embeddings of all training tokens that belong the same sense category are merged into a single averaged embedding.A schematic representation of the different steps of abstraction is presented in Figure 1.At the highest level of abstraction, then, the classification task of the unseen test tokens is attempted by means of a clustered or averaged representation of all embeddings assigned to that category.This averaged embedding can hence be thought of as a 'contextualized sense embedding'.Because I consider the 16 sense categories of the principled polysemy model (described in Section 3.1) to represent the highest level of abstraction, the number of clustered representations by means of which classification is attempted at this level is 16 (k = 16).Note that the level of abstraction could be increased further by reducing the number of clusters to 8 (to attempt the classification task using clustered representations of, for instance all tokens labelled as sense 5A, 5B and 5C), but no such actions were undertaken in this study.
The series of classification tasks (from k = n to k = 16) is evaluated against a test set (20% of the data; 100 iterations per level; see Figure 2), and will be expressed in an accuracy score (F 1 -score, between 0 and 1, with 1 representing perfect accuracy).The resulting series of classification accuracy scores allows us to assess the following: if the sense classification task goes well at the lowest level of abstraction (the exemplar level), we find that the contextual information encoded in the BERT embeddings of over encodes and captures local similarities between concrete tokens of the same sense category.As the level of abstraction increases, the classification task will involve classifiying unseen tokens not by means of other, concrete tokens, but by means of averaged contextual representations of multiple tokens that have been assigned the same label.In other words, the model will attempt the classification of unseen tokens by means of contextual representations that are decreasingly concrete and increasingly schematic (that is, representing abstractions over multiple tokens in the same sense category).
If classification accuracy of the unseen test tokens remains high when all training tokens of the same category are averaged into a 'sense embedding', we find that these abstract contextual representations are helpful tools to categorize new, unseen tokens.In that case, we could say that the 'meaningful abstractions' or sense categories proposed by Tyler and Evans also make sense in terms of the contextual information encoded in BERT embeddings.
Besides assessing to what extent BERT embeddings can be used to distinguish the sense categories proposed in the principled polysemy model, I will also discuss the global structure of  the network proposed by Tyler & Evans (2001).To discuss distances between sense categories, I use the cosine similarity between the embeddings (see, e.g., Bullinaria & Levy (2007); Heylen et al. (2015); Peters et al. (2018b)).

Sense categories
In what follows, I will describe the categories distinguished in Tyler & Evans, illustrating them with examples from the data set.For an in-depth description of the sense categories and a full argumentation as to why these (and only these) categories have been distinguished, I refer to Tyler & Evans (2001;2003).

Sense 1: 'above'
The first category contains all examples in which over signals that the TR is located above the LM.This category is considered to be the primary sense or 'protoscene' from which all other senses can be derived (Tyler & Evans 2001: 735-737).The relation expressed is an atemporal, spatial relation, where the TR is typically in close proximity to the LM.In many cases, TR is typically movable and smaller than the LM (as in ( 15)), but immovable (e.g. ( 16)) and larger (e.g. ( 17)) TRs occur as well. (

Sense (group) 2: A-B-C trajectory
Besides the protoscene, Tyler & Evans also distinguish a number of derived senses.Four of these can be conceived of as a 'cluster' of senses where over marks a trajectory from a starting point (A), a midpoint (B), and an endpoint (C).While not all senses in this cluster put equal focus on all points in the trajectory, the uniting factor seems to be that there is a certain linearity to the expressed relation.

ON-THE-OTHER-SIDE-OF (2A)
In examples ( 18) and ( 19), the TR is portrayed as being not above, but on the other side of the LM: God, let this be the peak.Let us be over the mountain (COHA, 2007) Note that the verb itself does not trigger the trajectory reading (when it does, the example will be assigned to the protoscene).
While the examples in ( 18) and ( 19) both function as prepositions, a large group of tokens in this category function as an adprep.In some cases, such as (20), the verb is combined with an adverbial phrase that indicates the endpoint of the trajectory.Thus, it could be argued that the combination of the verb and the endpoint adverbial already implies movement along a trajectory.The addition of over, then, seems to have a mere emphatic function.

TIME SPAN (2E)
In a fairly large number of examples, over "mediates a temporal relation of concurrence between a process or activity and the times during which the process or activity elapses" (Tyler & Evans 2001: 748-749), as illustrated in ( 36) and ( 37): (36) The war on witchcraft intensified over the next 200 years, sending millions of cats, not to mention humans, to their deaths.(2001, COHA) (37) Geologists and biologists before Darwin noted that the Earth and its inhabitants change over time.(2004, COHA) Note that the question may be raised whether these examples do in fact constitute a distinct sense of over, as the temporal reading is inferable from the fact that the LM consistently involves a noun that refers to a time-related concept.The choice was made to create a separate label for these examples, but if the inferrability criterion is adhered to more strictly, these examples could perhaps be classified as instances of Sense 2A.

Sense 3: Covering
Like Lakoff, Tyler & Evans also distinguish a category of examples such as (38), in which there is "an understood viewpoint from which the TR is blocking accessibility of vision to at least some part of the LM" (Lakoff 1987: 429).In these cases, the TR is not located above the LM from the vantage point of the viewer (Tyler & Evans 2001: 752): (38) A ratty leather jacket gaped open to reveal a white button-front shirt over an ample but not outrageous bosom (2005, COHA) The 'covering' sense also includes a examples there is a "multiplex trajector" (Lakoff 1987: 428) that is scattered over the LM, as in ( 39), or where the TR has covered a path consisting of multiple points over the LM, as in ( 40

Sense (group) 4: Proximity
As in sense 3, the TR is no longer necessarily positioned above the LM from the perspective of the viewer.Instead, over conveys that there is close proximity between the TR and LM.This proximity goes beyond the spatial realm, and is manifested in the attention paid by the TR to the LM.Examples such as ( 45) and ( 46) were classified as instances of sense 4A, but it must be noted here that the distinction between these examples and examples of sense 5B, which are discussed below, is difficult to maintain.

Sense (group) 5: Up
Four further senses fall under "the up cluster", which are suggested to derive "from construing a TR located physically higher than the LM as being vertically elevated or up relative to the LM" (Tyler & Evans 2001: 755).

MORE (5A)
The first (and most frequently occurring) sense is 5A.In all examples in this category, over indicates that a quantity is higher than the quantity expressed in the LM: (47) Jerry and I were parents to over fifty foster kids in our thirty years of marriage.
(2004, COHA) Tyler & Evans distinguish one further sense, sense 5A.1, where the TR is understood as something that is contained by, but exceeds the capacity of, the LM.The only clear example discussed is overtired, in The child was overtired and thus had difficulty falling asleep.The data set in the present paper does not include compounds with over.In a footnote, Tyler & Evans (2001: 757) explain that it is often possible to "construct a 'more' conceptualization" alongside "an 'excess' interpretation".In practice, this seemed to apply to nearly all examples in the data set.As such, no distinction was made between sense 5A and 5A.1.
It is, in many cases, also extremely difficult to distinguish cases of 'excess' as crossing a target point, or excess as exceeding an amount or capacity.Consider, for instance, the example in (48): (48) But for kids over age 5, as the portion size got larger, so did the amount they ate.
(2003, COHA) While suitable for the 'up' conceptualization, it is not inconceivable that examples such as (34) could also be classified under 2B (as time is a linear concept rather than a container).Tyler & Evans (2001: 758) also address this issue, stating that their network of senses "should be thought of as a semantic continuum, in which complex conceptualizations can draw on meanings from distinct nodes as well as the range of points between nodes, which provide nuanced semantic values".For simplicity's sake, the choice was made to assign all cases where a numeric threshold was exceeded to sense 5A.Fonteyn Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1323

Sense Distinctions
As a first point of enquiry, it is investigated whether BERT indeed recognises the sense categories proposed in the principled polysemy model in a relatively distinct and coherent manner.Following Kilgarriff (2003: 108), I define 'senses' as "abstractions over clusters of word usages".In other words, if the abstract, conceptual sense categories proposed in the principled polysemy model are recognized by BERT, we would expect to find that the geometrical distance (operationalized as the cosine distance) between the embeddings of all tokens labelled as Sense 1, for instance, is shorter than the distance between those tokens and tokens with a different category label, thus forming a cluster.
To visualize the local embedding clusters and the global positioning of those clusters relative to one another, a two-dimensional representation of the token embeddings was created based on the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm for dimensionality reduction of high-dimensional data (van der Maaten & Hinton 2008).In Figure 3, each token is represented by a dot, which has been coloured according to its manually assigned label.Note that the embeddings were created solely based on the contextual information surrounding over, and that the manual labels were assigned separately.The distributional model was hence not fed any human-defined knowledge about the number or nature of labelled sense categories.The two-dimensional plot below therefore visualizes the overall correspondence between clustered token embeddings, which can be conceptualized as 'distributionally defined senses', and the proposed 'conceptually defined' sense labels.area, but in others, such as 2D ('transfer'), tokens are grouped in separate, relatively distant areas.Thus, the question arises how we can assess the degree to which there is correspondence between the distributionally defined and conceptually defined senses.

From eyeballing
To address this question in a way that goes beyond eyeballing a visualization, this study uses a series of classification tasks, which help quantify the extent to which there is correspondence between the sense categories emerging from the two models (VAM).Figure 4 presents the results of the VAM when applied to all categories (over 100 iterations).The x-axis represents the abstraction continuum, which starts at no abstraction (the exemplar level, where classification of unseen tokens (20% of the data) is attempted by means of the nearest neighbour embeddings of concrete tokens) and reaches up to the target level (the highest level of abstraction, where all items within the 16 labelled categories are clustered into averaged 'sense embeddings').
The y-axis represents the classification accuracy of the distributional model (F 1 -score, i.e. the harmonic mean of precision and recall when applied to the test set).
At the lowest level of abstraction, the model's classification accuracy is quite high at 0.95, remaining relatively stable at this level until approximately 200 tokens have been clustered.Subsequently, its accuracy gradually drops to 0.8 (at approx.300 tokens), after which it drops below 0.7 at the highest levels of abstraction.These findings imply that the BERT embeddings do encode the similarities between members of the categories proposed in the principled polysemy model, but only up to a certain point.Beyond that point, the proposed abstractions no longer optimally fit the output of the distributional model.
When we assess the models classification accuracy per sense category (Figure 5), we also find that the model is more successful in 'recognizing' some sense abstractions than others.
Overall, the embeddings of over clearly encode the similarities between concrete tokens, and the performance of the model remains high at lower-intermediate levels of abstraction where only some of the concrete token embeddings are merged into slightly more schematic representations.Yet, whether higher levels of abstraction are still meaningfully encoded in the BERT embeddings of over seems to depend on the sense category under scrutiny.Unsurprisingly, perhaps, it is precisely the sense categories that occur in fixed syntactic configurations and have clear collocational preferences (e.g.Sense 2C 'completion', in which over consistently functions as an adprep in combination with a form of BE), are easier to group than more 'schematic' senses.There are, however, a number of notable cases where the model's classification accuracy drops sharply when more more items of the same category are merged.This pertains to Sense 3, where the model does not recognize similarities between, for instance, Spread a tablecloth over the table and He received votes from all over the floor.Reassuringly, this ties in with other analyses that have argued that 'covering' does not adequately capture the interpretation of all over (Queller 2001;Taylor 2006;Pawelec 2010: 98-101).Furthermore, Sense 4A ('focus of attention') and 5B ('control') suffer from the high number of false positives of the other category the model wishes to assign to them (see Section 3.3.4-3.3.5 on the difficulty of distinguishing 4A and 5B).Finally, drops in performance can also be witnessed for category 2A and 2D, as the model seems to have difficulties in relating examples describing physical and non-physical scenes.

Relations between senses
Having established that there is some correspondence between clustered BERT embeddings and the proposed sense categories (up to a certain point), we can now turn to the question whether the geometrical distances between the various senses of over (as emergent from the distributional semantic model) correspond with the semantic relationships proposed in the prepositional polysemy network of over proposed by Tyler & Evans (2001: 746), reproduced in The dark, full nodes in the suggested network representation in Figure 6 constitute what are considered to be separate senses, whereas the empty nodes are included as abstractions over a proposed cluster of related, derived senses.At the centre, we find the protoscene (Sense 1).Because the representation in Tyler & Evans has been constructed based on theoretical principles, it makes little sense to assess the representation based on the absolute geometrical distances between embeddings derived from the distributional model.It does make sense, however, to operationalize the directness of node linkage in the proposed network to relative distances between embeddings.More specifically, we could hypothesize that, if Sense 4A is not directly derived from the protoscene (Sense 1), but emerged as a further extension derived from Sense 4, we would expect that the relative distance between Sense 4A and Sense 1 is bigger than the relative distance between Sense 4A and Sense 4. Similarly, if Sense 2A, 2B, 2C and 2D form a cluster of related senses, we would expect the relative distance between, for instance, Sense 2A and 2B or 2A and 2D, to be shorter than the relative distance between Sense 2A and Sense 3.
To compare the principled polysemy network to the output of the distributional model by means of cosine distances, two approaches can be taken.First, we may approach the comparison by taking the senses proposed by Tyler & Evans (2001;2003) as given, and rely on manually assigned labels to create an averaged sense embedding -that is, a summary embedding similar to the clusters created at the highest level of abstraction in the VAM.Subsequently, we can calculate cosine similarities between these sense embeddings.If the result of this assessment turns out to be that the relative distance between the sense embeddings maps onto the suggested relative distances in the network in Figure 6, we could conclude that both models arrive at the same network representation in a relatively straightforward manner.
However, as explained in Section 4.1, the proposed sense categories do not always correspond with the way in which the tokens of a manually labelled category cluster, with categories such as 2A falling apart into multiple rather distinct groupings.A second approach, then, would be to adhere less strictly to the sense categories proposed by Tyler & Evans, and determine the geometrical distance between token clusters proposed by the distributional model.In what follows, I restrict myself to this second approach.In order for a group of tokens to be considered a cluster, it was decided that there should be at least 5 neighbouring tokens of the same sense category.As such, the smallest clusters represented in  As was briefly pointed out in Section 4.1, seven of the conceptual categories are 'consistent', with all tokens clustering together.This includes 2E ('time span'), 2C ('completion'), 4 ('examining'), 5B ('control'), 6 ('reflexive'), 7 ('communication channel') and 8 ('hangover').For the remaining nine categories, the model suggests that there are at least two different clusters.In some cases, the separate clusters are still part of the same higher-order branch (e.g. 1 'above', 4A 'focusof-attention', 6A 'repetition'), whereas for others, the embeddings are less closely related (e.g.2A 'other-side', 2D 'transfer', 5A 'excess').All in all, the suggested distances between the sense groupings differ substantially from the proposal put forward by Tyler & Evans: not only does the distributional distributional model suggest fairly large distances between token groupings that Tyler & Evans would have assigned to the same category (based on their shared underlying image schema), the proposed relative distances in the sense network are also not reflected in the geometrical distances between the (groupings) of embeddings.
Given that BERT does not group tokens of over according to abstract similarities in spatial configurations between them, the question that remains is what kind of groupings the model does suggest, and whether any (other) meaningful abstractions can be made.To address this question, we could examine the token groupings illustrated in Figure 7.
GROUP 1 -SPACE The first group that appears to be recognized appears to be a collection of spatial uses of over.Tokens grouped in Group 1 include all tokens assigned to Sense 1 ('above'), and some tokens of the spatial 'excess' sense, 2B, in which a spatial border or threshold is exceeded.Note, however, that this cluster does not correspond with Kreitzer (1997)  e.g.She was passed over for the job).Group 4 also contains all tokens of 4 ('examining').Given that the latter group of examples frequently (but not exclusively) involves the phrasal verb combination look over, it is not surprising that these tokens are positioned relatively close to (but, notably, are not confused with) cases where a glance is cast (classified as 2A).Finally, we also find all tokens of Sense 7 ('means of communication') in Group 4. While perhaps more loosely related to the 'mind' and 'perception' relations, Sense 7 also involves animate TRs and a non-spatial interpretation (i.e. in a sentence like they spoke over the phone, the preposition does not capture a physical, spatial positioning of the TR and LM).
GROUP 5 -PATH Related to Group 3, we find a cluster of tokens where over again has a spatial interpretation.Yet, unlike the tokens in Group 1, over functions as an adprep, and involves movement along a path (and hence partially overlaps with Kreitzer (1997)'s dynamic over 2 ).
These include examples of 2D ('transfer') as well as examples of 2A where the TR moves to a different location (e.g.I made my way over (to the computer)).

GROUP 6 -EXCESS (IN NUMBERS)
In the remaining set of tokens, then, a clustering of 5A tokens can be distinguished, where the LM is an amount or quantitative threshold (e.g.She mentioned that I donated over $100,000 to Katrina victims (2006, COHA)) or an amount (this cluster is also identified by Newman 2011).
GROUP 7 -NON-SPATIAL ADPREPS AND FIXED PHRASES While it is not easy to make sense of Group 7, it is interesting to note that all tokens classified as Sense 6 ('reflexive') and 6A ('repetition') are part of this group.Yet, rather than clustering together, they are positioned closely to token groupings classified as Sense 3 (in the fixed combination all over, e.g.I feel pain all over), Sense 5A (non-literal adprep uses, as in their personal involvement will spill over into their workplace interaction), and Sense 8 ('hangover').The relation between these groupings seems formal rather than semantic, as the majority of groupings present cases where over functions as an adprep, and occurs in relatively fixed phrases.
If we approach the complex internal semantic structure of over by means of BERT embeddings, then, it appears that global clusters are formed based on in intersection of similarities in, on the one hand, conceptual domain (i.e., spatial, temporal, mental, etc.), and syntactic resemblance on the other.Overall, the lack of correspondence between the suggested network configuration in Figure 6 and the global distances between the grouped embeddings does not necessarily indicate that the global groupings are not interpretable, or that no abstractions can be made -rather, it suggests that the use of an embedding-based approach leads to different abstractions.Yet, the fact that BERT is apt at recognizing metaphorical uses of over does not necessarily imply that it also recognizes that they are, in fact, metaphorical extensions of a literal, spatial source.This becomes evident when we consider the model's output at the global level.When we examine the relations between the token groupings emergent from the cluster analysis presented in Section 4.2, we find that the geometrical distance between the embedding of a particular spatial use of over, which may have given rise to a particular non-spatial use via metaphorical extension, is not shorter than the geometrical distance between the embeddings of two literal spatial uses (e.g.Group 1) or two metaphorical uses (e.g.Group 4) with different underlying image schemas.As such, if there is indeed a close connection between a spatial configuration and a non-physical scene it embodies, there may be a discrepancy between the geometric distances between the token embeddings of over and the actual conceptual similarity between those tokens.

Discussion and Conclusion
The observation BERT does not immediately capture similarities in terms of image-schema resemblances can be understood in light of the fact that the model has been trained on linguistic data alone, and has no experience with (physical) non-linguistic, perceptual information (such as spatial configurations, but also, for example, visual properties such as colours: Sommerauer & Fokkens 2018).Hence, BERT embeddings pick up fine-grained semantic distinctions based on collocational and morpho-syntactic cues, and can be employed to successfully group senses in distinct domains.However, BERT embeddings seem less equipped to flag abstract configurational resemblances in image schemas across those domains, which helps highlight what sort of semantic information is (and is not) encoded in contextualized embeddings.
Note that, if the perceptual, imagistic information that motivates the abstractions and sense connections made in the cognitive-conceptual model is still somehow encoded in contextual information (as appears to be suggested by Gromann & Hedblom (2017)), such information could be brought to the fore by further experimentation with the model's hyperparameters (e.g.different context window sizes, different (combinations of) layers), or fine-tuning the model to a sense classification task by exposing it to manually labelled examples in training.Yet, if such perceptual information is not represented in context embeddings and requires extralinguistic knowledge, an interesting avenue to pursue is, for instance, to train language models based on coupled textual and visual input (Chrupała et al. 2015).Of course, whether such additional training and supervision is desirable depends entirely on the question the researcher wishes to address, and which facets of meaning they deem relevant within their study or theoretical framework.In Cognitive Linguistics, researchers may be inclined to say that a model of meaning representation should capture the global resemblance between the underlying image schemas of prepositions, as image schemas (and embodiment) are part of the core tenets of the framework (e.g.Oakley 2010;Gibbs & Matlock 2001: 233), and play an important role in, for instance, studies of semantic change and grammaticalization (e.g.Rhee 2002).
As a final concluding remark, I wish to add that the findings presented in this study have important implications for the integration of neural language models -and perhaps, more generally, the application of Semantic Vector Space Models -in theoretical linguistic research, and in particular, to research on semantic change.In a recent publication, Boleda (2020) surveys a number of studies that have applied either count or predictive models to historical and diachronic corpus data.Such studies, which involve examination of nearest neighbours and cosine similarities between type-and/or token-vectors, have provided the key to detecting, as well as describing the diachronic trajectory of lexical and, albeit less commonly, grammatical semantic changes (e.g.different types of semantic change (and, by extension, I could add that they may also help assess competing hypotheses regarding the mechanisms of change at play in a particular diachronic development).Some steps have already been taken in this direction (e.g. the automated detection of semantic broadening and narrowing in Sagi et al. (2011);Giulianelli et al. (2020)), and indeed, BERT could be an excellent tool for detecting metaphorical extensions of linguistic items in diachronic corpora (Giulianelli et al. 2020).However, it should be clear that, at least when left entirely unsupervised, BERT does not seem to pick up that there may be abstract, imagistic similarities between domains.As such, researchers interested in studying metaphorical extensions (of prepositions or otherwise) should take into consideration that unsupervised BERT will be great at indicating that a metaphorical extension has occurred from one domain to another, but they do not reveal which perceptual similarity pattern is the most likely source of the extension.It could be possible, however, to tackle these issues by experimenting with additional supervision and different model architectures, and, crucially, by accelerating the dialogue on how to integrate these models in theoretical linguistic research, and vice versa.

Figure 1
Figure 1 Schematic representation of abstraction continuum.At the lowest level of abstraction, the items in the training set that can be used for classifiying an unseen token are the embeddings of concrete tokens in that set.At intermediate levels, the number of items that can be used to classify an unseen token is gradually reduced, as an increasing number of token embeddings are merged into an averaged embedding.When complete abstraction is reached, all items of the same category are merged into a single, averaged 'sense embedding'.

Figure 2
Figure 2 Schematic representation of VAM procedure.Starting from no clustered items, the VAM uses different configurations of clustered or averaged embeddings (which represent different levels of abstraction) in a training set (80% of the data) of the data to classify an unseen test token set (20% of the data).
grown up only a few hours away, over the Kentucky line.(COHA,2007) into the crawlspace from my closet and climb all over the house.(COHA,2005)

Figure 3 ,
it looks like the distributional model proposes a fair number of distinct token-clusters that correspond relatively well with the suggested sense categories: in the majority of cases, the local clusters (or cluster areas) consist of tokens that were assigned the same label.At the bottom of Figure3, for instance, a local cluster area was marked, which comprises entirely of examples of Sense 2A (more specifically, those cases where the TR has mentally overcome an obstacle or past relationship).Yet, at the same time, the correspondence between the two models seems to become weaker at the global level.In some cases, such as Sense 2C ('completion'), 2E ('time span'), 4 ('examining'), 7 ('means of communication') and 8 ('hangover'), it appears that all tokens with the same label are assigned to the same local cluster

Figure 4
Figure 4 VAM output over all data.

Figure 5
Figure 5 VAM output per sense category.

Figure 7
Figure 7 presents a hierarchical cluster tree of the annotated tokens.Of the 808 annotated examples, 26 were examples were excluded because they were considered unclear or ambiguous between multiple readings (cf.example (42)).The clustering presented in Figure 7 is based on the cosine distance between the embeddings of the remaining 782 examples.The coloured areas represent clusters of neighbouring tokens that were assigned to the same category.In order for a group of tokens to be considered a cluster, it was decided that there should be at least 5 neighbouring tokens of the same sense category.As such, the smallest clusters represented in Figure 7 are based on at least 5 examples, and the largest (i.e.cluster 2E, 'time span') contains 51 examples.In total, 47 examples did not have at least 4 neighbours of the same type.

Figure 7
Figure 7 presents a hierarchical cluster tree of the annotated tokens.Of the 808 annotated examples, 26 were examples were excluded because they were considered unclear or ambiguous between multiple readings (cf.example (42)).The clustering presented in Figure 7 is based on the cosine distance between the embeddings of the remaining 782 examples.The coloured areas represent clusters of neighbouring tokens that were assigned to the same category.In order for a group of tokens to be considered a cluster, it was decided that there should be at least 5 neighbouring tokens of the same sense category.As such, the smallest clusters represented in Figure 7 are based on at least 5 examples, and the largest (i.e.cluster 2E, 'time span') contains 51 examples.In total, 47 examples did not have at least 4 neighbours of the same type.

Table 3
With such an example, it is difficult to say whether interpretation of excess is entirely 'contextfree' and not evoked or supported by the lexeme limit.Similarly, the 'excess' implicature in examples (27) and (28) may be triggered by fault and illegally.Still, the decision to include a separate category of Sense 2B was maintained.Here, an object is indeed moved by a clerk to another clerk, but there is no explicit indication that the object was given to the second clerk.While different in lexical material, the example in (34) is structurally identical Tyler & Evans' examples of transfer (e.g.The teller handed the money over to the investigating officer).The key element that triggers the meaning of transfer is therefore perhaps not over, but the verb hand, which complicates the suggestion that 'transfer' constitutes an encoded sense somewhat.Still, examples such as (33) were placed in a separate category 2D.
Examples such as(22), where a look is thrown at an explicit endpoint, were also assigned to category 2A:(22)He looked over at the computer.(2007,COHA)Finally,anumber of examples assigned to this category do not refer to a spatial relation.If we consider examples such as (23), where the LM represents an obstacle or hurdle, one Fonteyn (27) … seldom-called violations in tennis -the foot fault.It occurs when a player's foot brushes or goes over the baseline when serving.(COHA, 2006) (28) He's illegally parked.His ass is over the white line.(2003,COHA)Furthermoreincluded in this category are cases such as (29), which portray a situation where the 'missed target' is a person in line for a reward (usually in the form of a job offer or promotion, as in (30)).The implicature here is that the reward was expected or deserved, but those expectations were not fulfilled.Theonly difference between these examples and examples of 2A is that some sort of transfer has taken place.However, one could argue that such a transaction is encoded by the verb used in the same construction.Consider, for instance, the example in (34):(34)The clerk lifted the bill from Peterson's hand and took it over to the second clerk sitting at the desk (COHA, 2003).
After studying my folder and going over the exact sequence of what to speak on, I allow myself the pleasure of flipping on the news (2004, COHA) In the majority of examples included in this category, the LM is the focus of the TRs attention.In these examples, over is equivalent to about, and in some cases, the LM can be considered the cause of the TRs actions: Note that, because the verb preside is used, it is also implied that the TR controls the LM.The same could also be said for examples such as (46), where the the notion of control or authority is not encoded in the lexical verb: The first category in Sense group 4 includes examples where the TR is examining the LM.The majority of cases involve the verb look (or near-synonyms such as glance), as in (41), but other verbs (e.g.read, go) occur as well: Fonteyn 's static over 1 : Group 1 also includes a subgroup of tokens of Sense 3 ('covering', over 3 ), (stative) uses of Sense 2A ('other-side-of', over 2 ), and examples involving dynamic verbs (e.g. the cat jumped over the fence, over 2 ).
of 2D, in which there is a transfer of power or authority (e.g.He took over the business).These are, in turn, closely related to 5B ('control').The other token cluster of 4A ('focus-of-attention'), which includes all examples where over can be paraphrased with 'about' or 'because of' (e.g.He agonized over it), is most closely linked to the subgroup of tokens in category 2A, where the TR has mentally overcome or lost interest in the LM (e.g.I am over the drama).We furthermore find all tokens of 5C ('preference') in Group 4, closely positioned next to a subgroup of 2B, where the LM is omitted or skipped in a selection procedure (possibly implying absence of preference, Tyler & Evans (2003;2001)ics, there has been no shortage of proposals for modelling polysemy networks, in which the syntactic configurations, collocations, and the notion of underlying image schemas are of central concern.To minimize the arguably subjective nature of further proposals, linguists are increasingly turning to the use of distributional, statistical methods, and, most recently, to deep contextualized neural language models.In the present study, I investigated to what extent the output of a fully unsupervised application of BERT (a 'meaning as context' model) corresponds with the sense network of over proposed byTyler & Evans (2003;2001)(a 'meaning as concept' model).The analyses reveal that, while there are interesting correspondences the two approaches, they ultimately lead to different abstractions.Which of these abstractions most closely approximate the abstractions that emerge from ellicited, experimental data (or the extent to which the 'context' and 'concept' models are complementary) remains an open question that needs to be addressed by means of behavioural studies.However, because Tyler & Evans' proposal foregrounds the importance of sense connections via image schemas, the analysis presented in this study does provide some insight into the extent to which such imagistic information may be encoded in BERT embeddings.Fonteyn Glossa: a journal of general linguistics DOI: 10.5334/gjgl.1323these local clusters does deviate from what is proposed by the principled polysemy model, it is furthermore reassuring that we can come up with a reasonable explanation for the divergence.For instance, when the embeddings of the tokens of the same principled polysemy category are split in separate clusters, the split can be motivated semantically: examples of Sense 2A ('other-side') that are used in a literal, physical sense (e.g.I got over the hill) are distinguished from more metaphorical uses (e.g.I got over my puberty weirdness).Notably, BERT's 'recognition' of such metaphorical uses extends into cases where surrounding context words are themselves used metaphorically (e.g.We will move on and get over these rough patches), indicating an unexpected aptitude for coping with elaborated metaphors (cf.De Pascale 2019: 157).
What emerges from the preceding analyses is that the extent to which the models appear to converge varies depending on whether we rely on BERT embeddings to detect local (concrete, token-based) or global (schematic) similarities between examples.At the local level, BERT's focus on collocational and syntactic patterns helps it in 'recognizing' similarities between tokens of the same category, resulting in relatively coherent local clusters.When the consistency of