The term ‘meaning’, as it is presently employed in Linguistics, is a polysemous concept, covering a broad range of operational definitions. Focussing on two of these definitions, meaning as ‘concept’ and meaning as ‘context’ (also known as ‘distributional semantics’), this paper explores to what extent these operational definitions lead to converging conclusions regarding the number and nature of distinct senses a polysemous form covers. More specifically, it investigates whether the sense network that emerges from the principled polysemy model of
The aim of the present study is to empirically investigate whether there is any correspondence between the generalizations that emerge from different operational models of meaning representation. More specifically, this study focuses on the operational definition of meaning as ‘concept’, as commonly employed in Cognitive Linguistics, and meaning defined as (or derived from) ‘context’, also known as distributional semantics, which has rapidly gained popularity in Corpus Linguistics and Computational Linguistics/NLP. As a case study, it will home in on the semantics of the English preposition
Following Brugman (
When it comes to their discussion of polysemy, such cognitive-conceptual accounts have faced substantial criticism, predominantly aimed at their apparent lack of principled and objective methods to determine how many senses can be distinguished, and how the global design of the polysemy networks is construed. One part of the problem appeared to be that the identification of the core node (or ‘prototype’) of the network seemed to rely solely on subjective, introspective judgements (
In response to the need for more well-defined and data-driven definitions of word senses (e.g
A similar (yet overall more cautious) enthusiasm has been expressed in Linguistics: being almost exclusively corpus-based, distributional approaches have provides a welcome bridge between the “well-established corpus-linguistic research tradition and Langacker’s idea that linguistic representations emerge from linguistic usage” (
Yet, while the distributional approach and the more traditional, cognitive-conceptual approach essentially share the same goal (that is, to capture the complex internal semantic structure of linguistic items in a rigorous, theoretically motivated, principled manner), it is also clear that there may be a non-trivial epistemological difference between the two approaches. The distributional approach to meaning and sense distinctions is based on a premise that essentially conflicts with one of the core criteria of, for instance, the principled polysemy approach (
Prioritizing depth over width, this study is set up as a detailed empirical comparison between two operational models of meaning representation, with one serving as a representative of the cognitive semantic (‘concept’) approach, and one representing the distributional (‘context’) approach. More specifically, this study focuses on one of the most well-developed cognitive-conceptual proposals – the principled polysemy model of
What emerges from the analysis is that BERT clearly captures fine-grained, local semantic similarities between tokens. Even with an entirely unsupervised application of BERT, discrete, coherent token groupings can be discerned that correspond relatively well with the sense categories proposed by means of the principled polysemy model. Furthermore, embeddings of
The interest in prepositional semantics in Cognitive Linguistics stems from the observation that language users are able to use a relatively small set of prepositions refer to an indefinitely large number of relations and scenes because of their cognitive ability to categorize concepts schematically (
Some key publications in developing the notion of image schemas and embodiment, and integrating those notions into the discussion of meaning representation, are Brugman (
Yet, while they evoke different spatial configurations, the image schemas underlying these examples are still connected to one another, as humans are able to recognize general similarities between abstract image schemas (as demonstrated experimentally by, for instance,
As each small modification to an image schema is mapped onto a discrete sense category, the meticulous and comprehensive accounts set out by Brugman and Lakoff are sometimes called the “full-specification” approach. In the case of Lakoff (
Subsequent proposals, then, set out ways to tackle the apparent lack of a principled procedure to determine the number of distinct (sub)senses. Two notable examples are the proposal of Kreitzer (
(1) | The painting hung over the fireplace. |
(2) | The cat jumped over the fence. |
(3) | The mask is over my face. |
These three relational schemata are also applicable to non-spatial domains, which, Kreitzer explains, are consistently conceptualized in terms of spatial image schemata: the use of
Addressing these issues, Tyler & Evans (
An issue that remains, however, is that there is still no objective, measurable means of determining the global structure of the polysemy network. Besides further attempts to establish which sense constitutes the core node or prototype (
In their accounts, Tyler & Evans choose to remain agnostic on the matter, explaining that “language does not function like a logical calculus which would allow us to … establish absolutely a single, precise derivation for each sense” (
At its core, the distributional approach conceptualizes the meaning of a word (or, more generally, of constructions) as a function of its lexical and grammatical context, and as such, meaning can be approached statistically (
An interesting observation made by Heylen et al. (
Example data set adapted from Newman (
context variables of ‘over’ | dynamicity | TR_concrete | TR_animate | LM | … | |
---|---|---|---|---|---|---|
dynamic | PERSON | concrete | animate | PLACE | … | |
stative | THING | concrete | non-animate | PLACE | … | |
dynamic | EVENT | abstract | non-animate | TIME | … | |
… | … | … | … | … | … | … |
Such data frames can subsequently be converted into a table with numeric infomation (e.g. the relative frequency of each example with each label), which can then be used as input for statistical analysis (
(4) | [ |
(5) |
(6) |
(7) |
(8) |
(9) |
(10) |
The appeal of this approach, according to Newman (
Still, the selection and annotation of the variables (which have been selected and defined by the analyst) is predominantly manual. As a “logical extension of the statistical state-of-art” (
Because the distributional approach to meaning is based on a relatively simple, and concrete premise, one may be tempted to assume that studies adopting this approach are highly comparable, if not identical in how they operationalize and model meaning. This would, however, be a mistaken assumption. It would be far beyond the scope of the present paper to survey the many different ways in which ‘meaning as context’ has been operationalized (for such a survey, one may consult Turney & Pantel (
A first development of note is the gradual turn from models that produce vectors of word types, to models that are able to create token-based (or ‘contextualized’) vectors. The distinction between type-based and token-based is not so much one of whether or not the resulting vector representations include contextual information – this is the case for both type-level as well as token-level vectors – but whether or not all contextual occurences of a single word are conflated into a single vector representation.
Type-based models work from the assumption that a word has a single, constant, ‘core’ meaning (which can be understood as a prototype, cf.
Example of contextual input in (based on lexical co-occurrence frequencies) for cat and mouse using a word type representation.
TYPE representation | … | |||||
---|---|---|---|---|---|---|
1 | 2 | 1 | 0 | 0 | … | |
1 | 0 | 1 | 1 | 1 | … | |
(11) | The |
(12) | Do not pet the paws of a |
(13) | The |
(14) | I bought an external |
For a word such as
In response to this issue, models that generate token-specific vector representations were developed. These token-based distributional models – in which individual vectors are assigned to, for instance, the two different examples of
Example of contextual input in (based on lexical co-occurrence frequencies) for
TOKEN representation | … | |||||
---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | … | |
0 | 1 | 1 | 0 | 0 | … | |
1 | 0 | 1 | 0 | 0 | … | |
0 | 0 | 0 | 1 | 1 | ||
Using the terminology employed in Baroni et al. (
By contrast, context-predicting models (yet again a cover-term for an extremely varied group of models, including weighted bag-of-words, to more syntactically informed variations, with new types of model architectures being added continuously) are designed to approach the construction of semantic vectors from a training-based angle: “Instead of first collecting context vectors and then reweighting these vectors based on various criteria, the vector weights are directly set to optimally predict the contexts in which the corresponding words tend to appear” (
In the present study, the distributional approach is represented by a single model architecture: Devlin et al. (
In a nutshell, BERT is a deep contextualized model based on a particular type of neural architecture, called “the Transformer”, which is entirely based on so-called “attention mechanisms” (
Like other Transformers, BERT consists of multiple layers (or ‘transformer blocks’), all of which contain multiple self-attention heads which behave similarly within their layer. The smallest pre-trained model, called BERTbase, consists of 12 layers with 12 attention heads, whereas the larger model, called BERTlarge, consists of 24 layers with 16 attention heads. Each of these layers captures the
The attention heads within BERT’s layers have been probed for the linguistic phenomena they capture. This revealed that particular heads capture syntactic relations (e.g. valency patterns and dependency relations), while others perform well at coreference resolution (
With respect to linguistic investigation into polysemy and sense disambiguation, the contextualized embeddings produced by BERT have thus far not been explored. One reason may be that neural models have grown into increasingly intransparant systems (
In the present study, BERTbase has been used to create contextualized embeddings for all occurrences of
Of the 39,834 tokens, 808 examples were manually annotated by two human annotators, following the sense description in Tyler & Evans (
In total, 16 different sense categories were distinguished. Following the example of Tyler & Evans (
Token Frequencies per category.
Category | Tokens | Category | Tokens |
---|---|---|---|
152 | 62 | ||
122 | 39 | ||
24 | 50 | ||
37 | 33 | ||
35 | 31 | ||
50 | 32 | ||
49 | 23 | ||
32 | 11 | ||
26 | |||
Between the two human annotators, inter-rater agreement was found to be very good (Fleiss’ Kappa = 0.867). In the statistical analyses presented below, 26 were examples were excluded because they were considered indeterminate between multiple categories. Further information on the sense categories is provided in Section 3.1.
Ultimately, the sense categorization proposed by Tyler and Evans was created in response to models that are too fine-grained, and hence lack what Tyler & Evans consider to be meaningful, principled abstractions. Thus, the question we are in fact asking is to what extent these abstractions are also ‘meaningful’ to models such as BERT, which approach prepositional meaning by compressing contextual data. To address this question, I adopt a procedure based on the Varying Abstraction Model (
The VAM conducts a series of evaluation tasks, where it predicts the category label of unseen test tokens against a manually assigned label. The series start with the prediction of the category label of an unseen test token based on its nearest neighbour embedding in a labelled training set. At this level, none of the token embeddings in the training set have been clustered, which, one could argue, means that the number of ‘clusters’
Schematic representation of abstraction continuum. At the lowest level of abstraction, the items in the training set that can be used for classifiying an unseen token are the embeddings of concrete tokens in that set. At intermediate levels, the number of items that can be used to classify an unseen token is gradually reduced, as an increasing number of token embeddings are merged into an averaged embedding. When complete abstraction is reached, all items of the same category are merged into a single, averaged ‘sense embedding’.
At the highest level of abstraction, then, the classification task of the unseen test tokens is attempted by means of a clustered or averaged representation of all embeddings assigned to that category. This averaged embedding can hence be thought of as a ‘contextualized sense embedding’. Because I consider the 16 sense categories of the principled polysemy model (described in Section 3.1) to represent the highest level of abstraction, the number of clustered representations by means of which classification is attempted at this level is 16 (
The series of classification tasks (from
Schematic representation of VAM procedure. Starting from no clustered items, the VAM uses different configurations of clustered or averaged embeddings (which represent different levels of abstraction) in a training set (80% of the data) of the data to classify an unseen test token set (20% of the data).
Besides assessing to what extent BERT embeddings can be used to distinguish the sense categories proposed in the principled polysemy model, I will also discuss the global structure of the network proposed by Tyler & Evans (
In what follows, I will describe the categories distinguished in Tyler & Evans, illustrating them with examples from the data set. For an in-depth description of the sense categories and a full argumentation as to why these (and only these) categories have been distinguished, I refer to Tyler & Evans (
The first category contains all examples in which
(15) | I noticed a painting hanging |
(16) | He was bleeding from a cut |
(17) | And, so that’s how I got into the apartment |
Besides the protoscene, Tyler & Evans also distinguish a number of derived senses. Four of these can be conceived of as a ‘cluster’ of senses where
(18) | I’d grown up only a few hours away, |
(19) | God, let this be the peak. Let us be |
Note that the verb itself does not trigger the trajectory reading (when it does, the example will be assigned to the protoscene).
While the examples in (18) and (19) both function as prepositions, a large group of tokens in this category function as an adprep. In some cases, such as (20), the verb is combined with an adverbial phrase that indicates the endpoint of the trajectory. Thus, it could be argued that the combination of the verb and the endpoint adverbial already implies movement along a trajectory. The addition of
(20) | After he left us, he drove |
(21) | Smiling, she hurries |
Examples such as (22), where a look is thrown at an explicit endpoint, were also assigned to category 2A:
(22) | He looked |
Finally, a number of examples assigned to this category do not refer to a spatial relation. If we consider examples such as (23), where the LM represents an obstacle or hurdle, one can metaphorically extend the use of
(23) | … that old and painful relationship. But Mike had seemed okay with it, as if he was completely |
(24) | I had a thing with her a bunch of years ago, and I guess I never got |
(25) | Memphis had gotten |
(26) | Your article is |
With such an example, it is difficult to say whether interpretation of excess is entirely ‘context-free’ and not evoked or supported by the lexeme
(27) | … seldom-called violations in tennis – the foot fault. It occurs when a player’s foot brushes or goes |
(28) | He’s illegally parked. His ass is |
Furthermore included in this category are cases such as (29), which portray a situation where the ‘missed target’ is a person in line for a reward (usually in the form of a job offer or promotion, as in (30)). The implicature here is that the reward was expected or deserved, but those expectations were not fulfilled.
(29) | … his monumental 1957 paper on the origins of elements, for which – to his annoyance – he was passed |
(30) | … a lot of times he’ll pass |
(31) | My school days are finally officially |
(32) | All the decisions had been made, the story was |
This category solely contains examples where
(33) | The woodsman reached in his pocket, pulled out the thirty euros, and handed the two bills |
The only difference between these examples and examples of 2A is that some sort of transfer has taken place. However, one could argue that such a transaction is encoded by the verb used in the same construction. Consider, for instance, the example in (34):
(34) | The clerk lifted the bill from Peterson’s hand and took it |
Here, an object is indeed moved by a clerk to another clerk, but there is no explicit indication that the object was given to the second clerk. While different in lexical material, the example in (34) is structurally identical Tyler & Evans’ examples of transfer (e.g.
Finally, Non-physical transfers, as illustrated in (35) are also considered as instances of 2D:
(35) | … the London office had grown considerably in the last eight years. Boyd wouldn’t half mind taking |
As non-physical transfers almost exclusively involve a transfer of control or authority, these examples are, at times, difficult to distinguish from examples of 5B (see below).
(36) | The war on witchcraft intensified |
(37) | Geologists and biologists before Darwin noted that the Earth and its inhabitants change |
Note that the question may be raised whether these examples do in fact constitute a distinct sense of
Like Lakoff, Tyler & Evans also distinguish a category of examples such as (38), in which there is “an understood viewpoint from which the TR is blocking accessibility of vision to at least some part of the LM” (
(38) | A ratty leather jacket gaped open to reveal a white button-front shirt |
The ‘covering’ sense also includes a examples there is a “multiplex trajector” (
(39) | He searches through the papers scattered |
(40) | I can get into the crawlspace from my closet and climb all |
As in sense 3, the TR is no longer necessarily positioned above the LM from the perspective of the viewer. Instead,
(41) | Chad got out and walked around the truck, looking it |
(42) | After studying my folder and going |
(43) | That debate, |
(44) | Ben was pictured displaying great emotion by crying |
In discussing sense 4A, Tyler & Evans also mention examples such as (45):
(45) | John Stewart presides |
Note that, because the verb
(46) | Francis watched |
Examples such as (45) and (46) were classified as instances of sense 4A, but it must be noted here that the distinction between these examples and examples of sense 5B, which are discussed below, is difficult to maintain.
Four further senses fall under “the
(47) | Jerry and I were parents to |
Tyler & Evans distinguish one further sense, sense 5A.1, where the TR is understood as something that is contained by, but exceeds the capacity of, the LM. The only clear example discussed is
It is, in many cases, also extremely difficult to distinguish cases of ‘excess’ as crossing a target point, or excess as exceeding an amount or capacity. Consider, for instance, the example in (48):
(48) | But for kids |
While suitable for the ‘up’ conceptualization, it is not inconceivable that examples such as (34) could also be classified under 2B (as time is a linear concept rather than a container). Tyler & Evans (
(49) | She was moved by her power |
(50) | But Nolan has final say |
As noted earlier, there is also a sense of ‘control’ in examples classified as 2D, where control is transferred from one party to another, and examples classified under 5B.
(51) | We haven’t switched to a local pediatrician, believing irrationally in Manhattan doctors |
(52) | His name was Miguel Santiago, and he insisted on being called Miguel |
Two more senses distinguished by Tyler & Evans are the reflexive use of
(53) | Elaine pulls her leg back and kicks the grill. The coals fly up and out, the grill tips |
(54) | Sometimes even horrible memories play |
(55) | I hope someday soon we can begin again … start |
Examples such as (53) are categorized as Sense 6, whereas (54) and (55) are classified as Sense 6A. All instances of reflexive and repetitive
Besides the senses discerned by Tyler & Evans, a few further categories were distinguished.
(56) | She leaned |
Finally, 49 examples were not classifiable along the categories set out above. These examples seem to fall within two categories: ‘means of communication’ and ‘hangover’.
(57) | You could break the news |
(58) | In Napster’s case the transfers took place |
(59) | … that he had gotten drunk the night before and that he was still horribly hung |
As a first point of enquiry, it is investigated whether BERT indeed recognises the sense categories proposed in the principled polysemy model in a relatively distinct and coherent manner. Following Kilgarriff (
To visualize the local embedding clusters and the global positioning of those clusters relative to one another, a two-dimensional representation of the token embeddings was created based on the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm for dimensionality reduction of high-dimensional data (
t-SNE embeddings of
From eyeballing
To address this question in a way that goes beyond eyeballing a visualization, this study uses a series of classification tasks, which help quantify the extent to which there is correspondence between the sense categories emerging from the two models (VAM).
VAM output over all data.
At the lowest level of abstraction, the model’s classification accuracy is quite high at 0.95, remaining relatively stable at this level until approximately 200 tokens have been clustered. Subsequently, its accuracy gradually drops to 0.8 (at approx. 300 tokens), after which it drops below 0.7 at the highest levels of abstraction. These findings imply that the BERT embeddings do encode the similarities between members of the categories proposed in the principled polysemy model, but only up to a certain point. Beyond that point, the proposed abstractions no longer optimally fit the output of the distributional model.
When we assess the models classification accuracy per sense category (
VAM output per sense category.
There are, however, a number of notable cases where the model’s classification accuracy drops sharply when more more items of the same category are merged. This pertains to Sense 3, where the model does not recognize similarities between, for instance,
Having established that there is some correspondence between clustered BERT embeddings and the proposed sense categories (up to a certain point), we can now turn to the question whether the geometrical distances between the various senses of
Polysemy Network of
The dark, full nodes in the suggested network representation in
To compare the principled polysemy network to the output of the distributional model by means of cosine distances, two approaches can be taken. First, we may approach the comparison by taking the senses proposed by Tyler & Evans (
However, as explained in Section 4.1, the proposed sense categories do not always correspond with the way in which the tokens of a manually labelled category cluster, with categories such as 2A falling apart into multiple rather distinct groupings. A second approach, then, would be to adhere less strictly to the sense categories proposed by Tyler & Evans, and determine the geometrical distance between token clusters proposed by the distributional model. In what follows, I restrict myself to this second approach.
Cluster tree (distance = cosine) with representative examples.
As was briefly pointed out in Section 4.1, seven of the conceptual categories are ‘consistent’, with all tokens clustering together. This includes 2E (‘time span’), 2C (‘completion’), 4 (‘examining’), 5B (‘control’), 6 (‘reflexive’), 7 (‘communication channel’) and 8 (‘hangover’). For the remaining nine categories, the model suggests that there are at least two different clusters. In some cases, the separate clusters are still part of the same higher-order branch (e.g. 1 ‘above’, 4A ‘focus-of-attention’, 6A ‘repetition’), whereas for others, the embeddings are less closely related (e.g. 2A ‘other-side’, 2D ‘transfer’, 5A ‘excess’). All in all, the suggested distances between the sense groupings differ substantially from the proposal put forward by Tyler & Evans: not only does the distributional distributional model suggest fairly large distances between token groupings that Tyler & Evans would have assigned to the same category (based on their shared underlying image schema), the proposed relative distances in the sense network are also not reflected in the geometrical distances between the (groupings) of embeddings.
Given that BERT does not group tokens of
If we approach the complex internal semantic structure of
Within Cognitive Linguistics, there has been no shortage of proposals for modelling polysemy networks, in which the syntactic configurations, collocations, and the notion of underlying image schemas are of central concern. To minimize the arguably subjective nature of further proposals, linguists are increasingly turning to the use of distributional, statistical methods, and, most recently, to deep contextualized neural language models. In the present study, I investigated to what extent the output of a fully unsupervised application of BERT (a ‘meaning as context’ model) corresponds with the sense network of
What emerges from the preceding analyses is that the extent to which the models appear to converge varies depending on whether we rely on BERT embeddings to detect local (concrete, token-based) or global (schematic) similarities between examples. At the local level, BERT’s focus on collocational and syntactic patterns helps it in ‘recognizing’ similarities between tokens of the same category, resulting in relatively coherent local clusters. When the consistency of these local clusters does deviate from what is proposed by the principled polysemy model, it is furthermore reassuring that we can come up with a reasonable explanation for the divergence. For instance, when the embeddings of the tokens of the same principled polysemy category are split in separate clusters, the split can be motivated semantically: examples of Sense 2A (‘other-side’) that are used in a literal, physical sense (e.g.
Yet, the fact that BERT is apt at recognizing metaphorical uses of
The observation BERT does not immediately capture similarities in terms of image-schema resemblances can be understood in light of the fact that the model has been trained on linguistic data alone, and has no experience with (physical) non-linguistic, perceptual information (such as spatial configurations, but also, for example, visual properties such as colours:
Note that, if the perceptual, imagistic information that motivates the abstractions and sense connections made in the cognitive-conceptual model is still somehow encoded in contextual information (as appears to be suggested by
As a final concluding remark, I wish to add that the findings presented in this study have important implications for the integration of neural language models – and perhaps, more generally, the application of Semantic Vector Space Models – in theoretical linguistic research, and in particular, to research on semantic change. In a recent publication, Boleda (
All data and code can be found at
I wish to thank Charlotte Maekelberghe for acting as the second annotator. I furthermore thank all anonymous reviewers as well as Folgert Karsdorp and Stefano De Pascale for their insightful feedback on earlier versions of this manuscript.
The author has no competing interests to declare.