In this paper, we illustrate a novel method to translate a derivational explanation of Universal 20 into vectorial representations. We exploit this vectorial representation to answer a number of theoretical questions. First, we use linear regression to automatically rank the costs of different syntactic movements within this proposal and investigate some proposals on partial and complete movement. This investigation of movement suggests that the nature of the movement is important, while the importance of harmonic specification of functional categories, i.e. whether the movement is partial or complete, is more contextdependent. We then evaluate whether the base order
Language universals, formal or statistical, absolute or implicational, linguistic properties exhibited by all languages, is one of the main topics in the study of language, and their existence, general nature and distribution are being investigated from a formal and cognitive point of view (
We will concentrate on the quantitative properties of language universals (
Specifically, we set out to address the following three questions:
Can the ranking of Cinque’s different kinds of movements be obtained automatically?
Is movement always more costly than lack of movement?
Is the base structure proposed by Cinque the best predictor of the typological frequency facts?
Datadriven computational models can help cast light on linguistic issues in two main ways. First, through their formal nature, they can make the linguistic assumptions in the proposals explicit and operational. Second, computational models can be used to develop and test correlations between different aspects of the data on a large scale. Methodologically, computational models and machine learning techniques provide robust tools to test the predictive power of the proposed generalisation.
This paper uses a computational modelling methodology previously developed in Cysouw (
One of the most easily observable distinguishing features of human languages is the order of words: the position of the verb in the sentence or the respective order of the modifiers of a noun, for example. Word orders vary greatly crosslinguistically, but each language has very strong preferences for a few orders, and, across languages, not all orders are equally preferred (
When any or all the items (demonstrative, numeral and descriptive adjective) precede the noun, they are always found in this order. If they follow, the order is exactly the same or its exact opposite.
A more explicit formulation is found in Cinque ( 

(a)  In prenominal position, the order of demonstrative, numeral, and adjective is Dem>Num>A. 
(b)  In postnominal position, the order is either Dem>Num>A or A>Num>Dem. 
Currently, we have access to larger samples of languages than Greenberg did. (See, for example, Dryer’s and Cinque’s large data collections in the cited work). These larger samples have confirmed that two of the three orders indicated by Greenberg as the only possible orders are indeed among the most frequent ones. Larger samples have also shown that many more orders are possible than stated in Greenberg’s universal, but with different frequencies (
Table
Attested word orders of Universal 20 and their estimated frequencies. (See text for more explanation).
Dryer’s Languages  Dryer’s Genera  Cinque’s 05 Languages  Cinque’s 13 Languages  

74  44  V. many  300  
3  2  0  0  
0  0  0  0  
0  0  0  0  
0  0  0  0  
0  0  0  0  
22  17  Many  114  
11  6  V. few (7)  35  
0  0  0  0  
4  3  V. few (8)  40  
0  0  0  0  
0  0  0  0  
28  22  Many  125  
3  3  V. few (4)  37  
5  3  0  0  
38  21  Few (2)  180  
4  2  V. few (3)  14  
2  1  V. few  15  
4  3  Few (8)  48  
6  4  V. few (3)  24  
1  1  0  0  
9  7  Few (7)  35  
19  11  Few (8)  69  
108  57  V. many (27)  411 
Many proposals have been put forth to identify the factors that could give rise to the distributions of different word orders of the noun phrase across languages of the world. These proposals range from general principles of symmetry and harmony (
The trigger of the work on these proposals for universal 20 is the generative, derivational account proposed in Cinque’s (
In this paper, the largescale quantitative typological observations and the underlying generative process that is specifically proposed in Cinque (
The proposal in Cinque (
A question of significant interest for syntacticians in the generative framework is which syntactic operations are possible, and which ones are not possible. And among those that are possible, which ones cost more than others. Cinque (
Having established the relative costs of movement to derive word orders in the DP, we turn to the category of some of the elements in the DP. Specifically, another question that has been the object of a lot of attention in theoretical linguistics in recent years is the syntactic category of cardinal numerals occurring in noun phrases, such as those shown in (1).
(1)  a.  Three apples 
b.  Many apples  
c.  Garden apples  
d.  Big apples 
Since numerals can appear in various syntactic contexts, it is not clear what syntactic category fits them best. It has been argued that numerals are quantifiers, adjectives, nouns, or a combination of categories. Cinque’s (
Experiments 3 and 4 of the current study compare two of these proposals from a typological perspective, namely merging numerals high versus merging them low, as well as a number of intermediate views where they merge high in some languages and low in others. The results suggest that, assuming numerals constitute a uniform syntactic category crosslinguistically, treating them as structurally high allows for a better prediction of the typological frequencies than treating them as adjectives, and structurally lower.
The rest of the paper will develop as follows. Section 3 presents the computational and experimental method. The following three sections each present an experimental question: section 4 presents Experiment 1, which tests the predictive power of Cinque’s (
The method used in this paper will require transforming the linguistic proposals concerning Universal 20 into a vectorial representation, as described below. This process transforms a derivational account into a nonderivational, fixedlength vector of the important properties of the account. This vectorial representation is very abstract and is compatible with many statistical methods. It is possible then to find automatically the relative weight of each element in the vector, a process of parameter fitting that best describes and predicts the typological frequencies. This method of transforming a derivational linguistic theory into vectors and then using machine learning techniques has first been proposed and methodologically justified in Merlo (
Formalise the properties and operations of a model of word order as simple primitive features with a set of associated values;
Encode each word order as a vector of instantiated primitives defined by the model;
Learn the parameters of the model through a learning algorithm on (a subset of) the data;
Run the model to test generalisation ability.
In the rest of the section, we briefly illustrate the featurebased formalisation of the linguistic proposals, and describe the experimental method.
In Cinque (
The order of merger in (2) is assumed, where only the overt NP or phrases containing the overt NP can move. The allowed syntactic operations (movements) are given in (3), omitting intermediate projections for simplicity.
(2)  [_{WP} Dem [_{XP} Num [_{YP} Adj [_{NP} N ]]]] 
(3)  a.  NP movement without piedpiping: 
[_{WP} [_{NP} N]_{1} Dem [_{YP} Adj t_{1} ]]]  
b.  Movement of a constituent containing the NP with piedpiping of the 

[_{WP} [_{YP} Adj [_{NP} N] ]_{1} Dem t_{1} ]]  
c.  NP movement with piedpiping of the 

[_{WP} [_{YP} [_{NP} N ]_{1} Adj t_{1} ]_{2} Dem t_{2}]  
d.  Partial movement: 

[_{WP} Dem [_{XP} Num [_{YP} [_{NP} N ]_{1} Adj t_{1} ]]]  
e.  Splitting the NP out of a moved element to move it to a higher position:  
[_{WP} [_{NP} N ]_{3} Dem [_{XP} [_{YP} ( 
The 24 possible permutations of demonstrative, numeral, adjective and noun are derived using these movements. Some orders require no movements, other require derivations of different numbers of movement steps. These derivations are summarized in Tables
Movements necessary for each word order in Cinque’s proposal (continued in next table). The table shows two lines for each word order, for each movement step. The first line describes the word order movement operation, the second line gives the name of the type of movement according to our formal encoding.
Word Order  Step 1  Step 2  Step 3  Step 4 

a. 
No movements  
b. 
NP above 
No more mov’ts  
[NP[XP]]Move  Partial mov’t  
c. 
NP above 
No more mov’ts  
NoPiedPiping  Partial mov’t  
d. 
NP above 

NoPiedPiping  
e. 
AP above 
NPless 

[XP[NP]]Move  NPlessMove  
f. 
NP above 
AP above 
NPless 

[NP[XP]]Move  [NP[XP]]Move  NPlessMove  
g. 
NP above 
AP above 
NP splits, above 
NPless 
[NP[XP]]Move  [NP[XP]]Move  Split  NPlessMove  
h. 
NP above 
AP above 
NPless 
NP splits, above 
[NP[XP]]Move  [NP[XP]]Move  NPlessMove  Split  
i. 
NP above 
NPlessAP moves  
[NP[XP]]Move  NPlessMove  
j. 
NP above 
NPlessAP moves  
NoPiedPiping  NPlessMove  
k. 
AP above 
AP above 

{ [XP[NP]]Move, NoPiedPiping }  
l. 
NP above 
AP above 

[NP[XP]]Move  NoPiedPiping 
Movements necessary for each word order in Cinque’s proposal (continued from Table
Word Order  Step 1  Step 2  Step 3  Step 4 

m. 
NP above 
NPless AP above 
No more mov’t  
[NP[XP]]Move  NPlessMove  Partial mov’t  
n. 
AP above 
No move mov’t  
[XP[NP]]Move  Partial mov’t  
o. 
NP above Adj  AP above 
No more mov’t  
[NP[XP]]Move  [NP[XP]]Move  Partial mov’t  
p. 
NP above 
AP above 
NP splits, above 

[NP[XP]]Move  [NP[XP]]Move  Split mov’t  
q. 
NP above 
NPless 

NoPiedPiping  NPlessMove  
r. 
NumP above 

[XP[NP]]Move  
s. 
NP above 
NumP above 

[NP[XP]]Move  [XP[NP]]Move  
t. 
NP above 
NumP above 

NoPiedPiping  [NP[XP]]Move  
u. 
NP above 
AP above 
NP splits to stay  NPlessNumP above 
[NP[XP]]Move  [NP[XP]]Move  Split  NPlessMove  
v. 
NP above 
AP above 
NP splits above 
NPless 
[NP[XP]]Move  [NP[XP]]Move  { Split, NPlessMove }  
w. 
AP above 
NumP above 

[XP[NP]]Move  [NP[XP]]Move  
x. 
NP above 
AP above 
NumP above 

[NP[XP]]Move  [NP[XP]]Move  [NP[XP]]Move 
Since each language in the sample has one dominant word order, and that word order is assumed to be the result of a derivation produced by some of the syntactic operations in (3), we treat these syntactic operations as binary parameters that a language either has or does not have.
In addition to the allowed movements, we define one parameter to describe movements like (4), which are argued to be impossible by Cinque (
(4)  NPlessMove: 
[_{WP} [_{YP} Adj t_{1} ]_{2} Dem [_{XP} [_{NP} N ]_{1} Num t_{2} ]] 
(5)  a.  Uses NP movement without pied piping 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  
c.  Uses NP movement with piedpiping of the [NP[XP]] type  
d.  Involves partial movement  
e.  Uses NPsplitting movement  
f.  Requires movement of a phrase not containing the NP 
Since these parameters are binary, we can now encode the different word orders as a vector of values, by assigning either 1 or 0 to each parameter. Importantly, since the frequency of a word order is not correlated with the number of movements necessary to reach it, but rather with the type of movement necessary to derive it, we do not count the different occurrences of a certain movement. Rather, a parameter has a positive value, if it needs its corresponding movement, and we assign it the value 1, regardless of how many times it is needed for a given derivation.
To illustrate the encoding of three different word orders, step by step, consider the English word order
The order
Feature  Value  

a.  Uses NP movement without piedpiping  0 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  0 
c.  Uses NP movement with piedpiping of the [NP[XP]] type  0 
d.  Involves partial movement  0 
e.  Uses NPsplitting movement  0 
f.  Requires movement of a phrase not containing the NP  0 
The mirror order of English
(6)  a.  [_{WP} Dem [_{XP} Num [_{YP} Adj [_{NP} N ]]]] 
b.  [_{WP} Dem [_{XP} Num [_{YP} [_{NP} N ]_{1} Adj t_{1}]]]  
c.  [_{WP} Dem [_{XP} [_{YP} [_{NP} N]_{1} Adj t_{1}]_{2} Num t_{2} ]]  
d.  [_{WP} [_{XP} [_{YP} [_{NP} N]_{1} Adj t_{1}]_{2} Num t_{2} ]_{3} Dem t_{3}] 
Since all of these movements are movements of the NP with piedpiping (i.e. movements of the [NP[XP]] type), only the parameter (5c) is set to 1, and all other ones are 0. This gives us Table
The encoding of
Feature  Value  

a.  Uses NP movement without piedpiping  0 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  0 
c.  Uses NP movement with piedpiping of the [NP[XP]] type  1 
d.  Involves partial movement  0 
e.  Uses NPsplitting movement  0 
f.  Requires movement of a phrase not containing the NP  0 
Finally, to illustrate a less straightforward case, consider the word order
The order
Feature  Value  

a.  Uses NP movement without piedpiping  1 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  0 
c.  Uses NP movement with piedpiping of the [NP[XP]] type  0 
d.  Involves partial movement  1 
e.  Uses NPsplitting movement  0 
f.  Requires movement of a phrase not containing the NP  0 
The orders and their encodings according to
NP moves w/o pp  [XP[NP]] moves  [NP[XP]] moves  Partial move  Split move  NPless move  Freq.  

a. 
0  0  0  0  0  0  300 
b. 
0  0  1  1  0  0  114 
c. 
1  0  0  1  0  0  37 
d. 
1  0  0  0  0  0  48 
e. 
0  1  0  0  0  1  0 
f. 
0  0  1  0  0  1  0 
g. 
0  0  1  0  1  1  0 
h. 
0  0  1  0  1  1  0 
i. 
0  0  1  0  0  1  0 
j. 
1  0  0  0  0  1  0 
k. 
1  1  0  0  0  0  14 
l. 
1  0  1  0  0  0  69 
m. 
0  0  1  1  0  1  0 
n. 
0  1  0  1  0  0  35 
o. 
0  0  1  1  0  0  125 
p. 
0  0  1  0  1  0  24 
q. 
1  0  0  0  0  1  0 
r. 
0  1  0  0  0  0  40 
s. 
0  1  1  0  0  0  180 
t. 
1  0  1  0  0  0  35 
u. 
0  0  1  0  1  1  0 
v. 
0  0  1  0  1  1  0 
w. 
0  1  1  0  0  0  23 
x. 
0  0  1  0  0  0  411 
Table
In this first experiment, we find the weights of the movement operations encoded vectorially, and compare them to Cinque’s proposal.
The primary data used in all four experiments is Cinque’s (
Zipfian distribution of word orders.
The different movement operations corresponding to each word order are shown in (5) and were discussed in the previous section. The features and the possible values of Cinque’s model are shown in Table
Using the encoded data in Table
Each syntactic operation is encoded as an indicator variable: a variable that has the two values 0 and 1, and indicates if the property is present or not. These are nominal variables, while the dependent variable is numeric. In this setting our multivariable linear regression gives us the positive and negative coefficients that are the difference from the predicted frequency value of the control group, the base order
To avoid excessive dependence of the results on a specific partition of the data, we use crossvalidation. Crossvalidation is a training and testing protocol in which the data is randomly partitioned into
We used a leaveoneout crossvalidation, automatically eliminating collinear attributes, to generate a linear regression model of the data.
The linear regression model that was generated was the function in (7), whose goodness of fit is indicated by the correlation coefficient of 0.52.
(7)  Frequency =  –129.0 × Uses NP movement without piedpiping 
–115.6 × Uses NP movement, piedpiping [XP[NP]]  
–37.8 × Uses NP movement with piedpiping of the [NP[XP]] type  
–65.6 × Partial Move  
–91.6 × Uses NPsplitting movement  
–135.9 × Requires moving a phrase not containing NP  
+242.8 
The linear model in (7) can be read as a ranking of the different syntactic operations in terms of markedness. The weights are summarised in (8). Specifically, we note that large negative values of any of the parameters (5a), (5b), (5e) and (5f) are considered very costly.
(8)  Weights of the different syntactic operations  
a.  NP movement without piedpiping  –129  
b.  NP movement with piedpiping of the [XP[NP]] type  –115  
c.  NP movement with piedpiping of the [NP[XP]] type  –37  
d.  Partial movement  –65  
e.  NPsplitting movement  –91  
f.  Movement of a phrase not containing the NP  –135 
If we interpret the weights as costs, so that high negative weights are high costs, we can rank the different movements in a partial order, as in (9) (where the symbol “<” means “less costly” and we use abbreviated symbols):
(9)  [NP[XP]] < Partial < Split < [XP[NP]] < NPw/oPiedP < Movew/oNP. 
A system of weights can also be inferred from Cinque’s proposal (2005: 321), based on the markedness levels assigned to the movement types, as in (10):
(10)  {NoMove, [NP[XP]], Total} < Partial < NPw/oPiedP < [XP[NP]] < {Split, Movew/oNP}. 
Considering only the types of movements that are encoded in both accounts and simplifying labelling for readability, we have the two following partial orders:
(11)  a.  Cinque:  [NP[XP]] = Partial < NPw/oPiedP < [XP[NP]] < Split = Movew/oNP. 
b.  Us:  [NP[XP]] < Partial < Split < [XP[NP]] < NPw/oPiedP < Movew/oNP. 
We can see that the two orders are well correlated: tied orders in Cinque’s are ranked but adjacent by our ranking, and in only one case the rank is reversed (XPNP and Split) in the two orders. Kendall’s
The low cost of partial movement does not confirm Cinque’s assumption that partial movement is penalising. Rather, partial movement appears to be less costly than other operations. Partial movement was considered more marked than complete movement in Cinque’s system to derive the fact that the mirror image of the base order is very frequent. This is an order where movement is complete. However, Cinque’s system does not predict that the mirror image of the base order is more frequent than the base order. In Cinque’s model,
In this threefold experiment, we test whether different definitions of partial and complete movement make a difference in predicting word orders. The first question we ask is: how do we define partial movement? Do we have partial movement when something moves, of any category, but does not move all the way above
A second question we ask is: if partial movement is costly as Cinque proposes, is lack of movement free, as Cinque (
The model tested in this experiment replaces partial movement with a NoMove parameter, to test if it is really the case that the best explanation of the typological frequency data is based on the assumption that
The same set of data was used for this experiment as for experiment 1. The encoding, however, involved a new parameter, which replaced
(12)  a.  Uses NP movement without piedpiping 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  
c.  Uses NP movement with piedpiping of the [NP[XP]] type  
d.  Involves lack of movement (partial or complete)  
e.  Uses NPsplitting movement  
f.  Requires movement of a phrase not containing the NP 
The materials are the same as those of the previous experiment, with modifications to the feature
(13)  Complete movement of anything is better than partial movement. 
1 if nothing moves above 

0 if nothing moved at all (no movement)  
0 if anything moves, 
In (13), partial movement is defined as involving any category, but receives a different value from no movement or complete movement. We also encode the two new models shown in (14) and (15). Both models consider that the distinction between complete and partial movement is relevant only for
Partial movement encodings.
Equation (13)  Equation (14)  Equation (15)  

a. 
0  0  1 
b. 
1  1  1 
c. 
1  1  1 
d. 
0  0  0 
e. 
0  1  1 
f. 
0  1  1 
g. 
0  0  0 
h. 
0  0  0 
i. 
0  1  1 
j. 
0  1  1 
k. 
0  0  0 
l. 
0  0  0 
m. 
1  1  1 
n. 
1  1  1 
o. 
1  1  1 
p. 
0  0  0 
q. 
0  1  1 
r. 
0  0  0 
s. 
0  0  0 
t. 
0  0  0 
u. 
0  1  1 
v. 
0  0  0 
w. 
0  0  0 
x. 
0  0  0 
Order
Feature  Value  

a.  Uses NP movement without piedpiping  0 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  0 
c.  Uses NP movement with piedpiping of the [NP[XP]] type  0 
d.  Involves lack of movement (partial or complete)  1 
e.  Uses NPsplitting movement  0 
f.  Requires movement of a phrase not containing the NP  0 
(14)  Complete 
1 if 

0 if nothing moves at all (no movement)  
0 if 
(15)  Any lack of 
1 if 

1 if nothing moves at all (no movement)  
0 if 
Hall et al.’s (
The orders and their encodings according to Cinque’s (
No Pied piping  [XP[NP]] moves  NP[XP]] moves  No move  Split move  NPless move  Freq.  

a. 
0  0  0  0  0  0  300 
b. 
0  0  1  1  0  0  114 
c. 
1  0  0  1  0  0  37 
d. 
1  0  0  0  0  0  48 
e. 
0  1  0  1  0  1  0 
f. 
0  0  1  1  0  1  0 
g. 
0  0  1  0  1  1  0 
h. 
0  0  1  0  1  1  0 
i. 
0  0  1  1  0  1  0 
j. 
1  0  0  1  0  1  0 
k. 
1  1  0  0  0  0  14 
l. 
1  0  1  0  0  0  69 
m. 
0  0  1  1  0  1  0 
n. 
0  1  0  1  0  0  35 
o. 
0  0  1  1  0  0  125 
p. 
0  0  1  0  1  0  24 
q. 
1  0  0  1  0  1  0 
r. 
0  1  0  0  0  0  40 
s. 
0  1  1  0  0  0  180 
t. 
1  0  1  0  0  0  35 
u. 
0  0  1  1  1  1  0 
v. 
0  0  1  0  1  1  0 
w. 
0  1  1  0  0  0  23 
x. 
0  0  1  0  0  0  411 
The three models derived by the linear regression are shown in examples (16), (17), and (18).
(16)  Frequency =  –151.9 × NP movement without piedpiping 
–139.3 × NP movement, piedpiping [XP[NP]]  
–42.5 × NP movement, piedpiping [NP[XP]]  
–92.2 × Involves lack of movement  
–106.7 × Uses NPsplitting movement  
–140.2 × Requires moving a phrase not containing NP  
+266.2  
Correlation coefficient 0.59 
(17)  Frequency =  –148.0 × Uses NP movement without piedpiping 
–138.3 × Uses NP movement, piedpiping [XP[NP]]  
–50.8 × Uses NP movement, piedpiping [NP[XP]]  
–90.0 × Involves lack of movement  
–138.5 × Uses NPsplitting movement  
–73.4 × Requires moving a phrase not containing NP  
+270.8  
Correlation coefficient 0.59 
(18)  Frequency =  –167.3 × Uses NP movement without piedpiping 
–157.2 × Uses NP movement, piedpiping [XP[NP]]  
–71.5 × Uses NP movement, piedpiping [NP[XP]]  
–81.9 × Involves lack of movement  
–136.0 × Uses NPsplitting movement  
–88.7 × Requires moving a phrase not containing NP  
+299.6  
Correlation coefficient 0.47 
The first observation about these models is that the
The comparison of the first encoding of complete movement compared to the second encoding of complete movement, anything moves above
The comparison of the first and second encoding of movement compared to the third encoding of movement (
The model in (14), where lack of movement is encoded in the same way as complete movement, as proposed in Cinque (
This conclusion is corroborated by the fact that the model in (13) is better than the one in (14), which shows that no movement and complete movement appear to pattern together. This is what we find in the distribution of word orders: the most frequent word order is generated by rollup movement of different kinds of phrases, and the third and fourth most frequent word orders are generated by partial movement of the
The two experiments described in the last two sections use a novel method to address the question of which syntactic operations are possible, which ones are not possible; and among those that are possible, which ones cost more than others. The results confirm previous theoretical proposals, but also explore new possibilities.
In particular, the ranking of costs among operations proposed by previous studies is mostly confirmed (
The modelling of movement operations in Cinque’s theory is based on some basic assumptions on the structural dominance of the syntactic categories that occur in the DP/NP. Cinque’s (
We move, now, to the question of the status of numerals, and where they merge in the DP. One possibility is that numerals merge higher than adjectives, and below the demonstrative (
This experiment compares two possibilities: That numerals are higher in the structure than all adjectives, and that numerals are themselves adjectives, and are therefore higher than some adjectives and lower than others. The second assumption, treating numerals as adjectives, is tantamount to assuming two base orders, (19a) and (19b), and that every word order can be derived from either of the two base orders in (19). So each word order would correspond to two vectors: one corresponding to the movements needed to derive it from (19a), and one corresponding to the movements needed to derive it from (19b).
(19)  a.  [_{WP} Dem [_{YP2} Num [_{YP1} Adj [_{NP} N ]]]] 
b.  [_{WP} Dem [_{YP2} Adj [_{YP1} Num [_{NP} N ]]]] 
It is crucial at this point to distinguish parametric movements that occur in order to license agreement or casemarking, and semanticallymotivated movements, like Quantifier Raising. A parametric movement would occur in a language regardless of the intended interpretation. In contrast, a semanticallymotivated movement only occurs when a special kind of interpretation (scope, collective/distributive/etc.) is intended, and will not typically occur in the dominant order of any language. While the former will be visible in typological data, the latter will not. For this reason, we now limit ourselves to parametric movements, and not semanticallymotivated ones.
To test these two options, we need to encode different base orders, and the sequence of movements that would generate this order. The encodings of the derivations from (19b) are shown in Tables
Movements necessary for each word order in Cinque’s proposal (continued in next table), assuming
Word Order  Step 1  Step 2  Step 3 

a. 
NP above 
NPless NumP above 
No more mov’ts 
[NP[XP]]Move  NPless Move  Partial mov’t  
b. 
NumP above 
No more mov’ts  
[XP[NP]]Move  Partial movement  
c. 
NP above 
NumP above 
No more mov’ts 
NoPiedPiping  [NP[XP]]Move  Partial mov’t  
d. 
NP above 
NumP above 
NP splits above 
NoPiedPiping  [NP[XP]]Move  Split Move  
e. 
NP above 
NPless NumP above 

[NP[XP]]Move  NPlessMove  
f. 
NP above 
NPless NumP above 

NoPiedPiping  NPlessMove  
g. 
NumP above 

[XP[NP]]Move  
h. 
NP above 
NumP above 

NoPiedPiping  [NP[XP]]Move  
i. 
NumP above 
NPless 

[XP[NP]]Move  NPlessMove  
j. 
NP above 
NPless 

[NP[XP]]Move  NPlessMove  
k. 

NoPiedPiping  NPless Move  
l. 
NumP above 
NP above 

NoPiedPiping  NPless Move  [NP[XP]]Move 
Movements necessary for each word order in Cinque’s proposal (continued from Table
Word Order  Step 1  Step 2  Step 3  Step 4 

m. 
No mov’t  
n. 
NP above 
No more mov’t  
[NP[XP]]Move  Partial mov’t  
o. 
NP above 
No move mov’t  
NoPiedPiping  Partial mov’t  
p. 
NP above 

NoPiedPiping  
q. 
NumP above 
NPless AP abv 

[NP[XP]]Move  [NP[XP]]Move  {Split, NPless Move}  NPless Move  
r. 
NP above 
AdjP above 
NumP above AP  
[NP[XP]]Move  NPless Move  [XP[NP]]Move  [XP[NP]]Move  
s. 
AdjP above 

[NP[XP]]Move  [NP [XP]]Move  
t. 
NP above 
NumP abv 
AdjP abv 

[NP[XP]]Move  [NP[XP]]Move  [NP[XP]]Move  
u. 
NP above 
NPless AdjP abv 

NoPiedPiping  NPlessMove  
v. 
AdjP above 

[XP[NP]]Move  
w. 
NP above 
AdjP above 

[NP[XP]]Move  [XP[NP]]Move  
x. 
NP above 
AdjP above 

No PiedPiping  [NP[XP]]Move 
Experiment 1 already details the encoding of the assumption that numerals merge always above adjectives. In order to encode the idea that a numeral can merge either below or above adjectives, we encode each word order as two vectors, one that corresponds to the movements assuming the base order [Dem [Num [Adj [N]]]], i.e. those in Table
To differentiate the two vectors for each word order, we add the binary parameter
(20)  a.  Uses NP movement without piedpiping 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  
c.  Uses NP movement with piedpiping of the [NP[XP]] type  
d.  Involves lack of movement (partial or complete)  
e.  Uses NPsplitting movement  
f.  Requires movement of a phrase not containing the NP  
g.  Involves the numeral merging below the adjectives 
Now, instead of having one vector per word order, with this new encoding, any given
Encoding of first derivation of
Feature  Value  

g.  Involves the numeral merging below the adjectives  0 
a.  Uses NP movement without piedpiping  1 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  0 
c.  Uses NP movement with piedpiping of the [NP[XP]] type  0 
d.  Involves lack of movement (partial or complete)  1 
e.  Uses NPsplitting movement  0 
f.  Requires movement of a phrase not containing the NP  0 
Encoding of second derivation of
Feature  Value  

g.  Involves the numeral merging below the adjectives  1 
a.  Uses NP movement without piedpiping  1 
b.  Uses NP movement with piedpiping of the [XP[NP]] type  0 
c.  Uses NP movement with piedpiping of the [NP[XP]] type  1 
d.  Involves lack of movement (partial or complete)  1 
e.  Uses NPsplitting movement  0 
f.  Requires movement of a phrase not containing the NP  0 
Movements needed for each order with [Dem [Num [Adj [N]]]] as base order.
Num merge below  No pied piping  XP[NP]] moves  [NP[XP]] moves  Partial move  Split move  NPlessmove  

a. 
0  0  0  0  0  0  0 
b. 
0  0  0  1  1  0  0 
c. 
0  1  0  0  1  0  0 
d. 
0  1  0  0  0  0  0 
e. 
0  0  1  0  0  0  1 
f. 
0  0  0  1  0  0  1 
g. 
0  0  0  1  0  1  1 
h. 
0  0  0  1  0  1  1 
i. 
0  0  0  1  0  0  1 
j. 
0  1  0  0  0  0  1 
k. 
0  1  1  0  0  0  0 
l. 
0  1  0  1  0  0  0 
m. 
0  0  0  1  1  0  1 
n. 
0  0  1  0  1  0  0 
o. 
0  0  0  1  1  0  0 
p. 
0  0  0  1  0  1  0 
q. 
0  1  0  0  0  0  1 
r. 
0  0  1  0  0  0  0 
s. 
0  0  1  1  0  0  0 
t. 
0  1  1  0  0  0  0 
u. 
0  0  0  1  0  1  1 
v. 
0  0  0  1  0  1  1 
w. 
0  0  1  1  0  0  0 
x. 
0  0  0  1  0  0  0 
Movements needed for each order with [Dem [Adj [Num [N]]]] as base order.
Num merge below  No pied piping  [XP[NP]] moves  [NP[XP]] moves  Partial move  Split move  NPlessmove  

a′. 
1  0  0  1  1  0  1 
b′. 
1  0  1  0  1  0  0 
c′. 
1  1  0  1  1  0  0 
d′. 
1  1  0  1  0  1  0 
e′. 
1  0  0  1  0  0  1 
f′. 
1  1  0  0  0  0  1 
g′. 
1  0  1  0  0  0  0 
h′. 
1  1  0  1  0  0  0 
i′. 
1  0  1  0  0  0  1 
j′. 
1  0  0  1  0  0  1 
k′. 
1  1  0  0  0  0  1 
l′. 
1  1  0  1  0  0  1 
m′. 
1  0  0  0  0  0  0 
n′. 
1  0  0  1  1  0  0 
o′. 
1  1  0  0  1  0  0 
p′. 
1  1  0  0  0  0  0 
q′. 
1  0  0  1  0  1  1 
r′. 
1  0  1  1  0  0  1 
s′. 
1  0  0  1  0  0  0 
t′. 
1  0  0  1  0  0  0 
u′. 
1  1  0  0  0  0  1 
v′. 
1  0  1  0  0  0  0 
w′. 
1  0  1  1  0  0  0 
x′. 
1  1  0  1  0  0  0 
If numerals are adjectives, there are two possible derivations for each word order, as detailed in Tables
We sample all the proportions of derivations where all the observed word orders can be generated in the same proportions. We test all the proportions in increments (or decrements) of 10%. This gives eleven combinations (derivation1 100%, derivation2 0%; derivation1 90%, derivation2 10%, …, derivation1 0%, derivation2 100%). For example, the
We also sample some mixed combinations of proportions where different observed word orders can be generated by the two derivations in different proportions. The eleven samples with fixed proportions of derivations described above range from a 0–100% combination in favour of one base order (high
Twentyone files containing the data with both derivations with the weights in Table
Sampling of the space of possibilities when treating the merge position of the numeral as a parametric choice.
Assumed Base Order  

Word orders  
Distr 0:  100% of languages  0% of languages  all word orders 
Distr 0–1:  100% of languages  0% of languages  half word orders 
90% of languages  10% of languages  half word orders  
Distr 1:  90% of languages  10% of languages  all word orders 
Distr 1–2:  90% of languages  10% of languages  half word orders 
80% of languages  20% of languages  half word orders  
Distr 2:  80% of languages  20% of languages  all word orders 
…  
Distr 9:  10% of languages  90% of languages  all word orders 
Distr 9–10:  10% of languages  90% of languages  half word orders 
0% of languages  100% of languages  half word orders  
Distr 10:  0% of languages  100% of languages  all word orders 
As it turns out, treating numerals as highmerging gives rise to better predictions than any of the formulas allowing numerals to merge either higher or lower than adjectives in the dominant base structure of a language. That is to say that the assumption that numerals merge higher than adjectives in the dominant merge order of all languages makes better predictions than the assumption that numerals merge higher than adjectives in the dominant merge order of some languages, and lower than adjectives in the dominant merge order of others. The result are detailed in the following two paragraphs.
The linear regression model that was generated by assuming numerals merge high is the function in (21), with correlation coefficient of 0.751. Notice that while the features are the same as those illustrated in Table
(21)  Frequency =  –129.0 × Uses NP movement without piedpiping 
–115.6 × Uses NP movement, piedpiping [XP[NP]]  
–37.8 × Uses NP movement, piedpiping [NP[XP]]  
–65.6 × Partial Move  
–91.6 × Uses NPsplitting movement  
–135.9 × Requires moving a phrase not containing NP  
+242.8 
The linear regression models that are generated by the assumption that numerals are adjectives are shown in Table
Linear regression models generated by the assumption that numerals are adjectives, different proportions (Int. = intercept; Corr. = correlation).
No pied piping  [XP[NP]] moves  [NP[XP]] moves  No move  Split move  NPless move  Int.  Corr.  

19–0  –120.8  –115.9  –33.5  –78.5  –92.2  –133.9  236.9  0.76 
19–01  –100.9  –104.6  20.2  –70.9  –85.7  –126.3  213.0  0.69 
19–1  –87.9  –97.1  –  –58.3  –91.8  –118.6  192.6  0.68 
19–12  –76.9  –91.8  –  –52.7  –83.8  –113.4  181.6  0.64 
19–2  –75.8  –91.9  –  –42.9  –85.5  –107.7  178.7  0.61 
19–23  –65.4  –86.7  –  –37.6  –70.0  –78.2  168.1  0.57 
19–3  –64.4  –86.3  –  –28.1  –79.3  –96.8  164.9  0.53 
19–34  –40.0  –66.7  +25.9  –  –73.7  –86.7  124.9  0.49 
19–4  –38.6  –64.9  +29.1  –  –76.9  –81.8  121.0  0.49 
19–45  –27.7  –59.1  +34.6  –  –73.9  –77.3  109.0  0.47 
19–5  –26.6  –56.9  +37.6  –  –76.6  –72.5  105.0  0.45 
19–56  –  –43.5  +49.1  –  –71.4  –66.5  80.1  0.43 
19–6  –  –40.7  +51.4  –  –74.0  –61.9  77.2  0.41 
19–67  –  –41.3  +51.0  –  –72.9  –59.1  75.2  0.40 
19–7  –  –35.1  +52.8  +28.8  –68.5  –50.1  62.5  0.41 
19–78  –  –35.9  +52.5  +32.4  –67.1  –46.8  60.0  0.40 
19–8  –  –32.1  +54.4  +40.2  –66.0  –40.9  53.3  0.41 
19–89  –  –33.4  +54.0  +43.9  –64.6  –37.5  51.1  0.38 
19–9  +26.9  –  +67.9  +55.4  –59.1  –23.3  15.9  0.39 
19–910  +33.1  –  +68.3  +58.8  –58.9  –24.3  10.8  0.42 
19–10  +32.8  –  +68.9  +65.9  –57.6  –18.6  –5.6  0.41 
The results of this experiment show a sharp degradation in data correlation as the proportions of highadjectival word orders increases followed by fluctuation between 0.38 and 0.41 for the datasets that have a majority of highadjectival word orders. This shows that the dominant merge order in any given language for a DP containing a demonstrative, a numeral, an adjective, and a noun must be
It is important to note, however, that the results showing that including the second merge order deteriorates the typological predictions does not automatically entail that it is not a possible merge order. As alluded to in section 5.4, it is possible for various merge orders and movements to take place for semantic reasons. For example, it is possible to merge the numeral lower than some adjectives for scope reasons. That, however, will not affect the dominant order in the language, as it will only occur in the (infrequent) cases of a “high adjective”, like
(22)  a.  the last three boxes 
b.  the three heavy boxes (each is heavy)  
c.  the heavy three boxes (they are heavy as a whole) 
Whether these are cases of numerals merging lower than their usual position (
The previous experiments attempt to predict the actual counts of languages per word order. Not all languages, however, have been documented in the typological literature and, for those that have, there is some debate on what is the dominant word order. This experiment addresses the possible criticism that results of experiment 3 are too dependent on the actual counts of different word orders, which have a tendency to change when new languages are attested and documented. For this reason, we group the exact frequencies in discrete equivalence classes, and we repeat the experiment as a classification task. The experiment concentrates on predicting the ranking of word order counts and confirms the results of experiment 3.
The same encoded data is used as the previous experiments. The goal attribute, the attribute we are trying to predict, is a given word order’s frequency class. We can group the languages in different frequency groups, by discretising the frequencies in different ways: either as simply possible or impossible (two values), as was the original goal in Cinque’s paper, or as having different levels of frequency.
We performed the classification task at two different levels of discretized granularity for the frequency. Given that the data is actually distributed according to a powerlaw, as shown in Figure 8, we binned the languages into classes according to the magnitude of their frequency. Using averages or medians would not have properly represented the fundamental fact that the frequencies are distributed exponentially. For two levels of granularity, the cutoff point is whether the number of languages per word order was in the double or triple digits, or in the single digits or zero. So
For three levels of granularity, the same magnitudebased mapping is used. So
Among the many available learning algorithms, we use a simple probabilistic learning algorithms, Naive Bayes, and
In the Naive Bayes algorithm, the objective of training is to learn the most probable word order type given the probability of each vector of features. This probability is decomposed, according to Bayes rule, into the probability of the attributes given the goal predicate and the prior probability of the goal predicate itself. In our setting, the attributes are the movement operations and the goal predicate is the typological frequency. This method is chosen because despite its simplicity it works well in practice. Results will be compared to a baseline which consists in assuming that all word orders belong to the most frequent class. The baseline tells us whether the model has learnt anything beyond class frequency. The baseline consists in always predicting that languages are not attested in the threeway classification, or that they are infrequent in the twoway classification. We used the WEKA Data Mining Software (
The results in Table
Results of the classification tasks on Naive Bayes. Correctly classified instances in parentheses.
High 
Low 


3 classes  79.2% (38/48)  62.5% (30/48) 
2 classes  100% (48/48)  79.2% (38/48) 
Baseline  58.3% (28/48)  58.3% (28/48) 
Linguistic universals have been discussed from very many different points of view. We concentrate here on those that are directly related to some aspects of our proposal.
In this work, we have proposed using statistical models and classifiers to automatically model and evaluate some quantitative and qualitative aspects of proposals concerning Universal 20. Two previous pieces of work have also used linear regression modeling and classification to evaluate different explanations of linguistic proposals, specifically concerning Universal 20.
Cysouw (
Cinque’s (
To avoid developing too strong a theory, other methods set out to explain the frequency differentials between pairs of word orders. In contrast to Cinque’s single universal base order, Abels and Neeleman (
Since our account does not hinge on the LCA per se, which we do not explicitly represent, nor on any specific assumptions on the internal structure of the phrases, which differ in the two accounts, it is not clear if our method can say anything interesting about the difference of the two theories.
Abels and Neeleman’s account however is very restricted and makes interesting quantitative predictions. Specifically, it predicts, first, that all eight base orders should be roughly equivalent in frequency (counted by genera or by languages) and, second, that all six derived orders should be less frequent or, at most, as frequent as the base order from which they are generated. Typological counts are shown in Table
Abels and Neeleman’s predictions (DGen = Dryer’s genera; C13Lg = Cinque’s (2013) languages).
Base order  DGen  C13Lg  Derived order  DGen  C13Lg 

44  300  3  37  
3  48  
2  14  
3  40  7  35  
6  35  
1  15  
17  114  ( 
11  69  
21  180  
22  125  4  24  
57  411 
A strict interpretation of the first prediction appears not to be fulfilled, especially if we look at the counts for genera. We can however notice that the base orders are all frequent, and none is rare, which seems instead to support the theory. Prediction two is largely confirmed. In only one case, the derived order does not appear to be convincingly more costly than its corresponding base order (
Future work will develop a systematic comparison of these two approaches. Since the two approaches have a very considerable difference in complexity, with Cinque’s allowing several types of movement, the comparison will have to include some measure of model complexity (for example by using Bayes factors) and is beyond the scope of the current article.
One of the clearest observations concerning Universal 20 above is that the two most frequently attested word orders are harmonic. This is a widespread observation in typology. While it is wellknown that disharmonic patterns in the order of words exist, and, in fact most languages are not fully harmonic (
A fundamental assumption of this paper is that frequencies within a language and across languages are an aspect of language that is systematically related to its formal properties and that requires explanation, both in its numerical magnitude and its distribution. Like Yang (
In this respect our work is also related to other computational probabilistic proposals for language universals. In a widely discussed, and controversial, paper, Dunn et al. (
This model is computationally sophisticated, but it aims to explain linguistic data that are relatively simple. Our approach shifts the focus of the investigation, and is computationally simpler, but linguistically more detailed. Although very simple, Naive Bayes belongs to the same class of models as it is a Bayesian model with latent variables, but our encoding of word orders is more distributed and encodes a derivational theory. To a large extent, our model confirms Cinque’s approach and its differential cost of movement operations. These differential costs were set up to explain a pervasive universal asymmetry between prenominal and postnominal modifiers, the asymmetry between the very restricted choice of word orders among prenominal modifers and the much larger set of options for postnominal modifiers. This asymmetry is not captured by any of the other models that allow symmetric base orders.
The syntactic category of numerals has been in debate in recent years, with proposals claiming that numerals are quantifiers (Stavrou & Terzi 2007,
The idea that at least some numerals are adjectives is based on a number of properties of numerals that are typical of adjectives. Corbett (
(23)
a.
odin
one.
zurnal
magazine.
‘one magazine’
b.
odna
one.
gazeta
newspaper.
‘one newspaper’
Stavrou & Terzi (
(24)  There are many/three books on the table. 
(25)  The many/three books are on the table 
Also, in negated contexts, both numerals and quantifiers have scope ambiguities, as in (26). The sentence can mean either ‘There are three/many women I didn’t see’ or ‘I didn’t see many/three women, I saw few/five’.
(26)  I didn’t see many/three women. 
In addition, they show that numerals and quantifiers share properties that adjectives do not have, namely that cardinals and quantifiers are both able to license bare subjects in Greek (27). They can both appear without a noun, while an adjective cannot (28). They can head a partitive construction which an adjective cannot (29). And they both allow for split topicalization in Greek which an adjective cannot (30).
(27)
Tris/ligi
three/few
fitites
students
parusiasan
presented
to
the
arthro.
article
‘Three/few students presented the article.’
(28)  I met three/many/*tall. 
(29)  Many/three/*tall of the demonstrators caused trouble. 
(30)
Vivlia
books
agorasa
bought.
merika/lika/deka.
several/few/ten
‘Books I bought several/few/ten.’
Many researchers view numerals as not having a uniform category, including Corbett (
There is much intralinguistic evidence for more than one position for numerals. Researchers, including Zabbal (
Our work does not rule out the possibility of a lower merge position that is motivated semantically. It specifically rules out the possibility that different languages have different merge positions. The confirmation that the high merge position for numerals corresponds to the dominant order in all the languages, combined with intralinguistic evidence of word orders that require a low merge position for numerals, lends support to the idea that the low merge position for numerals is semanticallymotivated.
In this paper, we set out to compare and test different syntactic proposals concerning Universal 20 using vectorial representations and machine learning methods. Specifically, we set out to answer the following three questions:
Can Cinque’s ranking of the different kinds of movements be predicted automatically using Universal 20?
Is movement always costly? (Is lack of movement always the less costly route?)
Is the base structure proposed by Cinque the best predictor of the typological facts?
Since the syntactic proposals we are modelling are fairly complex, we kept the modelling as simple and as faithful as possible. We modelled the 24 possible permutations of
We then used linear regression in order to determine the weights of the different syntactic movements proposed in Cinque (
Our largescale automatic investigation allows us to discover some facts that would not have been accessible by more traditional methods. Determining the weights of the movement operations and establishing the preferred merge sequence requires computations that exhaustively explore the space of options and calculate the optimal solutions over the space of all languages, computations that are too costly to be done by hand and that would not be informative if done on a small scale.
Specifically, there has been evidence that numerals can merge lower than adjectives in some contexts, Ouwayda (
The preference for recursive piedpiping of the NP among movement operations is also compatible with recent results on scope preferences for universal 20 reported in Culbertson & Adger (
This is our interpretation of Cinque’s partial movement. Partial movement occurs when there is movement, of any category, but nothing has moved above the demonstrative.
We thank Guglielmo Cinque for giving us acces to this invaluable database.
Collinearity refers to a linear relationship between two or more explanatory variables. Correlation between two variables increases the variance of the correlation coefficient and makes the prediction unstable.
This conclusion is drawn based on a systematic comparison of correlation coefficients of linear regression in different settings of crossvalidation (leave one out, or 10fold crossvalidation) and with or without attribute selection, as shown below.
Crossval Method
Attribute Selection?
Equation (13)
Equation (14)
Equation (15)
Leave one out
no
0.59
0.59
0.47
Leave one out
yes
0.55
0.53
0.42
10fold
no
0.56
0.60
0.53
10fold
yes
0.57
0.54
0.48
Average
0.57
0.56
0.47
There are other proposals about numerals. For example, (
We are very grateful to Guglielmo Cinque for giving us acces to his data and to Giuseppe Samo, for his very helpful attentive reading and comments. All remaining errors are our own.
The research described in this paper was partially funded by the Swiss NSF under grant 144362.
The authors have no competing interests to declare.
The order of authors is alphabetical to indicate equivalent contributions of the two authors. Most of the work by Sarah Ouwayda was performed while at the university of Geneva.