the summary of:
Levshina,N.(2015). 19.Exemplars, categories, prototypes: simple and multiple correpondence analysis, How to do Linguistics with R:Data exploration and statistical analysis(pp 367-385), John Benjamins Publishing Company
19.1 Register variation of Basic Colour Terms: Simple Correspondence Analysis
19.1.1 The data and hypothesis
Add-on packages
library(Rling);library(vcd);library(ca);library(rgl)The dataset
colreg
- contains the counts of eleven BCT in different registers
- from the Corpus of Contemporary American English(COCA)
data(colreg)
colreg## spoken fiction academic press
## black 20335 41118 26892 73080
## blue 4693 22093 3605 21210
## brown 1185 10914 1201 11539
## gray 1168 12140 1289 6559
## green 3860 14398 4477 26837
## orange 931 3496 474 5766
## pink 962 7312 584 6356
## purple 613 3366 429 3403
## red 7230 25111 5621 34596
## white 14474 40745 26336 54883
## yellow 1349 10553 1855 10382
mosaic plot
mosaicplot(colreg, las=2, shade=TRUE, main="Register variation of BCT")blue-shaded rectangles: overpresented in a given register pink- and red-shaded rectangles: underrepresented in a given register
- However, the mosaic plot is not particularly convenient when the number of categories is large
- It does not show any common dimensions of variation
- A more appropriate method in this situation is Simple Correspondence Analysis(SCA)
19.1.2 Simple Correspondence Analysis
Corresponence Analysis(CA)
- Identification of systematic relationships between variables
- Capturing the main tendencies in several dimensions.
- Similar to MDS, PCA, and FA, it represents the objects of analysis as points in a low-dimensional space
ca.bc<-ca(colreg)
summary(ca.bc,rows=FALSE,columns=FALSE)##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.043730 77.9 77.9 *******************
## 2 0.010787 19.2 97.1 *****
## 3 0.001650 2.9 100.0 *
## -------- -----
## Total: 0.056167 100.0
- Principal inertias = eigenvalues in PCA
- the two first dimensions represent together 97.1% of variation
Two-dimensional CA map
plot(ca.bc,col.lab ="black")- in SCA, row labels are located close to one another if they contain similar proportions of counts in each column
- the rows have similar profiles
- the rows have similar profiles
- profiles are actual frequencies divided by the row total
- the profile of black in
colregis [0.13, 0.25, 0.17, 0.45] - the profile of white is [0.11, 0.30, 0.19, 0.40]
- Those profiles are more similar to each other than to the profile of gray [0.06, 0.57, 0.06,0.31]
- This explains why black and white are close on the map, and gray is far from both of them
- the profile of black in
- CA maps represent the difference between profiles as \(\chi\)2-distance
- similar to the Euclidean distance
- the stronger a row deviates from the average profile, the farther away from the ohter rows it will be located
- for columns
- Ther labels are located close if they contain similar proportions of counts in each row
- However, the interpretation of mutual proximity of rows and colums is not straightforward.
- the function creates a so-called symmetric plot.
- the algorithm tries to overlay the BCT space on the register space in an optimal way (rescaling)
- therefore, the location of individual rows should be interpreted with regard to the dimensions formed by the colums
Interpretation of the CA map
- The first dimension (the horizontal axis)
- the achromatic primary colours (black and white, left handed) vs. other terms (right handed)
- spoken and academic subcorpora vs. fiction
- The primary colours yellow and blue are close to the secondary BCTs
- The press cubsorpus’s orientation is shared by red and green
- political connotations (Green Party, Red Army)
- proper names (Red Cross , Green Bay Packer)
- food terms (red wine, green beans)
- the secondary term orange is also found nearby
- ‘made of oranges’
Interpretation of symmetric(default)CA maps
It is easy to misinterpret a CA map. To be on the safe side, follow these rules:
- Row-to-Row distances on the CA map represent the approximate \(\chi\)2-distances between the row profiles
- Column-to-column distances on the CA map represent the approximate \(\chi\)2-distances between the column profiles
- There is no direct interpretation of row-to-column or colunm-to-row distances
- Interpret the dimensions first
- and then examine how the profiles are located with regard to the dimensions of variation(Greenacre 2007:72)`
Plot all three dimensions
plot3d()in packagergllabels=c(1,1): both row and column profiles should be shown as text labels
plot3d(ca.bc,labels=c(1,1))You must enable Javascript to view this page properly.
To summarize
- Most secondary BCT cluster together in the same part of the plot where one finds fiction
- blue and yellow are the closest to the secondary terms
- in Berlin and Kay’s(1969) hierarchy
- in Berlin and Kay’s(1969) hierarchy
- blue and yellow are the closest to the secondary terms
- The location of green and red is relatively high on Dimension 2
- the position of newspaper and magazine texts
19.2 Visualization of exemplars and prototypes of lexical categories
Multiple Correspondence Analysis of Stuhl and Sessel
Add-on packages
library(Rling);library(FactoMineR);library(ca);library(rms)Prototype Theory(e.g.Rosch 1975, Rosch & Mervis 1975)
Categorization of a new item is performed by comparing it with the prototype of an existing category
- The prototype is the summary representation of a category
- contains all features of the category instances
- weighted according to their frequency of occurrence in the subject’s previous experience
- The category BIRD
- have the feature ‘can fly’, only some of them have ‘can swim’
- Different members of a category possess typical features to a different extent
- robin is more prototypical member than a penguin
- Features plays a crucial role in establishing the similarity between two examplars
- These features are highly intercorrelated
- a typical bird can fly, has wings, and makes nests in trees
Data Structure
- Focuses on two german lexical categories Stuhl ‘chair’ and Sessel ‘armchair’
- According to the classic study by Gipper(1959), the boundaries between the categories were fuzzy
data(chairs)
str(chairs)## 'data.frame': 188 obs. of 19 variables:
## $ Shop : Factor w/ 3 levels "ikea.de","Moebel-Profi.de",..: 2 1 1 2 1 3 1 3 1 1 ...
## $ WordDE : Factor w/ 44 levels "3-in-1-Sessel",..: 2 17 38 41 23 13 25 15 40 40 ...
## $ Category : Factor w/ 2 levels "Sessel","Stuhl": 2 2 1 2 2 2 2 1 2 2 ...
## $ Function : Factor w/ 5 levels "Eat","NotSpec",..: 1 1 2 1 1 5 2 4 1 1 ...
## $ Age : Factor w/ 2 levels "Adult","Children": 1 2 1 1 2 1 1 1 1 1 ...
## $ Back : Factor w/ 4 levels "Adjust","High",..: 3 4 4 2 2 2 4 2 4 4 ...
## $ Soft : Factor w/ 3 levels "No","Pad","Yes": 1 1 1 3 1 3 1 3 1 1 ...
## $ Arms : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 2 1 1 ...
## $ Upholst : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 2 1 2 ...
## $ MaterialSeat: Factor w/ 10 levels "Fabric","Leather",..: 6 10 8 1 6 1 10 2 10 1 ...
## $ SeatHeight : Factor w/ 3 levels "Adjust","High",..: 3 2 3 3 2 1 3 3 3 3 ...
## $ SeatDepth : Factor w/ 3 levels "Adjust","Deep",..: 3 3 3 3 3 2 3 2 3 3 ...
## $ Swivel : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ Roll : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ Rock : Factor w/ 2 levels "No","Rock": 1 1 1 1 1 1 1 1 1 1 ...
## $ AddFunctions: Factor w/ 3 levels "Bed","No","Table": 2 2 2 2 2 2 2 2 2 2 ...
## $ Recline : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ ReclineBack : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ SaveSpace : Factor w/ 3 levels "collapse","No",..: 2 2 3 2 2 2 1 2 2 2 ...
- Variables
- Shop: one of the three online stores
- WordDE: the exact lexical label of each chair or armchair
- Category:the lexical category ‘Stuhl’ or ‘Sessel’
- Many of the variables are intercorrelated
- if a chair can swivel, it can usually roll
swivelRoll<-xtabs(~ chairs$Swivel+chairs$Roll)
swivelRoll## chairs$Roll
## chairs$Swivel No Yes
## No 133 1
## Yes 14 40
chisq.test(swivelRoll)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: swivelRoll
## X-squared = 117.1, df = 1, p-value < 2.2e-16
19.2.2 Multiple Correspondence Analysis
- To analyse mutivariate data with more than two categorical variables
- package
FactoMineR - The input data
- dataframe format
- the rows are individual observations
- the columns are categorical variables
- MCA can represent
- relationships between the values of categorical variables
- individual observation
Create a map
chairs.ca<-MCA(chairs[,-c(1:3)],graph=FALSE)
plot(chairs.ca, cex=0.7, col.var="black", col.ind="grey")The contributions of different variables : Dimension 1
dimdesc(chairs.ca,axes=1)## $`Dim 1`
## $`Dim 1`$quali
## R2 p.value
## Upholst 0.72940952 1.094774e-54
## MaterialSeat 0.74518860 3.215782e-48
## Function 0.69158437 1.158923e-45
## Soft 0.66568141 9.657154e-45
## Swivel 0.40875670 5.393205e-23
## Roll 0.38348403 2.728416e-21
## SeatHeight 0.39565748 5.870717e-21
## Back 0.36654364 3.802707e-18
## Arms 0.21473392 2.133731e-11
## SeatDepth 0.20909906 3.769585e-10
## SaveSpace 0.19444992 2.058545e-09
## Age 0.06521465 4.047690e-04
## ReclineBack 0.06368029 4.764098e-04
## Recline 0.04908474 2.246446e-03
##
## $`Dim 1`$category
## Estimate p.value
## Upholst_No 0.508362663 1.094774e-54
## Soft_No 0.388494836 5.840206e-41
## Swivel_No 0.402810868 5.393205e-23
## Roll_No 0.427504866 2.728416e-21
## Wood 0.293104780 4.464862e-16
## NotSpec 0.643951013 4.899545e-13
## Eat 0.234142965 5.684581e-13
## Back_Mid 0.358373579 1.779517e-12
## SeatHeight_Norm 0.143680221 2.417614e-12
## Arms_No 0.264219623 2.133731e-11
## SeatDepth_Norm 0.447005585 6.603795e-10
## Plastic 0.346541022 1.681653e-08
## SaveSpace_stack 0.267052677 4.199531e-07
## Children 0.236113583 4.047690e-04
## ReclineBack_No 0.163875900 4.764098e-04
## SaveSpace_collapse 0.252603082 1.181098e-03
## Rattan 0.124400133 1.696248e-03
## Recline_No 0.182956176 2.246446e-03
## SeatHeight_High 0.538230447 4.514663e-03
## Back_Low 0.737133741 6.215202e-03
## Recline_Yes -0.182956176 2.246446e-03
## ReclineBack_Yes -0.163875900 4.764098e-04
## Adult -0.236113583 4.047690e-04
## SeatDepth_Adjust -0.437445200 1.207452e-04
## Back_Adjust -0.875345607 1.096450e-05
## Relax -0.463574619 8.503520e-06
## SeatDepth_Deep -0.009560385 8.078739e-06
## Fabric -0.704604631 6.260860e-10
## SaveSpace_No -0.519655759 2.401817e-10
## Leather -0.835048584 8.166132e-11
## Back_High -0.220161713 5.595166e-11
## Arms_Yes -0.264219623 2.133731e-11
## Work -0.714944787 5.944443e-16
## SeatHeight_Adjust -0.681910668 7.980545e-21
## Roll_Yes -0.427504866 2.728416e-21
## Swivel_Yes -0.402810868 5.393205e-23
## Soft_Yes -0.575087143 9.455970e-46
## Upholst_Yes -0.508362663 1.094774e-54
'Dim 1'$qualishows the statistics- R2
- How strongly each variable is associated with each dimension
'Dim 1'$categoryprovides information on the directionality of those associations- estimates of simple linear regression coefficients
- if the estimate is positive, then located in the right-hand part of the plot
- if it is negative, then on the left
- The greater the deviation from zero, the stronger the effecct
- estimates of simple linear regression coefficients
- highly comfortable(left) vs. less comfortable
The contributions of different variables : Dimension 2
dimdesc(chairs.ca,axes=2)## $`Dim 2`
## $`Dim 2`$quali
## R2 p.value
## Function 0.78584122 4.217810e-60
## SeatDepth 0.64716353 1.414314e-42
## ReclineBack 0.57438079 2.422581e-36
## SeatHeight 0.52515707 1.204684e-30
## Roll 0.46545580 4.290655e-27
## Recline 0.27597436 9.940716e-15
## Swivel 0.26129650 6.598620e-14
## Back 0.14372484 2.682914e-06
## AddFunctions 0.11015232 2.049744e-05
## Upholst 0.05391569 1.343894e-03
## Arms 0.05093462 1.845045e-03
## Rock 0.05020680 1.993581e-03
## MaterialSeat 0.10623070 1.571677e-02
## Age 0.02829025 2.103903e-02
## Soft 0.03925557 2.461661e-02
##
## $`Dim 2`$category
## Estimate p.value
## ReclineBack_No 0.43818645 2.422581e-36
## SeatHeight_Adjust 0.42064264 1.263155e-28
## Roll_Yes 0.41932722 4.290655e-27
## Work 0.52198563 1.502005e-24
## SeatDepth_Norm 0.04767135 6.407920e-17
## Recline_No 0.38623766 9.940716e-15
## Swivel_Yes 0.28673582 6.598620e-14
## SeatDepth_Adjust 0.69693231 8.045886e-08
## Back_Adjust 0.78717617 1.046328e-07
## AddFunctions_No 0.17691154 4.792253e-04
## Upholst_No 0.12305293 1.343894e-03
## Arms_No 0.11456915 1.845045e-03
## Rock_No 0.39410033 1.993581e-03
## Wood 0.19020718 3.350552e-03
## Children 0.13845645 2.103903e-02
## Soft_No 0.01136842 2.286251e-02
## Plastic 0.20449930 3.735216e-02
## Fabric -0.15465893 2.219771e-02
## Adult -0.13845645 2.103903e-02
## Soft_Yes -0.17784951 8.232290e-03
## Rock_Rock -0.39410033 1.993581e-03
## Arms_Yes -0.11456915 1.845045e-03
## Upholst_Yes -0.12305293 1.343894e-03
## AddFunctions_Bed -0.69031884 5.405031e-06
## Swivel_No -0.28673582 6.598620e-14
## Recline_Yes -0.38623766 9.940716e-15
## Roll_No -0.41932722 4.290655e-27
## SeatHeight_Norm -0.46315824 3.007332e-30
## SeatDepth_Deep -0.74460367 2.505087e-36
## ReclineBack_Yes -0.43818645 2.422581e-36
## Relax -0.63329716 2.835996e-45
- functionality is still a distinctive feature
- chairs for relaxation vs. chairs for work
- there are three distinct categories of chairs
- comfortable chairs for relaxation
- comfortable adjustable chairs for work
- mutifuctional chairs for the household
- In the middle of the part
- comportable chairs for the dining room
- comportable chairs for the dining room
How are these differences related to the lexical categories under investigation, Stuhl and Sessel?
chairs.ca1<-MCA(chairs[,-c(1:2)],quali.sup=1,graph=FALSE)
plot(chairs.ca1, invis="ind", col.var="darkgrey",col.quali.sup="black")- Stuhl
- upper right quadrant of the map
- the features are associated with simple practical chairs
- relatively close to the centre
- most office chairs for work are also categoriaed as Stuhl
- most office chairs for work are also categoriaed as Stuhl
- Sessel
- located at the bottom
- the features are associated with comfortable chairs for relaxation
- The functional differentiation is crucial
95% confidence ellipses
- around the centroid of Stuhl and Sessel
- prototypes of the categories
- Since the confidence ellipses do not overlap, the prototypes can be regarded as distinct
plotellipses(chairs.ca1, keepvar=1, label="quali")- around all examplars the represent each category
- displays significant overlap, which supports Gipper’s observation about fuzzy boundaries between the categories
plotellipses(chairs.ca1, means=FALSE, keepvar=1, label="quali")Eigenvalues
- The proportion of explained variance tends to be very modest because the total variance is inflated
- the first two dimensions represent only 27.42%
head(chairs.ca$eig,11)## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.32507257 15.297533 15.29753
## dim 2 0.25767552 12.125907 27.42344
## dim 3 0.13519020 6.361892 33.78533
## dim 4 0.12293223 5.785046 39.57038
## dim 5 0.10891028 5.125190 44.69557
## dim 6 0.09618531 4.526367 49.22193
## dim 7 0.09019392 4.244420 53.46635
## dim 8 0.08619851 4.056401 57.52275
## dim 9 0.08165427 3.842554 61.36531
## dim 10 0.07264654 3.418661 64.78397
## dim 11 0.07066398 3.325364 68.10933
Adjusted MCA
- Adjusted MCA estimates explained variation more realistically
chairs.ca2<-mjca(chairs[,-c(1:3)])
summary(chairs.ca2,rows=FALSE,columns=FALSE)##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.078443 47.1 47.1 **************
## 2 0.043342 26.0 73.2 ********
## 3 0.006012 3.6 76.8 *
## 4 0.004155 2.5 79.3 *
## 5 0.002451 1.5 80.8
## 6 0.001291 0.8 81.5
## 7 0.000873 0.5 82.1
## 8 0.000639 0.4 82.4
## 9 0.000417 0.3 82.7
## 10 0.000117 0.1 82.8
## 11 7.6e-050 0.0 82.8
## 12 1e-05000 0.0 82.8
## -------- -----
## Total: 0.166428
- two first dimensions represent 73.2% of inertia
- But are the non-adjusted and adjusted solutions equivalent?
A correlation analysis of the coordinates of features of the first two dimensions
cor(chairs.ca$var$coord[,1],chairs.ca2$colcoord[,1])## [1] 1
cor(chairs.ca$var$coord[,2],chairs.ca2$colcoord[,2])## [1] -1
- the solutions are practically identical, deffering only in scale and the orientation of the second dimension in
chairs.ca2(upside down) - Therefore, our initial two-dimensional representation is not perfect, but it is acceptable for a pilot study
Reducing the correlated variables to a smaller set of underlying dimensions
- Intercorrelated features of categorical variables are pervasive in linguistic practice
- think, believe
- mental state, animate, first argument, complement clause…
- mental state, animate, first argument, complement clause…
- think, believe
- MCA dimensions as predictors in a logistic regression model
dim1<- chairs.ca$ind$coord[,1] #coordinates of individual examplars on the horizontal axis
dim2<- chairs.ca$ind$coord[,2] #the same for the vertical axis
m<-lrm(chairs$Category ~dim1+dim2)
m## Logistic Regression Model
##
## lrm(formula = chairs$Category ~ dim1 + dim2)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 188 LR chi2 118.82 R2 0.643 C 0.921
## Sessel 67 d.f. 2 g 2.667 Dxy 0.842
## Stuhl 121 Pr(> chi2) <0.0001 gr 14.394 gamma 0.844
## max |deriv| 2e-06 gp 0.386 tau-a 0.388
## Brier 0.094
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 0.9833 0.2448 4.02 <0.0001
## dim1 2.1780 0.5319 4.09 <0.0001
## dim2 3.9151 0.5377 7.28 <0.0001
##
- Two dimensions have a high predictive power in the choice between the lexical categories(C>0.92)
- The coefficients show that the higher the value of an exemplar with regard to dimention 1 and dimension 2 in the MCA, the greater are the chances of it being categorized as a Stuhl
How to report result of Correspondence Analysis
- plots
- the number of dimensions with the proportions of explained variance(inertia)
- feature that contribute to the orientation of dimensions
- For SCA \(\chi\)2-statistic, the degrees of freedom and the p-value
- For MCA, it is recommended to add the results of a confirmatory logistic regression