19.Exemplars, categories, prototypes

the summary of:

Levshina,N.(2015). 19.Exemplars, categories, prototypes: simple and multiple correpondence analysis, How to do Linguistics with R:Data exploration and statistical analysis(pp 367-385), John Benjamins Publishing Company

19.1 Register variation of Basic Colour Terms: Simple Correspondence Analysis

19.1.1 The data and hypothesis

Add-on packages

library(Rling);library(vcd);library(ca);library(rgl)

The dataset colreg

contains the counts of eleven BCT in different registers
from the Corpus of Contemporary American English(COCA)

data(colreg)
colreg

##        spoken fiction academic press
## black   20335   41118    26892 73080
## blue     4693   22093     3605 21210
## brown    1185   10914     1201 11539
## gray     1168   12140     1289  6559
## green    3860   14398     4477 26837
## orange    931    3496      474  5766
## pink      962    7312      584  6356
## purple    613    3366      429  3403
## red      7230   25111     5621 34596
## white   14474   40745    26336 54883
## yellow   1349   10553     1855 10382

mosaic plot

mosaicplot(colreg, las=2, shade=TRUE, main="Register variation of BCT")

blue-shaded rectangles: overpresented in a given register pink- and red-shaded rectangles: underrepresented in a given register

However, the mosaic plot is not particularly convenient when the number of categories is large
It does not show any common dimensions of variation
A more appropriate method in this situation is Simple Correspondence Analysis(SCA)

19.1.2 Simple Correspondence Analysis

Corresponence Analysis(CA)

Identification of systematic relationships between variables
Capturing the main tendencies in several dimensions.
Similar to MDS, PCA, and FA, it represents the objects of analysis as points in a low-dimensional space

ca.bc<-ca(colreg)
summary(ca.bc,rows=FALSE,columns=FALSE)

## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.043730  77.9  77.9  *******************      
##  2      0.010787  19.2  97.1  *****                    
##  3      0.001650   2.9 100.0  *                        
##         -------- -----                                 
##  Total: 0.056167 100.0

Principal inertias = eigenvalues in PCA
- the two first dimensions represent together 97.1% of variation

Two-dimensional CA map

plot(ca.bc,col.lab ="black")

in SCA, row labels are located close to one another if they contain similar proportions of counts in each column
- the rows have similar profiles
profiles are actual frequencies divided by the row total
- the profile of black in colreg is [0.13, 0.25, 0.17, 0.45]
- the profile of white is [0.11, 0.30, 0.19, 0.40]
- Those profiles are more similar to each other than to the profile of gray [0.06, 0.57, 0.06,0.31]
- This explains why black and white are close on the map, and gray is far from both of them
CA maps represent the difference between profiles as $\chi$²-distance
- similar to the Euclidean distance
- the stronger a row deviates from the average profile, the farther away from the ohter rows it will be located
for columns
- Ther labels are located close if they contain similar proportions of counts in each row
- However, the interpretation of mutual proximity of rows and colums is not straightforward.
  - the function creates a so-called symmetric plot.
  - the algorithm tries to overlay the BCT space on the register space in an optimal way (rescaling)
  - therefore, the location of individual rows should be interpreted with regard to the dimensions formed by the colums

Interpretation of the CA map

The first dimension (the horizontal axis)
- the achromatic primary colours (black and white, left handed) vs. other terms (right handed)
- spoken and academic subcorpora vs. fiction
The primary colours yellow and blue are close to the secondary BCTs
The press cubsorpus’s orientation is shared by red and green
- political connotations (Green Party, Red Army)
- proper names (Red Cross , Green Bay Packer)
- food terms (red wine, green beans)
- the secondary term orange is also found nearby
  - ‘made of oranges’

Interpretation of symmetric(default)CA maps

It is easy to misinterpret a CA map. To be on the safe side, follow these rules:

Row-to-Row distances on the CA map represent the approximate $\chi$²-distances between the row profiles
Column-to-column distances on the CA map represent the approximate $\chi$²-distances between the column profiles
There is no direct interpretation of row-to-column or colunm-to-row distances
- Interpret the dimensions first
- and then examine how the profiles are located with regard to the dimensions of variation(Greenacre 2007:72)`

Plot all three dimensions

plot3d() in package rgl
labels=c(1,1) : both row and column profiles should be shown as text labels

plot3d(ca.bc,labels=c(1,1))

You must enable Javascript to view this page properly.

To summarize

Most secondary BCT cluster together in the same part of the plot where one finds fiction
- blue and yellow are the closest to the secondary terms
  - in Berlin and Kay’s(1969) hierarchy
The location of green and red is relatively high on Dimension 2
- the position of newspaper and magazine texts

19.2 Visualization of exemplars and prototypes of lexical categories
Multiple Correspondence Analysis of Stuhl and Sessel

Add-on packages

library(Rling);library(FactoMineR);library(ca);library(rms)

Prototype Theory(e.g.Rosch 1975, Rosch & Mervis 1975)

Categorization of a new item is performed by comparing it with the prototype of an existing category
The prototype is the summary representation of a category
- contains all features of the category instances
- weighted according to their frequency of occurrence in the subject’s previous experience
The category BIRD
- have the feature ‘can fly’, only some of them have ‘can swim’
- Different members of a category possess typical features to a different extent
- robin is more prototypical member than a penguin
Features plays a crucial role in establishing the similarity between two examplars
These features are highly intercorrelated
- a typical bird can fly, has wings, and makes nests in trees

Data Structure

Focuses on two german lexical categories Stuhl ‘chair’ and Sessel ‘armchair’
- According to the classic study by Gipper(1959), the boundaries between the categories were fuzzy

data(chairs)
str(chairs)

## 'data.frame':    188 obs. of  19 variables:
##  $ Shop        : Factor w/ 3 levels "ikea.de","Moebel-Profi.de",..: 2 1 1 2 1 3 1 3 1 1 ...
##  $ WordDE      : Factor w/ 44 levels "3-in-1-Sessel",..: 2 17 38 41 23 13 25 15 40 40 ...
##  $ Category    : Factor w/ 2 levels "Sessel","Stuhl": 2 2 1 2 2 2 2 1 2 2 ...
##  $ Function    : Factor w/ 5 levels "Eat","NotSpec",..: 1 1 2 1 1 5 2 4 1 1 ...
##  $ Age         : Factor w/ 2 levels "Adult","Children": 1 2 1 1 2 1 1 1 1 1 ...
##  $ Back        : Factor w/ 4 levels "Adjust","High",..: 3 4 4 2 2 2 4 2 4 4 ...
##  $ Soft        : Factor w/ 3 levels "No","Pad","Yes": 1 1 1 3 1 3 1 3 1 1 ...
##  $ Arms        : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 2 1 1 ...
##  $ Upholst     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 2 1 2 ...
##  $ MaterialSeat: Factor w/ 10 levels "Fabric","Leather",..: 6 10 8 1 6 1 10 2 10 1 ...
##  $ SeatHeight  : Factor w/ 3 levels "Adjust","High",..: 3 2 3 3 2 1 3 3 3 3 ...
##  $ SeatDepth   : Factor w/ 3 levels "Adjust","Deep",..: 3 3 3 3 3 2 3 2 3 3 ...
##  $ Swivel      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Roll        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Rock        : Factor w/ 2 levels "No","Rock": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AddFunctions: Factor w/ 3 levels "Bed","No","Table": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Recline     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ReclineBack : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ SaveSpace   : Factor w/ 3 levels "collapse","No",..: 2 2 3 2 2 2 1 2 2 2 ...

Variables
- Shop: one of the three online stores
- WordDE: the exact lexical label of each chair or armchair
- Category:the lexical category ‘Stuhl’ or ‘Sessel’
Many of the variables are intercorrelated
- if a chair can swivel, it can usually roll

swivelRoll<-xtabs(~ chairs$Swivel+chairs$Roll)
swivelRoll

##              chairs$Roll
## chairs$Swivel  No Yes
##           No  133   1
##           Yes  14  40

chisq.test(swivelRoll)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  swivelRoll
## X-squared = 117.1, df = 1, p-value < 2.2e-16

19.2.2 Multiple Correspondence Analysis

To analyse mutivariate data with more than two categorical variables
package FactoMineR
The input data
- dataframe format
- the rows are individual observations
- the columns are categorical variables
MCA can represent
- relationships between the values of categorical variables
- individual observation

Create a map

chairs.ca<-MCA(chairs[,-c(1:3)],graph=FALSE)
plot(chairs.ca, cex=0.7, col.var="black", col.ind="grey")

The contributions of different variables : Dimension 1

dimdesc(chairs.ca,axes=1)

## $`Dim 1`
## $`Dim 1`$quali
##                      R2      p.value
## Upholst      0.72940952 1.094774e-54
## MaterialSeat 0.74518860 3.215782e-48
## Function     0.69158437 1.158923e-45
## Soft         0.66568141 9.657154e-45
## Swivel       0.40875670 5.393205e-23
## Roll         0.38348403 2.728416e-21
## SeatHeight   0.39565748 5.870717e-21
## Back         0.36654364 3.802707e-18
## Arms         0.21473392 2.133731e-11
## SeatDepth    0.20909906 3.769585e-10
## SaveSpace    0.19444992 2.058545e-09
## Age          0.06521465 4.047690e-04
## ReclineBack  0.06368029 4.764098e-04
## Recline      0.04908474 2.246446e-03
## 
## $`Dim 1`$category
##                        Estimate      p.value
## Upholst_No          0.508362663 1.094774e-54
## Soft_No             0.388494836 5.840206e-41
## Swivel_No           0.402810868 5.393205e-23
## Roll_No             0.427504866 2.728416e-21
## Wood                0.293104780 4.464862e-16
## NotSpec             0.643951013 4.899545e-13
## Eat                 0.234142965 5.684581e-13
## Back_Mid            0.358373579 1.779517e-12
## SeatHeight_Norm     0.143680221 2.417614e-12
## Arms_No             0.264219623 2.133731e-11
## SeatDepth_Norm      0.447005585 6.603795e-10
## Plastic             0.346541022 1.681653e-08
## SaveSpace_stack     0.267052677 4.199531e-07
## Children            0.236113583 4.047690e-04
## ReclineBack_No      0.163875900 4.764098e-04
## SaveSpace_collapse  0.252603082 1.181098e-03
## Rattan              0.124400133 1.696248e-03
## Recline_No          0.182956176 2.246446e-03
## SeatHeight_High     0.538230447 4.514663e-03
## Back_Low            0.737133741 6.215202e-03
## Recline_Yes        -0.182956176 2.246446e-03
## ReclineBack_Yes    -0.163875900 4.764098e-04
## Adult              -0.236113583 4.047690e-04
## SeatDepth_Adjust   -0.437445200 1.207452e-04
## Back_Adjust        -0.875345607 1.096450e-05
## Relax              -0.463574619 8.503520e-06
## SeatDepth_Deep     -0.009560385 8.078739e-06
## Fabric             -0.704604631 6.260860e-10
## SaveSpace_No       -0.519655759 2.401817e-10
## Leather            -0.835048584 8.166132e-11
## Back_High          -0.220161713 5.595166e-11
## Arms_Yes           -0.264219623 2.133731e-11
## Work               -0.714944787 5.944443e-16
## SeatHeight_Adjust  -0.681910668 7.980545e-21
## Roll_Yes           -0.427504866 2.728416e-21
## Swivel_Yes         -0.402810868 5.393205e-23
## Soft_Yes           -0.575087143 9.455970e-46
## Upholst_Yes        -0.508362663 1.094774e-54

'Dim 1'$quali shows the statistics
- R²
- How strongly each variable is associated with each dimension
'Dim 1'$category provides information on the directionality of those associations
- estimates of simple linear regression coefficients
  - if the estimate is positive, then located in the right-hand part of the plot
  - if it is negative, then on the left
- The greater the deviation from zero, the stronger the effecct
highly comfortable(left) vs. less comfortable

The contributions of different variables : Dimension 2

dimdesc(chairs.ca,axes=2)

## $`Dim 2`
## $`Dim 2`$quali
##                      R2      p.value
## Function     0.78584122 4.217810e-60
## SeatDepth    0.64716353 1.414314e-42
## ReclineBack  0.57438079 2.422581e-36
## SeatHeight   0.52515707 1.204684e-30
## Roll         0.46545580 4.290655e-27
## Recline      0.27597436 9.940716e-15
## Swivel       0.26129650 6.598620e-14
## Back         0.14372484 2.682914e-06
## AddFunctions 0.11015232 2.049744e-05
## Upholst      0.05391569 1.343894e-03
## Arms         0.05093462 1.845045e-03
## Rock         0.05020680 1.993581e-03
## MaterialSeat 0.10623070 1.571677e-02
## Age          0.02829025 2.103903e-02
## Soft         0.03925557 2.461661e-02
## 
## $`Dim 2`$category
##                      Estimate      p.value
## ReclineBack_No     0.43818645 2.422581e-36
## SeatHeight_Adjust  0.42064264 1.263155e-28
## Roll_Yes           0.41932722 4.290655e-27
## Work               0.52198563 1.502005e-24
## SeatDepth_Norm     0.04767135 6.407920e-17
## Recline_No         0.38623766 9.940716e-15
## Swivel_Yes         0.28673582 6.598620e-14
## SeatDepth_Adjust   0.69693231 8.045886e-08
## Back_Adjust        0.78717617 1.046328e-07
## AddFunctions_No    0.17691154 4.792253e-04
## Upholst_No         0.12305293 1.343894e-03
## Arms_No            0.11456915 1.845045e-03
## Rock_No            0.39410033 1.993581e-03
## Wood               0.19020718 3.350552e-03
## Children           0.13845645 2.103903e-02
## Soft_No            0.01136842 2.286251e-02
## Plastic            0.20449930 3.735216e-02
## Fabric            -0.15465893 2.219771e-02
## Adult             -0.13845645 2.103903e-02
## Soft_Yes          -0.17784951 8.232290e-03
## Rock_Rock         -0.39410033 1.993581e-03
## Arms_Yes          -0.11456915 1.845045e-03
## Upholst_Yes       -0.12305293 1.343894e-03
## AddFunctions_Bed  -0.69031884 5.405031e-06
## Swivel_No         -0.28673582 6.598620e-14
## Recline_Yes       -0.38623766 9.940716e-15
## Roll_No           -0.41932722 4.290655e-27
## SeatHeight_Norm   -0.46315824 3.007332e-30
## SeatDepth_Deep    -0.74460367 2.505087e-36
## ReclineBack_Yes   -0.43818645 2.422581e-36
## Relax             -0.63329716 2.835996e-45

functionality is still a distinctive feature
chairs for relaxation vs. chairs for work
there are three distinct categories of chairs
- comfortable chairs for relaxation
- comfortable adjustable chairs for work
- mutifuctional chairs for the household
In the middle of the part
- comportable chairs for the dining room

How are these differences related to the lexical categories under investigation, Stuhl and Sessel?

chairs.ca1<-MCA(chairs[,-c(1:2)],quali.sup=1,graph=FALSE)
plot(chairs.ca1, invis="ind", col.var="darkgrey",col.quali.sup="black")

Stuhl
- upper right quadrant of the map
- the features are associated with simple practical chairs
- relatively close to the centre
  - most office chairs for work are also categoriaed as Stuhl
Sessel
- located at the bottom
- the features are associated with comfortable chairs for relaxation
The functional differentiation is crucial

95% confidence ellipses

around the centroid of Stuhl and Sessel
prototypes of the categories
Since the confidence ellipses do not overlap, the prototypes can be regarded as distinct

plotellipses(chairs.ca1, keepvar=1, label="quali")

around all examplars the represent each category
displays significant overlap, which supports Gipper’s observation about fuzzy boundaries between the categories

plotellipses(chairs.ca1, means=FALSE, keepvar=1, label="quali")

Eigenvalues

The proportion of explained variance tends to be very modest because the total variance is inflated
the first two dimensions represent only 27.42%

head(chairs.ca$eig,11)

##        eigenvalue percentage of variance cumulative percentage of variance
## dim 1  0.32507257              15.297533                          15.29753
## dim 2  0.25767552              12.125907                          27.42344
## dim 3  0.13519020               6.361892                          33.78533
## dim 4  0.12293223               5.785046                          39.57038
## dim 5  0.10891028               5.125190                          44.69557
## dim 6  0.09618531               4.526367                          49.22193
## dim 7  0.09019392               4.244420                          53.46635
## dim 8  0.08619851               4.056401                          57.52275
## dim 9  0.08165427               3.842554                          61.36531
## dim 10 0.07264654               3.418661                          64.78397
## dim 11 0.07066398               3.325364                          68.10933

Adjusted MCA

Adjusted MCA estimates explained variation more realistically

chairs.ca2<-mjca(chairs[,-c(1:3)])
summary(chairs.ca2,rows=FALSE,columns=FALSE)

## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.078443  47.1  47.1  **************           
##  2      0.043342  26.0  73.2  ********                 
##  3      0.006012   3.6  76.8  *                        
##  4      0.004155   2.5  79.3  *                        
##  5      0.002451   1.5  80.8                           
##  6      0.001291   0.8  81.5                           
##  7      0.000873   0.5  82.1                           
##  8      0.000639   0.4  82.4                           
##  9      0.000417   0.3  82.7                           
##  10     0.000117   0.1  82.8                           
##  11     7.6e-050   0.0  82.8                           
##  12     1e-05000   0.0  82.8                           
##         -------- -----                                 
##  Total: 0.166428

two first dimensions represent 73.2% of inertia
But are the non-adjusted and adjusted solutions equivalent?

A correlation analysis of the coordinates of features of the first two dimensions

cor(chairs.ca$var$coord[,1],chairs.ca2$colcoord[,1])

## [1] 1

cor(chairs.ca$var$coord[,2],chairs.ca2$colcoord[,2])

## [1] -1

the solutions are practically identical, deffering only in scale and the orientation of the second dimension in chairs.ca2(upside down)
Therefore, our initial two-dimensional representation is not perfect, but it is acceptable for a pilot study

Reducing the correlated variables to a smaller set of underlying dimensions

Intercorrelated features of categorical variables are pervasive in linguistic practice
- think, believe
  - mental state, animate, first argument, complement clause…
MCA dimensions as predictors in a logistic regression model

dim1<- chairs.ca$ind$coord[,1] #coordinates of individual examplars on the horizontal axis
dim2<- chairs.ca$ind$coord[,2] #the same for the vertical axis
m<-lrm(chairs$Category ~dim1+dim2)
m

## Logistic Regression Model
##  
##  lrm(formula = chairs$Category ~ dim1 + dim2)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           188    LR chi2     118.82    R2       0.643    C       0.921    
##   Sessel        67    d.f.             2    g        2.667    Dxy     0.842    
##   Stuhl        121    Pr(> chi2) <0.0001    gr      14.394    gamma   0.844    
##  max |deriv| 2e-06                          gp       0.386    tau-a   0.388    
##                                             Brier    0.094                     
##  
##            Coef   S.E.   Wald Z Pr(>|Z|)
##  Intercept 0.9833 0.2448 4.02   <0.0001 
##  dim1      2.1780 0.5319 4.09   <0.0001 
##  dim2      3.9151 0.5377 7.28   <0.0001 
##

Two dimensions have a high predictive power in the choice between the lexical categories(C>0.92)
The coefficients show that the higher the value of an exemplar with regard to dimention 1 and dimension 2 in the MCA, the greater are the chances of it being categorized as a Stuhl

How to report result of Correspondence Analysis

plots
the number of dimensions with the proportions of explained variance(inertia)
feature that contribute to the orientation of dimensions
For SCA $\chi$²-statistic, the degrees of freedom and the p-value
For MCA, it is recommended to add the results of a confirmatory logistic regression

19.Exemplars, categories, prototypes

Simple and multiple correspondence analysis

김소희

2018년 12월 14일

the summary of:

19.1 Register variation of Basic Colour Terms: Simple Correspondence Analysis

19.1.1 The data and hypothesis

19.1.2 Simple Correspondence Analysis

Interpretation of symmetric(default)CA maps

19.2 Visualization of exemplars and prototypes of lexical categories
Multiple Correspondence Analysis of Stuhl and Sessel

19.2.2 Multiple Correspondence Analysis

19.Exemplars, categories, prototypes

Simple and multiple correspondence analysis

김소희

2018년 12월 14일

the summary of:

19.1 Register variation of Basic Colour Terms: Simple Correspondence Analysis

19.1.1 The data and hypothesis

19.1.2 Simple Correspondence Analysis

Interpretation of symmetric(default)CA maps

19.2 Visualization of exemplars and prototypes of lexical categoriesMultiple Correspondence Analysis of Stuhl and Sessel

19.2.2 Multiple Correspondence Analysis

19.2 Visualization of exemplars and prototypes of lexical categories
Multiple Correspondence Analysis of Stuhl and Sessel