ANLY540 - Analysis of Human Language - Assignment 10: Correspondence Analysis

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(ca)
library(dplyr)
library(FactoMineR)

Simple Correspondence Analysis

The Data

Women and metonymy in Ancient Chinese: the data concerns metonymic patterns that were used to refer to women in texts of the Ming dynasty in China (1368 Ã¢â¬â 1644). The rows are different types of female referents, namely, imperial woman (queen or emperor’s concubine), servant girl, beautiful woman, mother or grandmother, unchaste woman (prostitute or mistress), young girl, wife (or concubine). The columns are six metonymic patterns:

Action for agent or patient, e.g. “to ruin state” for “beautiful woman”
Body part for whole, e.g. “powder-heads” for “prostitutes”
Location for located, e.g. “the middle palace” for “queen”
A piece of clothing for person, e.g. “red dress” for “beautiful woman”
Characteristic for person, e.g. “respectable-kind” for “mother”
Possessed for possessor, e.g. “blusher and powder” for “beautiful woman”

Import the data and create a mosaic plot to visualize the differences in usage across women references.

Mosaic Plot is a visualization of the standardized residuals from a chi-square type analysis.
The box is related to the observed cell size.
Coloring is shaded based on direction and strength of the residuals.

# load data
chinese_names <- read.csv("chinese_names.csv")
rownames(chinese_names) <- chinese_names[,1]
chinese_names <- chinese_names[,-1]

#mosaic plot
mosaicplot(chinese_names, #dataframe
           las = 2, #axis label style (perpendicular)
           shade = T, #color in the boxes
           main = "Metonymic patterns")

The length and width of the bars are representative of the size of the category. Wife and beautiful are the most common female references, while mother and young are the least common female references. The references with respect to location seem to be the most common for almost all female references. An exception is the beautiful reference, where location is the least common.
Actual v/s Expected Occurences:
- Action for the agent occurences seem to be more than expected for female references with to mother.
- Body part occurences seem to be more than expected for beautiful and unchaste female references and less than expected for imperial female references.
- Location occurences seem to be more than expected for imperial and wife female references and less than expected for beautiful, unchaste, and young female references.
- Clothing occurences seem to be more than expected for beautiful and young female references and less than expected for imperial and wife female references.
- Characteristic occurences seem to be more than expected for beautiful and young female references and less than expected for imperial female references.
- Possessed occurences seem to be more expected for beautiful female references and less than expected for unchaste female references.

The Analysis

Run a simple correspondence analysis on the data.

A simple correspondence analysis identifies systematic relationships between variables in low dimensional space.

sca_model <- ca(chinese_names)
summary(sca_model)

## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.762754  79.1  79.1  ********************     
##  2      0.166546  17.3  96.4  ****                     
##  3      0.021321   2.2  98.6  *                        
##  4      0.012843   1.3 100.0                           
##  5      0.000249   0.0 100.0                           
##         -------- -----                                 
##  Total: 0.963714 100.0                                 
## 
## 
## Rows:
##     name   mass  qlt  inr    k=1 cor ctr     k=2 cor ctr  
## 1 | Impr |  133  982   57 | -638 981  71 |    20   1   0 |
## 2 | Btfl |  250  999  499 | 1363 965 608 |   254  34  97 |
## 3 | Mthr |   30   80   18 | -207  75   2 |   -53   5   1 |
## 4 | Unch |   64  998  194 |  824 231  57 | -1503 768 862 |
## 5 | Yong |   26  407   27 |  422 176   6 |   483 230  36 |
## 6 | Wife |  498  995  205 | -627 991 257 |    37   4   4 |
## 
## Columns:
##     name   mass  qlt  inr    k=1 cor ctr     k=2 cor ctr  
## 1 | Actn |   32   65   18 | -186  65   1 |    20   1   0 |
## 2 | Bdyp |  106 1000  282 | 1221 580 206 | -1039 420 684 |
## 3 | Lctn |  628  998  262 | -634 998 331 |    -2   0   0 |
## 4 | Clth |   79  795   65 |  687 589  49 |   406 206  78 |
## 5 | Chrc |   77  999  174 | 1397 900 198 |   462  98  99 |
## 6 | Psss |   78  976  199 | 1450 856 215 |   544 120 139 |

What do the inertia values tell you about the dimensionality of the data?

The inertias tell us how much variation is captured by each dimension and how many dimensions are required to capture the most variance. We want to try to represent the relationship between variables in as few dimensions as possible.

In this case, the first two dimensions capture 96.4% of the variance.
Having 4 or 5 dimensions captures all the variance in the data.

We can choose to use just 2 dimensions as this captures pretty much all the variance.

Create a 2D plot of the data.

plot(sca_model)

What can you tell about the word usage from examining this plot?

Location and Action ocurrences tend to be seen along with Mother, Wife, and Imperial female references.
Clothing occurences tend to cluster along with young female references.
Characteristic and Possession ocurrences tend to be common with Beautiful female references.
Bodypart occurences are usually common with Unchaste female references.

All these 6 metonymic patterns occuring commonly with these specific female references make sense and there is nothing unexpected here.

Multiple Correspondence Analysis

The data included is from a large project examining the definitions of words, thus, exploring their category requirements. The following columns are included:

Cue: the word participants saw in the study, what they gave a definition for.
POS_Cue: the part of speech of the cue word.
POS_Feature: the part of speech for the feature word they listed (i.e. zebra-stripes, stripes would be the feature).
POS_Translated: these features were then translated into a root form, and this column denotes the part of speech for the translated word.
A1 and A2: the type of affix that was used in the feature. For example, ducks would be translated to duck, and the difference is a numerical marker for the affix of s.

Run a multiple correspondence analysis on the data, excluding the cue column.

mca_data <- read.csv("mca_data.csv")
mca_model <- MCA(mca_data[,-1], #dataset minus the first column
                 graph = FALSE)
summary(mca_model)

## 
## Call:
## MCA(X = mca_data[, -1], graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.410   0.377   0.350   0.286   0.258   0.234
## % of var.              5.855   5.381   4.999   4.082   3.680   3.340
## Cumulative % of var.   5.855  11.237  16.236  20.318  23.999  27.338
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.225   0.219   0.213   0.208   0.206   0.203
## % of var.              3.215   3.126   3.036   2.968   2.943   2.903
## Cumulative % of var.  30.554  33.679  36.715  39.683  42.626  45.529
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.201   0.201   0.200   0.200   0.200   0.200
## % of var.              2.875   2.872   2.864   2.862   2.858   2.855
## Cumulative % of var.  48.404  51.276  54.140  57.002  59.861  62.716
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.200   0.198   0.196   0.196   0.193   0.191
## % of var.              2.850   2.832   2.806   2.799   2.764   2.725
## Cumulative % of var.  65.566  68.399  71.205  74.003  76.767  79.492
##                       Dim.25  Dim.26  Dim.27  Dim.28  Dim.29  Dim.30
## Variance               0.188   0.183   0.173   0.169   0.152   0.138
## % of var.              2.681   2.615   2.478   2.416   2.165   1.966
## Cumulative % of var.  82.173  84.787  87.266  89.682  91.847  93.813
##                       Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.122   0.110   0.089   0.070   0.043
## % of var.              1.738   1.575   1.269   0.994   0.611
## Cumulative % of var.  95.551  97.126  98.395  99.389 100.000
## 
## Individuals (the 10 first)
##                             Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                        |  0.140  0.000  0.008 | -0.168  0.000  0.012 |
## 2                        | -0.406  0.001  0.057 |  0.785  0.005  0.214 |
## 3                        | -0.914  0.006  0.210 |  0.845  0.005  0.180 |
## 4                        | -0.307  0.001  0.017 |  0.709  0.004  0.091 |
## 5                        |  1.750  0.021  0.267 |  1.803  0.024  0.284 |
## 6                        |  0.090  0.000  0.008 | -0.331  0.001  0.108 |
## 7                        |  0.736  0.004  0.256 |  0.093  0.000  0.004 |
## 8                        | -0.130  0.000  0.009 | -0.794  0.005  0.321 |
## 9                        |  0.108  0.000  0.007 | -0.562  0.002  0.197 |
## 10                       |  1.750  0.021  0.267 |  1.803  0.024  0.284 |
##                           Dim.3    ctr   cos2  
## 1                        -0.339  0.001  0.049 |
## 2                         0.031  0.000  0.000 |
## 3                         0.124  0.000  0.004 |
## 4                         0.622  0.003  0.070 |
## 5                        -1.882  0.029  0.309 |
## 6                        -0.383  0.001  0.144 |
## 7                         0.524  0.002  0.129 |
## 8                        -0.525  0.002  0.140 |
## 9                        -0.111  0.000  0.008 |
## 10                       -1.882  0.029  0.309 |
## 
## Categories (the 10 first)
##                               Dim.1      ctr     cos2   v.test      Dim.2
## pos_cue_adjective        |    0.436    1.281    0.030   32.859 |    0.104
## pos_cue_noun             |   -0.111    0.436    0.032  -33.510 |   -0.107
## pos_cue_other            |    0.677    0.458    0.010   18.432 |    0.697
## pos_cue_verb             |    0.049    0.014    0.000    3.442 |    0.393
## pos_feature_adjective    |    0.928    9.803    0.262   96.383 |   -0.038
## pos_feature_noun         |   -0.097    0.208    0.008  -16.577 |   -0.780
## pos_feature_other        |    2.499   13.357    0.286  100.736 |    2.203
## pos_feature_verb         |   -1.039   14.316    0.403 -119.490 |    0.972
## pos_translated_adjective |    1.013    8.972    0.224   89.111 |   -0.018
## pos_translated_noun      |   -0.032    0.024    0.001   -5.727 |   -0.579
##                               ctr     cos2   v.test      Dim.3      ctr
## pos_cue_adjective           0.079    0.002    7.810 |    0.627    3.098
## pos_cue_noun                0.439    0.029  -32.253 |   -0.108    0.480
## pos_cue_other               0.528    0.010   18.971 |   -0.561    0.368
## pos_cue_verb                1.012    0.022   27.764 |    0.022    0.003
## pos_feature_adjective       0.018    0.000   -3.946 |    1.123   16.834
## pos_feature_noun           14.558    0.500 -133.062 |   -0.484    6.033
## pos_feature_other          11.299    0.223   88.821 |   -2.379   14.184
## pos_feature_verb           13.617    0.352  111.722 |    0.222    0.764
## pos_translated_adjective    0.003    0.000   -1.573 |    0.860    7.579
## pos_translated_noun         8.407    0.300 -103.194 |   -0.215    1.246
##                              cos2   v.test  
## pos_cue_adjective           0.063   47.210 |
## pos_cue_noun                0.030  -32.503 |
## pos_cue_other               0.007  -15.274 |
## pos_cue_verb                0.000    1.531 |
## pos_feature_adjective       0.384  116.710 |
## pos_feature_noun            0.192  -82.559 |
## pos_feature_other           0.260  -95.919 |
## pos_feature_verb            0.018   25.509 |
## pos_translated_adjective    0.162   75.680 |
## pos_translated_noun         0.041  -38.283 |
## 
## Categorical variables (eta2)
##                            Dim.1 Dim.2 Dim.3  
## pos_cue                  | 0.045 0.039 0.069 |
## pos_feature              | 0.772 0.744 0.662 |
## pos_translated           | 0.638 0.497 0.471 |
## a1                       | 0.542 0.491 0.372 |
## a2                       | 0.053 0.112 0.176 |

Plot the variables in a 2D graph. Use invis = "ind" rather than col.ind = "gray" so you can read the plot better.

plot(mca_model, cex = 0.7,
     col.var = "black", #color the variable names
     invis = "ind") #color the indicators

Use the dimdesc function to show the usefulness of the variables and to help you understand the results. Remember that the markdown preview doesn’t show you the whole output, use the console or knit to see the complete results.

dimdesc(mca_model)

## $`Dim 1`
## $`Dim 1`$quali
##                        R2 p.value
## pos_cue        0.04486592       0
## pos_feature    0.77226714       0
## pos_translated 0.63752821       0
## a1             0.54220339       0
## a2             0.05251250       0
## 
## $`Dim 1`$category
##                                            Estimate       p.value
## a1=a1_not                                0.90844500  0.000000e+00
## a1=a1_none                               0.40599205  0.000000e+00
## a1=a1_characteristic                     0.44374893  0.000000e+00
## pos_translated=pos_translated_other      1.18985308  0.000000e+00
## pos_translated=pos_translated_adjective  0.22085451  0.000000e+00
## pos_feature=pos_feature_other            1.23323912  0.000000e+00
## pos_feature=pos_feature_adjective        0.22733731  0.000000e+00
## a2=a2_characteristic                     0.75884682 7.624568e-275
## pos_cue=pos_cue_adjective                0.11113680 1.957209e-240
## a1=a1_magnitude                          1.08403991 2.555276e-229
## pos_cue=pos_cue_other                    0.26531878  3.192972e-76
## a1=a1_location                           0.62809919  3.253969e-28
## a2=a2_actions_process                    0.49732949  2.455727e-21
## a2=a2_past_tense                         0.31004858  4.340918e-15
## a2=a2_magnitude                          0.79477524  1.346548e-06
## a2=a2_not                                0.30411117  2.336047e-03
## pos_cue=pos_cue_verb                    -0.13693653  5.767695e-04
## a1=a1_opposites_wrong                   -0.43767077  1.663935e-07
## pos_translated=pos_translated_noun      -0.44810770  1.016708e-08
## a2=a2_third_person                      -0.79493852  4.336728e-21
## a1=a1_actions_process                   -0.07588789  3.209269e-27
## a1=a1_numbers                           -0.04384170  1.620015e-28
## a2=a2_none                              -0.01593500  8.436091e-36
## a1=a1_time                              -0.59287143  5.241091e-44
## a1=a1_person_object                     -0.15561204  2.577109e-55
## pos_feature=pos_feature_noun            -0.42871100  6.045533e-62
## a2=a2_numbers                           -0.39197800  6.024356e-85
## pos_cue=pos_cue_noun                    -0.23951905 3.982554e-250
## a1=a1_third_person                      -0.77024497  0.000000e+00
## a1=a1_present_participle                -0.63592049  0.000000e+00
## a1=a1_past_tense                        -0.65100137  0.000000e+00
## pos_translated=pos_translated_verb      -0.96259989  0.000000e+00
## pos_feature=pos_feature_verb            -1.03186543  0.000000e+00
## 
## 
## $`Dim 2`
## $`Dim 2`$quali
##                        R2      p.value
## pos_feature    0.74384040  0.00000e+00
## pos_translated 0.49728416  0.00000e+00
## a1             0.49145783  0.00000e+00
## a2             0.11215962  0.00000e+00
## pos_cue        0.03876337 1.57991e-303
## 
## $`Dim 2`$category
##                                        Estimate       p.value
## a2=a2_none                           0.46089226  0.000000e+00
## a1=a1_third_person                   0.51146419 1.618226e-310
## a1=a1_present_participle             0.33965787  0.000000e+00
## a1=a1_past_tense                     0.70251606  0.000000e+00
## a1=a1_none                           0.22524679  0.000000e+00
## pos_translated=pos_translated_other  1.14565556  0.000000e+00
## pos_feature=pos_feature_verb         0.23468643  0.000000e+00
## pos_feature=pos_feature_other        0.99056909  0.000000e+00
## pos_cue=pos_cue_verb                 0.07463552 1.685349e-171
## pos_cue=pos_cue_other                0.26098378  1.195717e-80
## a2=a2_characteristic                 0.05931649  6.502988e-67
## a1=a1_time                           0.47867566  1.532936e-27
## a2=a2_past_tense                     0.82693211  1.668765e-26
## a1=a1_not                            0.13369637  3.630435e-12
## a2=a2_third_person                   0.76640629  1.546967e-05
## a1=a1_opposites_wrong                0.37650503  2.149196e-05
## a2=a2_actions_process                0.22706357  1.598349e-04
## a2=a2_present_participle             0.51464745  3.204520e-02
## a2=a2_location                      -0.46664326  4.141076e-02
## a2=a2_magnitude                     -0.05803628  2.635872e-03
## pos_feature=pos_feature_adjective   -0.38500354  7.950096e-05
## a2=a2_time                          -1.06594529  1.312428e-06
## a2=a2_slang                         -1.28436035  5.451363e-10
## a1=a1_slang                         -0.75728119  2.233746e-13
## pos_cue=pos_cue_adjective           -0.10306155  5.565805e-15
## a2=a2_person_object                 -0.44031429  1.315954e-48
## a1=a1_characteristic                -0.20956663 5.073298e-164
## pos_cue=pos_cue_noun                -0.23255775 1.362376e-231
## a1=a1_actions_process               -0.43518917 1.178695e-297
## a2=a2_numbers                       -0.65840627  0.000000e+00
## a1=a1_person_object                 -0.78517764  0.000000e+00
## a1=a1_numbers                       -0.64618373  0.000000e+00
## pos_translated=pos_translated_verb  -0.01655096  0.000000e+00
## pos_translated=pos_translated_noun  -0.73664796  0.000000e+00
## pos_feature=pos_feature_noun        -0.84025199  0.000000e+00
## 
## 
## $`Dim 3`
## $`Dim 3`$quali
##                        R2 p.value
## pos_cue        0.06910587       0
## pos_feature    0.66167384       0
## pos_translated 0.47105205       0
## a1             0.37233929       0
## a2             0.17557189       0
## 
## $`Dim 3`$category
##                                            Estimate       p.value
## a2=a2_past_tense                         1.37281862  0.000000e+00
## a1=a1_not                                1.24512339  0.000000e+00
## a1=a1_magnitude                          0.96468972  0.000000e+00
## a1=a1_characteristic                     0.11159328  0.000000e+00
## pos_translated=pos_translated_noun       0.17816523  0.000000e+00
## pos_translated=pos_translated_adjective  0.81398871  0.000000e+00
## pos_feature=pos_feature_adjective        0.88895630  0.000000e+00
## pos_cue=pos_cue_adjective                0.37383401  0.000000e+00
## a2=a2_characteristic                     0.29589869 8.324527e-291
## a2=a2_actions_process                    0.97440704 1.682151e-183
## a1=a1_past_tense                         0.13687947 7.273683e-176
## pos_feature=pos_feature_verb             0.35574385 7.674028e-145
## a2=a2_present_participle                 0.65166458 7.341818e-139
## pos_translated=pos_translated_verb       0.40781191 1.994576e-107
## a2=a2_not                                0.62677356  1.293838e-29
## a1=a1_opposites_wrong                    0.75988965  1.924203e-28
## a2=a2_magnitude                          0.90747258  2.046611e-18
## a1=a1_time                               0.12088008  5.429633e-14
## a2=a2_third_person                       0.04277301  1.224576e-09
## a2=a2_opposites_wrong                    0.50126199  2.618666e-02
## a1=a1_location                          -0.33129161  7.810088e-03
## a2=a2_slang                             -1.58052637  1.335070e-05
## a1=a1_slang                             -0.82940872  2.110012e-10
## pos_cue=pos_cue_other                   -0.32882152  7.824926e-53
## a2=a2_numbers                           -0.95317928 1.870750e-180
## a1=a1_person_object                     -0.57595086 1.443391e-197
## pos_cue=pos_cue_noun                    -0.06090470 3.165924e-235
## a1=a1_none                              -0.36446676 7.539803e-258
## a2=a2_none                              -0.46049026 1.108574e-278
## a1=a1_numbers                           -0.61255479  0.000000e+00
## pos_translated=pos_translated_other     -1.39996585  0.000000e+00
## pos_feature=pos_feature_other           -1.18301264  0.000000e+00
## pos_feature=pos_feature_noun            -0.06168751  0.000000e+00

What are the largest predictors (i.e., R^2 over .25) of the first dimension?

The largest predictors of the first dimension in terms of their R^2 values are:

pos_feature - R^2 = 0.77226714
pos_translated - R^2 = 0.63752821
a1 - R^2 = 0.54220339

Looking at the category output for dimension one, what types of features does this appear to represent? (Try looking at the largest positive estimates to help distinguish what is represented by this dimension).

The type of features that dimension one respresents are the following:

pos_feature=pos_feature_other
pos_translated=pos_translated_other
a1=a1_magnitude
a1=a1_not
a2=a2_magnitude

Simple Categories

To view simple categories like we did in the lecture, try picking a view words out of the dataset that might be considered similar. I’ve shown how to do this below with three words, but feel free to pick your own. Change the words and the DF to your dataframe name. We will overlay those as supplemental variables.

mca_data <- mca_data %>% rename("cue" = "Ã¯..cue")

#pick any several interesting words 
words = c("mom", "family", "relative")

mca_model2 = MCA(mca_data[mca_data$cue %in% words , ], 
                 quali.sup = 1, #supplemental variable
                graph = FALSE)

Create a 2D plot of your category analysis.

plot(mca_model2,
     invis = "ind",
     col.var = "darkgray",
     col.quali.sup = "black")

Add the prototype ellipses to the plot.

plotellipses(mca_model2, keepvar = 1, #use column 1 to label
             label = "quali")

Create a 95% CI type plot for the category.

plotellipses(mca_model2,
             means = F,
             keepvar = 1, #use column 1 to label
             label = "quali")

What can you tell about the categories from these plots? Are they distinct or overlapping?

The confidence ellipses for family, relative and mom all overlap, representing the fuzzy boundaries that these categories seem to have. So we can’t consider these prototypes as distinct entities.