Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk

library(FactoMineR)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(ggplot2)
library(ca)
library(vcd)
## Loading required package: grid
library(mosaic)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggstance
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following object is masked from 'package:vcd':
## 
##     mplot
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(dplyr)

Simple Correspondence Analysis

The Data

Women and metonymy in Ancient Chinese: the data concerns metonymic patterns that were used to refer to women in texts of the Ming dynasty in China (1368 – 1644). The rows are different types of female referents, namely, imperial woman (queen or emperor’s concubine), servant girl, beautiful woman, mother or grandmother, unchaste woman (prostitute or mistress), young girl, wife (or concubine). The columns are six metonymic patterns:

  • Action for agent or patient, e.g. “to ruin state” for “beautiful woman”
  • Body part for whole, e.g. “powder-heads” for “prostitutes”
  • Location for located, e.g. “the middle palace” for “queen”
  • A piece of clothing for person, e.g. “red dress” for “beautiful woman”
  • Characteristic for person, e.g. “respectable-kind” for “mother”
  • Possessed for possessor, e.g. “blusher and powder” for “beautiful woman”

Import the data and create a mosaic plot to visualize the differences in usage across women references.

##r chunk

chinese_names = read.csv("chinese_names.csv")

head(chinese_names)
##         Name Action Bodypart Location Clothes Characteristic Possessed
## 1  Imperial      10        1      204       5              2         0
## 2 Beautiful       9       99        5      71            112       120
## 3    Mother       8        4       32       1              2         3
## 4  Unchaste       2       67       27       3              5         2
## 5     Young       1        2       16      14              7         3
## 6      Wife      24        3      763      37              1         2
str(chinese_names)
## 'data.frame':    6 obs. of  7 variables:
##  $ Name          : Factor w/ 6 levels "Beautiful ","Imperial ",..: 2 1 3 4 6 5
##  $ Action        : int  10 9 8 2 1 24
##  $ Bodypart      : int  1 99 4 67 2 3
##  $ Location      : int  204 5 32 27 16 763
##  $ Clothes       : int  5 71 1 3 14 37
##  $ Characteristic: int  2 112 2 5 7 1
##  $ Possessed     : int  0 120 3 2 3 2
rownames(chinese_names) = chinese_names[,1]
chinese_names = chinese_names[,-1]

#mosaic plot
mosaicplot(chinese_names, las = 2,  shade = TRUE, main = "Metonymic patterns for differences in usage across women references")

The Analysis

Run a simple correspondence analysis on the data.

##r chunk 

chinesename_analysis= ca(chinese_names)

summary(chinesename_analysis)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.762754  79.1  79.1  ********************     
##  2      0.166546  17.3  96.4  ****                     
##  3      0.021321   2.2  98.6  *                        
##  4      0.012843   1.3 100.0                           
##  5      0.000249   0.0 100.0                           
##         -------- -----                                 
##  Total: 0.963714 100.0                                 
## 
## 
## Rows:
##     name   mass  qlt  inr    k=1 cor ctr     k=2 cor ctr  
## 1 | Impr |  133  982   57 | -638 981  71 |    20   1   0 |
## 2 | Btfl |  250  999  499 | 1363 965 608 |   254  34  97 |
## 3 | Mthr |   30   80   18 | -207  75   2 |   -53   5   1 |
## 4 | Unch |   64  998  194 |  824 231  57 | -1503 768 862 |
## 5 | Yong |   26  407   27 |  422 176   6 |   483 230  36 |
## 6 | Wife |  498  995  205 | -627 991 257 |    37   4   4 |
## 
## Columns:
##     name   mass  qlt  inr    k=1 cor ctr     k=2 cor ctr  
## 1 | Actn |   32   65   18 | -186  65   1 |    20   1   0 |
## 2 | Bdyp |  106 1000  282 | 1221 580 206 | -1039 420 684 |
## 3 | Lctn |  628  998  262 | -634 998 331 |    -2   0   0 |
## 4 | Clth |   79  795   65 |  687 589  49 |   406 206  78 |
## 5 | Chrc |   77  999  174 | 1397 900 198 |   462  98  99 |
## 6 | Psss |   78  976  199 | 1450 856 215 |   544 120 139 |

What do the inertia values tell you about the dimensionality of the data?

  • Here the first two dimensions capture 96% of the variance
  • And the fourth dimension captures all the variance

Create a 2D plot of the data.

##r chunk

plot(chinesename_analysis)

What can you tell about the word usage from examining this plot?

  • The words are closer if they have same frequency counts

-The distances on the map are a representation of the χ2 values of each row/column to the average profile

Multiple Correspondence Analysis

The data included is from a large project examining the definitions of words, thus, exploring their category requirements. The following columns are included:

Run a multiple correspondence analysis on the data, excluding the cue column.

##r chunk

mca_data = read.csv("mca_data.csv")

head(mca_data)
##       cue pos_cue pos_feature pos_translated                 a1   a2
## 1 abandon    verb        noun           noun               none none
## 2 abandon    verb        verb           verb               none none
## 3 abandon    verb        verb           verb present_participle none
## 4 abandon    verb   adjective           verb         past_tense none
## 5 abandon    verb       other          other               none none
## 6 abdomen    noun        noun           noun               none none
str(mca_data)
## 'data.frame':    35447 obs. of  6 variables:
##  $ cue           : Factor w/ 4436 levels "abandon","abdomen",..: 1 1 1 1 1 2 2 2 2 3 ...
##  $ pos_cue       : Factor w/ 4 levels "adjective","noun",..: 4 4 4 4 4 2 2 2 2 4 ...
##  $ pos_feature   : Factor w/ 4 levels "adjective","noun",..: 2 4 4 1 3 2 1 2 2 3 ...
##  $ pos_translated: Factor w/ 4 levels "adjective","noun",..: 2 4 4 4 3 2 1 2 2 3 ...
##  $ a1            : Factor w/ 14 levels "actions_process",..: 5 5 11 9 5 5 5 7 2 5 ...
##  $ a2            : Factor w/ 14 levels "actions_process",..: 5 5 5 5 5 5 5 5 5 5 ...
#rownames(mca_data) = mca_data[,1]
mca_data = mca_data[,-1]


mca_data_analysis= MCA(mca_data[,-1], graph = FALSE)

summary(mca_data_analysis)
## 
## Call:
## MCA(X = mca_data[, -1], graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.507   0.466   0.429   0.355   0.322   0.289   0.280
## % of var.              6.337   5.825   5.366   4.436   4.024   3.617   3.503
## Cumulative % of var.   6.337  12.162  17.528  21.964  25.988  29.606  33.108
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.271   0.263   0.258   0.254   0.252   0.251   0.250
## % of var.              3.393   3.293   3.221   3.175   3.150   3.140   3.130
## Cumulative % of var.  36.502  39.795  43.016  46.191  49.341  52.481  55.611
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.250   0.250   0.250   0.249   0.248   0.245   0.244
## % of var.              3.127   3.125   3.120   3.109   3.098   3.063   3.045
## Cumulative % of var.  58.738  61.863  64.983  68.092  71.190  74.253  77.298
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.241   0.235   0.222   0.214   0.190   0.172   0.152
## % of var.              3.014   2.936   2.773   2.671   2.374   2.151   1.906
## Cumulative % of var.  80.312  83.249  86.022  88.692  91.066  93.217  95.123
##                       Dim.29  Dim.30  Dim.31  Dim.32
## Variance               0.138   0.111   0.087   0.054
## % of var.              1.726   1.392   1.090   0.669
## Cumulative % of var.  96.849  98.241  99.331 100.000
## 
## Individuals (the 10 first)
##                             Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                        |  0.187  0.000  0.030 | -0.302  0.001  0.078 | -0.394
## 2                        | -0.527  0.002  0.154 |  0.696  0.003  0.268 |  0.029
## 3                        | -1.112  0.007  0.387 |  0.710  0.003  0.158 |  0.088
## 4                        | -0.453  0.001  0.040 |  0.593  0.002  0.069 |  0.691
## 5                        |  1.912  0.020  0.291 |  2.134  0.028  0.363 | -1.929
## 6                        |  0.187  0.000  0.030 | -0.302  0.001  0.078 | -0.394
## 7                        |  0.827  0.004  0.268 |  0.203  0.000  0.016 |  0.722
## 8                        | -0.013  0.000  0.000 | -0.836  0.004  0.296 | -0.599
## 9                        |  0.209  0.000  0.023 | -0.570  0.002  0.171 | -0.100
## 10                       |  1.912  0.020  0.291 |  2.134  0.028  0.363 | -1.929
##                             ctr   cos2  
## 1                         0.001  0.133 |
## 2                         0.000  0.000 |
## 3                         0.000  0.002 |
## 4                         0.003  0.094 |
## 5                         0.024  0.296 |
## 6                         0.001  0.133 |
## 7                         0.003  0.204 |
## 8                         0.002  0.152 |
## 9                         0.000  0.005 |
## 10                        0.024  0.296 |
## 
## Categories (the 10 first)
##                               Dim.1      ctr     cos2   v.test      Dim.2
## pos_feature_adjective    |    0.870    8.715    0.231   90.401 |    0.001
## pos_feature_noun         |   -0.010    0.002    0.000   -1.669 |   -0.772
## pos_feature_other        |    2.444   12.918    0.274   98.543 |    2.496
## pos_feature_verb         |   -1.126   16.986    0.473 -129.469 |    0.878
## pos_translated_adjective |    0.968    8.284    0.205   85.173 |    0.038
## pos_translated_noun      |    0.026    0.016    0.001    4.681 |   -0.569
## pos_translated_other     |    2.484   11.252    0.237   91.640 |    2.814
## pos_translated_verb      |   -0.894   12.243    0.360 -113.000 |    0.508
## a1_actions_process       |   -0.152    0.082    0.002   -7.979 |   -0.769
## a1_characteristic        |    0.600    2.861    0.069   49.515 |   -0.291
##                               ctr     cos2   v.test      Dim.3      ctr
## pos_feature_adjective       0.000    0.000    0.098 |    1.202   19.639
## pos_feature_noun           14.408    0.489 -131.689 |   -0.530    7.373
## pos_feature_other          14.654    0.286  100.634 |   -2.159   11.904
## pos_feature_verb           11.223    0.287  100.904 |    0.195    0.603
## pos_translated_adjective    0.014    0.000    3.356 |    0.949    9.409
## pos_translated_noun         8.199    0.290 -101.381 |   -0.245    1.655
## pos_translated_other       15.710    0.304  103.825 |   -2.637   14.980
## pos_translated_verb         4.304    0.116   64.243 |    0.139    0.351
## a1_actions_process          2.282    0.046  -40.312 |   -0.069    0.020
## a1_characteristic           0.730    0.016  -23.984 |    0.563    2.977
##                              cos2   v.test  
## pos_feature_adjective       0.440  124.877 |
## pos_feature_noun            0.231  -90.413 |
## pos_feature_other           0.214  -87.051 |
## pos_feature_verb            0.014   22.444 |
## pos_translated_adjective    0.197   83.533 |
## pos_translated_noun         0.054  -43.714 |
## pos_translated_other        0.267  -97.304 |
## pos_translated_verb         0.009   17.605 |
## a1_actions_process          0.000   -3.618 |
## a1_characteristic           0.061   46.483 |
## 
## Categorical variables (eta2)
##                            Dim.1 Dim.2 Dim.3  
## pos_feature              | 0.783 0.751 0.679 |
## pos_translated           | 0.645 0.526 0.453 |
## a1                       | 0.556 0.470 0.398 |
## a2                       | 0.044 0.117 0.187 |

Plot the variables in a 2D graph. Use invis = "ind" rather than col.ind = "gray" so you can read the plot better.

##r chunk

plot(mca_data_analysis, cex = 0.5,
     col.var = "red", 
     invis= "ind")

Use the dimdesc function to show the usefulness of the variables and to help you understand the results. Remember that the markdown preview doesn’t show you the whole output, use the console or knit to see the complete results.

##r chunk

dimdesc(mca_data_analysis)
## $`Dim 1`
## $`Dim 1`$quali
##                        R2 p.value
## pos_feature    0.78315711       0
## pos_translated 0.64471297       0
## a1             0.55595804       0
## a2             0.04393332       0
## 
## $`Dim 1`$category
##                                            Estimate       p.value
## a1=a1_none                               0.45651602  0.000000e+00
## a1=a1_characteristic                     0.50067905  0.000000e+00
## pos_translated=pos_translated_other      1.30839265  0.000000e+00
## pos_translated=pos_translated_adjective  0.22922345  0.000000e+00
## pos_feature=pos_feature_other            1.35261577  0.000000e+00
## pos_feature=pos_feature_adjective        0.23166646  0.000000e+00
## a1=a1_not                                0.88373318 1.913170e-262
## a2=a2_characteristic                     0.82807322 5.318457e-262
## a1=a1_magnitude                          1.14982429 4.569105e-203
## a1=a1_location                           0.72953694  7.045299e-30
## a2=a2_actions_process                    0.47387019  7.270339e-16
## a2=a2_magnitude                          0.79718987  1.431242e-05
## a2=a2_past_tense                         0.18091510  4.705422e-05
## a1=a1_numbers                            0.05063490  2.189220e-02
## a2=a2_not                                0.22518129  4.512929e-02
## pos_translated=pos_translated_noun      -0.44132493  2.841210e-06
## a1=a1_opposites_wrong                   -0.54676209  3.325198e-09
## a1=a1_actions_process                   -0.03471943  1.433693e-15
## a1=a1_person_object                     -0.06906787  2.294673e-21
## a2=a2_third_person                      -0.91873160  7.648144e-23
## a2=a2_none                              -0.01471010  4.453613e-41
## a2=a2_numbers                           -0.30887041  3.354590e-44
## a1=a1_time                              -0.71670020  1.815025e-52
## a1=a1_third_person                      -0.88348633  0.000000e+00
## a1=a1_present_participle                -0.72936126  0.000000e+00
## a1=a1_past_tense                        -0.81293060  0.000000e+00
## pos_translated=pos_translated_verb      -1.09629116  0.000000e+00
## pos_feature=pos_feature_verb            -1.18952856  0.000000e+00
## 
## 
## $`Dim 2`
## $`Dim 2`$quali
##                       R2 p.value
## pos_feature    0.7509786       0
## pos_translated 0.5262071       0
## a1             0.4702413       0
## a2             0.1167293       0
## 
## $`Dim 2`$category
##                                            Estimate       p.value
## a2=a2_none                               0.56228112  0.000000e+00
## a1=a1_past_tense                         0.72086326  0.000000e+00
## a1=a1_none                               0.31530739  0.000000e+00
## pos_translated=pos_translated_other      1.44457983  0.000000e+00
## pos_feature=pos_feature_verb             0.15484159  0.000000e+00
## pos_feature=pos_feature_other            1.25985299  0.000000e+00
## a1=a1_third_person                       0.54617139 7.565060e-264
## a1=a1_present_participle                 0.33965769 5.340420e-231
## a2=a2_characteristic                     0.09453136  7.207779e-73
## a2=a2_past_tense                         0.86695131  8.368665e-17
## a1=a1_time                               0.40786894  2.267594e-15
## a2=a2_actions_process                    0.25379738  4.745612e-06
## a1=a1_not                                0.11399405  8.801548e-06
## a2=a2_third_person                       0.81682119  6.587179e-04
## a1=a1_magnitude                          0.12828675  7.697514e-04
## a1=a1_opposites_wrong                    0.31653910  2.581146e-03
## a2=a2_location                          -0.50233933  3.565496e-02
## a2=a2_magnitude                         -0.03076440  2.079612e-03
## pos_translated=pos_translated_adjective -0.45041427  7.907533e-04
## a2=a2_time                              -1.20021982  5.261517e-07
## a2=a2_slang                             -1.38962769  4.668725e-10
## a1=a1_slang                             -0.77187138  4.935940e-12
## a2=a2_person_object                     -0.47995103  2.849101e-52
## a1=a1_characteristic                    -0.18493008 3.915540e-128
## a2=a2_numbers                           -0.71298512  0.000000e+00
## a1=a1_person_object                     -0.85455684  0.000000e+00
## a1=a1_numbers                           -0.67985194  0.000000e+00
## a1=a1_actions_process                   -0.51139051  0.000000e+00
## pos_translated=pos_translated_verb      -0.12961452  0.000000e+00
## pos_translated=pos_translated_noun      -0.86455104  0.000000e+00
## pos_feature=pos_feature_noun            -0.97106479  0.000000e+00
## 
## 
## $`Dim 3`
## $`Dim 3`$quali
##                       R2 p.value
## pos_feature    0.6785920       0
## pos_translated 0.4532433       0
## a1             0.3978683       0
## a2             0.1874376       0
## 
## $`Dim 3`$category
##                                           Estimate       p.value
## a2=a2_past_tense                         1.6004423  0.000000e+00
## a2=a2_characteristic                     0.4114460 1.433708e-315
## a1=a1_not                                1.4031875  0.000000e+00
## a1=a1_magnitude                          1.1958762  0.000000e+00
## a1=a1_characteristic                     0.1414688  0.000000e+00
## pos_translated=pos_translated_noun       0.1332089  0.000000e+00
## pos_translated=pos_translated_adjective  0.9158582  0.000000e+00
## pos_feature=pos_feature_adjective        0.9990786  0.000000e+00
## a2=a2_actions_process                    1.1372599 3.014382e-185
## a1=a1_past_tense                         0.1132825 9.665706e-144
## a2=a2_present_participle                 0.7371994 7.436795e-131
## pos_feature=pos_feature_verb             0.3395428 2.398048e-112
## pos_translated=pos_translated_verb       0.3850374  1.154297e-69
## a1=a1_opposites_wrong                    0.8676726  7.119226e-30
## a2=a2_not                                0.6292867  2.378004e-24
## a2=a2_magnitude                          0.9873431  6.847873e-17
## a1=a1_time                               0.1159710  6.522110e-13
## a2=a2_third_person                       0.1226626  1.969468e-10
## a2=a2_time                              -1.1062469  3.734607e-02
## a1=a1_present_participle                -0.2623936  3.252439e-04
## a1=a1_actions_process                   -0.2726630  2.964159e-04
## a2=a2_slang                             -1.8119996  2.167998e-06
## a1=a1_slang                             -0.9562133  2.407952e-11
## a1=a1_none                              -0.3639707 5.084896e-151
## a2=a2_none                              -0.4568427 7.903856e-240
## a2=a2_numbers                           -1.1097715 1.030554e-251
## a1=a1_person_object                     -0.7122676 2.081277e-273
## a1=a1_numbers                           -0.7156212  0.000000e+00
## pos_translated=pos_translated_other     -1.4341045  0.000000e+00
## pos_feature=pos_feature_other           -1.2031432  0.000000e+00
## pos_feature=pos_feature_noun            -0.1354782  0.000000e+00

What are the largest predictors (i.e., R^2 over .25) of the first dimension?

Looking at the category output for dimension one, what types of features does this appear to represent? (Try looking at the largest positive estimates to help distinguish what is represented by this dimension).

-a1-not and a2-charactersitics shows the highest estimates for 1st dimenstional variance

Simple Categories

To view simple categories like we did in the lecture, try picking a view words out of the dataset that might be considered similar. I’ve shown how to do this below with three words, but feel free to pick your own. Change the words and the DF to your dataframe name. We will overlay those as supplemental variables.

##r chunk
#pick any several interesting words 
words = c("mom", "family", "relative")


mca_data1 = read.csv("mca_data.csv")

#rownames(mca_data1$"ï..cue")= mca_data1$cue

mca_model2 = MCA(mca_data1[mca_data1$cue %in% words , ], 
                 quali.sup = 1, #supplemental variable
                 graph = TRUE)

Create a 2D plot of your category analysis.

##r chunk 

plot(mca_model2)

Add the prototype ellipses to the plot.

##r chunk

plotellipses(mca_model2, means=FALSE, keepvar=1, label="quali")

Create a 95% CI type plot for the category.

##r chunk

plotellipses(mca_model2, keepvar=1, label="quali")

What can you tell about the categories from these plots? Are they distinct or overlapping?

Run a MCA in Python

In this section, run the same MCA from above in Python. Include the MCA code and print out the inertia values for your analysis.

##python chunk 

#import prince

#mca = prince.MCA( ##set up the mca analysis
   # n_components=2,
   # n_iter=3,
   # copy=True,
   # check_input=True,
   # engine='auto',
   # random_state=42)
    
#mca_data_py = r.mca_data1
#mca_data_py = mca_data_py.drop(['cue'], axis=1)
#mca = mca.fit(mca_data_py)
#mca.explained_inertia_

Plot the Results

Plot the results of your MCA using Python in the section below. I have included Python code below that will help if you are completing this assignment on the cloud.

##python chunk
#import matplotlib
#matplotlib.use('Agg')

Explore the differences

Do the R and Python results from the MCA show you the answer? Do you detect any differences between the outputs?