Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
##r chunk
library(FactoMineR)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(ggplot2)
library(ca)
library(vcd)
## Loading required package: grid
library(mosaic)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following object is masked from 'package:vcd':
##
## mplot
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(dplyr)
Women and metonymy in Ancient Chinese: the data concerns metonymic patterns that were used to refer to women in texts of the Ming dynasty in China (1368 – 1644). The rows are different types of female referents, namely, imperial woman (queen or emperor’s concubine), servant girl, beautiful woman, mother or grandmother, unchaste woman (prostitute or mistress), young girl, wife (or concubine). The columns are six metonymic patterns:
Import the data and create a mosaic plot to visualize the differences in usage across women references.
##r chunk
chinese_names = read.csv("chinese_names.csv")
head(chinese_names)
## Name Action Bodypart Location Clothes Characteristic Possessed
## 1 Imperial 10 1 204 5 2 0
## 2 Beautiful 9 99 5 71 112 120
## 3 Mother 8 4 32 1 2 3
## 4 Unchaste 2 67 27 3 5 2
## 5 Young 1 2 16 14 7 3
## 6 Wife 24 3 763 37 1 2
str(chinese_names)
## 'data.frame': 6 obs. of 7 variables:
## $ Name : Factor w/ 6 levels "Beautiful ","Imperial ",..: 2 1 3 4 6 5
## $ Action : int 10 9 8 2 1 24
## $ Bodypart : int 1 99 4 67 2 3
## $ Location : int 204 5 32 27 16 763
## $ Clothes : int 5 71 1 3 14 37
## $ Characteristic: int 2 112 2 5 7 1
## $ Possessed : int 0 120 3 2 3 2
rownames(chinese_names) = chinese_names[,1]
chinese_names = chinese_names[,-1]
#mosaic plot
mosaicplot(chinese_names, las = 2, shade = TRUE, main = "Metonymic patterns for differences in usage across women references")
Run a simple correspondence analysis on the data.
##r chunk
chinesename_analysis= ca(chinese_names)
summary(chinesename_analysis)
##
## Principal inertias (eigenvalues):
##
## dim value % cum% scree plot
## 1 0.762754 79.1 79.1 ********************
## 2 0.166546 17.3 96.4 ****
## 3 0.021321 2.2 98.6 *
## 4 0.012843 1.3 100.0
## 5 0.000249 0.0 100.0
## -------- -----
## Total: 0.963714 100.0
##
##
## Rows:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | Impr | 133 982 57 | -638 981 71 | 20 1 0 |
## 2 | Btfl | 250 999 499 | 1363 965 608 | 254 34 97 |
## 3 | Mthr | 30 80 18 | -207 75 2 | -53 5 1 |
## 4 | Unch | 64 998 194 | 824 231 57 | -1503 768 862 |
## 5 | Yong | 26 407 27 | 422 176 6 | 483 230 36 |
## 6 | Wife | 498 995 205 | -627 991 257 | 37 4 4 |
##
## Columns:
## name mass qlt inr k=1 cor ctr k=2 cor ctr
## 1 | Actn | 32 65 18 | -186 65 1 | 20 1 0 |
## 2 | Bdyp | 106 1000 282 | 1221 580 206 | -1039 420 684 |
## 3 | Lctn | 628 998 262 | -634 998 331 | -2 0 0 |
## 4 | Clth | 79 795 65 | 687 589 49 | 406 206 78 |
## 5 | Chrc | 77 999 174 | 1397 900 198 | 462 98 99 |
## 6 | Psss | 78 976 199 | 1450 856 215 | 544 120 139 |
What do the inertia values tell you about the dimensionality of the data?
Create a 2D plot of the data.
##r chunk
plot(chinesename_analysis)
What can you tell about the word usage from examining this plot?
-The distances on the map are a representation of the χ2 values of each row/column to the average profile
The data included is from a large project examining the definitions of words, thus, exploring their category requirements. The following columns are included:
Run a multiple correspondence analysis on the data, excluding the cue column.
##r chunk
mca_data = read.csv("mca_data.csv")
head(mca_data)
## cue pos_cue pos_feature pos_translated a1 a2
## 1 abandon verb noun noun none none
## 2 abandon verb verb verb none none
## 3 abandon verb verb verb present_participle none
## 4 abandon verb adjective verb past_tense none
## 5 abandon verb other other none none
## 6 abdomen noun noun noun none none
str(mca_data)
## 'data.frame': 35447 obs. of 6 variables:
## $ cue : Factor w/ 4436 levels "abandon","abdomen",..: 1 1 1 1 1 2 2 2 2 3 ...
## $ pos_cue : Factor w/ 4 levels "adjective","noun",..: 4 4 4 4 4 2 2 2 2 4 ...
## $ pos_feature : Factor w/ 4 levels "adjective","noun",..: 2 4 4 1 3 2 1 2 2 3 ...
## $ pos_translated: Factor w/ 4 levels "adjective","noun",..: 2 4 4 4 3 2 1 2 2 3 ...
## $ a1 : Factor w/ 14 levels "actions_process",..: 5 5 11 9 5 5 5 7 2 5 ...
## $ a2 : Factor w/ 14 levels "actions_process",..: 5 5 5 5 5 5 5 5 5 5 ...
#rownames(mca_data) = mca_data[,1]
mca_data = mca_data[,-1]
mca_data_analysis= MCA(mca_data[,-1], graph = FALSE)
summary(mca_data_analysis)
##
## Call:
## MCA(X = mca_data[, -1], graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.507 0.466 0.429 0.355 0.322 0.289 0.280
## % of var. 6.337 5.825 5.366 4.436 4.024 3.617 3.503
## Cumulative % of var. 6.337 12.162 17.528 21.964 25.988 29.606 33.108
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.271 0.263 0.258 0.254 0.252 0.251 0.250
## % of var. 3.393 3.293 3.221 3.175 3.150 3.140 3.130
## Cumulative % of var. 36.502 39.795 43.016 46.191 49.341 52.481 55.611
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.250 0.250 0.250 0.249 0.248 0.245 0.244
## % of var. 3.127 3.125 3.120 3.109 3.098 3.063 3.045
## Cumulative % of var. 58.738 61.863 64.983 68.092 71.190 74.253 77.298
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.241 0.235 0.222 0.214 0.190 0.172 0.152
## % of var. 3.014 2.936 2.773 2.671 2.374 2.151 1.906
## Cumulative % of var. 80.312 83.249 86.022 88.692 91.066 93.217 95.123
## Dim.29 Dim.30 Dim.31 Dim.32
## Variance 0.138 0.111 0.087 0.054
## % of var. 1.726 1.392 1.090 0.669
## Cumulative % of var. 96.849 98.241 99.331 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.187 0.000 0.030 | -0.302 0.001 0.078 | -0.394
## 2 | -0.527 0.002 0.154 | 0.696 0.003 0.268 | 0.029
## 3 | -1.112 0.007 0.387 | 0.710 0.003 0.158 | 0.088
## 4 | -0.453 0.001 0.040 | 0.593 0.002 0.069 | 0.691
## 5 | 1.912 0.020 0.291 | 2.134 0.028 0.363 | -1.929
## 6 | 0.187 0.000 0.030 | -0.302 0.001 0.078 | -0.394
## 7 | 0.827 0.004 0.268 | 0.203 0.000 0.016 | 0.722
## 8 | -0.013 0.000 0.000 | -0.836 0.004 0.296 | -0.599
## 9 | 0.209 0.000 0.023 | -0.570 0.002 0.171 | -0.100
## 10 | 1.912 0.020 0.291 | 2.134 0.028 0.363 | -1.929
## ctr cos2
## 1 0.001 0.133 |
## 2 0.000 0.000 |
## 3 0.000 0.002 |
## 4 0.003 0.094 |
## 5 0.024 0.296 |
## 6 0.001 0.133 |
## 7 0.003 0.204 |
## 8 0.002 0.152 |
## 9 0.000 0.005 |
## 10 0.024 0.296 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2
## pos_feature_adjective | 0.870 8.715 0.231 90.401 | 0.001
## pos_feature_noun | -0.010 0.002 0.000 -1.669 | -0.772
## pos_feature_other | 2.444 12.918 0.274 98.543 | 2.496
## pos_feature_verb | -1.126 16.986 0.473 -129.469 | 0.878
## pos_translated_adjective | 0.968 8.284 0.205 85.173 | 0.038
## pos_translated_noun | 0.026 0.016 0.001 4.681 | -0.569
## pos_translated_other | 2.484 11.252 0.237 91.640 | 2.814
## pos_translated_verb | -0.894 12.243 0.360 -113.000 | 0.508
## a1_actions_process | -0.152 0.082 0.002 -7.979 | -0.769
## a1_characteristic | 0.600 2.861 0.069 49.515 | -0.291
## ctr cos2 v.test Dim.3 ctr
## pos_feature_adjective 0.000 0.000 0.098 | 1.202 19.639
## pos_feature_noun 14.408 0.489 -131.689 | -0.530 7.373
## pos_feature_other 14.654 0.286 100.634 | -2.159 11.904
## pos_feature_verb 11.223 0.287 100.904 | 0.195 0.603
## pos_translated_adjective 0.014 0.000 3.356 | 0.949 9.409
## pos_translated_noun 8.199 0.290 -101.381 | -0.245 1.655
## pos_translated_other 15.710 0.304 103.825 | -2.637 14.980
## pos_translated_verb 4.304 0.116 64.243 | 0.139 0.351
## a1_actions_process 2.282 0.046 -40.312 | -0.069 0.020
## a1_characteristic 0.730 0.016 -23.984 | 0.563 2.977
## cos2 v.test
## pos_feature_adjective 0.440 124.877 |
## pos_feature_noun 0.231 -90.413 |
## pos_feature_other 0.214 -87.051 |
## pos_feature_verb 0.014 22.444 |
## pos_translated_adjective 0.197 83.533 |
## pos_translated_noun 0.054 -43.714 |
## pos_translated_other 0.267 -97.304 |
## pos_translated_verb 0.009 17.605 |
## a1_actions_process 0.000 -3.618 |
## a1_characteristic 0.061 46.483 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## pos_feature | 0.783 0.751 0.679 |
## pos_translated | 0.645 0.526 0.453 |
## a1 | 0.556 0.470 0.398 |
## a2 | 0.044 0.117 0.187 |
Plot the variables in a 2D graph. Use invis = "ind" rather than col.ind = "gray" so you can read the plot better.
##r chunk
plot(mca_data_analysis, cex = 0.5,
col.var = "red",
invis= "ind")
Use the dimdesc function to show the usefulness of the variables and to help you understand the results. Remember that the markdown preview doesn’t show you the whole output, use the console or knit to see the complete results.
##r chunk
dimdesc(mca_data_analysis)
## $`Dim 1`
## $`Dim 1`$quali
## R2 p.value
## pos_feature 0.78315711 0
## pos_translated 0.64471297 0
## a1 0.55595804 0
## a2 0.04393332 0
##
## $`Dim 1`$category
## Estimate p.value
## a1=a1_none 0.45651602 0.000000e+00
## a1=a1_characteristic 0.50067905 0.000000e+00
## pos_translated=pos_translated_other 1.30839265 0.000000e+00
## pos_translated=pos_translated_adjective 0.22922345 0.000000e+00
## pos_feature=pos_feature_other 1.35261577 0.000000e+00
## pos_feature=pos_feature_adjective 0.23166646 0.000000e+00
## a1=a1_not 0.88373318 1.913170e-262
## a2=a2_characteristic 0.82807322 5.318457e-262
## a1=a1_magnitude 1.14982429 4.569105e-203
## a1=a1_location 0.72953694 7.045299e-30
## a2=a2_actions_process 0.47387019 7.270339e-16
## a2=a2_magnitude 0.79718987 1.431242e-05
## a2=a2_past_tense 0.18091510 4.705422e-05
## a1=a1_numbers 0.05063490 2.189220e-02
## a2=a2_not 0.22518129 4.512929e-02
## pos_translated=pos_translated_noun -0.44132493 2.841210e-06
## a1=a1_opposites_wrong -0.54676209 3.325198e-09
## a1=a1_actions_process -0.03471943 1.433693e-15
## a1=a1_person_object -0.06906787 2.294673e-21
## a2=a2_third_person -0.91873160 7.648144e-23
## a2=a2_none -0.01471010 4.453613e-41
## a2=a2_numbers -0.30887041 3.354590e-44
## a1=a1_time -0.71670020 1.815025e-52
## a1=a1_third_person -0.88348633 0.000000e+00
## a1=a1_present_participle -0.72936126 0.000000e+00
## a1=a1_past_tense -0.81293060 0.000000e+00
## pos_translated=pos_translated_verb -1.09629116 0.000000e+00
## pos_feature=pos_feature_verb -1.18952856 0.000000e+00
##
##
## $`Dim 2`
## $`Dim 2`$quali
## R2 p.value
## pos_feature 0.7509786 0
## pos_translated 0.5262071 0
## a1 0.4702413 0
## a2 0.1167293 0
##
## $`Dim 2`$category
## Estimate p.value
## a2=a2_none 0.56228112 0.000000e+00
## a1=a1_past_tense 0.72086326 0.000000e+00
## a1=a1_none 0.31530739 0.000000e+00
## pos_translated=pos_translated_other 1.44457983 0.000000e+00
## pos_feature=pos_feature_verb 0.15484159 0.000000e+00
## pos_feature=pos_feature_other 1.25985299 0.000000e+00
## a1=a1_third_person 0.54617139 7.565060e-264
## a1=a1_present_participle 0.33965769 5.340420e-231
## a2=a2_characteristic 0.09453136 7.207779e-73
## a2=a2_past_tense 0.86695131 8.368665e-17
## a1=a1_time 0.40786894 2.267594e-15
## a2=a2_actions_process 0.25379738 4.745612e-06
## a1=a1_not 0.11399405 8.801548e-06
## a2=a2_third_person 0.81682119 6.587179e-04
## a1=a1_magnitude 0.12828675 7.697514e-04
## a1=a1_opposites_wrong 0.31653910 2.581146e-03
## a2=a2_location -0.50233933 3.565496e-02
## a2=a2_magnitude -0.03076440 2.079612e-03
## pos_translated=pos_translated_adjective -0.45041427 7.907533e-04
## a2=a2_time -1.20021982 5.261517e-07
## a2=a2_slang -1.38962769 4.668725e-10
## a1=a1_slang -0.77187138 4.935940e-12
## a2=a2_person_object -0.47995103 2.849101e-52
## a1=a1_characteristic -0.18493008 3.915540e-128
## a2=a2_numbers -0.71298512 0.000000e+00
## a1=a1_person_object -0.85455684 0.000000e+00
## a1=a1_numbers -0.67985194 0.000000e+00
## a1=a1_actions_process -0.51139051 0.000000e+00
## pos_translated=pos_translated_verb -0.12961452 0.000000e+00
## pos_translated=pos_translated_noun -0.86455104 0.000000e+00
## pos_feature=pos_feature_noun -0.97106479 0.000000e+00
##
##
## $`Dim 3`
## $`Dim 3`$quali
## R2 p.value
## pos_feature 0.6785920 0
## pos_translated 0.4532433 0
## a1 0.3978683 0
## a2 0.1874376 0
##
## $`Dim 3`$category
## Estimate p.value
## a2=a2_past_tense 1.6004423 0.000000e+00
## a2=a2_characteristic 0.4114460 1.433708e-315
## a1=a1_not 1.4031875 0.000000e+00
## a1=a1_magnitude 1.1958762 0.000000e+00
## a1=a1_characteristic 0.1414688 0.000000e+00
## pos_translated=pos_translated_noun 0.1332089 0.000000e+00
## pos_translated=pos_translated_adjective 0.9158582 0.000000e+00
## pos_feature=pos_feature_adjective 0.9990786 0.000000e+00
## a2=a2_actions_process 1.1372599 3.014382e-185
## a1=a1_past_tense 0.1132825 9.665706e-144
## a2=a2_present_participle 0.7371994 7.436795e-131
## pos_feature=pos_feature_verb 0.3395428 2.398048e-112
## pos_translated=pos_translated_verb 0.3850374 1.154297e-69
## a1=a1_opposites_wrong 0.8676726 7.119226e-30
## a2=a2_not 0.6292867 2.378004e-24
## a2=a2_magnitude 0.9873431 6.847873e-17
## a1=a1_time 0.1159710 6.522110e-13
## a2=a2_third_person 0.1226626 1.969468e-10
## a2=a2_time -1.1062469 3.734607e-02
## a1=a1_present_participle -0.2623936 3.252439e-04
## a1=a1_actions_process -0.2726630 2.964159e-04
## a2=a2_slang -1.8119996 2.167998e-06
## a1=a1_slang -0.9562133 2.407952e-11
## a1=a1_none -0.3639707 5.084896e-151
## a2=a2_none -0.4568427 7.903856e-240
## a2=a2_numbers -1.1097715 1.030554e-251
## a1=a1_person_object -0.7122676 2.081277e-273
## a1=a1_numbers -0.7156212 0.000000e+00
## pos_translated=pos_translated_other -1.4341045 0.000000e+00
## pos_feature=pos_feature_other -1.2031432 0.000000e+00
## pos_feature=pos_feature_noun -0.1354782 0.000000e+00
What are the largest predictors (i.e., R^2 over .25) of the first dimension?
Looking at the category output for dimension one, what types of features does this appear to represent? (Try looking at the largest positive estimates to help distinguish what is represented by this dimension).
-a1-not and a2-charactersitics shows the highest estimates for 1st dimenstional variance
To view simple categories like we did in the lecture, try picking a view words out of the dataset that might be considered similar. I’ve shown how to do this below with three words, but feel free to pick your own. Change the words and the DF to your dataframe name. We will overlay those as supplemental variables.
##r chunk
#pick any several interesting words
words = c("mom", "family", "relative")
mca_data1 = read.csv("mca_data.csv")
#rownames(mca_data1$"ï..cue")= mca_data1$cue
mca_model2 = MCA(mca_data1[mca_data1$cue %in% words , ],
quali.sup = 1, #supplemental variable
graph = TRUE)
Create a 2D plot of your category analysis.
##r chunk
plot(mca_model2)
Add the prototype ellipses to the plot.
##r chunk
plotellipses(mca_model2, means=FALSE, keepvar=1, label="quali")
Create a 95% CI type plot for the category.
##r chunk
plotellipses(mca_model2, keepvar=1, label="quali")
What can you tell about the categories from these plots? Are they distinct or overlapping?
In this section, run the same MCA from above in Python. Include the MCA code and print out the inertia values for your analysis.
##python chunk
#import prince
#mca = prince.MCA( ##set up the mca analysis
# n_components=2,
# n_iter=3,
# copy=True,
# check_input=True,
# engine='auto',
# random_state=42)
#mca_data_py = r.mca_data1
#mca_data_py = mca_data_py.drop(['cue'], axis=1)
#mca = mca.fit(mca_data_py)
#mca.explained_inertia_
Plot the results of your MCA using Python in the section below. I have included Python code below that will help if you are completing this assignment on the cloud.
##python chunk
#import matplotlib
#matplotlib.use('Agg')
Do the R and Python results from the MCA show you the answer? Do you detect any differences between the outputs?