Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

Note: If you are using the VM Server, you will need to install ggplot2 package. Use the Install button in the packages tab in the lower right hand corner of the screen.

##r chunk
library(reticulate)
py_config()
## python:         /Users/hailunfeng/Library/r-miniconda/envs/r-reticulate/bin/python
## libpython:      /Users/hailunfeng/Library/r-miniconda/envs/r-reticulate/lib/libpython3.6m.dylib
## pythonhome:     /Users/hailunfeng/Library/r-miniconda/envs/r-reticulate:/Users/hailunfeng/Library/r-miniconda/envs/r-reticulate
## version:        3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 18:53:43)  [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
## numpy:          /Users/hailunfeng/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version:  1.19.1
library(FactoMineR)
library(ca)
library(ellipse)
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
library(ggplot2)

The Data

The purpose of the assignment is to understand how gendered pronouns have been used in the New York Times. Each line in the dataset represents an instance where a gendered pronoun was used in an New York Times article. The variables in the dataset are: - Gender: masculine pronoun (he,him,his) versus feminine pronoun (she,her,hers) - DictionaryWord: the specific pronoun (he, him, his, she, her,or hers) - Decade: when the article was written (1990s, 2000s, or 2010s) - ArticleType: what the article was about (Arts and Entertainment or Sports) - PronounType: function of the pronoun (subject, object, or possessive) - Career: was the pronoun used in reference to someone’s career (TRUE or FALSE) - Family: was the pronoun used in reference to someone’s family (TRUE or FALSE) - Great: was the word great used to describe the person (TRUE or FALSE) - Beautiful: was the word beautiful used to describe the person (TRUE or FALSE)

Simple Correspondence Analysis

Using simple correspondence analysis, explore the dimensions underlying the use of different gendered pronouns over time. The below R chunk subsets the data and creates a frequency table to run the correspondence analsysis on.

##r chunk
df<- read.csv('GenderPronounsNYT.csv')
df[,7:10]<- lapply(df[,7:10], factor)
freq<- df[,c('DictionaryWord', 'Decade')]
freq<- table(df$DictionaryWord,df$Decade)
freq
##       
##        1990s 2000s 2010s
##   he    1429  1457  1302
##   her    514   525   641
##   hers     4     4     2
##   him    331   348   226
##   his   1384  1350  1333
##   she    338   316   496

**What patterns do you notice in the frequency table?

Answer: The usage of specific pronouns like “he,” “hers,” “him,” and “his” is decreasing between the 1990s to 2010s. However, during the same period, words such as “her” and “she” have been used more frequently by the New York Times, meaning there are more articles today related to females.

The Analysis

Run a simple correspondence analysis on the data.

##r chunk 

simple_model<- ca(freq)
summary(simple_model)
## 
## Principal inertias (eigenvalues):
## 
##  dim    value      %   cum%   scree plot               
##  1      0.008946  98.9  98.9  *************************
##  2      0.000096   1.1 100.0                           
##         -------- -----                                 
##  Total: 0.009042 100.0                                 
## 
## 
## Rows:
##     name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
## 1 |   he |  349 1000   90 |  -48 993  90 |    4   7  61 |
## 2 |  her |  140 1000  163 |  101 973 160 |   17  27 410 |
## 3 | hers |    1 1000    7 | -282 993   7 |  -24   7   5 |
## 4 |  him |   75 1000  267 | -179 998 269 |    8   2  47 |
## 5 |  his |  339 1000    9 |  -11 486   5 |  -11 514 446 |
## 6 |  she |   96 1000  464 |  209 999 468 |   -6   1  31 |
## 
## Columns:
##     name   mass  qlt  inr    k=1  cor ctr    k=2 cor ctr  
## 1 | 1990 |  333 1000  124 |  -57  954 120 |  -13  46 547 |
## 2 | 2000 |  333 1000  221 |  -77  979 218 |   11  21 448 |
## 3 | 2010 |  333 1000  655 |  133 1000 662 |    1   0   5 |

**What do the inertia values tell you about the dimensionality of the data? Answer: The first dimension captures 98.9% of the variance, and the second dimension captures it all.

Create a 2D plot of the data.

##r chunk

plot(simple_model)

**What can you tell about the pronoun usage from examining this plot? Answer: The graph above shows that words like “he” and “his” are used more often in the 2000s and 1990s while “her” and “she” is more in the 2010s.

Multiple Correspondence Analysis

Using the full dataset, explore the clusters which may represent gendered pronouns (Gender variable). Be sure to remove the Filename and DictionaryWord columns before running the MCA and exclude the Gender column in your first MCA model. Use all the other variables in the model.

##r chunk

multiple_model<- MCA(df[4:10], graph = FALSE)
summary(multiple_model)
## 
## Call:
## MCA(X = df[4:10], graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.171   0.153   0.145   0.144   0.142   0.142   0.140
## % of var.             13.271  11.912  11.289  11.226  11.061  11.021  10.903
## Cumulative % of var.  13.271  25.183  36.472  47.698  58.759  69.780  80.683
##                        Dim.8   Dim.9
## Variance               0.132   0.116
## % of var.             10.278   9.038
## Cumulative % of var.  90.962 100.000
## 
## Individuals (the 10 first)
##                           Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 2                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 3                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 4                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 5                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 6                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 7                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 8                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 9                      |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
## 10                     |  0.059  0.000  0.006 | -0.173  0.002  0.049 |  0.311
##                           ctr   cos2  
## 1                       0.006  0.158 |
## 2                       0.006  0.158 |
## 3                       0.006  0.158 |
## 4                       0.006  0.158 |
## 5                       0.006  0.158 |
## 6                       0.006  0.158 |
## 7                       0.006  0.158 |
## 8                       0.006  0.158 |
## 9                       0.006  0.158 |
## 10                      0.006  0.158 |
## 
## Categories (the 10 first)
##                            Dim.1     ctr    cos2  v.test     Dim.2     ctr
## 1990s                  |  -0.012   0.004   0.000  -0.908 |  -0.020   0.013
## 2000s                  |   0.059   0.096   0.002   4.551 |  -0.305   2.887
## 2010s                  |  -0.047   0.062   0.001  -3.643 |   0.325   3.281
## Arts and Entertainment |   0.745  23.238   0.555  81.611 |   0.025   0.030
## Sports                 |  -0.745  23.238   0.555 -81.611 |  -0.025   0.030
## Object                 |   1.263  28.749   0.438  72.466 |  -0.016   0.005
## Possessive             |  -0.108   0.329   0.006  -8.451 |   0.907  26.078
## Subject                |  -0.529  10.432   0.224 -51.894 |  -0.685  19.470
## Career_FALSE           |   0.040   0.130   0.070  28.966 |  -0.102   0.956
## Career_TRUE            |  -1.756   5.725   0.070 -28.966 |   4.516  42.175
##                           cos2  v.test     Dim.3     ctr    cos2  v.test  
## 1990s                    0.000  -1.562 |  -0.828  22.497   0.343 -64.141 |
## 2000s                    0.046 -23.601 |   0.102   0.341   0.005   7.892 |
## 2010s                    0.053  25.163 |   0.726  17.302   0.264  56.249 |
## Arts and Entertainment   0.001   2.787 |  -0.038   0.072   0.001  -4.179 |
## Sports                   0.001  -2.787 |   0.038   0.072   0.001   4.179 |
## Object                   0.000  -0.928 |   0.314   2.092   0.027  18.031 |
## Possessive               0.423  71.281 |  -0.456   6.957   0.107 -35.842 |
## Subject                  0.376 -67.168 |   0.196   1.686   0.031  19.243 |
## Career_FALSE             0.462 -74.488 |  -0.027   0.069   0.032 -19.480 |
## Career_TRUE              0.462  74.488 |   1.181   3.044   0.032  19.480 |
## 
## Categorical variables (eta2)
##                          Dim.1 Dim.2 Dim.3  
## Decade                 | 0.002 0.066 0.408 |
## ArticleType            | 0.555 0.001 0.001 |
## PronounType            | 0.472 0.488 0.109 |
## Career                 | 0.070 0.462 0.032 |
## Family                 | 0.020 0.032 0.173 |
## Great                  | 0.004 0.020 0.257 |
## Beautiful              | 0.071 0.003 0.036 |

Plot the variables in a 2D graph. Use invis = "ind" rather than col.ind = "gray" so you can read the plot better.

##r chunk

plot(multiple_model, cex = .7, col.var = "black", col.ind = "gray", invis = "ind")

Use the dimdesc function (specify the dimension and output, i.e. dimdesc(mca_model)[[‘Dim 1’]]$quali) to show the usefulness of the variables and to help you understand the results.

##r chunk
dimdesc(multiple_model)[['Dim 1']]$quali
##                      R2
## ArticleType 0.555076817
## PronounType 0.471895900
## Beautiful   0.070990821
## Career      0.069926295
## Family      0.020036773
## Great       0.004484384
## Decade      0.001934022
##                                                                                                                                                                                                                p.value
## ArticleType 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## PronounType 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Beautiful   0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003875365
## Career      0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003760586697
## Family      0.00000000000000000000000000000000000000000000000000000094880385970060989939811547973777538019298595299507011401138485451102188919799082808460660804926317202189751499034708082187535461620922624600971140
## Great       0.00000000000020856693443996706141326654382693644163558090165455638498315238393843173980712890625000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Decade      0.00000905253969254840187657045608160544247766665648669004440307617187500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
dimdesc(multiple_model)[['Dim 1']]$category
##                                       Estimate
## PronounType=Object                  0.43534713
## ArticleType=Arts and Entertainment  0.30774601
## Beautiful=Beautiful_TRUE            0.97943255
## Career=Career_FALSE                 0.37095697
## Family=Family_TRUE                  0.29142677
## Great=Great_TRUE                    0.12171325
## Decade=2000s                        0.02426956
## Decade=2010s                       -0.01942983
## Great=Great_FALSE                  -0.12171325
## PronounType=Possessive             -0.13058081
## Family=Family_FALSE                -0.29142677
## Career=Career_TRUE                 -0.37095697
## Beautiful=Beautiful_FALSE          -0.97943255
## PronounType=Subject                -0.30476631
## ArticleType=Sports                 -0.30774601
##                                                                                                                                                                                                                                       p.value
## PronounType=Object                 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## ArticleType=Arts and Entertainment 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Beautiful=Beautiful_TRUE           0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003875365
## Career=Career_FALSE                0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003760586696
## Family=Family_TRUE                 0.00000000000000000000000000000000000000000000000000000094880385969127081378683780658001269550477053688281934231686875385265809679245990531674708125381275219361598674621261748703824169799643300712843619
## Great=Great_TRUE                   0.00000000000020856693444116522122143966716556243474350353095392307523070485331118106842041015625000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Decade=2000s                       0.00000529751946315421301823616814785644635321659734472632408142089843750000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Decade=2010s                       0.00026821237299664435352677949175870253384346142411231994628906250000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Great=Great_FALSE                  0.00000000000020856693444116522122143966716556243474350353095392307523070485331118106842041015625000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## PronounType=Possessive             0.00000000000000002595609492685777930618392376223026773375389346224829945075640580398612655699253082275390625000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Family=Family_FALSE                0.00000000000000000000000000000000000000000000000000000094880385969133862271178006100222060379002070466767309311048041971942169141275574235062070447803192180512827040046132868290369671918489625703219126
## Career=Career_TRUE                 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003760586696
## Beautiful=Beautiful_FALSE          0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003875365
## PronounType=Subject                0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## ArticleType=Sports                 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
dimdesc(multiple_model)[['Dim 2']]$quali
##                      R2
## PronounType 0.488378064
## Career      0.462414867
## Decade      0.066263619
## Family      0.031631089
## Great       0.020231405
## Beautiful   0.002538027
## ArticleType 0.000647125
##                                                                                                                                                                                                 p.value
## PronounType 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Career      0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Decade      0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002456674
## Family      0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000007416651380725288094579831304485420742502170254758546687161525030320826770552519253689890070519534547
## Great       0.00000000000000000000000000000000000000000000000000000028680991613964169097309218201748402055342231510386121447768666838535485210665569367741576774782727653387697907027546245402357064514
## Beautiful   0.00000003358201732701286314849528527419486589877806181902997195720672607421875000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## ArticleType 0.00532271735344243702298117071336491790134459733963012695312500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
dimdesc(multiple_model)[['Dim 2']]$category
##                                        Estimate
## Career=Career_TRUE                  0.903800885
## PronounType=Possessive              0.328146638
## Decade=2010s                        0.127139280
## Family=Family_TRUE                  0.346917511
## Great=Great_TRUE                    0.244936244
## Beautiful=Beautiful_TRUE            0.175458783
## ArticleType=Arts and Entertainment  0.009955511
## ArticleType=Sports                 -0.009955511
## Beautiful=Beautiful_FALSE          -0.175458783
## Great=Great_FALSE                  -0.244936244
## Family=Family_FALSE                -0.346917511
## Decade=2000s                       -0.119246224
## Career=Career_FALSE                -0.903800885
## PronounType=Subject                -0.294951973
##                                                                                                                                                                                    p.value
## Career=Career_TRUE                 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## PronounType=Possessive             0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Decade=2010s                       0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001801324
## Family=Family_TRUE                 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000007416651380670619007654800092086039815386725796667961832109150570
## Great=Great_TRUE                   0.00000000000000000000000000000000000000000000000000000028680991614885414196813154242949945771747395633585092165592799133527552122556575839483609219719
## Beautiful=Beautiful_TRUE           0.00000003358201732738857520016177086834285869798577550682239234447479248046875000000000000000000000000000000000000000000000000000000000000000000000000
## ArticleType=Arts and Entertainment 0.00532271735343326033579325340383547882083803415298461914062500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## ArticleType=Sports                 0.00532271735343325599898456346181774279102683067321777343750000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## Beautiful=Beautiful_FALSE          0.00000003358201732735112046202536977522468131240884758881293237209320068359375000000000000000000000000000000000000000000000000000000000000000000000000
## Great=Great_FALSE                  0.00000000000000000000000000000000000000000000000000000028680991614888290280488589606797866668034181596254636937501328123816563561045619354702394307328
## Family=Family_FALSE                0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000007416651380647946424014477064970061978976653977061033882474346830
## Decade=2000s                       0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000486730063770104133296956
## Career=Career_FALSE                0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## PronounType=Subject                0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

**How would you describe the dimensions (based on the graph and usefulness of the variables)?

Answer: From the graph above, we can see that variables like Family_TRUE, Great_TRUE, and Possessive are very close to each other. Beautiful_FALSE, Great_FALSE, Career_FALSE, Art and Entertainment Object, and three times are grouped. Meanwhile, variable like Beautiful_TRUE is by itself and Family_FALSE and Sports are closely related.

With the usefulness of the variables, we can see that for the first dimension, ArticleType and PronounType variables are the most important, capturing most of the variance. Also, options like the pronoun used are about someone’s career, the pronoun functioning as possessive, the word beautiful is used to describe the subject, and the article is about arts and entertainment contribute much to the first dimension as well.

Similarly, PronounType and Career variables capture most of the variance of the second dimension alongside factors like if the pronoun used is about someone’s career or family and the pronoun functioning as possessive.

Simple Categories

Use the Gender labels column (similar to how we used the party label in the example during lecture) to determine if the clusters map onto Masculine versus Feminine pronouns.

##r chunk

multiple_model2<- MCA(df[,-c(1,3)], quali.sup = 1, graph = FALSE)

Create a 2D plot of your category analysis.

##r chunk 

plot(multiple_model2, cex =1.2, col.var = "darkgray", col.quali.sup = "black")

Add the prototype ellipses to the plot. Note this is a relatively large dataset so the confidence ellipses may be small.

##r chunk

plotellipses(multiple_model2, keepvar = 1, label = "quali")

Create a 95% CI type plot for the category.

##r chunk

plotellipses(multiple_model2, means = F, keepvar = 1, label = "quali")

**What can you tell about the categories from these plots? Are they distinct or overlapping?

Answer: From the graph above, we can see that the two categories are very much overlapping, meaning they aren’t as divided as we would have expected. Also, this is another sign that the New York Times treats both genders in their content with a fair and similar approach.

Run a MCA in Python

In this section, run the same MCA from above in Python. Include the MCA code and print out the inertia values for your analysis.

##python chunk 
import prince
mca=prince.MCA(n_components=2, n_iter=3, copy=True, check_input=True, engine='auto', random_state=42)

df=r.df
df=df.drop(['Filename', 'DictionaryWord'], axis=1)
mca=mca.fit(df)
mca.explained_inertia_
## [0.17145037907767557, 0.10888059745913085]

Plot the Results

Plot the results of your MCA using Python in the section below. I have included Python code below that will help if you are completing this assignment on the VM server.

##python chunk
import matplotlib
matplotlib.use('Agg')

ax=mca.plot_coordinates(X=df, ax=None, figsize=(10, 10), show_row_points=False, row_points_size=10, show_row_labels=True, show_column_points=True, column_points_size=30, show_column_labels=True, legend_n_cols=2).legend(loc='upper left')

ax.get_figure()

Explore the differences

Would you make different conclusions from the R output versus the Python output? What are some of the differences between the R and Python models?

Answer: Though the two charts have slight differences, they look very identical, possibly leading to very similar conclusions. However, visually, in the graph by R, Sport_FALSE and Beautiful_FALSE are located near Gender_Masculine. In the Python graph, only ArticleType_Sports is near Gender_Masculine.

Also, the Inertia by Python suggests that dimension 1 & 2 explains 28% of the variance, while the one by R says 25%.

What other variables might be missing in our dataset that might lead to more distinct categories?

Answer: Variables like the length of the article, how frequent imperative words like “must,” “have to,” “need to” are used in the context might help to distinguish the two categories further.