Data Analysis in the Social Sciences

Principal component analysis

Let us load a file that contains data on ratings of judges provided by lawyers (expert ratings). Lawyers evaluated different traits and abilities of US Supreme court judges and then their ratings were scaled. Full description of the dataset can be found here.

Suppose we want to get several indicators of judges’ characteristics. At the same time we want to get less variables than we have originally (12) since we are not interested in working with many correlated variables that duplicate each other. In other words, we should create several aggregate indices based on original numeric 12 variables. To do so we need Principal Component Analysis (PCA) as a method of dimensionality reduction.

Firstly, let us skip non-numeric columns (here it is X, names of judges):

mat <- dat[,2:13]

Secondly, let us look at correlations so as to make sure that original indicators have something in common:

# correlation matrix
cor(mat)

##             CONT       INTG       DMNR      DILG      CFMG       DECI
## CONT  1.00000000 -0.1331909 -0.1536885 0.0123920 0.1369123 0.08653823
## INTG -0.13319089  1.0000000  0.9646153 0.8715111 0.8140858 0.80284636
## DMNR -0.15368853  0.9646153  1.0000000 0.8368510 0.8133582 0.80411683
## DILG  0.01239200  0.8715111  0.8368510 1.0000000 0.9587988 0.95616608
## CFMG  0.13691230  0.8140858  0.8133582 0.9587988 1.0000000 0.98113590
## DECI  0.08653823  0.8028464  0.8041168 0.9561661 0.9811359 1.00000000
## PREP  0.01146921  0.8777965  0.8558175 0.9785684 0.9579140 0.95708831
## FAMI -0.02563656  0.8688580  0.8412415 0.9573634 0.9354684 0.94280452
## ORAL -0.01199681  0.9113992  0.9067729 0.9544758 0.9505657 0.94825640
## WRIT -0.04381025  0.9088347  0.8930611 0.9592503 0.9422470 0.94610093
## PHYS  0.05424827  0.7419360  0.7886804 0.8129211 0.8794874 0.87176277
## RTEN -0.03364343  0.9372632  0.9437002 0.9299652 0.9270827 0.92499241
##            PREP        FAMI        ORAL        WRIT       PHYS        RTEN
## CONT 0.01146921 -0.02563656 -0.01199681 -0.04381025 0.05424827 -0.03364343
## INTG 0.87779650  0.86885798  0.91139915  0.90883469 0.74193597  0.93726315
## DMNR 0.85581749  0.84124150  0.90677295  0.89306109 0.78868038  0.94370017
## DILG 0.97856839  0.95736345  0.95447583  0.95925032 0.81292115  0.92996523
## CFMG 0.95791402  0.93546838  0.95056567  0.94224697 0.87948744  0.92708271
## DECI 0.95708831  0.94280452  0.94825640  0.94610093 0.87176277  0.92499241
## PREP 1.00000000  0.98986345  0.98310045  0.98679918 0.84867350  0.95029259
## FAMI 0.98986345  1.00000000  0.98133905  0.99069557 0.84374436  0.94164495
## ORAL 0.98310045  0.98133905  1.00000000  0.99342943 0.89116392  0.98213227
## WRIT 0.98679918  0.99069557  0.99342943  1.00000000 0.85594002  0.96755639
## PHYS 0.84867350  0.84374436  0.89116392  0.85594002 1.00000000  0.90654782
## RTEN 0.95029259  0.94164495  0.98213227  0.96755639 0.90654782  1.00000000

Yes, there are high values of Pearson’s correlation coefficient. Now we are ready to perform PCA. As we discussed, we should scale our date first so as to have a center of a distribution at 0 and a standard deviation equal to 1. So, we use the function prcomp() and add options for scaling:

pca <- prcomp(mat, center = TRUE, scale = TRUE)
pca

## Standard deviations (1, .., p=12):
##  [1] 3.18331647 1.05078398 0.57697626 0.50383231 0.29060762 0.19309598
##  [7] 0.14029545 0.12415832 0.08850690 0.07491146 0.05708042 0.04539134
## 
## Rotation (n x k) = (12 x 12):
##               PC1          PC2          PC3          PC4          PC5
## CONT  0.003075143 -0.932890644  0.334756548  0.058576867  0.093438368
## INTG -0.288550775  0.182040993  0.549360126  0.173977074 -0.014543880
## DMNR -0.286884206  0.197565743  0.556490386 -0.124412022 -0.228832817
## DILG -0.304354091 -0.036304667 -0.163629910  0.321395544 -0.301936920
## CFMG -0.302572733 -0.168393523 -0.207341904  0.012949223 -0.448430522
## DECI -0.301891969 -0.127877299 -0.297902771  0.030491779 -0.424003128
## PREP -0.309406446 -0.032230248 -0.151869345  0.213656069  0.202853400
## FAMI -0.306679527  0.001315183 -0.195290454  0.200651140  0.507470003
## ORAL -0.312708348  0.003625720 -0.002150634 -0.007441042  0.246059421
## WRIT -0.311061231  0.031378756 -0.056045596  0.137104995  0.305562842
## PHYS -0.280723624 -0.089037698 -0.154000444 -0.841266046  0.118424976
## RTEN -0.309790218  0.039381306  0.172869757 -0.184223629  0.006717911
##               PC6          PC7           PC8         PC9        PC10
## CONT -0.004064432  0.005214784 -6.006597e-02  0.02514533 -0.03038881
## INTG  0.369937339 -0.449810741  3.341645e-01  0.27537794  0.10897641
## DMNR -0.394724667  0.466747889 -2.470974e-01  0.19910004 -0.07241282
## DILG  0.598676072  0.209710731 -3.548587e-01 -0.03977180 -0.38339165
## CFMG -0.085728870  0.246903359  7.135261e-01 -0.14342471  0.09850310
## DECI -0.392609484 -0.536429933 -3.024227e-01  0.25823773  0.06743847
## PREP  0.083216652  0.335390036 -1.536754e-01  0.10876864  0.67986284
## FAMI -0.101538704 -0.036378004  2.038889e-02  0.22306628  0.04004599
## ORAL -0.150272440  0.057580177  9.062990e-02 -0.29951714 -0.25599455
## WRIT -0.238172386 -0.060899994  1.261203e-01 -0.02497324 -0.47478254
## PHYS  0.299281534  0.024959951 -1.364511e-05  0.26627286 -0.05900837
## RTEN  0.036126847 -0.256194180 -2.213898e-01 -0.75645893  0.24993250
##               PC11         PC12
## CONT  0.0145329260 -0.007940919
## INTG -0.1125535650  0.009848658
## DMNR  0.1343234234  0.059121657
## DILG  0.0709517642  0.053790339
## CFMG  0.1658680927  0.025082947
## DECI -0.1284999526  0.044141604
## PREP -0.3187612119 -0.273286884
## FAMI  0.5733628652  0.421739844
## ORAL -0.6386061655  0.494391025
## WRIT  0.0004056397 -0.696107204
## PHYS -0.0181381019 -0.053783960
## RTEN  0.2855143026 -0.080267574

What do we see in this output? Weights of older variables in newer variables that are called principal components (PC1, PC2, PC3, etc.). So, we can say that CONT has the weight of 0.003 in PC1, INTG has the weight of -0.288 in PC1, and so on. In other words, we can regard PC1 as a new integral index of jugde behaviour (one instead of twelve). Just recall your cumulative score that is combined from several grades taken from some weights.

Now we can try to choose several components and interpret what do they mean. It is a substantial thing, not a techinal one. If we look at the first component PC1, we see that only number of contacts with a lawyer contributes positively to PC1 (CONT has a positive weight and others have a negative one). Thus, we can think that the first component is a measure of sociability of a judge. In the second component the highest weight has the openess and good manners of a judge, so we can say that PC2 is responsible for manners. The forth component is also interesting, we can interpret it as an overall competence: preparation for the hearings and knowledge of laws. The fifth one can be regarded as a general performance: ability to provide written and oral rulings and good physical tenure. Let us stop here and decide which components are most informative.

summary(pca)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.1833 1.05078 0.57698 0.50383 0.29061 0.19310
## Proportion of Variance 0.8445 0.09201 0.02774 0.02115 0.00704 0.00311
## Cumulative Proportion  0.8445 0.93647 0.96421 0.98537 0.99240 0.99551
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.14030 0.12416 0.08851 0.07491 0.05708 0.04539
## Proportion of Variance 0.00164 0.00128 0.00065 0.00047 0.00027 0.00017
## Cumulative Proportion  0.99715 0.99844 0.99909 0.99956 0.99983 1.00000

Usually the fisrt component is the most informative, then goes the second one, and so on (just by the logic of PCA). Often we are interested in those principal components that are interpretable substantially, but if we want to get some proof, we can create a scree plot (or an elbow plot as it called sometimes):

plot(pca, type = "l", main = "Scree plot")

It shows the decrease in the variances of principal components, so we should take such components that go before the bend (here we can take two approximately).

Now we can predict the values of our new indices (principal components) for every observation in data.

mat2 <- predict(pca, newdata = mat)

Look at the fisrt 3 rows:

View(head(mat2, 3))

For example, for the first judge the value of PC1 is 0.5857, for the second one is , and so on. In other words, we can say that the value of our ‘sociability’ index for the first judge is 0.5857 and for the second one is -2.38. The same logic is for other components.

And check that all correlation coefficients between principal components are zero (approximately) as expected:

cor(mat2)

##                PC1           PC2           PC3           PC4           PC5
## PC1   1.000000e+00 -3.144469e-16 -1.533744e-16 -5.415659e-16 -1.894434e-15
## PC2  -3.144469e-16  1.000000e+00  3.656561e-16 -3.736549e-17  1.004348e-16
## PC3  -1.533744e-16  3.656561e-16  1.000000e+00  1.235125e-16  9.875528e-16
## PC4  -5.415659e-16 -3.736549e-17  1.235125e-16  1.000000e+00  4.966328e-16
## PC5  -1.894434e-15  1.004348e-16  9.875528e-16  4.966328e-16  1.000000e+00
## PC6   3.269063e-16 -1.962836e-16  4.255859e-17  9.844700e-16  2.488013e-17
## PC7   1.668104e-15  3.115529e-16  2.824472e-15 -7.492448e-16  4.299730e-16
## PC8  -1.990301e-15  4.693585e-16 -1.846566e-15  3.567437e-16 -1.434178e-15
## PC9   4.392010e-15 -7.434676e-16 -1.761981e-15 -6.846485e-16  1.816207e-16
## PC10 -1.299895e-15 -6.423597e-17  1.134284e-15  1.060094e-15  9.588901e-16
## PC11  6.089771e-15 -6.559170e-16 -3.045706e-15 -1.473830e-15 -1.728038e-15
## PC12  1.186571e-15 -4.869973e-17  2.333852e-15  1.116960e-16  1.199785e-15
##                PC6           PC7           PC8           PC9          PC10
## PC1   3.269063e-16  1.668104e-15 -1.990301e-15  4.392010e-15 -1.299895e-15
## PC2  -1.962836e-16  3.115529e-16  4.693585e-16 -7.434676e-16 -6.423597e-17
## PC3   4.255859e-17  2.824472e-15 -1.846566e-15 -1.761981e-15  1.134284e-15
## PC4   9.844700e-16 -7.492448e-16  3.567437e-16 -6.846485e-16  1.060094e-15
## PC5   2.488013e-17  4.299730e-16 -1.434178e-15  1.816207e-16  9.588901e-16
## PC6   1.000000e+00 -1.902228e-16 -1.044185e-15 -1.812230e-15  8.156293e-16
## PC7  -1.902228e-16  1.000000e+00  9.807703e-16  1.354460e-15  3.514828e-16
## PC8  -1.044185e-15  9.807703e-16  1.000000e+00  1.209401e-15  1.845993e-16
## PC9  -1.812230e-15  1.354460e-15  1.209401e-15  1.000000e+00 -4.674462e-16
## PC10  8.156293e-16  3.514828e-16  1.845993e-16 -4.674462e-16  1.000000e+00
## PC11 -3.175632e-16 -6.032765e-18 -8.961713e-16  2.167627e-16 -1.171468e-15
## PC12 -4.097175e-16  8.988985e-16  1.548460e-15 -3.043606e-16 -1.003605e-15
##               PC11          PC12
## PC1   6.089771e-15  1.186571e-15
## PC2  -6.559170e-16 -4.869973e-17
## PC3  -3.045706e-15  2.333852e-15
## PC4  -1.473830e-15  1.116960e-16
## PC5  -1.728038e-15  1.199785e-15
## PC6  -3.175632e-16 -4.097175e-16
## PC7  -6.032765e-18  8.988985e-16
## PC8  -8.961713e-16  1.548460e-15
## PC9   2.167627e-16 -3.043606e-16
## PC10 -1.171468e-15 -1.003605e-15
## PC11  1.000000e+00 -7.467347e-16
## PC12 -7.467347e-16  1.000000e+00

Now we can join our original data frame with a new one so as make different comparisons and further analysis possible:

with_pc <- cbind(mat, mat2[,1:2])

Data Analysis in the Social Sciences

Alla Tambovtseva

Principal component analysis