Let us load a file that contains data on ratings of judges provided by lawyers (expert ratings). Lawyers evaluated different traits and abilities of US Supreme court judges and then their ratings were scaled. Full description of the dataset can be found here.
Suppose we want to get several indicators of judges’ characteristics. At the same time we want to get less variables than we have originally (12) since we are not interested in working with many correlated variables that duplicate each other. In other words, we should create several aggregate indices based on original numeric 12 variables. To do so we need Principal Component Analysis (PCA) as a method of dimensionality reduction.
Firstly, let us skip non-numeric columns (here it is X
, names of judges):
mat <- dat[,2:13]
Secondly, let us look at correlations so as to make sure that original indicators have something in common:
# correlation matrix
cor(mat)
## CONT INTG DMNR DILG CFMG DECI
## CONT 1.00000000 -0.1331909 -0.1536885 0.0123920 0.1369123 0.08653823
## INTG -0.13319089 1.0000000 0.9646153 0.8715111 0.8140858 0.80284636
## DMNR -0.15368853 0.9646153 1.0000000 0.8368510 0.8133582 0.80411683
## DILG 0.01239200 0.8715111 0.8368510 1.0000000 0.9587988 0.95616608
## CFMG 0.13691230 0.8140858 0.8133582 0.9587988 1.0000000 0.98113590
## DECI 0.08653823 0.8028464 0.8041168 0.9561661 0.9811359 1.00000000
## PREP 0.01146921 0.8777965 0.8558175 0.9785684 0.9579140 0.95708831
## FAMI -0.02563656 0.8688580 0.8412415 0.9573634 0.9354684 0.94280452
## ORAL -0.01199681 0.9113992 0.9067729 0.9544758 0.9505657 0.94825640
## WRIT -0.04381025 0.9088347 0.8930611 0.9592503 0.9422470 0.94610093
## PHYS 0.05424827 0.7419360 0.7886804 0.8129211 0.8794874 0.87176277
## RTEN -0.03364343 0.9372632 0.9437002 0.9299652 0.9270827 0.92499241
## PREP FAMI ORAL WRIT PHYS RTEN
## CONT 0.01146921 -0.02563656 -0.01199681 -0.04381025 0.05424827 -0.03364343
## INTG 0.87779650 0.86885798 0.91139915 0.90883469 0.74193597 0.93726315
## DMNR 0.85581749 0.84124150 0.90677295 0.89306109 0.78868038 0.94370017
## DILG 0.97856839 0.95736345 0.95447583 0.95925032 0.81292115 0.92996523
## CFMG 0.95791402 0.93546838 0.95056567 0.94224697 0.87948744 0.92708271
## DECI 0.95708831 0.94280452 0.94825640 0.94610093 0.87176277 0.92499241
## PREP 1.00000000 0.98986345 0.98310045 0.98679918 0.84867350 0.95029259
## FAMI 0.98986345 1.00000000 0.98133905 0.99069557 0.84374436 0.94164495
## ORAL 0.98310045 0.98133905 1.00000000 0.99342943 0.89116392 0.98213227
## WRIT 0.98679918 0.99069557 0.99342943 1.00000000 0.85594002 0.96755639
## PHYS 0.84867350 0.84374436 0.89116392 0.85594002 1.00000000 0.90654782
## RTEN 0.95029259 0.94164495 0.98213227 0.96755639 0.90654782 1.00000000
Yes, there are high values of Pearson’s correlation coefficient. Now we are ready to perform PCA. As we discussed, we should scale our date first so as to have a center of a distribution at 0 and a standard deviation equal to 1. So, we use the function prcomp()
and add options for scaling:
pca <- prcomp(mat, center = TRUE, scale = TRUE)
pca
## Standard deviations (1, .., p=12):
## [1] 3.18331647 1.05078398 0.57697626 0.50383231 0.29060762 0.19309598
## [7] 0.14029545 0.12415832 0.08850690 0.07491146 0.05708042 0.04539134
##
## Rotation (n x k) = (12 x 12):
## PC1 PC2 PC3 PC4 PC5
## CONT 0.003075143 -0.932890644 0.334756548 0.058576867 0.093438368
## INTG -0.288550775 0.182040993 0.549360126 0.173977074 -0.014543880
## DMNR -0.286884206 0.197565743 0.556490386 -0.124412022 -0.228832817
## DILG -0.304354091 -0.036304667 -0.163629910 0.321395544 -0.301936920
## CFMG -0.302572733 -0.168393523 -0.207341904 0.012949223 -0.448430522
## DECI -0.301891969 -0.127877299 -0.297902771 0.030491779 -0.424003128
## PREP -0.309406446 -0.032230248 -0.151869345 0.213656069 0.202853400
## FAMI -0.306679527 0.001315183 -0.195290454 0.200651140 0.507470003
## ORAL -0.312708348 0.003625720 -0.002150634 -0.007441042 0.246059421
## WRIT -0.311061231 0.031378756 -0.056045596 0.137104995 0.305562842
## PHYS -0.280723624 -0.089037698 -0.154000444 -0.841266046 0.118424976
## RTEN -0.309790218 0.039381306 0.172869757 -0.184223629 0.006717911
## PC6 PC7 PC8 PC9 PC10
## CONT -0.004064432 0.005214784 -6.006597e-02 0.02514533 -0.03038881
## INTG 0.369937339 -0.449810741 3.341645e-01 0.27537794 0.10897641
## DMNR -0.394724667 0.466747889 -2.470974e-01 0.19910004 -0.07241282
## DILG 0.598676072 0.209710731 -3.548587e-01 -0.03977180 -0.38339165
## CFMG -0.085728870 0.246903359 7.135261e-01 -0.14342471 0.09850310
## DECI -0.392609484 -0.536429933 -3.024227e-01 0.25823773 0.06743847
## PREP 0.083216652 0.335390036 -1.536754e-01 0.10876864 0.67986284
## FAMI -0.101538704 -0.036378004 2.038889e-02 0.22306628 0.04004599
## ORAL -0.150272440 0.057580177 9.062990e-02 -0.29951714 -0.25599455
## WRIT -0.238172386 -0.060899994 1.261203e-01 -0.02497324 -0.47478254
## PHYS 0.299281534 0.024959951 -1.364511e-05 0.26627286 -0.05900837
## RTEN 0.036126847 -0.256194180 -2.213898e-01 -0.75645893 0.24993250
## PC11 PC12
## CONT 0.0145329260 -0.007940919
## INTG -0.1125535650 0.009848658
## DMNR 0.1343234234 0.059121657
## DILG 0.0709517642 0.053790339
## CFMG 0.1658680927 0.025082947
## DECI -0.1284999526 0.044141604
## PREP -0.3187612119 -0.273286884
## FAMI 0.5733628652 0.421739844
## ORAL -0.6386061655 0.494391025
## WRIT 0.0004056397 -0.696107204
## PHYS -0.0181381019 -0.053783960
## RTEN 0.2855143026 -0.080267574
What do we see in this output? Weights of older variables in newer variables that are called principal components (PC1, PC2, PC3, etc.). So, we can say that CONT
has the weight of 0.003 in PC1, INTG
has the weight of -0.288 in PC1, and so on. In other words, we can regard PC1 as a new integral index of jugde behaviour (one instead of twelve). Just recall your cumulative score that is combined from several grades taken from some weights.
Now we can try to choose several components and interpret what do they mean. It is a substantial thing, not a techinal one. If we look at the first component PC1, we see that only number of contacts with a lawyer contributes positively to PC1 (CONT
has a positive weight and others have a negative one). Thus, we can think that the first component is a measure of sociability of a judge. In the second component the highest weight has the openess and good manners of a judge, so we can say that PC2 is responsible for manners. The forth component is also interesting, we can interpret it as an overall competence: preparation for the hearings and knowledge of laws. The fifth one can be regarded as a general performance: ability to provide written and oral rulings and good physical tenure. Let us stop here and decide which components are most informative.
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.1833 1.05078 0.57698 0.50383 0.29061 0.19310
## Proportion of Variance 0.8445 0.09201 0.02774 0.02115 0.00704 0.00311
## Cumulative Proportion 0.8445 0.93647 0.96421 0.98537 0.99240 0.99551
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.14030 0.12416 0.08851 0.07491 0.05708 0.04539
## Proportion of Variance 0.00164 0.00128 0.00065 0.00047 0.00027 0.00017
## Cumulative Proportion 0.99715 0.99844 0.99909 0.99956 0.99983 1.00000
Usually the fisrt component is the most informative, then goes the second one, and so on (just by the logic of PCA). Often we are interested in those principal components that are interpretable substantially, but if we want to get some proof, we can create a scree plot (or an elbow plot as it called sometimes):
plot(pca, type = "l", main = "Scree plot")
It shows the decrease in the variances of principal components, so we should take such components that go before the bend (here we can take two approximately).
Now we can predict the values of our new indices (principal components) for every observation in data.
mat2 <- predict(pca, newdata = mat)
Look at the fisrt 3 rows:
View(head(mat2, 3))
For example, for the first judge the value of PC1 is 0.5857, for the second one is , and so on. In other words, we can say that the value of our ‘sociability’ index for the first judge is 0.5857 and for the second one is -2.38. The same logic is for other components.
And check that all correlation coefficients between principal components are zero (approximately) as expected:
cor(mat2)
## PC1 PC2 PC3 PC4 PC5
## PC1 1.000000e+00 -3.144469e-16 -1.533744e-16 -5.415659e-16 -1.894434e-15
## PC2 -3.144469e-16 1.000000e+00 3.656561e-16 -3.736549e-17 1.004348e-16
## PC3 -1.533744e-16 3.656561e-16 1.000000e+00 1.235125e-16 9.875528e-16
## PC4 -5.415659e-16 -3.736549e-17 1.235125e-16 1.000000e+00 4.966328e-16
## PC5 -1.894434e-15 1.004348e-16 9.875528e-16 4.966328e-16 1.000000e+00
## PC6 3.269063e-16 -1.962836e-16 4.255859e-17 9.844700e-16 2.488013e-17
## PC7 1.668104e-15 3.115529e-16 2.824472e-15 -7.492448e-16 4.299730e-16
## PC8 -1.990301e-15 4.693585e-16 -1.846566e-15 3.567437e-16 -1.434178e-15
## PC9 4.392010e-15 -7.434676e-16 -1.761981e-15 -6.846485e-16 1.816207e-16
## PC10 -1.299895e-15 -6.423597e-17 1.134284e-15 1.060094e-15 9.588901e-16
## PC11 6.089771e-15 -6.559170e-16 -3.045706e-15 -1.473830e-15 -1.728038e-15
## PC12 1.186571e-15 -4.869973e-17 2.333852e-15 1.116960e-16 1.199785e-15
## PC6 PC7 PC8 PC9 PC10
## PC1 3.269063e-16 1.668104e-15 -1.990301e-15 4.392010e-15 -1.299895e-15
## PC2 -1.962836e-16 3.115529e-16 4.693585e-16 -7.434676e-16 -6.423597e-17
## PC3 4.255859e-17 2.824472e-15 -1.846566e-15 -1.761981e-15 1.134284e-15
## PC4 9.844700e-16 -7.492448e-16 3.567437e-16 -6.846485e-16 1.060094e-15
## PC5 2.488013e-17 4.299730e-16 -1.434178e-15 1.816207e-16 9.588901e-16
## PC6 1.000000e+00 -1.902228e-16 -1.044185e-15 -1.812230e-15 8.156293e-16
## PC7 -1.902228e-16 1.000000e+00 9.807703e-16 1.354460e-15 3.514828e-16
## PC8 -1.044185e-15 9.807703e-16 1.000000e+00 1.209401e-15 1.845993e-16
## PC9 -1.812230e-15 1.354460e-15 1.209401e-15 1.000000e+00 -4.674462e-16
## PC10 8.156293e-16 3.514828e-16 1.845993e-16 -4.674462e-16 1.000000e+00
## PC11 -3.175632e-16 -6.032765e-18 -8.961713e-16 2.167627e-16 -1.171468e-15
## PC12 -4.097175e-16 8.988985e-16 1.548460e-15 -3.043606e-16 -1.003605e-15
## PC11 PC12
## PC1 6.089771e-15 1.186571e-15
## PC2 -6.559170e-16 -4.869973e-17
## PC3 -3.045706e-15 2.333852e-15
## PC4 -1.473830e-15 1.116960e-16
## PC5 -1.728038e-15 1.199785e-15
## PC6 -3.175632e-16 -4.097175e-16
## PC7 -6.032765e-18 8.988985e-16
## PC8 -8.961713e-16 1.548460e-15
## PC9 2.167627e-16 -3.043606e-16
## PC10 -1.171468e-15 -1.003605e-15
## PC11 1.000000e+00 -7.467347e-16
## PC12 -7.467347e-16 1.000000e+00
Now we can join our original data frame with a new one so as make different comparisons and further analysis possible:
with_pc <- cbind(mat, mat2[,1:2])