To implement the procedure, the use of the statistical package R is chosen. The data is loaded by default when starting R. This data set contains statistics, in arrests per 100,000 residents per assault (Assault), murder (Murder) and rape (Monkfish) in the 50 US States in 1973. The percentage of the population in urban areas (UrbanPop) is also provided. Let’s see the first 10 data:
USArrests[1:10,]
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Connecticut 3.3 110 77 11.1
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
To perform PCA on USArrests data:
prcomp(USArrests)
## Standard deviations (1, .., p=4):
## [1] 83.732400 14.212402 6.489426 2.482790
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
## Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
## UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
## Rape 0.07515550 0.20071807 0.97408059 0.07232502
The standard deviations are the eigenvalues of the correlation matrix, and represent the variability in each component. The higher the value, the more relevant is the corresponding variable for display purposes. If I want to visualize the relative importance of each component, I will do the following:
plot(prcomp(USArrests))
Numerically:
summary(prcomp(USArrests))
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 83.7324 14.21240 6.4894 2.48279
## Proportion of Variance 0.9655 0.02782 0.0058 0.00085
## Cumulative Proportion 0.9655 0.99335 0.9991 1.00000
As I can see, the variability of the data is mainly explained by the first principal component PC1 which, as can be seen in the Rotation matrix, gives a weight of 0.9952 to the Assault variable, and weights close to zero to the rest. In the previous table it is observed that the proportion of variance explained by the first component PC1 is 96.6%, that is, it is practically the only relevant one. What is happening? If you look at the USArrests data, the magnitude of the Assault values is much greater than that of the other variables. For example, in the case of Alabama, 236 versus 13.2 for Murder or 21.2 for Rape. Then it is the variable that will have the most influence on the final result, as seen in the preceding graph. The second most influential component is PC2, which depends on UrbanPop, the next largest variable, and so on.
How to solve this problem that a variable is more influential for the simple fact of having more magnitude? The answer is standardization. Let’s repeat the analysis, but standardizing the data:
plot(prcomp(USArrests,scale=T))
summary(prcomp(USArrests,scale=T))
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
As can be seen, with the first two components we collect practically 87% of the variability. This means that a graph of the USAarrests data, represented by the first two principal components, will be sufficiently representative.
Before going to the graph, let’s analyze the rotation matrix, in search of semantic interpretation for the principal components:
prcomp(USArrests,scale=T)
## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
PC1 assigns weights, all of the same sign, to the variables. It is a weighted average of the original variables. That is, it is a summary measure that allows ordering, in this component, the behavior of the 50 states from the point of view of the crimes committed. For example, by concentrating on PC2, states would be ranked in terms of their urban population in one sense, weighing the number of murders in the opposite sense. In PC3, the order is mainly given by the violations. We must bear in mind, in any case, that there is no guarantee of interpretability in a PCA analysis.
We draw the projected data on the first two components:
plot(prcomp(USArrests,scale=T)$x[,1:2])
It would be desirable to be able to see the names of the states, rather than just dots.
plot(prcomp(USArrests,scale=T)$x[,1:2],type="n")
text(prcomp(USArrests,scale=T)$x[,1:2],rownames(USArrests))
Nearby points on the map indicate similar behavior / profile in terms of crimes committed. From the graph we can infer that Florida, Nevada and California are three extreme points in terms of criminal behavior. But in what sense? Very low or very high crime? The answer is very high, given that in the rotation matrix all the weights of the first principal component are negative, then the higher the value in the original variables, the situation further to the left in PC1. As for PC2, we have extreme behaviors, for example in Hawaii at the bottom and in Mississippi and North Carolina at the top. In the first case there is a large urban population and few murders. In the second, on the contrary.
How to improve the graph? One possible way is to incorporate the information of the variables using the biplot technique.
biplot(prcomp(USArrests,scale=T))
As I can see, component 1 comes to be where the average of the four variables would be. The states are almost ordered in the PC2 component by UrbanPop in one sense, and by Murder in the other, as I have already deduced / verified.