So if you’re reading this, you probably have some idea what Principal Components Analysis (PCA) is. If not, PCA is a method of reducing a dataset of intercorrelated variables down to a few variables which aim to describe as much as possible of the variation. These Principal Components (PCs) are linear combinations of the variables in the dataset. If any of this is unclear, stick with me, and hopefully I can show you by example. First, lets load an exciting array of packages.
require(ggplot2)
require(reshape)
require(corrgram)
require(RColorBrewer)
require(FactoMineR)
require(igraph)
require(Matrix)
Now, we’re going to use the “State.x77” dataset which is built into R. It contains 50 rows (one for each US state), and 8 columns that measure different socioeconomic variables.
state.dat <- as.data.frame(state.x77)
summary(state.dat)
## Population Income Illiteracy Life Exp
## Min. : 365 Min. :3098 Min. :0.500 Min. :67.96
## 1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625 1st Qu.:70.12
## Median : 2838 Median :4519 Median :0.950 Median :70.67
## Mean : 4246 Mean :4436 Mean :1.170 Mean :70.88
## 3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575 3rd Qu.:71.89
## Max. :21198 Max. :6315 Max. :2.800 Max. :73.60
## Murder HS Grad Frost Area
## Min. : 1.400 Min. :37.80 Min. : 0.00 Min. : 1049
## 1st Qu.: 4.350 1st Qu.:48.05 1st Qu.: 66.25 1st Qu.: 36985
## Median : 6.850 Median :53.25 Median :114.50 Median : 54277
## Mean : 7.378 Mean :53.11 Mean :104.46 Mean : 70736
## 3rd Qu.:10.675 3rd Qu.:59.15 3rd Qu.:139.75 3rd Qu.: 81163
## Max. :15.100 Max. :67.30 Max. :188.00 Max. :566432
Let’s have a look at what each variable looks like. This is ggplot2 code; here it doesn’t massively matter if you don’t understand what it’s doing. Later on for the more complex plots I’ll break the code down.
ggplot(melt(state.dat),aes(x=value))+
geom_histogram(colour="black",aes(fill=variable))+
scale_fill_brewer(palette="Spectral",
name="Variable")+
facet_wrap(~variable,nrow=2,scales="free")+
labs(x="",y="Frequency")+
theme(legend.position = "none")
Variables are fairly self-descriptive. The dataset gives them as:
Population: Population estimate as of July 1, 1975
Income: Per capita income (1974)
Illiteracy: Illiteracy (1970, percent of population)
Life Exp: Life expectancy in years (1969-71)
Murder: Murder and non-negligent manslaughter rate per 100,000 population (1976)
HS Grad: Percent high-school graduates (1970)
Frost: Mean number of days with minimum temperature below freezing (1931-1960) in capital or large city
Area: Land area in square miles
The reader will have to forgive me this: I am British, therefore my knowledge of US geography (both physical and socioeconomical) is not the best.
Either way, let’s first take a look at how intercorrelated this dataset is.
cor.mat <- cor(state.dat,method = "spearman")
cor.mat[lower.tri(cor.mat)] <- NA
cor.mat <- melt(cor.mat)
cor.mat <- na.omit(cor.mat)
cor.mat$X1 <- factor(cor.mat$X1,levels = unique(cor.mat$X1))
cor.mat$X2 <- factor(cor.mat$X2,levels = unique(cor.mat$X2))
ggplot(data = cor.mat, aes(X2, X1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradientn(colours = rev(brewer.pal(5,"RdYlBu")),
name = "Spearman\nCorrelation") +
theme_bw(base_size = 20) +
theme(
axis.text.x = element_text(
angle = 45, vjust = 1, hjust = 1
),
legend.title = element_text(size = 16),
legend.position = "top"
) +
coord_fixed() +
labs(x = "",y = "")+
geom_text(aes(label=round(value,2)))
Yes, that ugly bit of code is the best way I’ve found for doing a corrgram in ggplot. It involves making a correlation matrix, melting it using reshape, making factors of each combination, then plotting.
In terms of what it shows, bluer tiles represent stronger negative correlation, and redder ones stronger positive correlation. I’ve printed the coefficients on each tile to make it even clearer.
Many of the strongest correlations are fairly understandable. Higher murder rate means a lower life expectancy. More high school graduates means lower illiteracy. But it’s still a bit of a mess with a lot of correlation patterns in there, so lets see if we can clean it up using PCA. We create a PCA object using the PCA function from the FactoMineR package. scale.unit rescales variables before performing the PCA, and graph will generate 2 graphs.
pca.dat <- PCA(state.dat,graph = TRUE,scale.unit = TRUE)
Now, the first of these two graphs is a bit of a mess. It is each row (so in this case, state) plotted for PC1 and PC2 (also called Dim 1 and Dim 2). It’s hard to interpret and I don’t like it, but it gives a rough idea of which states are “unusual” (unusual individuals tend to be further apart). This also means that similar states are clustered together. If you look at the lower left of the graph, most of the states clustered there are southern, for example.
The second graph shows estimations of the axes of each variable. That means that if you increase along one arrow, the other ones give a rough estimate of what’s happening in the other variables. For example, states with larger populations tend to also be larger in area (arrows are largely in the same direction), whereas states with a higher murder rate tend to have a lower life expectancy. These should largely agree with the correlation matrix produced earlier. For some people (me included) the factor map is a more interpretable visualisation of this.
So what were the PCs generated? How well do they capture the overall variation in the dataset? And how are they stored in the object? Let’s take a look.
pca.eig <- as.data.frame(pca.dat$eig)
pca.eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.5988956 44.986195 44.98619
## comp 2 1.6319192 20.398990 65.38519
## comp 3 1.1119412 13.899264 79.28445
## comp 4 0.7075042 8.843803 88.12825
## comp 5 0.3846417 4.808021 92.93627
## comp 6 0.3074617 3.843271 96.77954
## comp 7 0.1444488 1.805610 98.58515
## comp 8 0.1131877 1.414846 100.00000
This table shows the eigenvalues of the PCA. Wikipedia describes eigenvalues as “if T is a linear transformation from a vector space V over a field F into itself and v is a vector in V that is not the zero vector, then v is an eigenvector of T if T(v) is a scalar multiple of v”.
If that made no sense to you, you’re in 99.99% of the population (which also includes me). It’s part of the under-the-bonnet calculations done to generate the PCs, and a larger eigenvalue for a PC means that that PC contains more of the original variance in the dataset (“percentage of variance” in that output). Because of the way PCA is performed, each PC explains less overall variance than the previous one. The analysis will generate as many PCs as you gave it variables, and their percentage of variance explained will sum to 100%. This can be plotted in a bizarre bar-and-line combinaton known as a Scree plot.
pca.eig$index <- as.factor(1:nrow(pca.eig))
pca.eig$index.cont <- 1:nrow(pca.eig)
#Scree plot
ggplot(pca.eig,aes(x=index,y=`percentage of variance`)) +
geom_bar(stat = "identity",aes(fill = index),colour = "black") +
geom_path(aes(x = index.cont),size = 1,colour = "Gray50") +
geom_point(size = 3) +
labs(x = "Principal Component",
y = "Percentage of Variance Explained") +
scale_fill_brewer(breaks = c(1:8),
palette = "YlGnBu",
direction = -1) +
theme_bw(base_size = 20) +
theme(legend.position = "none")
First, the ggplot2 code. We plot using the pca.eig dataset, with our x variable being simply row numbers (but categorical, for the bars), and y being the variance explained. We then draw geom_bar and geom_point using these. For geom_path we override the x aesthetic with a continous version of the row numbers to ensure it joins up continously. The rest of the call is simply colour, labelling and other design aspects.
In terms of what we’re looking at, the Scree plot is often used as an important tool at this stage. If you’re attempting to reduce dimensionality in your dataset, you only want to take some PCs forwards for analysis. Whilst there is no set statistical rule for how to make this decision, the Scree plot is a good starting point. Many people look for an “elbow” in the line. This suggests that beyond that point, PCs are providing marginal gains in terms of variance explained, and subsequent PCs are just encapsulating noise. Here, I’ll probably stick to the first 4 PCs (for a total of 88.12% of overall variance, which is pretty decent). Now, how is each PC structured? What variables make them up? These are our next important points. Calling pca.dat$var$contrib (note the two-layer nesting) will give you this data.
pca.contrib <- as.data.frame(pca.dat$var$contrib)[,1:4]
pca.contrib
## Dim.1 Dim.2 Dim.3 Dim.4
## Population 1.5984061 16.8817585 43.0763114 16.7596530
## Income 8.9299316 26.9339033 1.0071968 0.7822798
## Illiteracy 21.8714450 0.2805685 0.5026596 12.4487608
## Life Exp 16.9423099 0.6667720 12.9551741 19.5862312
## Murder 19.7364030 9.4217898 1.1765201 2.7423416
## HS Grad 18.0356857 8.9261493 0.2470935 5.3626574
## Frost 12.7743655 2.3588074 14.9857610 38.2729294
## Area 0.1114532 34.5302512 26.0492836 4.0451468
Here, “Dim” is the same as PC, and the numbers represent the proportion of that variable (row names) which is represented in that PC. So PC consists of 1.59% of population, 8.92% of income etc. Remember that PCs represent combinations of variables. Therefore, the contrib table shows you how variables are combined to produce each PC. Next I need to quickly clean up this data frame and prepare it for plotting. These things are much easier to visualise with a plot
pca.contrib$var <-
row.names(pca.contrib) #Turn row names into a variable
pca.contrib <- melt(pca.contrib,id.vars = "var") #Melt the dataset
total.var <-
paste("(",round(pca.eig$`percentage of variance`[1:4],1),"% of Variance)",sep =
"") #Extract the % of variance explained by each PC
pca.contrib$variable <- factor(pca.contrib$variable,labels =
paste("PC",1:4," ",
total.var,sep = "")) #Append these values to the names of each PC
But we have a slight (albeit hidden) issue here. These values tell us how much each variable contributes to (more correctly loads onto) each PC, but not the direction of the loading. Cast your mind back to the variables factor map. Illiteracy and murder were almost inverse to life expectancy. Therefore, the loading of those two is probably negative, although it could also be the case that life expectancy has loaded negatively. Always check the directions of your PCs! Either way, one of them must have loaded negatively, since they are negatively correalted with each other. We can explore this using the co-ordinates of each variable in each PC.
pca.coord <- as.data.frame(pca.dat$var$coord)
pca.coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Population -0.23984363 0.52487776 -0.69208615 0.34434757 0.251765858
## Income 0.56690291 0.66297778 -0.10582738 0.07439531 -0.395428165
## Illiteracy -0.88720374 0.06766573 0.07476148 -0.29677518 0.002186803
## Life Exp 0.78085597 -0.10431289 -0.37954435 -0.37225450 0.202555453
## Murder -0.84278855 0.39211733 0.11437750 0.13929172 -0.079427577
## HS Grad 0.80565843 0.38166418 0.05241692 -0.19478457 -0.061563366
## Frost 0.67803840 -0.19619845 0.40820686 0.52036774 0.134807911
## Area 0.06333314 0.75067024 0.53819393 -0.16917324 0.309171079
So as I suspected, illiteracy and murder have loaded negatively onto PC1. So what does this all mean? Take PC1 (Dim.1). If a state has a 1 unit increase in PC1 compared to another state, this means that state also has a 0.56 unit increase in income, a 0.81 increase in high school graduates, and a 0.89 unit decrease in illiteracy (due to negative loading).
The fact that units are rescaled before PCA does make these differences a little harder to get one’s head around. Let’s get on with plot preparation to aid that. I wrote a little loop that detects negative loading and appends it to our table of contributions.
for (i in 1:nrow(pca.contrib))
{
pca.contrib$dir[i] <- pca.coord[which(row.names(pca.coord) == pca.contrib$var[i]),
grep(as.numeric(substr(pca.contrib$variable[i],3,3)),names(pca.coord))] >
0
}
pca.contrib
## var variable value dir
## 1 Population PC1 (45% of Variance) 1.5984061 FALSE
## 2 Income PC1 (45% of Variance) 8.9299316 TRUE
## 3 Illiteracy PC1 (45% of Variance) 21.8714450 FALSE
## 4 Life Exp PC1 (45% of Variance) 16.9423099 TRUE
## 5 Murder PC1 (45% of Variance) 19.7364030 FALSE
## 6 HS Grad PC1 (45% of Variance) 18.0356857 TRUE
## 7 Frost PC1 (45% of Variance) 12.7743655 TRUE
## 8 Area PC1 (45% of Variance) 0.1114532 TRUE
## 9 Population PC2 (20.4% of Variance) 16.8817585 TRUE
## 10 Income PC2 (20.4% of Variance) 26.9339033 TRUE
## 11 Illiteracy PC2 (20.4% of Variance) 0.2805685 TRUE
## 12 Life Exp PC2 (20.4% of Variance) 0.6667720 FALSE
## 13 Murder PC2 (20.4% of Variance) 9.4217898 TRUE
## 14 HS Grad PC2 (20.4% of Variance) 8.9261493 TRUE
## 15 Frost PC2 (20.4% of Variance) 2.3588074 FALSE
## 16 Area PC2 (20.4% of Variance) 34.5302512 TRUE
## 17 Population PC3 (13.9% of Variance) 43.0763114 FALSE
## 18 Income PC3 (13.9% of Variance) 1.0071968 FALSE
## 19 Illiteracy PC3 (13.9% of Variance) 0.5026596 TRUE
## 20 Life Exp PC3 (13.9% of Variance) 12.9551741 FALSE
## 21 Murder PC3 (13.9% of Variance) 1.1765201 TRUE
## 22 HS Grad PC3 (13.9% of Variance) 0.2470935 TRUE
## 23 Frost PC3 (13.9% of Variance) 14.9857610 TRUE
## 24 Area PC3 (13.9% of Variance) 26.0492836 TRUE
## 25 Population PC4 (8.8% of Variance) 16.7596530 TRUE
## 26 Income PC4 (8.8% of Variance) 0.7822798 TRUE
## 27 Illiteracy PC4 (8.8% of Variance) 12.4487608 FALSE
## 28 Life Exp PC4 (8.8% of Variance) 19.5862312 FALSE
## 29 Murder PC4 (8.8% of Variance) 2.7423416 TRUE
## 30 HS Grad PC4 (8.8% of Variance) 5.3626574 FALSE
## 31 Frost PC4 (8.8% of Variance) 38.2729294 TRUE
## 32 Area PC4 (8.8% of Variance) 4.0451468 FALSE
So this is what we’ll be plotting. var is each variable; variable (confusingly) is each PC, along with variance explained (from the eigenvalues); value is how much that variable contributed to that PC; and dir stands for direction, with dir = FALSE meaning negative loading. With that all said and done, it’s time for our first plot (and it’s a big one).
ggplot(pca.contrib,aes(x=var,y=value)) +
geom_hline(yintercept = 100 / length(unique(pca.contrib$var)),
linetype = "dashed") +
geom_bar(stat = "identity",aes(fill = var),colour = "black") +
facet_wrap(~ variable,nrow = 2,scales = "free") +
geom_label(aes(label = var,colour = dir)) +
scale_colour_manual(values = c("red","black"))+
scale_fill_brewer(palette = "Spectral") +
theme_bw(base_size = 20) +
theme(axis.text.x = element_blank()) +
labs(x = "",y = "% Contribution to Principal Component") +
theme(legend.position = "none")
Once again, the code. We call the contribution table with variable names on the x axis, and the contribution on the Y. We then draw geom_bar for each variable to show this. Now come the clever parts, if I do say so myself. facet_wrap creates a sub-panel for each PC, and geom_hline draws a dashed line on each panel. This line represents the expected contributions to that variable if all variables contributed evenly. Therefore, variables above this line have a “greater-than-average” contribution to that PC. “Significant”" contribution sounds a bit nicer, but the concept makes no statistical sense.
So what can we see from this? PC1 is comprised mostly of illiteracy, murder, life expectancy and HS grad. Taking negative loadings into account, that means a higher PC1 means that state has a higher life expectancy and adult education rate, and lower illiteracy and murder rate. Sounds quite nice! They also tend to be a bit colder though… You can study the structure of each PC this way. States with a higher PC2 tend to be larger, richer and more populous (but maybe a tad more murdery? I guess that’s just a consequence of their size).
So you can see already how the structuring of a PCA can help you get your head around correlations hidden in a dataset. I’ll give some examples later on to help make this clearer. For now, let’s do another visualisation. This will tell us not only the structure of each PC, but also how the different PCs relate to one another in terms of shared variables. This is not immediately apparent from the above graph.
Worth noting: I came up with this visualisation myself. That’s not to say I invented it, as it may well have been done before. I think it makes sense and I quite like it. But the warning is this; the code and process by which it’s generated is extremely ugly. Unless you really want to know the mechanics of how it’s done, you may wish to skip ahead…
With that warning aside, lets proceed! Firstly, we need to get our data into the right shape.
pca.grid <-
expand.grid(unique(pca.contrib$var),paste("PC",1:4,sep = ""))
levels(pca.contrib$variable) <- paste("PC",1:4,sep = "")
pca.grid$Var1 <- as.character(pca.grid$Var1)
pca.grid$Var2 <- as.character(pca.grid$Var2)
pca.grid
## Var1 Var2
## 1 Population PC1
## 2 Income PC1
## 3 Illiteracy PC1
## 4 Life Exp PC1
## 5 Murder PC1
## 6 HS Grad PC1
## 7 Frost PC1
## 8 Area PC1
## 9 Population PC2
## 10 Income PC2
## 11 Illiteracy PC2
## 12 Life Exp PC2
## 13 Murder PC2
## 14 HS Grad PC2
## 15 Frost PC2
## 16 Area PC2
## 17 Population PC3
## 18 Income PC3
## 19 Illiteracy PC3
## 20 Life Exp PC3
## 21 Murder PC3
## 22 HS Grad PC3
## 23 Frost PC3
## 24 Area PC3
## 25 Population PC4
## 26 Income PC4
## 27 Illiteracy PC4
## 28 Life Exp PC4
## 29 Murder PC4
## 30 HS Grad PC4
## 31 Frost PC4
## 32 Area PC4
So we start off with a two-column data frame, one of variable names, the other just listing each PC. Next, we go through every row in that data frame and append two more columns. One is the % contribution of that variable to that PC. The other is if that is a greater-than-average contribution (TRUE/FALSE).
for (i in 1:nrow(pca.grid))
{
df <- subset(pca.contrib,var == pca.grid[i,1] &
variable == pca.grid[i,2])
pca.grid$contrib[i] <- df$value
pca.grid$sig[i] <-
df$value > 100 / length(unique(pca.contrib$var))
}
pca.grid
## Var1 Var2 contrib sig
## 1 Population PC1 1.5984061 FALSE
## 2 Income PC1 8.9299316 FALSE
## 3 Illiteracy PC1 21.8714450 TRUE
## 4 Life Exp PC1 16.9423099 TRUE
## 5 Murder PC1 19.7364030 TRUE
## 6 HS Grad PC1 18.0356857 TRUE
## 7 Frost PC1 12.7743655 TRUE
## 8 Area PC1 0.1114532 FALSE
## 9 Population PC2 16.8817585 TRUE
## 10 Income PC2 26.9339033 TRUE
## 11 Illiteracy PC2 0.2805685 FALSE
## 12 Life Exp PC2 0.6667720 FALSE
## 13 Murder PC2 9.4217898 FALSE
## 14 HS Grad PC2 8.9261493 FALSE
## 15 Frost PC2 2.3588074 FALSE
## 16 Area PC2 34.5302512 TRUE
## 17 Population PC3 43.0763114 TRUE
## 18 Income PC3 1.0071968 FALSE
## 19 Illiteracy PC3 0.5026596 FALSE
## 20 Life Exp PC3 12.9551741 TRUE
## 21 Murder PC3 1.1765201 FALSE
## 22 HS Grad PC3 0.2470935 FALSE
## 23 Frost PC3 14.9857610 TRUE
## 24 Area PC3 26.0492836 TRUE
## 25 Population PC4 16.7596530 TRUE
## 26 Income PC4 0.7822798 FALSE
## 27 Illiteracy PC4 12.4487608 FALSE
## 28 Life Exp PC4 19.5862312 TRUE
## 29 Murder PC4 2.7423416 FALSE
## 30 HS Grad PC4 5.3626574 FALSE
## 31 Frost PC4 38.2729294 TRUE
## 32 Area PC4 4.0451468 FALSE
Still with me? Excellent. Now, we make a blank matrix. For this to work, it has to, officially, be a square matrix. But we will end up leaving most of it empty.
nms <- c(unique(pca.grid$Var1),unique(pca.grid$Var2))
pca.mat <- Matrix(
0,dimnames = list(nms,nms),
nrow = length(unique(nms)),
ncol = length(unique(nms)),
sparse = TRUE,doDiag = TRUE
)
print(pca.mat)
## 12 x 12 sparse Matrix of class "dsCMatrix"
## [[ suppressing 12 column names 'Population', 'Income', 'Illiteracy' ... ]]
##
## Population . . . . . . . . . . . .
## Income . . . . . . . . . . . .
## Illiteracy . . . . . . . . . . . .
## Life Exp . . . . . . . . . . . .
## Murder . . . . . . . . . . . .
## HS Grad . . . . . . . . . . . .
## Frost . . . . . . . . . . . .
## Area . . . . . . . . . . . .
## PC1 . . . . . . . . . . . .
## PC2 . . . . . . . . . . . .
## PC3 . . . . . . . . . . . .
## PC4 . . . . . . . . . . . .
As you can see, the rows are named for each variable and each PC. The columns would have the same names, but the print method surpresses them by default. This is what we call a “sparse matrix”, from the Matrix package. That means it can have empty cells with no problem. This is necessary for us! Now, the following hideous loop fills some of the cells with the data we need.
for (i in 1:nrow(pca.mat))
{
for (j in 1:ncol(pca.mat))
{
val <- subset(pca.grid,Var1 == row.names(pca.mat)[i] &
Var2 == colnames(pca.mat)[j])
if (nrow(val) == 0)
{
next
}
if (val[,4] == TRUE)
{
pca.mat[i,j] <- val[,3]
}
}
}
It works by looping through each row and column and making a subset of the earlier dataframe of contributions for that x-y combination. If that combination doesn’t exist (as most won’t, because we want a symmetrical matrix) it moves onto the next one. If that combination exists and that variable’s contribution to that PC is greater-than-expected, the contribution is inserted into the matrix (how often do you legitimately get to use a phrase like that?) Either way, this leaves us with the following sad-looking matrix:
image(pca.mat)
Ignore most of the empty cells, the only ones that matter are rows 1:8 and columns 9:12. The rows represent variables, the columns represent PCs 1:4. A darker cell means that variable (though they aren’t labelled) contributes more to that PC. Empty cells mean that variable contributes less than average to that PC.
“Cool” though this is, this is not our final destination. Next, we must use the igraph package to make a graph from the relationships in this matrix. This is a special use of the word graph, which refers to a figure showing the relationships between things. I mean technically yes, any plot does that. But this kind of graph is called a graph. Just take my word for it.
graph <-
graph_from_adjacency_matrix(pca.mat,mode = "undirected",weighted = NULL)
graph <- simplify(graph,remove.loops = TRUE)
So we create the graph from our adjacency matrix, and simplify it for ease of plotting. We then need to select a layout for the graph, and plot it. I chose the Fruchterman-Reingold layout algorithm as that tends to give a nice readable graph.
layout <- layout_with_fr(graph)
plot(graph,layout = layout)
It must be noted that the layout is deterministic. Every time you run it, it recalculates, and can be quite different based on your data. Observe.
layout <- layout_with_fr(graph)
plot(graph,layout = layout)
Anyway, we’re nearly there. This graph technically shows which variables contribute to which PC. But I want it to be both prettier and more informative! This calls for ggplot2. It also calls for some serious data wrangling to extract what I need from the extremely obtuse igraph object. First, we extract a data frame of the graph nodes.
coord <- data.frame(names = attr(V(graph),"names"),
x = layout[,1],y = layout[,2])
coord
## names x y
## 1 Population -0.1136986 11.528751
## 2 Income -3.1372125 12.712944
## 3 Illiteracy 5.0716546 9.166989
## 4 Life Exp 2.5928246 10.738258
## 5 Murder 6.0020231 10.203971
## 6 HS Grad 5.7320291 11.575549
## 7 Frost 2.7399309 11.464784
## 8 Area -0.2947086 12.769613
## 9 PC1 4.4252830 10.637158
## 10 PC2 -1.6198034 12.372688
## 11 PC3 1.2322274 11.877974
## 12 PC4 1.3936845 10.872730
This shows us the x-y co-ordinates of each node (point) on the graph. The edges are a bit trickier.
edges <- attr(E(graph),"vnames")
edges
## [1] "Population|PC2" "Population|PC3" "Population|PC4" "Income|PC2"
## [5] "Illiteracy|PC1" "Life Exp|PC1" "Life Exp|PC3" "Life Exp|PC4"
## [9] "Murder|PC1" "HS Grad|PC1" "Frost|PC1" "Frost|PC3"
## [13] "Frost|PC4" "Area|PC2" "Area|PC3"
All we can extracty from the object is a set of edge locations. First things first, we need to split each entry (say, Population|PC2) into a clear start and end point. To do this, I’ve created a vectorised wrapper for the strsplit function that will do this. The code is hideous, I know.
strsplit.vect <- function(char,split,item)
{
char <- as.character(char)
output <- c()
for (i in 1:length(char))
{
output[i] <- strsplit(char[i],split)[[1]][item]
}
return(output)
}
edges <- data.frame(start = strsplit.vect(edges,"[|]",1),
end = strsplit.vect(edges,"[|]",2))
edges
## start end
## 1 Population PC2
## 2 Population PC3
## 3 Population PC4
## 4 Income PC2
## 5 Illiteracy PC1
## 6 Life Exp PC1
## 7 Life Exp PC3
## 8 Life Exp PC4
## 9 Murder PC1
## 10 HS Grad PC1
## 11 Frost PC1
## 12 Frost PC3
## 13 Frost PC4
## 14 Area PC2
## 15 Area PC3
We’re getting somewhere though. Now, we loop through this data frame and do a lot of stuff at each iteration.
for (i in 1:nrow(edges))
{
edges$start.x[i] <- coord[as.character(coord[,1]) == edges[i,1],2]
edges$start.y[i] <- coord[as.character(coord[,1]) == edges[i,1],3]
edges$end.x[i] <- coord[as.character(coord[,1]) == edges[i,2],2]
edges$end.y[i] <- coord[as.character(coord[,1]) == edges[i,2],3]
edges$weight[i] <-
pca.mat[which(row.names(pca.mat) == edges[i,1]),
which(colnames(pca.mat) == edges[i,2])]
edges$c.x[i] <- mean(c(edges$start.x[i],edges$end.x[i]))
edges$c.y[i] <- mean(c(edges$start.y[i],edges$end.y[i]))
edges$dir[i] <-
ifelse(pca.contrib[pca.contrib$var == edges$start[i] &
pca.contrib$variable == edges$end[i],4],
"+","-")
}
edges
## start end start.x start.y end.x end.y weight
## 1 Population PC2 -0.1136986 11.528751 -1.619803 12.37269 16.88176
## 2 Population PC3 -0.1136986 11.528751 1.232227 11.87797 43.07631
## 3 Population PC4 -0.1136986 11.528751 1.393684 10.87273 16.75965
## 4 Income PC2 -3.1372125 12.712944 -1.619803 12.37269 26.93390
## 5 Illiteracy PC1 5.0716546 9.166989 4.425283 10.63716 21.87145
## 6 Life Exp PC1 2.5928246 10.738258 4.425283 10.63716 16.94231
## 7 Life Exp PC3 2.5928246 10.738258 1.232227 11.87797 12.95517
## 8 Life Exp PC4 2.5928246 10.738258 1.393684 10.87273 19.58623
## 9 Murder PC1 6.0020231 10.203971 4.425283 10.63716 19.73640
## 10 HS Grad PC1 5.7320291 11.575549 4.425283 10.63716 18.03569
## 11 Frost PC1 2.7399309 11.464784 4.425283 10.63716 12.77437
## 12 Frost PC3 2.7399309 11.464784 1.232227 11.87797 14.98576
## 13 Frost PC4 2.7399309 11.464784 1.393684 10.87273 38.27293
## 14 Area PC2 -0.2947086 12.769613 -1.619803 12.37269 34.53025
## 15 Area PC3 -0.2947086 12.769613 1.232227 11.87797 26.04928
## c.x c.y dir
## 1 -0.8667510 11.950720 +
## 2 0.5592644 11.703362 -
## 3 0.6399930 11.200740 +
## 4 -2.3785079 12.542816 +
## 5 4.7484688 9.902074 -
## 6 3.5090538 10.687708 +
## 7 1.9125260 11.308116 -
## 8 1.9932545 10.805494 -
## 9 5.2136530 10.420565 -
## 10 5.0786560 11.106353 +
## 11 3.5826069 11.050971 +
## 12 1.9860791 11.671379 +
## 13 2.0668077 11.168757 +
## 14 -0.9572560 12.571151 +
## 15 0.4687594 12.323793 +
So start.x and start.y are the co-ordinates (from the coord frame) of the start point (the variable). end.x and end.y are the same but for the PC point. Note, with this visualisation method, two variables are never linked, only a variable to a PC. But a variable can still be linked to multiple PCs.
weight is the contribution of that variable to that PC. Finally, c.x and c.y are the midpoints along the edge, and dir is whether loading was positive or negative. Now, final bit of data prep before plotting. We create a variable called PC which is true if that node is a PC, and false if it is a variable.
coord$PC <- grepl("PC",coord$names)
coord
## names x y PC
## 1 Population -0.1136986 11.528751 FALSE
## 2 Income -3.1372125 12.712944 FALSE
## 3 Illiteracy 5.0716546 9.166989 FALSE
## 4 Life Exp 2.5928246 10.738258 FALSE
## 5 Murder 6.0020231 10.203971 FALSE
## 6 HS Grad 5.7320291 11.575549 FALSE
## 7 Frost 2.7399309 11.464784 FALSE
## 8 Area -0.2947086 12.769613 FALSE
## 9 PC1 4.4252830 10.637158 TRUE
## 10 PC2 -1.6198034 12.372688 TRUE
## 11 PC3 1.2322274 11.877974 TRUE
## 12 PC4 1.3936845 10.872730 TRUE
We then create a variable called “colour”, which for each PC will equal that PC’s name, and for all variables will equal PC5. This will be used to colour code our graph.
coord$names <- as.character(coord$names)
coord$colour <- factor(ifelse(coord$PC == TRUE,coord$names,"PC5"))
coord
## names x y PC colour
## 1 Population -0.1136986 11.528751 FALSE PC5
## 2 Income -3.1372125 12.712944 FALSE PC5
## 3 Illiteracy 5.0716546 9.166989 FALSE PC5
## 4 Life Exp 2.5928246 10.738258 FALSE PC5
## 5 Murder 6.0020231 10.203971 FALSE PC5
## 6 HS Grad 5.7320291 11.575549 FALSE PC5
## 7 Frost 2.7399309 11.464784 FALSE PC5
## 8 Area -0.2947086 12.769613 FALSE PC5
## 9 PC1 4.4252830 10.637158 TRUE PC1
## 10 PC2 -1.6198034 12.372688 TRUE PC2
## 11 PC3 1.2322274 11.877974 TRUE PC3
## 12 PC4 1.3936845 10.872730 TRUE PC4
Now, to plotting! The base plot is called on the nodes (coord) data frame. We then add the edges in using geom_segment to draw a line from the start to the end points, coloured by which PC it ends up at, and sized according to weight (that is, the contribution of that variable to that PC). We then write labels for each PC and each variable, coloured by the colour parameter. The colour scheme here is crucial; we select 4 colours, then white. This means that PC1-4 will have their own colour, and “PC5” will be white. Since PC5 is the label given to all variables, they will be in white text. Finally we write another set of labels on the midpoints of each edge with a “+” or “-” to denote positive or negative loading.
ggplot(coord,aes(x,y)) +
geom_segment(data = edges
,aes(
x = start.x,y = start.y,
xend = end.x,yend = end.y,
colour = end,size = weight
)) +
geom_label(size = 6,aes(fill = colour,label = names)) +
theme_bw() +
theme(
legend.position = "none",
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
) +
labs(x = "",y = "") +
scale_colour_manual(values = c(brewer.pal(4,"Set1"))) +
scale_fill_manual(values = c(brewer.pal(4,"Set1"),"white")) +
geom_label(aes(x = c.x,y = c.y,label = dir),data = edges,size = 8)
I suppose it’s a matter of taste, but I really like this visualisation. Each PC has a colour, and links to its contributing variables. Thicker lines indicate a less symetric contribution. You can, at a glance, see both PC structure and their relationships. It provides a nice reference to quickly look back at when describing PC effects.
With that in mind, let’s have a look at how states vary in PCs. You can extract the actual PC scores for each state using the call pca.dat$ind$coord.
pca.stat <- as.data.frame(pca.dat$ind$coord)[,1:4]
pca.stat
## Dim.1 Dim.2 Dim.3 Dim.4
## Alabama -3.82836428 -0.23716257 0.23164558 -0.387160137
## Alaska 1.06382751 5.51156918 4.28364318 -0.581518252
## Arizona -0.87623537 0.75262575 0.07805313 -1.736293836
## Arkansas -2.40595872 -1.30142362 0.22505473 -0.629534491
## California -0.24383210 3.54515336 -2.83493329 0.071090007
## Colorado 2.08311775 0.51079765 0.51657601 0.111038575
## Connecticut 1.91871993 -0.24547359 -0.66939807 -0.171453869
## Delaware 0.42909658 -0.51307618 0.22094439 0.194608671
## Florida -1.18402345 1.14626187 -1.29485401 -0.493455278
## Georgia -3.32761584 0.11107318 0.39099842 0.460212968
## Hawaii 0.49198599 0.12653389 -1.39131861 -2.980183281
## Idaho 1.43788059 -0.61734785 0.43889922 -0.409728335
## Illinois -0.12017203 1.29540733 -0.81201670 1.599063843
## Indiana 0.47598579 -0.24769029 -0.30454473 0.663154499
## Iowa 2.34538438 -0.54230654 -0.29622393 -0.206823603
## Kansas 1.92082004 -0.07797440 -0.17890918 -0.620307462
## Kentucky -2.15097823 -1.07505720 0.25427250 0.352380672
## Louisiana -4.28406544 -0.34981662 0.23121601 -0.888269677
## Maine 0.96994215 -1.71970311 0.72880344 0.550797460
## Maryland 0.20549128 0.39275855 -0.34266902 0.668715496
## Massachusetts 1.20803512 -0.22087616 -1.02998153 -0.293016026
## Michigan -0.18371587 0.85571675 -0.55982049 1.194761816
## Minnesota 2.45832345 -0.36904451 -0.27515377 -0.068086037
## Mississippi -4.07302459 -1.06191343 0.92891890 -0.177695544
## Missouri -0.31441452 -0.14981157 0.00667219 0.559512234
## Montana 1.39287204 -0.03387927 1.35355575 0.247436744
## Nebraska 2.20315952 -0.55330929 0.05633810 -0.460491366
## Nevada 1.13852733 1.14441562 1.90984188 1.520116562
## New Hampshire 1.68825709 -1.32572231 0.44232732 0.485234407
## New Jersey 0.65617713 0.28432749 -0.98379015 0.622361283
## New Mexico -1.33587312 -0.29655089 1.37579210 -0.714194797
## New York -1.06101371 1.91293670 -2.22014430 1.279829922
## North Carolina -2.72168811 -0.52238917 0.14418326 0.566332475
## North Dakota 2.44221334 -0.78986053 0.27978746 0.137693336
## Ohio 0.27067753 0.42108548 -1.04342981 1.159210823
## Oklahoma -0.07466361 -0.65314783 -0.07191020 -0.613076010
## Oregon 1.33817792 0.22998659 -0.50402882 -1.363717422
## Pennsylvania -0.07816735 0.27214457 -1.08096169 1.302588254
## Rhode Island 0.74836879 -1.47613919 -0.23013703 -0.299896914
## South Carolina -3.74868247 -0.91908151 0.75633355 0.319026783
## South Dakota 2.03296648 -1.32844647 0.54199391 0.159024325
## Tennessee -2.24065364 -0.65763460 -0.01693749 -0.002769606
## Texas -2.43814743 2.35107064 -0.28894455 -0.812241135
## Utah 2.28581091 -0.53975620 0.22199848 -0.859252535
## Vermont 1.38316765 -1.52470730 0.45022716 0.362759114
## Virginia -1.00363500 0.18644420 -0.21296885 0.327713915
## Washington 1.35361753 0.51673796 -0.86045991 -1.250002832
## West Virginia -1.52191818 -1.61824796 0.49718832 0.345594712
## Wisconsin 1.77538396 -0.64218163 -0.42968914 0.113439275
## Wyoming 1.49885527 0.04268506 1.36796025 0.645470271
By default, this will hold information on the first 5 PCs. To get it to show more, use the ncp argument when initially creating the PC. Now, we can just rename these variables and add them to the actual state dataset.
names(pca.stat) <- paste("PC",1:4,sep = "")
state.dat <- cbind(state.dat,pca.stat)
Now, I’ve created a quick function that will print the top 5 and bottom 5 scoring states for each PC to have a look at patterns.
toptail <- function(name,dat)
{
sort.dat <-
dat[order(dat[,which(names(dat) == name)],decreasing = TRUE),]
out <- rbind(cbind(row.names(head(sort.dat,5)),1:5),
cbind("--------","--------"),
cbind(row.names(tail(sort.dat,5)),46:50))
print(as.data.frame(out),row.names = FALSE)
}
Let’s start with PC1. I won’t include the mapping code cos it’s not really part of this tutorial
toptail("PC1",state.dat)
## V1 V2
## Minnesota 1
## North Dakota 2
## Iowa 3
## Utah 4
## Nebraska 5
## -------- --------
## Georgia 46
## South Carolina 47
## Alabama 48
## Mississippi 49
## Louisiana 50
So looking at the structure plot, PC1 indicates greater education, less murder, and a tiny bit of temperature levels. In general, the frost parameter is interesting, as it’s a good proxy for north vs south (which I’m told is a big thing in the US). Indeed, most of the higher PC1 states are northern, and all of the lowest PC1 states are southern. I’ll print the PCA structure plot again for reference:
toptail("PC2",state.dat)
## V1 V2
## Alaska 1
## California 2
## Texas 3
## New York 4
## Illinois 5
## -------- --------
## South Dakota 46
## Rhode Island 47
## Vermont 48
## West Virginia 49
## Maine 50
PC2 is states that are big, populous and with a high income. Looking at the top and bottom states, that makes a lot of sense.
toptail("PC3",state.dat)
## V1 V2
## Alaska 1
## Nevada 2
## New Mexico 3
## Wyoming 4
## Montana 5
## -------- --------
## Pennsylvania 46
## Florida 47
## Hawaii 48
## New York 49
## California 50
PC3 is interesting. It is still physically big states, but population loads negatively, therefore it shows sparsely populated states. Frost loads onto this PC, suggesting that these states tend to be cooler, or that more densely populated states are warmer ones (see Hawaii, Florida and California). Weirdly, these also have a slightly lower life expectancy.
This is the real heart of PCA; the fact that something like population can have both positive and negative loadings with different PCs that reveals different patterns in the data. Also that you see tiny correlations with things like Frost that still get accounted for.
toptail("PC4",state.dat)
## V1 V2
## Illinois 1
## Nevada 2
## Pennsylvania 3
## New York 4
## Michigan 5
## -------- --------
## Louisiana 46
## Washington 47
## Oregon 48
## Arizona 49
## Hawaii 50
Finally, PC4 shows highly populated colder states with a lower life expectancy. Here, we see an issue with PCA; later PCs tend to capture more noise, therefore their contributions are somewhat more difficult to explain.
Either way, hopefully I’ve shown you how to carry out and visualise this very helpful analysis technique, including my own novel visualisation methods. Any questions, drop me a comment on here, or hit me up on Twitter @b_t_cooper. Thanks for your time!