Creating, Visualising and Exploring Multiple Correlated Variables using Principal Components Analysis (PCA)

So if you’re reading this, you probably have some idea what Principal Components Analysis (PCA) is. If not, PCA is a method of reducing a dataset of intercorrelated variables down to a few variables which aim to describe as much as possible of the variation. These Principal Components (PCs) are linear combinations of the variables in the dataset. If any of this is unclear, stick with me, and hopefully I can show you by example. First, lets load an exciting array of packages.

require(ggplot2)
require(reshape)
require(corrgram)
require(RColorBrewer)
require(FactoMineR)
require(igraph)
require(Matrix)

Now, we’re going to use the “State.x77” dataset which is built into R. It contains 50 rows (one for each US state), and 8 columns that measure different socioeconomic variables.

state.dat <- as.data.frame(state.x77)
summary(state.dat)

##    Population        Income       Illiteracy       Life Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432

Let’s have a look at what each variable looks like. This is ggplot2 code; here it doesn’t massively matter if you don’t understand what it’s doing. Later on for the more complex plots I’ll break the code down.

ggplot(melt(state.dat),aes(x=value))+
  geom_histogram(colour="black",aes(fill=variable))+
  scale_fill_brewer(palette="Spectral",
                    name="Variable")+
  facet_wrap(~variable,nrow=2,scales="free")+
  labs(x="",y="Frequency")+
  theme(legend.position = "none")

Variables are fairly self-descriptive. The dataset gives them as:

Population: Population estimate as of July 1, 1975

Income: Per capita income (1974)

Illiteracy: Illiteracy (1970, percent of population)

Life Exp: Life expectancy in years (1969-71)

Murder: Murder and non-negligent manslaughter rate per 100,000 population (1976)

HS Grad: Percent high-school graduates (1970)

Frost: Mean number of days with minimum temperature below freezing (1931-1960) in capital or large city

Area: Land area in square miles

The reader will have to forgive me this: I am British, therefore my knowledge of US geography (both physical and socioeconomical) is not the best.

Either way, let’s first take a look at how intercorrelated this dataset is.

cor.mat <- cor(state.dat,method = "spearman")
cor.mat[lower.tri(cor.mat)] <- NA
cor.mat <- melt(cor.mat)
cor.mat <- na.omit(cor.mat)
cor.mat$X1 <- factor(cor.mat$X1,levels = unique(cor.mat$X1))
cor.mat$X2 <- factor(cor.mat$X2,levels = unique(cor.mat$X2))

ggplot(data = cor.mat, aes(X2, X1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradientn(colours = rev(brewer.pal(5,"RdYlBu")),
                       name = "Spearman\nCorrelation") +
  theme_bw(base_size = 20) +
  theme(
    axis.text.x = element_text(
      angle = 45, vjust = 1, hjust = 1
    ),
    legend.title = element_text(size = 16),
    legend.position = "top"
  ) +
  coord_fixed() +
  labs(x = "",y = "")+
  geom_text(aes(label=round(value,2)))

Yes, that ugly bit of code is the best way I’ve found for doing a corrgram in ggplot. It involves making a correlation matrix, melting it using reshape, making factors of each combination, then plotting.

In terms of what it shows, bluer tiles represent stronger negative correlation, and redder ones stronger positive correlation. I’ve printed the coefficients on each tile to make it even clearer.

Many of the strongest correlations are fairly understandable. Higher murder rate means a lower life expectancy. More high school graduates means lower illiteracy. But it’s still a bit of a mess with a lot of correlation patterns in there, so lets see if we can clean it up using PCA. We create a PCA object using the PCA function from the FactoMineR package. scale.unit rescales variables before performing the PCA, and graph will generate 2 graphs.

pca.dat <- PCA(state.dat,graph = TRUE,scale.unit = TRUE)

Now, the first of these two graphs is a bit of a mess. It is each row (so in this case, state) plotted for PC1 and PC2 (also called Dim 1 and Dim 2). It’s hard to interpret and I don’t like it, but it gives a rough idea of which states are “unusual” (unusual individuals tend to be further apart). This also means that similar states are clustered together. If you look at the lower left of the graph, most of the states clustered there are southern, for example.

The second graph shows estimations of the axes of each variable. That means that if you increase along one arrow, the other ones give a rough estimate of what’s happening in the other variables. For example, states with larger populations tend to also be larger in area (arrows are largely in the same direction), whereas states with a higher murder rate tend to have a lower life expectancy. These should largely agree with the correlation matrix produced earlier. For some people (me included) the factor map is a more interpretable visualisation of this.

So what were the PCs generated? How well do they capture the overall variation in the dataset? And how are they stored in the object? Let’s take a look.

pca.eig <- as.data.frame(pca.dat$eig)
pca.eig

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1  3.5988956              44.986195                          44.98619
## comp 2  1.6319192              20.398990                          65.38519
## comp 3  1.1119412              13.899264                          79.28445
## comp 4  0.7075042               8.843803                          88.12825
## comp 5  0.3846417               4.808021                          92.93627
## comp 6  0.3074617               3.843271                          96.77954
## comp 7  0.1444488               1.805610                          98.58515
## comp 8  0.1131877               1.414846                         100.00000

This table shows the eigenvalues of the PCA. Wikipedia describes eigenvalues as “if T is a linear transformation from a vector space V over a field F into itself and v is a vector in V that is not the zero vector, then v is an eigenvector of T if T(v) is a scalar multiple of v”.

If that made no sense to you, you’re in 99.99% of the population (which also includes me). It’s part of the under-the-bonnet calculations done to generate the PCs, and a larger eigenvalue for a PC means that that PC contains more of the original variance in the dataset (“percentage of variance” in that output). Because of the way PCA is performed, each PC explains less overall variance than the previous one. The analysis will generate as many PCs as you gave it variables, and their percentage of variance explained will sum to 100%. This can be plotted in a bizarre bar-and-line combinaton known as a Scree plot.

pca.eig$index <- as.factor(1:nrow(pca.eig))
pca.eig$index.cont <- 1:nrow(pca.eig)

#Scree plot
ggplot(pca.eig,aes(x=index,y=`percentage of variance`)) +
  geom_bar(stat = "identity",aes(fill = index),colour = "black") +
  geom_path(aes(x = index.cont),size = 1,colour = "Gray50") +
  geom_point(size = 3) +
  labs(x = "Principal Component",
       y = "Percentage of Variance Explained") +
  scale_fill_brewer(breaks = c(1:8),
                    palette = "YlGnBu",
                    direction = -1) +
  theme_bw(base_size = 20) +
  theme(legend.position = "none")

First, the ggplot2 code. We plot using the pca.eig dataset, with our x variable being simply row numbers (but categorical, for the bars), and y being the variance explained. We then draw geom_bar and geom_point using these. For geom_path we override the x aesthetic with a continous version of the row numbers to ensure it joins up continously. The rest of the call is simply colour, labelling and other design aspects.

In terms of what we’re looking at, the Scree plot is often used as an important tool at this stage. If you’re attempting to reduce dimensionality in your dataset, you only want to take some PCs forwards for analysis. Whilst there is no set statistical rule for how to make this decision, the Scree plot is a good starting point. Many people look for an “elbow” in the line. This suggests that beyond that point, PCs are providing marginal gains in terms of variance explained, and subsequent PCs are just encapsulating noise. Here, I’ll probably stick to the first 4 PCs (for a total of 88.12% of overall variance, which is pretty decent). Now, how is each PC structured? What variables make them up? These are our next important points. Calling pca.dat$var$contrib (note the two-layer nesting) will give you this data.

pca.contrib <- as.data.frame(pca.dat$var$contrib)[,1:4]
pca.contrib

##                 Dim.1      Dim.2      Dim.3      Dim.4
## Population  1.5984061 16.8817585 43.0763114 16.7596530
## Income      8.9299316 26.9339033  1.0071968  0.7822798
## Illiteracy 21.8714450  0.2805685  0.5026596 12.4487608
## Life Exp   16.9423099  0.6667720 12.9551741 19.5862312
## Murder     19.7364030  9.4217898  1.1765201  2.7423416
## HS Grad    18.0356857  8.9261493  0.2470935  5.3626574
## Frost      12.7743655  2.3588074 14.9857610 38.2729294
## Area        0.1114532 34.5302512 26.0492836  4.0451468

Here, “Dim” is the same as PC, and the numbers represent the proportion of that variable (row names) which is represented in that PC. So PC consists of 1.59% of population, 8.92% of income etc. Remember that PCs represent combinations of variables. Therefore, the contrib table shows you how variables are combined to produce each PC. Next I need to quickly clean up this data frame and prepare it for plotting. These things are much easier to visualise with a plot

pca.contrib$var <-
  row.names(pca.contrib) #Turn row names into a variable

  pca.contrib <- melt(pca.contrib,id.vars = "var") #Melt the dataset
  
  total.var <-
  paste("(",round(pca.eig$`percentage of variance`[1:4],1),"% of Variance)",sep =
  "") #Extract the % of variance explained by each PC
  
  pca.contrib$variable <- factor(pca.contrib$variable,labels =
  paste("PC",1:4," ",
  total.var,sep = "")) #Append these values to the names of each PC

But we have a slight (albeit hidden) issue here. These values tell us how much each variable contributes to (more correctly loads onto) each PC, but not the direction of the loading. Cast your mind back to the variables factor map. Illiteracy and murder were almost inverse to life expectancy. Therefore, the loading of those two is probably negative, although it could also be the case that life expectancy has loaded negatively. Always check the directions of your PCs! Either way, one of them must have loaded negatively, since they are negatively correalted with each other. We can explore this using the co-ordinates of each variable in each PC.

pca.coord <- as.data.frame(pca.dat$var$coord)
pca.coord

##                  Dim.1       Dim.2       Dim.3       Dim.4        Dim.5
## Population -0.23984363  0.52487776 -0.69208615  0.34434757  0.251765858
## Income      0.56690291  0.66297778 -0.10582738  0.07439531 -0.395428165
## Illiteracy -0.88720374  0.06766573  0.07476148 -0.29677518  0.002186803
## Life Exp    0.78085597 -0.10431289 -0.37954435 -0.37225450  0.202555453
## Murder     -0.84278855  0.39211733  0.11437750  0.13929172 -0.079427577
## HS Grad     0.80565843  0.38166418  0.05241692 -0.19478457 -0.061563366
## Frost       0.67803840 -0.19619845  0.40820686  0.52036774  0.134807911
## Area        0.06333314  0.75067024  0.53819393 -0.16917324  0.309171079

So as I suspected, illiteracy and murder have loaded negatively onto PC1. So what does this all mean? Take PC1 (Dim.1). If a state has a 1 unit increase in PC1 compared to another state, this means that state also has a 0.56 unit increase in income, a 0.81 increase in high school graduates, and a 0.89 unit decrease in illiteracy (due to negative loading).

The fact that units are rescaled before PCA does make these differences a little harder to get one’s head around. Let’s get on with plot preparation to aid that. I wrote a little loop that detects negative loading and appends it to our table of contributions.

for (i in 1:nrow(pca.contrib))
{
  pca.contrib$dir[i] <- pca.coord[which(row.names(pca.coord) == pca.contrib$var[i]),
                                  grep(as.numeric(substr(pca.contrib$variable[i],3,3)),names(pca.coord))] >
    0
}
pca.contrib

##           var                variable      value   dir
## 1  Population   PC1 (45% of Variance)  1.5984061 FALSE
## 2      Income   PC1 (45% of Variance)  8.9299316  TRUE
## 3  Illiteracy   PC1 (45% of Variance) 21.8714450 FALSE
## 4    Life Exp   PC1 (45% of Variance) 16.9423099  TRUE
## 5      Murder   PC1 (45% of Variance) 19.7364030 FALSE
## 6     HS Grad   PC1 (45% of Variance) 18.0356857  TRUE
## 7       Frost   PC1 (45% of Variance) 12.7743655  TRUE
## 8        Area   PC1 (45% of Variance)  0.1114532  TRUE
## 9  Population PC2 (20.4% of Variance) 16.8817585  TRUE
## 10     Income PC2 (20.4% of Variance) 26.9339033  TRUE
## 11 Illiteracy PC2 (20.4% of Variance)  0.2805685  TRUE
## 12   Life Exp PC2 (20.4% of Variance)  0.6667720 FALSE
## 13     Murder PC2 (20.4% of Variance)  9.4217898  TRUE
## 14    HS Grad PC2 (20.4% of Variance)  8.9261493  TRUE
## 15      Frost PC2 (20.4% of Variance)  2.3588074 FALSE
## 16       Area PC2 (20.4% of Variance) 34.5302512  TRUE
## 17 Population PC3 (13.9% of Variance) 43.0763114 FALSE
## 18     Income PC3 (13.9% of Variance)  1.0071968 FALSE
## 19 Illiteracy PC3 (13.9% of Variance)  0.5026596  TRUE
## 20   Life Exp PC3 (13.9% of Variance) 12.9551741 FALSE
## 21     Murder PC3 (13.9% of Variance)  1.1765201  TRUE
## 22    HS Grad PC3 (13.9% of Variance)  0.2470935  TRUE
## 23      Frost PC3 (13.9% of Variance) 14.9857610  TRUE
## 24       Area PC3 (13.9% of Variance) 26.0492836  TRUE
## 25 Population  PC4 (8.8% of Variance) 16.7596530  TRUE
## 26     Income  PC4 (8.8% of Variance)  0.7822798  TRUE
## 27 Illiteracy  PC4 (8.8% of Variance) 12.4487608 FALSE
## 28   Life Exp  PC4 (8.8% of Variance) 19.5862312 FALSE
## 29     Murder  PC4 (8.8% of Variance)  2.7423416  TRUE
## 30    HS Grad  PC4 (8.8% of Variance)  5.3626574 FALSE
## 31      Frost  PC4 (8.8% of Variance) 38.2729294  TRUE
## 32       Area  PC4 (8.8% of Variance)  4.0451468 FALSE

So this is what we’ll be plotting. var is each variable; variable (confusingly) is each PC, along with variance explained (from the eigenvalues); value is how much that variable contributed to that PC; and dir stands for direction, with dir = FALSE meaning negative loading. With that all said and done, it’s time for our first plot (and it’s a big one).

ggplot(pca.contrib,aes(x=var,y=value)) +
  geom_hline(yintercept = 100 / length(unique(pca.contrib$var)),
             linetype = "dashed") +
  geom_bar(stat = "identity",aes(fill = var),colour = "black") +
  facet_wrap(~ variable,nrow = 2,scales = "free") +
  geom_label(aes(label = var,colour = dir)) +
  scale_colour_manual(values = c("red","black"))+
  scale_fill_brewer(palette = "Spectral") +
  theme_bw(base_size = 20) +
  theme(axis.text.x = element_blank()) +
  labs(x = "",y = "% Contribution to Principal Component") +
  theme(legend.position = "none")

Once again, the code. We call the contribution table with variable names on the x axis, and the contribution on the Y. We then draw geom_bar for each variable to show this. Now come the clever parts, if I do say so myself. facet_wrap creates a sub-panel for each PC, and geom_hline draws a dashed line on each panel. This line represents the expected contributions to that variable if all variables contributed evenly. Therefore, variables above this line have a “greater-than-average” contribution to that PC. “Significant”" contribution sounds a bit nicer, but the concept makes no statistical sense.

So what can we see from this? PC1 is comprised mostly of illiteracy, murder, life expectancy and HS grad. Taking negative loadings into account, that means a higher PC1 means that state has a higher life expectancy and adult education rate, and lower illiteracy and murder rate. Sounds quite nice! They also tend to be a bit colder though… You can study the structure of each PC this way. States with a higher PC2 tend to be larger, richer and more populous (but maybe a tad more murdery? I guess that’s just a consequence of their size).

So you can see already how the structuring of a PCA can help you get your head around correlations hidden in a dataset. I’ll give some examples later on to help make this clearer. For now, let’s do another visualisation. This will tell us not only the structure of each PC, but also how the different PCs relate to one another in terms of shared variables. This is not immediately apparent from the above graph.

Worth noting: I came up with this visualisation myself. That’s not to say I invented it, as it may well have been done before. I think it makes sense and I quite like it. But the warning is this; the code and process by which it’s generated is extremely ugly. Unless you really want to know the mechanics of how it’s done, you may wish to skip ahead…

With that warning aside, lets proceed! Firstly, we need to get our data into the right shape.

pca.grid <-
  expand.grid(unique(pca.contrib$var),paste("PC",1:4,sep = ""))
levels(pca.contrib$variable) <- paste("PC",1:4,sep = "")
pca.grid$Var1 <- as.character(pca.grid$Var1)
pca.grid$Var2 <- as.character(pca.grid$Var2)
pca.grid

##          Var1 Var2
## 1  Population  PC1
## 2      Income  PC1
## 3  Illiteracy  PC1
## 4    Life Exp  PC1
## 5      Murder  PC1
## 6     HS Grad  PC1
## 7       Frost  PC1
## 8        Area  PC1
## 9  Population  PC2
## 10     Income  PC2
## 11 Illiteracy  PC2
## 12   Life Exp  PC2
## 13     Murder  PC2
## 14    HS Grad  PC2
## 15      Frost  PC2
## 16       Area  PC2
## 17 Population  PC3
## 18     Income  PC3
## 19 Illiteracy  PC3
## 20   Life Exp  PC3
## 21     Murder  PC3
## 22    HS Grad  PC3
## 23      Frost  PC3
## 24       Area  PC3
## 25 Population  PC4
## 26     Income  PC4
## 27 Illiteracy  PC4
## 28   Life Exp  PC4
## 29     Murder  PC4
## 30    HS Grad  PC4
## 31      Frost  PC4
## 32       Area  PC4

So we start off with a two-column data frame, one of variable names, the other just listing each PC. Next, we go through every row in that data frame and append two more columns. One is the % contribution of that variable to that PC. The other is if that is a greater-than-average contribution (TRUE/FALSE).

for (i in 1:nrow(pca.grid))
{
  df <- subset(pca.contrib,var == pca.grid[i,1] &
                 variable == pca.grid[i,2])
  pca.grid$contrib[i] <- df$value
  pca.grid$sig[i] <-
    df$value > 100 / length(unique(pca.contrib$var))
}
pca.grid

##          Var1 Var2    contrib   sig
## 1  Population  PC1  1.5984061 FALSE
## 2      Income  PC1  8.9299316 FALSE
## 3  Illiteracy  PC1 21.8714450  TRUE
## 4    Life Exp  PC1 16.9423099  TRUE
## 5      Murder  PC1 19.7364030  TRUE
## 6     HS Grad  PC1 18.0356857  TRUE
## 7       Frost  PC1 12.7743655  TRUE
## 8        Area  PC1  0.1114532 FALSE
## 9  Population  PC2 16.8817585  TRUE
## 10     Income  PC2 26.9339033  TRUE
## 11 Illiteracy  PC2  0.2805685 FALSE
## 12   Life Exp  PC2  0.6667720 FALSE
## 13     Murder  PC2  9.4217898 FALSE
## 14    HS Grad  PC2  8.9261493 FALSE
## 15      Frost  PC2  2.3588074 FALSE
## 16       Area  PC2 34.5302512  TRUE
## 17 Population  PC3 43.0763114  TRUE
## 18     Income  PC3  1.0071968 FALSE
## 19 Illiteracy  PC3  0.5026596 FALSE
## 20   Life Exp  PC3 12.9551741  TRUE
## 21     Murder  PC3  1.1765201 FALSE
## 22    HS Grad  PC3  0.2470935 FALSE
## 23      Frost  PC3 14.9857610  TRUE
## 24       Area  PC3 26.0492836  TRUE
## 25 Population  PC4 16.7596530  TRUE
## 26     Income  PC4  0.7822798 FALSE
## 27 Illiteracy  PC4 12.4487608 FALSE
## 28   Life Exp  PC4 19.5862312  TRUE
## 29     Murder  PC4  2.7423416 FALSE
## 30    HS Grad  PC4  5.3626574 FALSE
## 31      Frost  PC4 38.2729294  TRUE
## 32       Area  PC4  4.0451468 FALSE

Still with me? Excellent. Now, we make a blank matrix. For this to work, it has to, officially, be a square matrix. But we will end up leaving most of it empty.

nms <- c(unique(pca.grid$Var1),unique(pca.grid$Var2))
pca.mat <- Matrix(
  0,dimnames = list(nms,nms),
  nrow = length(unique(nms)),
  ncol = length(unique(nms)),
  sparse = TRUE,doDiag = TRUE
)
print(pca.mat)

## 12 x 12 sparse Matrix of class "dsCMatrix"

##    [[ suppressing 12 column names 'Population', 'Income', 'Illiteracy' ... ]]

##                                   
## Population . . . . . . . . . . . .
## Income     . . . . . . . . . . . .
## Illiteracy . . . . . . . . . . . .
## Life Exp   . . . . . . . . . . . .
## Murder     . . . . . . . . . . . .
## HS Grad    . . . . . . . . . . . .
## Frost      . . . . . . . . . . . .
## Area       . . . . . . . . . . . .
## PC1        . . . . . . . . . . . .
## PC2        . . . . . . . . . . . .
## PC3        . . . . . . . . . . . .
## PC4        . . . . . . . . . . . .

As you can see, the rows are named for each variable and each PC. The columns would have the same names, but the print method surpresses them by default. This is what we call a “sparse matrix”, from the Matrix package. That means it can have empty cells with no problem. This is necessary for us! Now, the following hideous loop fills some of the cells with the data we need.

for (i in 1:nrow(pca.mat))
{
  for (j in 1:ncol(pca.mat))
  {
    val <- subset(pca.grid,Var1 == row.names(pca.mat)[i] &
                    Var2 == colnames(pca.mat)[j])
    if (nrow(val) == 0)
    {
      next
    }
    
    if (val[,4] == TRUE)
    {
      pca.mat[i,j] <- val[,3]
    }
  }
}

It works by looping through each row and column and making a subset of the earlier dataframe of contributions for that x-y combination. If that combination doesn’t exist (as most won’t, because we want a symmetrical matrix) it moves onto the next one. If that combination exists and that variable’s contribution to that PC is greater-than-expected, the contribution is inserted into the matrix (how often do you legitimately get to use a phrase like that?) Either way, this leaves us with the following sad-looking matrix:

image(pca.mat)

Ignore most of the empty cells, the only ones that matter are rows 1:8 and columns 9:12. The rows represent variables, the columns represent PCs 1:4. A darker cell means that variable (though they aren’t labelled) contributes more to that PC. Empty cells mean that variable contributes less than average to that PC.

“Cool” though this is, this is not our final destination. Next, we must use the igraph package to make a graph from the relationships in this matrix. This is a special use of the word graph, which refers to a figure showing the relationships between things. I mean technically yes, any plot does that. But this kind of graph is called a graph. Just take my word for it.

graph <-
  graph_from_adjacency_matrix(pca.mat,mode = "undirected",weighted = NULL)
graph <- simplify(graph,remove.loops = TRUE)

So we create the graph from our adjacency matrix, and simplify it for ease of plotting. We then need to select a layout for the graph, and plot it. I chose the Fruchterman-Reingold layout algorithm as that tends to give a nice readable graph.

layout <- layout_with_fr(graph)
plot(graph,layout = layout)

It must be noted that the layout is deterministic. Every time you run it, it recalculates, and can be quite different based on your data. Observe.

layout <- layout_with_fr(graph)
plot(graph,layout = layout)

Anyway, we’re nearly there. This graph technically shows which variables contribute to which PC. But I want it to be both prettier and more informative! This calls for ggplot2. It also calls for some serious data wrangling to extract what I need from the extremely obtuse igraph object. First, we extract a data frame of the graph nodes.

coord <- data.frame(names = attr(V(graph),"names"),
                    x = layout[,1],y = layout[,2])
coord

##         names          x         y
## 1  Population -0.1136986 11.528751
## 2      Income -3.1372125 12.712944
## 3  Illiteracy  5.0716546  9.166989
## 4    Life Exp  2.5928246 10.738258
## 5      Murder  6.0020231 10.203971
## 6     HS Grad  5.7320291 11.575549
## 7       Frost  2.7399309 11.464784
## 8        Area -0.2947086 12.769613
## 9         PC1  4.4252830 10.637158
## 10        PC2 -1.6198034 12.372688
## 11        PC3  1.2322274 11.877974
## 12        PC4  1.3936845 10.872730

This shows us the x-y co-ordinates of each node (point) on the graph. The edges are a bit trickier.

edges <- attr(E(graph),"vnames")
edges

##  [1] "Population|PC2" "Population|PC3" "Population|PC4" "Income|PC2"    
##  [5] "Illiteracy|PC1" "Life Exp|PC1"   "Life Exp|PC3"   "Life Exp|PC4"  
##  [9] "Murder|PC1"     "HS Grad|PC1"    "Frost|PC1"      "Frost|PC3"     
## [13] "Frost|PC4"      "Area|PC2"       "Area|PC3"

All we can extracty from the object is a set of edge locations. First things first, we need to split each entry (say, Population|PC2) into a clear start and end point. To do this, I’ve created a vectorised wrapper for the strsplit function that will do this. The code is hideous, I know.

strsplit.vect <- function(char,split,item)
{
  char <- as.character(char)
  output <- c()
  for (i in 1:length(char))
  {
    output[i] <- strsplit(char[i],split)[[1]][item]
  }
  return(output)
}
edges <- data.frame(start = strsplit.vect(edges,"[|]",1),
                    end = strsplit.vect(edges,"[|]",2))
edges

##         start end
## 1  Population PC2
## 2  Population PC3
## 3  Population PC4
## 4      Income PC2
## 5  Illiteracy PC1
## 6    Life Exp PC1
## 7    Life Exp PC3
## 8    Life Exp PC4
## 9      Murder PC1
## 10    HS Grad PC1
## 11      Frost PC1
## 12      Frost PC3
## 13      Frost PC4
## 14       Area PC2
## 15       Area PC3

We’re getting somewhere though. Now, we loop through this data frame and do a lot of stuff at each iteration.

for (i in 1:nrow(edges))
{
  edges$start.x[i] <- coord[as.character(coord[,1]) == edges[i,1],2]
  edges$start.y[i] <- coord[as.character(coord[,1]) == edges[i,1],3]
  edges$end.x[i] <- coord[as.character(coord[,1]) == edges[i,2],2]
  edges$end.y[i] <- coord[as.character(coord[,1]) == edges[i,2],3]
  edges$weight[i] <-
    pca.mat[which(row.names(pca.mat) == edges[i,1]),
            which(colnames(pca.mat) == edges[i,2])]
  
  edges$c.x[i] <- mean(c(edges$start.x[i],edges$end.x[i]))
  edges$c.y[i] <- mean(c(edges$start.y[i],edges$end.y[i]))
  edges$dir[i] <-
    ifelse(pca.contrib[pca.contrib$var == edges$start[i] &
                         pca.contrib$variable == edges$end[i],4],
           "+","-")
}
edges

##         start end    start.x   start.y     end.x    end.y   weight
## 1  Population PC2 -0.1136986 11.528751 -1.619803 12.37269 16.88176
## 2  Population PC3 -0.1136986 11.528751  1.232227 11.87797 43.07631
## 3  Population PC4 -0.1136986 11.528751  1.393684 10.87273 16.75965
## 4      Income PC2 -3.1372125 12.712944 -1.619803 12.37269 26.93390
## 5  Illiteracy PC1  5.0716546  9.166989  4.425283 10.63716 21.87145
## 6    Life Exp PC1  2.5928246 10.738258  4.425283 10.63716 16.94231
## 7    Life Exp PC3  2.5928246 10.738258  1.232227 11.87797 12.95517
## 8    Life Exp PC4  2.5928246 10.738258  1.393684 10.87273 19.58623
## 9      Murder PC1  6.0020231 10.203971  4.425283 10.63716 19.73640
## 10    HS Grad PC1  5.7320291 11.575549  4.425283 10.63716 18.03569
## 11      Frost PC1  2.7399309 11.464784  4.425283 10.63716 12.77437
## 12      Frost PC3  2.7399309 11.464784  1.232227 11.87797 14.98576
## 13      Frost PC4  2.7399309 11.464784  1.393684 10.87273 38.27293
## 14       Area PC2 -0.2947086 12.769613 -1.619803 12.37269 34.53025
## 15       Area PC3 -0.2947086 12.769613  1.232227 11.87797 26.04928
##           c.x       c.y dir
## 1  -0.8667510 11.950720   +
## 2   0.5592644 11.703362   -
## 3   0.6399930 11.200740   +
## 4  -2.3785079 12.542816   +
## 5   4.7484688  9.902074   -
## 6   3.5090538 10.687708   +
## 7   1.9125260 11.308116   -
## 8   1.9932545 10.805494   -
## 9   5.2136530 10.420565   -
## 10  5.0786560 11.106353   +
## 11  3.5826069 11.050971   +
## 12  1.9860791 11.671379   +
## 13  2.0668077 11.168757   +
## 14 -0.9572560 12.571151   +
## 15  0.4687594 12.323793   +

So start.x and start.y are the co-ordinates (from the coord frame) of the start point (the variable). end.x and end.y are the same but for the PC point. Note, with this visualisation method, two variables are never linked, only a variable to a PC. But a variable can still be linked to multiple PCs.

weight is the contribution of that variable to that PC. Finally, c.x and c.y are the midpoints along the edge, and dir is whether loading was positive or negative. Now, final bit of data prep before plotting. We create a variable called PC which is true if that node is a PC, and false if it is a variable.

coord$PC <- grepl("PC",coord$names)
coord

##         names          x         y    PC
## 1  Population -0.1136986 11.528751 FALSE
## 2      Income -3.1372125 12.712944 FALSE
## 3  Illiteracy  5.0716546  9.166989 FALSE
## 4    Life Exp  2.5928246 10.738258 FALSE
## 5      Murder  6.0020231 10.203971 FALSE
## 6     HS Grad  5.7320291 11.575549 FALSE
## 7       Frost  2.7399309 11.464784 FALSE
## 8        Area -0.2947086 12.769613 FALSE
## 9         PC1  4.4252830 10.637158  TRUE
## 10        PC2 -1.6198034 12.372688  TRUE
## 11        PC3  1.2322274 11.877974  TRUE
## 12        PC4  1.3936845 10.872730  TRUE

We then create a variable called “colour”, which for each PC will equal that PC’s name, and for all variables will equal PC5. This will be used to colour code our graph.

coord$names <- as.character(coord$names)
coord$colour <- factor(ifelse(coord$PC == TRUE,coord$names,"PC5"))
coord

##         names          x         y    PC colour
## 1  Population -0.1136986 11.528751 FALSE    PC5
## 2      Income -3.1372125 12.712944 FALSE    PC5
## 3  Illiteracy  5.0716546  9.166989 FALSE    PC5
## 4    Life Exp  2.5928246 10.738258 FALSE    PC5
## 5      Murder  6.0020231 10.203971 FALSE    PC5
## 6     HS Grad  5.7320291 11.575549 FALSE    PC5
## 7       Frost  2.7399309 11.464784 FALSE    PC5
## 8        Area -0.2947086 12.769613 FALSE    PC5
## 9         PC1  4.4252830 10.637158  TRUE    PC1
## 10        PC2 -1.6198034 12.372688  TRUE    PC2
## 11        PC3  1.2322274 11.877974  TRUE    PC3
## 12        PC4  1.3936845 10.872730  TRUE    PC4

Now, to plotting! The base plot is called on the nodes (coord) data frame. We then add the edges in using geom_segment to draw a line from the start to the end points, coloured by which PC it ends up at, and sized according to weight (that is, the contribution of that variable to that PC). We then write labels for each PC and each variable, coloured by the colour parameter. The colour scheme here is crucial; we select 4 colours, then white. This means that PC1-4 will have their own colour, and “PC5” will be white. Since PC5 is the label given to all variables, they will be in white text. Finally we write another set of labels on the midpoints of each edge with a “+” or “-” to denote positive or negative loading.

ggplot(coord,aes(x,y)) +
  geom_segment(data = edges
               ,aes(
                 x = start.x,y = start.y,
                 xend = end.x,yend = end.y,
                 colour = end,size = weight
               )) +
  geom_label(size = 6,aes(fill = colour,label = names)) +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  ) +
  labs(x = "",y = "") +
  scale_colour_manual(values = c(brewer.pal(4,"Set1"))) +
  scale_fill_manual(values = c(brewer.pal(4,"Set1"),"white")) +
  geom_label(aes(x = c.x,y = c.y,label = dir),data = edges,size = 8)

I suppose it’s a matter of taste, but I really like this visualisation. Each PC has a colour, and links to its contributing variables. Thicker lines indicate a less symetric contribution. You can, at a glance, see both PC structure and their relationships. It provides a nice reference to quickly look back at when describing PC effects.

With that in mind, let’s have a look at how states vary in PCs. You can extract the actual PC scores for each state using the call pca.dat$ind$coord.

pca.stat <- as.data.frame(pca.dat$ind$coord)[,1:4]
pca.stat

##                      Dim.1       Dim.2       Dim.3        Dim.4
## Alabama        -3.82836428 -0.23716257  0.23164558 -0.387160137
## Alaska          1.06382751  5.51156918  4.28364318 -0.581518252
## Arizona        -0.87623537  0.75262575  0.07805313 -1.736293836
## Arkansas       -2.40595872 -1.30142362  0.22505473 -0.629534491
## California     -0.24383210  3.54515336 -2.83493329  0.071090007
## Colorado        2.08311775  0.51079765  0.51657601  0.111038575
## Connecticut     1.91871993 -0.24547359 -0.66939807 -0.171453869
## Delaware        0.42909658 -0.51307618  0.22094439  0.194608671
## Florida        -1.18402345  1.14626187 -1.29485401 -0.493455278
## Georgia        -3.32761584  0.11107318  0.39099842  0.460212968
## Hawaii          0.49198599  0.12653389 -1.39131861 -2.980183281
## Idaho           1.43788059 -0.61734785  0.43889922 -0.409728335
## Illinois       -0.12017203  1.29540733 -0.81201670  1.599063843
## Indiana         0.47598579 -0.24769029 -0.30454473  0.663154499
## Iowa            2.34538438 -0.54230654 -0.29622393 -0.206823603
## Kansas          1.92082004 -0.07797440 -0.17890918 -0.620307462
## Kentucky       -2.15097823 -1.07505720  0.25427250  0.352380672
## Louisiana      -4.28406544 -0.34981662  0.23121601 -0.888269677
## Maine           0.96994215 -1.71970311  0.72880344  0.550797460
## Maryland        0.20549128  0.39275855 -0.34266902  0.668715496
## Massachusetts   1.20803512 -0.22087616 -1.02998153 -0.293016026
## Michigan       -0.18371587  0.85571675 -0.55982049  1.194761816
## Minnesota       2.45832345 -0.36904451 -0.27515377 -0.068086037
## Mississippi    -4.07302459 -1.06191343  0.92891890 -0.177695544
## Missouri       -0.31441452 -0.14981157  0.00667219  0.559512234
## Montana         1.39287204 -0.03387927  1.35355575  0.247436744
## Nebraska        2.20315952 -0.55330929  0.05633810 -0.460491366
## Nevada          1.13852733  1.14441562  1.90984188  1.520116562
## New Hampshire   1.68825709 -1.32572231  0.44232732  0.485234407
## New Jersey      0.65617713  0.28432749 -0.98379015  0.622361283
## New Mexico     -1.33587312 -0.29655089  1.37579210 -0.714194797
## New York       -1.06101371  1.91293670 -2.22014430  1.279829922
## North Carolina -2.72168811 -0.52238917  0.14418326  0.566332475
## North Dakota    2.44221334 -0.78986053  0.27978746  0.137693336
## Ohio            0.27067753  0.42108548 -1.04342981  1.159210823
## Oklahoma       -0.07466361 -0.65314783 -0.07191020 -0.613076010
## Oregon          1.33817792  0.22998659 -0.50402882 -1.363717422
## Pennsylvania   -0.07816735  0.27214457 -1.08096169  1.302588254
## Rhode Island    0.74836879 -1.47613919 -0.23013703 -0.299896914
## South Carolina -3.74868247 -0.91908151  0.75633355  0.319026783
## South Dakota    2.03296648 -1.32844647  0.54199391  0.159024325
## Tennessee      -2.24065364 -0.65763460 -0.01693749 -0.002769606
## Texas          -2.43814743  2.35107064 -0.28894455 -0.812241135
## Utah            2.28581091 -0.53975620  0.22199848 -0.859252535
## Vermont         1.38316765 -1.52470730  0.45022716  0.362759114
## Virginia       -1.00363500  0.18644420 -0.21296885  0.327713915
## Washington      1.35361753  0.51673796 -0.86045991 -1.250002832
## West Virginia  -1.52191818 -1.61824796  0.49718832  0.345594712
## Wisconsin       1.77538396 -0.64218163 -0.42968914  0.113439275
## Wyoming         1.49885527  0.04268506  1.36796025  0.645470271

By default, this will hold information on the first 5 PCs. To get it to show more, use the ncp argument when initially creating the PC. Now, we can just rename these variables and add them to the actual state dataset.

names(pca.stat) <- paste("PC",1:4,sep = "")
state.dat <- cbind(state.dat,pca.stat)

Now, I’ve created a quick function that will print the top 5 and bottom 5 scoring states for each PC to have a look at patterns.

toptail <- function(name,dat)
{
  sort.dat <-
    dat[order(dat[,which(names(dat) == name)],decreasing = TRUE),]
  
  out <- rbind(cbind(row.names(head(sort.dat,5)),1:5),
               cbind("--------","--------"),
               cbind(row.names(tail(sort.dat,5)),46:50))
  
  print(as.data.frame(out),row.names = FALSE)
}

Let’s start with PC1. I won’t include the mapping code cos it’s not really part of this tutorial

toptail("PC1",state.dat)

##              V1       V2
##       Minnesota        1
##    North Dakota        2
##            Iowa        3
##            Utah        4
##        Nebraska        5
##        -------- --------
##         Georgia       46
##  South Carolina       47
##         Alabama       48
##     Mississippi       49
##       Louisiana       50

So looking at the structure plot, PC1 indicates greater education, less murder, and a tiny bit of temperature levels. In general, the frost parameter is interesting, as it’s a good proxy for north vs south (which I’m told is a big thing in the US). Indeed, most of the higher PC1 states are northern, and all of the lowest PC1 states are southern. I’ll print the PCA structure plot again for reference:

toptail("PC2",state.dat)

##             V1       V2
##         Alaska        1
##     California        2
##          Texas        3
##       New York        4
##       Illinois        5
##       -------- --------
##   South Dakota       46
##   Rhode Island       47
##        Vermont       48
##  West Virginia       49
##          Maine       50

PC2 is states that are big, populous and with a high income. Looking at the top and bottom states, that makes a lot of sense.

toptail("PC3",state.dat)

##            V1       V2
##        Alaska        1
##        Nevada        2
##    New Mexico        3
##       Wyoming        4
##       Montana        5
##      -------- --------
##  Pennsylvania       46
##       Florida       47
##        Hawaii       48
##      New York       49
##    California       50

PC3 is interesting. It is still physically big states, but population loads negatively, therefore it shows sparsely populated states. Frost loads onto this PC, suggesting that these states tend to be cooler, or that more densely populated states are warmer ones (see Hawaii, Florida and California). Weirdly, these also have a slightly lower life expectancy.

This is the real heart of PCA; the fact that something like population can have both positive and negative loadings with different PCs that reveals different patterns in the data. Also that you see tiny correlations with things like Frost that still get accounted for.

toptail("PC4",state.dat)

##            V1       V2
##      Illinois        1
##        Nevada        2
##  Pennsylvania        3
##      New York        4
##      Michigan        5
##      -------- --------
##     Louisiana       46
##    Washington       47
##        Oregon       48
##       Arizona       49
##        Hawaii       50

Finally, PC4 shows highly populated colder states with a lower life expectancy. Here, we see an issue with PCA; later PCs tend to capture more noise, therefore their contributions are somewhat more difficult to explain.

Either way, hopefully I’ve shown you how to carry out and visualise this very helpful analysis technique, including my own novel visualisation methods. Any questions, drop me a comment on here, or hit me up on Twitter @b_t_cooper. Thanks for your time!

Creating, Visualising and Exploring Multiple Correlated Variables using Principal Components Analysis (PCA)

Ben Cooper

August 24, 2017