Visualizing multicollinearity using igraph

Document history: amended June 18 to change some of the default settings and colours amended June 18 to further tidy the code and add some additional functionality
amended June 16 to tidy up the function code
amended June 16 to improve the visualization of circular (and grid) layouts
amended June 15 to correct an error in the plot function

Introduction

Recently I have been using a range of neighbourhood data to explore some of the predictors of Covid-19 fatality rates in London However, this is a problem with the data – the variables I am using are highly co-linear. I was therefore interested in using R’s igraph package to explore the relationships between the variables and the structure of the correlations. I give the reason for why I prefer it over some other possible approaches in the conclusion.

Getting Started

To get started, load the data and source the (pretty simple) function I have written to interface with igraph. The function should work with other data sets but I am offering no guarantees. Primarily it was written for my benefit and no doubt contains plenty of examples of my bad/lazy coding (i.e. don’t be mean!).

pkg <- try(require(tidyverse))
if(!pkg) {
  install.packages("tidyverse")
  require(tidyverse)
}
df <- read_csv("https://www.dropbox.com/s/ylcipspefqgfrp0/neighbourhood_data.csv?dl=1")
source("https://www.dropbox.com/s/ht5k403adce4h6f/corr_graph.R?dl=1")

The data are similar to those described in this article except where it has the variables as the difference between neighbouring locations, here the data are just the simple percentages (e.g. the percentage of the population in a neighbourhood aged in their twenties, the percentage of the population of an Indian ethnicity, etc.). Each row in the data is for a Middle Super Output area in London but I have stripped out their names/codes here so we are just left with numeric data.

Visualizing the multicollinearity

The correlations between the variables can be shown as a matrix/table of data but it is quite large and unwieldy to read in its entirety.

as_tibble(cor(df)) %>%
  print(n = 3)

## # A tibble: 36 x 36
##   Twenties Thirties Fourties Fifties Sixties Seventies Eightyplus  INDIAN
##      <dbl>    <dbl>    <dbl>   <dbl>   <dbl>     <dbl>      <dbl>   <dbl>
## 1    1       0.582   -0.514   -0.719  -0.660   -0.598     -0.544  -0.148 
## 2    0.582   1       -0.0136  -0.739  -0.749   -0.716     -0.685  -0.0866
## 3   -0.514  -0.0136   1        0.290   0.125    0.0788    -0.0101 -0.0958
## # ... with 33 more rows, and 28 more variables: PAKISTANI <dbl>, BANG <dbl>,
## #   AOTH <dbl>, CHNE <dbl>, ARAB <dbl>, BAFR <dbl>, BCRB <dbl>, BOTH <dbl>,
## #   WBRI <dbl>, WOTH <dbl>, IRISH <dbl>, Detached <dbl>, Semi <dbl>,
## #   Terraced <dbl>, Flats <dbl>, decile <dbl>, vhighocc <dbl>, IMD_rank <dbl>,
## #   Social_rented <dbl>, Group1 <dbl>, Group2 <dbl>, Group3 <dbl>,
## #   Group4 <dbl>, Group5 <dbl>, Group6 <dbl>, Group7 <dbl>, Group8 <dbl>,
## #   gender_ratio <dbl>

Let’s jump-in and visualize them using igraph:

corr_graph(df)

Here it is showing positive correlations in blue, on the left, and negative correlations in red, on the right. Correlations with a magnitude less than 0.5 are ignored.

We can change that threshold, keeping all those with a correlation of magnitude 0.75 or greater, for example:

corr_graph(df, r.cut.off = 0.75)

This reveals the strong correlations between some of the age-related variables, between some of the ethnicity variables (specifically, the black groups), and between some of the measures of social class.

An alternative is not to use the magnitude of the correlation but whether it is significant at, say, a 99% confidence or greater. I am less keen on this because, with a reasonably sized data set, many or all will be; however, to do it…

corr_graph(df, alpha.cut.off = 0.01, use.significance = TRUE)

… which might be clearer if another one of the layout options in igraph is used instead (see towards the end of this helpful article):

corr_graph(df, layout = "layout_in_circle",
           alpha.cut.off = 0.01, use.significance = TRUE)

Here’s another alternative, this time going back to the default cut off using a correlation of magnitude greater than 0.5. I don’t like it as much because it loses some of the structure of the inter-variable correlations (as does the circle chart) but it does (also like the circle) spread the variables out more across the chart which may, perhaps, help with legibility.

corr_graph(df, layout = "layout_on_grid",
           scale.vertex = TRUE,
           vertex.cex = 3, margin = 0.5)

Note that in this chart (above), the size of the circles (each vertex) is scaled so that the larger the circle, the greater that variable’s total correlation with the others. Going back to the first chart type and applying the same principle:

corr_graph(df, vertex.cex = 5, scale.vertex = TRUE)

It might be noted that some variables are missing from the charts. Those variables are the ones that have no correlation with any other variable at the chosen cut-off (here, the default that the magnitude of the correlation exceeds 0.5). If we wanted to include those variables on the chart just to make the point that they are not connected to the others in the same way,

corr_graph(df, vertex.cex = 5,
           scale.vertex = TRUE,
           border = "dark grey", omit.rest = FALSE)

This shows the ‘disconnect’ of the Arab and Bangladeshi ethnicity variables from the rest (at the given cut off). Another possibility is to use the circle design again,

corr_graph(df, layout = "layout_in_circle",
           scale.vertex = TRUE,
           vertex.cex = 1.5, omit.rest = FALSE)

One further embellishment is to bring in a ‘domain variable’ with the idea that the different variables measure different types of domain (different types of thing). First, create the domain variable,

names(df)

##  [1] "Twenties"      "Thirties"      "Fourties"      "Fifties"      
##  [5] "Sixties"       "Seventies"     "Eightyplus"    "INDIAN"       
##  [9] "PAKISTANI"     "BANG"          "AOTH"          "CHNE"         
## [13] "ARAB"          "BAFR"          "BCRB"          "BOTH"         
## [17] "WBRI"          "WOTH"          "IRISH"         "Detached"     
## [21] "Semi"          "Terraced"      "Flats"         "decile"       
## [25] "vhighocc"      "IMD_rank"      "Social_rented" "Group1"       
## [29] "Group2"        "Group3"        "Group4"        "Group5"       
## [33] "Group6"        "Group7"        "Group8"        "gender_ratio"

domains <- c(rep("age", 7), rep("ethnicity", 12), rep("housing", 4),
             rep("deprivation", 4), rep("class", 8),
             rep("other", 1))
length(domains) == length(names(df))

## [1] TRUE

Then plot the graph with it,

corr_graph(df, vertex.cex = 5, scale.vertex = TRUE,
           legend.pos = "bottomleft",
           domains = domains, omit.rest = FALSE)

Another possibility is to focus only on the correlations with specific variables, in this example also fixing all the circles to be the same size and annotating the linkages with the correlation values (remember: correlations of magnitude less than 0.5 are omitted by default).

corr_graph(df, use.only = c("WBRI", "BCRB"),
            vertex.cex = 1, annotate = TRUE)

From this we might note that higher percentages of the White British in a neighbourhood are positively correlated with older age groups in London and lower deprivation (higher IMD ranks) amongst other things, and negatively correlated with high overcrowding within properties. Black Caribbeans tends to be living in the same areas as other black groups and those places are more likely to exhibit deprivation.

Other ways of visualizing correlations

Of course, there are other ways of visualizing the correlations, including using the excellent corrplot package. As a simple example,

pkg <- try(require(corrplot))
if(!pkg) {
  install.packages("corrplot")
  require(corrplot)
}
M <- cor(df)
diag(M) <- 0
corrplot(M, type = "upper", method = "circle")

Or the connections can be explored as a dendrogram:

pkg <- try(require(psych))
if(!pkg) {
  install.packages("psych")
  require(psych)
}
x <- as(df, "data.frame")
plot(hclust(as.dist(sqrt(2*(1-cor(na.omit(x)))))), xlab="")

Although useful, the reason I did not stick with these is that I am not sure either reveals the structure of the relationships as effectively as using igraph does.

Conclusion

I will leave it there but there a few other options you can change and, of course, you can always hack the code as you wish.

head(corr_graph, n = 10)

##                                                                                     
## 1  function (df, layout = "layout_nicely", output = c("windows",                    
## 2      "quartz", "bmp", "jpeg", "png"), filename = "corr_graph",                    
## 3      domains = NULL, res = 150, height = 6, width = 12, r.cut.off = 0.5,          
## 4      alpha.cut.off = 0.05, use.significance = FALSE, set.seed = 19062020,         
## 5      max.line.width = 6, vertex.cex = 0, label.cex = 0.65, border = "light grey", 
## 6      margin = 0.1, asp = 1, omit.rest = TRUE, use.only = NULL,                    
## 7      scale.vertex = FALSE, legend.pos = c("none", "bottomleft",                   
## 8          "bottomright", "topleft", "topright"), annotate = FALSE,                 
## 9      send.to.output = TRUE)                                                       
## 10 {

If you do use it and/or this article was helpful to you, then I would appreciate it be attributed, especially in any published work:

Harris R, 2020, Visualizing multicollinearity using igraph. https://rpubs.com/profrichharris/visual-collinearity

In turn, I would like to acknowledge the following articles, which I found very helpful: https://www.r-graph-gallery.com/250-correlation-network-with-igraph.html and https://rstudio-pubs-static.s3.amazonaws.com/362044_903076131972463e8fdfcc00885fc9a6.html