Let’s say we have a 25-column matrix and we want to analyse and visualize the correlation between columns. There are several possible ways. This post is about using the corrplot package.

First, let’s look at the object briefly:

mat[1:5,1:5]
##            x_1        x_2       x_3       x_4       x_5
## [1,] -19.56761 -19.464597 -29.44343 24.298796 -25.57141
## [2,] -22.35271 -25.073553 -19.20713 16.564984 -23.71568
## [3,] -23.73841 -10.949853 -24.10458  9.953472 -22.21538
## [4,] -18.36704 -14.156476 -17.70389 19.239598 -15.83134
## [5,] -13.00010  -8.743272 -11.18862 15.272035 -14.60418
dim(mat)
## [1] 100  25
class(mat)
## [1] "matrix"

The first step is calculating the correlations. Note that I use Spearman’s Rank Correlation Coefficients. This is just because I use a simulated data and not quite sure if there are outliers.

corr_mat=cor(mat,method="s")
corr_mat[1:5,1:5]
##             x_1         x_2         x_3         x_4         x_5
## x_1  1.00000000  0.22557456 -0.05596160 -0.02371437  0.57196520
## x_2  0.22557456  1.00000000 -0.06113411  0.04896490  0.38661866
## x_3 -0.05596160 -0.06113411  1.00000000 -0.35621962  0.04523252
## x_4 -0.02371437  0.04896490 -0.35621962  1.00000000 -0.12919292
## x_5  0.57196520  0.38661866  0.04523252 -0.12919292  1.00000000

Manual inspection of a 25x25 matrix is time-consuming and uninformative. Thus, we need a better way to visualize the results.

library(corrplot)
corrplot(corr_mat)

There are several options which can be applied to make this plot more beautiful and easy to understand. Let’s try this one:

corrplot(corr_mat,title = "Correlation Plot", method = "square", outline = T, addgrid.col = "darkgray", order="hclust", mar = c(4,0,4,0), addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.5, cl.cex = 1.5)

Better, right? Let’s look at the options in more detail:

  1. method = to determine the shape of the correlation objects. Can take the values: “circle” (default), “square”, “ellipse”, “number”, “pie”, “shade” and “color”.
  2. outline = to draw the black outline of the correlation objects such as circles or squares.
  3. addgrid.col = to determine the color of the grids. Would dissapear if NA.
  4. order = the order of the columns. If not specified it is plotted as in the original matrix, but sometimes it is not so informative. Possible methods are: “AOE” (angular order of the eigenvectors), “FPC” (first principal component), “hclust”, “alphabet”. There is also hclust.method to determine the agglomeration method if the order is “hclust”.
  5. addrect = when the order is “hclust”, determines the number of rectangles according to the hierarchical cluster. rect.something arguments are about the rectangles added according to this argument.
  6. cl.something = these are the arguments about the color legend.
  7. tl.something = these are the arguments about the text labels.

Just to show a couple of more options:

library(RColorBrewer)
corrplot(corr_mat, method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.5, cl.cex = 1.5, addCoef.col = "white", number.digits = 2, number.cex = 0.75, col = colorRampPalette(c("darkred","white","midnightblue"))(100))

The other arguments used:

  1. addCoef.col = to add the correlation coefficients
  2. number.digits = to determine the number of digits added to plot
  3. col = color spectrum used for the plot

There are a couple of more arguments useful to create more complicated graphs. For example, let’s say we have two different correlation matrices giving the correlation between the same objects but depending on different variables. Let’s say we have corr_mat2.

We want to visualize these numbers as the lower triangle of the plot whereas we want the first ones to remain at top:

ord=hclust(1-as.dist(corr_mat))$order
corrplot(corr_mat[ord,ord], outline = T, addgrid.col = "darkgray",cl.pos = "r", tl.col = "indianred4", tl.cex = 1.5, cl.cex = 1.5, mar = c(4,0,4,0), type = "upper", tl.pos = "tl", bg="azure2")
corrplot(corr_mat2[ord,ord], outline = T, addgrid.col = "darkgray",cl.pos = "r", tl.col = "indianred4", tl.cex = 1.5, cl.cex = 1.5, mar = c(4,0,4,0), type = "lower", tl.pos = "tl", add=T, bg="azure")

Here, the order is determined according to the hierarchical clustering of the first correlation matrix. The reason why we do not use “hclust” option is that it would change between the two correlation matrices. In order to make sure the order is the same for two plots, I determined the order beforehand.

Using this method, we see that there is a strong correlation between some of the objects according to the first measure, whereas the correlation is generally low with respect to the second one.

If we try the same by ordering the matrices according to the second correlation matrix:

ord=hclust(1-as.dist(corr_mat2))$order
corrplot(corr_mat[ord,ord], outline = T, addgrid.col = "darkgray",cl.pos = "r", tl.col = "indianred4", tl.cex = 1.5, cl.cex = 1.5, mar = c(4,0,4,0), type = "upper", tl.pos = "tl", bg="azure2")
corrplot(corr_mat2[ord,ord], outline = T, addgrid.col = "darkgray",cl.pos = "r", tl.col = "indianred4", tl.cex = 1.5, cl.cex = 1.5, mar = c(4,0,4,0), type = "lower", tl.pos = "tl", add=T, bg="azure")

Apparently, there is no clear clustering based on the second measure.

Note that I used different backgrounds for different variables. However, I don’t think this is a good practice as it may affect the perception of the color of circles. Nevertheless, it might be a useful option in some situations.