Corrplot
Code-Through

Introduction

For this code-through, we will be exploring the R package ‘Corrplot’. With this package, we can visualize correlations in our data-sets, and perhaps uncover some hidden patterns that we may not have seen otherwise. Through our labs, we have explored some of the functions in the package, but there are also other functions that are worthy of exploration.

Firstly, let’s load the requisite packages and the data set we will be using. For visualization purposes, we will be using the dataset swiss which has six variables from various regions of Switzerland collected in 1888. This dataset is nice to use as all data-points are already on the same scale (i.e. 0-100), so the focus can be on our visualization.

library(corrplot) #our learning package
library(dplyr) #for wrangling data if necessary
library(flextable) #for nice tables!
library(rstatix) #for statistical calculations

data(swiss)

swiss %>%
    head() %>%
    t() %>% 
    as.data.frame() %>% 
    add_rownames("Province:") %>% 
    flextable()

Province:	Courtelary	Delemont	Franches-Mnt	Moutier	Neuveville	Porrentruy
Fertility	80.20	83.10	92.5	85.80	76.90	76.10
Agriculture	17.00	45.10	39.7	36.50	43.50	35.30
Examination	15.00	6.00	5.0	12.00	17.00	9.00
Education	12.00	9.00	5.0	7.00	15.00	7.00
Catholic	9.96	84.84	93.4	33.77	5.16	90.57
Infant.Mortality	22.20	22.20	20.2	20.30	20.60	26.60

A simple way to get started is to make a correlation matrix from the data:

S<-cor(swiss)
S2<-as.data.frame(S)
flextable(S2)

Fertility	Agriculture	Examination	Education	Catholic	Infant.Mortality
1.0000000	0.35307918	-0.6458827	-0.66378886	0.4636847	0.41655603
0.3530792	1.00000000	-0.6865422	-0.63952252	0.4010951	-0.06085861
-0.6458827	-0.68654221	1.0000000	0.69841530	-0.5727418	-0.11402160
-0.6637889	-0.63952252	0.6984153	1.00000000	-0.1538589	-0.09932185
0.4636847	0.40109505	-0.5727418	-0.15385892	1.0000000	0.17549591
0.4165560	-0.06085861	-0.1140216	-0.09932185	0.1754959	1.00000000

Here we can see the correlations between each of our six variables. While it is possible to analyze the data here, if we had more variables than our six, it would quickly become unmanageable. Moreover, for those not trained in statistics, it can be quite difficult to understand what this means. Corrplot can assist in these cases, so let’s explore it’s visualization options.

Corrplot: Methods

Corrplot indicates positive correlations with a shade of blue, and negative correlations in shades of red. The more intense the color, the stronger the correlation between the variables. The package offers seven different visualization methods using the ‘method = “method choice”’, which are explored here:

Please click on the plots to enlarge!

First up, the default option, circles:

corrplot(S, method = "circle")

Perhaps squares are preferred, where just like circles, they are colored and sized to indicate the strength of the correlation:

corrplot(S, method = "square")

Others may prefer ellipses, where strength & direction of the relationship are demonstrated through the size, color, and orientation of the ellipse:

corrplot(S, method = "ellipse")

In instances where the exact correlation strength is important, consider this numerical representation of the data:

corrplot(S, method = "number")

If you prefer full color, than perhaps shade or color is a better option. Shade stripes the negative correlations to make them easier to tell apart:

corrplot(S, method = "shade")

Or just go all in on color!

corrplot(S, method = "color")

For a more interesting method of data presentation, corrplot allows you to use pie charts within your correlation matrix to demonstrate the size of the strength of the relationship.

corrplot(S, method = "pie")

Corrplot: Types

Besides different methods of presenting the data, Corrplot also offers three different types of presentations, which is convenient in simplifying the data presentation even further. This uses the following argument in the code ‘type = “type choice”’. With this argument, you can choose to show the full data table, only the top triangle, or only the bottom triangle.

corrplot(S, type="full") #this is the default

corrplot(S, type="upper")

corrplot(S, type="lower")

In many cases, “lower”, indicating the lower triangle of correlations, might be the most natural way to present the matrix. However, depending on the type of presentation, this might vary!

Corrplot: hclust

Another function corrplot offers is the ability to reorder the correlations hierarchically, called hclust (-H-ierarchical -Clust-ering). For CPP529 students, we used this function in Lab 4 for our correlation plots!

corrplot(S, type = "lower", order = "hclust")

Corrplot: Aesthetics

There is a couple of changes you can make to your correlation plot to make it more relevant to your dataset! First, let’s create the set of the colors we want. We are using the colorRampPalette function to create a spectrum, and the number in the parentheses indicates how many breaks we want in our spectrum.

colpractice<- colorRampPalette(c("purple", "white", "blue"))(10)

Now, let’s create our plot again! This time, add col = “colpractice” to integrate our spectrum into our plot!

corrplot(S, type = "lower", col = colpractice)

You can also change the background color!

corrplot(S, bg="gray", type= "lower", col = colpractice)

Finally, you can also change the font color, font orientation, or eliminate the center diagonal line of 100 percent correlation.

corrplot(S, tl.col="dark green", tl.srt=45, diag = FALSE)

An Extra Corrplot Function

For a more informative plot, you can add the coefficient to your plot.

corrplot(S, method="square", addCoef.col = 'black',tl.col="black", diag = FALSE)

Corrplot: Stats!

One final function of interest uses the rstatix package along with the corrplot package. This combination allows us to look at the significance of our correlations. Remember that significance lets us know whether or not it is likely that this relationship occurs by chance. If something is significant, we are pretty sure that this relationship is not due to chance. Most studies use a 0.05 significance level, which translates into a confidence interval of 95%, or that we are 95% sure our relationship isn’t due to chance.
We can not only calculate the Pearson coefficient (or p-value), which we can compare to our significance level to determine if the listed correlation is due to chance or if there actually is a relationship, but also have that displayed onto our plot!

s3 = cor.mtest(S, conf.level = 0.95) #this uses the rstatix package to create a matrix of our p-value,
                                     #with a confidence level of 95% translated into a decimal

corrplot(S, p.mat = s3$p, #dataset, pulling in our p-values using p.mat, and calling in the matrix above
         method = 'circle', 
         type = 'lower',
         insig = 'p-value', #all insignificant values are displayed (i.e. anything larger than 0.05)
         sig.level = -1, #all significant values are displayed (i.e. anything smaller than 0.05)
         order = 'hclust', #how we cluster our data
         diag = FALSE)

You also have the option to cross out insignificant values, so that the focus is only on significant relationships!

corrplot(S, p.mat = s3$p, 
         method = 'circle', 
         type = 'lower',
         #all insignificant values are crossed out when you don't include an "insig"
         order = 'hclust', 
         diag = FALSE)

To wrap up the statistical portion, we can also use corrplot to visualize confidence intervals:

corrplot(S, 
         lowCI = s3$lowCI, #pulling the lower end of the confidence interval from s3
         uppCI = s3$uppCI, #pulling the upper end of the confidence interval from s3
         order = 'hclust',
         tl.pos = 'd', tl.col="black", tl.cex = 0.6, #text position: 'in diagonal' or d, color:, 
                                                     #and size:
         rect.col = 'navy', plotC = 'rect', cl.pos = 'n'#color of CI border, shape of CI (rectangle), and
                                                      #whether there should be a text label ('n' means no)
         )

Discussion and Conclusion

Before this code-through ends, I wanted to briefly look at the (significant) correlations. Examination correlates the most highly with education, which makes sense since examination is a score that measures how well army recruits did on their entrance examination. There is also a negative relationship between Catholicism, fertility and agriculture with doing well on entrance examinations. A similar relationship exists between agriculture and fertility with education. While at the beginning of this code-through the relationship between these variables was unclear, through the manipulation and presentation of the data throughout the exploration of the corrplot package, we are able to discern the relationship in a variety of ways.

In conclusion, corrplot offers various ways to visualize correlation plots, and to communicate this data with the wider public. There are different methods, types, color options, and even ways to communicate it visually and with the data to allow informative communication for a variety of needs.

References:

Kassambara, A. (n.d.). Compute Correlation Matrix with P-values—Cor_mat. Retrieved April 28, 2022, from https://rpkgs.datanovia.com/rstatix/reference/cor_mat.html

Visualize correlation matrix using correlogram . (n.d.). STHDA. Retrieved April 7, 2022, from http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram

Wei, T., & Simko, V. (2021, November 28). An Introduction to corrplot package. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

Corrplot Code-Through