Corrplot
Code-Through
Introduction
For this code-through, we will be exploring the R package ‘Corrplot’. With this package, we can visualize correlations in our data-sets, and perhaps uncover some hidden patterns that we may not have seen otherwise. Through our labs, we have explored some of the functions in the package, but there are also other functions that are worthy of exploration.
Firstly, let’s load the requisite packages and the data set we will be using. For visualization purposes, we will be using the dataset swiss which has six variables from various regions of Switzerland collected in 1888. This dataset is nice to use as all data-points are already on the same scale (i.e. 0-100), so the focus can be on our visualization.
library(corrplot) #our learning package
library(dplyr) #for wrangling data if necessary
library(flextable) #for nice tables!
library(rstatix) #for statistical calculationsdata(swiss)
swiss %>%
head() %>%
t() %>%
as.data.frame() %>%
add_rownames("Province:") %>%
flextable()Province: | Courtelary | Delemont | Franches-Mnt | Moutier | Neuveville | Porrentruy |
Fertility | 80.20 | 83.10 | 92.5 | 85.80 | 76.90 | 76.10 |
Agriculture | 17.00 | 45.10 | 39.7 | 36.50 | 43.50 | 35.30 |
Examination | 15.00 | 6.00 | 5.0 | 12.00 | 17.00 | 9.00 |
Education | 12.00 | 9.00 | 5.0 | 7.00 | 15.00 | 7.00 |
Catholic | 9.96 | 84.84 | 93.4 | 33.77 | 5.16 | 90.57 |
Infant.Mortality | 22.20 | 22.20 | 20.2 | 20.30 | 20.60 | 26.60 |
A simple way to get started is to make a correlation matrix from the data:
S<-cor(swiss)
S2<-as.data.frame(S)
flextable(S2)Fertility | Agriculture | Examination | Education | Catholic | Infant.Mortality |
1.0000000 | 0.35307918 | -0.6458827 | -0.66378886 | 0.4636847 | 0.41655603 |
0.3530792 | 1.00000000 | -0.6865422 | -0.63952252 | 0.4010951 | -0.06085861 |
-0.6458827 | -0.68654221 | 1.0000000 | 0.69841530 | -0.5727418 | -0.11402160 |
-0.6637889 | -0.63952252 | 0.6984153 | 1.00000000 | -0.1538589 | -0.09932185 |
0.4636847 | 0.40109505 | -0.5727418 | -0.15385892 | 1.0000000 | 0.17549591 |
0.4165560 | -0.06085861 | -0.1140216 | -0.09932185 | 0.1754959 | 1.00000000 |
Corrplot: Methods
Corrplot indicates positive correlations with a shade of blue, and negative correlations in shades of red. The more intense the color, the stronger the correlation between the variables. The package offers seven different visualization methods using the ‘method = “method choice”’, which are explored here:
Please click on the plots to enlarge!
First up, the default option, circles:
corrplot(S, method = "circle")
Perhaps squares are preferred, where just like circles, they are colored and sized to indicate the strength of the correlation:
corrplot(S, method = "square")
Others may prefer ellipses, where strength & direction of the relationship are demonstrated through the size, color, and orientation of the ellipse:
corrplot(S, method = "ellipse")
In instances where the exact correlation strength is important, consider this numerical representation of the data:
corrplot(S, method = "number")
If you prefer full color, than perhaps shade or color is a better option. Shade stripes the negative correlations to make them easier to tell apart:
corrplot(S, method = "shade")
Or just go all in on color!
corrplot(S, method = "color")
For a more interesting method of data presentation, corrplot allows you to use pie charts within your correlation matrix to demonstrate the size of the strength of the relationship.
corrplot(S, method = "pie")Corrplot: Types
Besides different methods of presenting the data, Corrplot also offers three different types of presentations, which is convenient in simplifying the data presentation even further. This uses the following argument in the code ‘type = “type choice”’. With this argument, you can choose to show the full data table, only the top triangle, or only the bottom triangle.
corrplot(S, type="full") #this is the defaultcorrplot(S, type="upper") corrplot(S, type="lower") In many cases, “lower”, indicating the lower triangle of correlations, might be the most natural way to present the matrix. However, depending on the type of presentation, this might vary!
Corrplot: hclust
Another function corrplot offers is the ability to reorder the correlations hierarchically, called hclust (-H-ierarchical -Clust-ering). For CPP529 students, we used this function in Lab 4 for our correlation plots!
corrplot(S, type = "lower", order = "hclust")Corrplot: Aesthetics
There is a couple of changes you can make to your correlation plot to make it more relevant to your dataset! First, let’s create the set of the colors we want. We are using the colorRampPalette function to create a spectrum, and the number in the parentheses indicates how many breaks we want in our spectrum.
colpractice<- colorRampPalette(c("purple", "white", "blue"))(10)
Now, let’s create our plot again! This time, add col = “colpractice” to integrate our spectrum into our plot!
corrplot(S, type = "lower", col = colpractice)
You can also change the background color!
corrplot(S, bg="gray", type= "lower", col = colpractice)
Finally, you can also change the font color, font orientation, or eliminate the center diagonal line of 100 percent correlation.
corrplot(S, tl.col="dark green", tl.srt=45, diag = FALSE)An Extra Corrplot Function
For a more informative plot, you can add the coefficient to your plot.
corrplot(S, method="square", addCoef.col = 'black',tl.col="black", diag = FALSE)Corrplot: Stats!
One final function of interest uses the rstatix package along with the corrplot package. This combination allows us to look at the significance of our correlations. Remember that significance lets us know whether or not it is likely that this relationship occurs by chance. If something is significant, we are pretty sure that this relationship is not due to chance. Most studies use a 0.05 significance level, which translates into a confidence interval of 95%, or that we are 95% sure our relationship isn’t due to chance.
We can not only calculate the Pearson coefficient (or p-value), which we can compare to our significance level to determine if the listed correlation is due to chance or if there actually is a relationship, but also have that displayed onto our plot!
s3 = cor.mtest(S, conf.level = 0.95) #this uses the rstatix package to create a matrix of our p-value,
#with a confidence level of 95% translated into a decimal
corrplot(S, p.mat = s3$p, #dataset, pulling in our p-values using p.mat, and calling in the matrix above
method = 'circle',
type = 'lower',
insig = 'p-value', #all insignificant values are displayed (i.e. anything larger than 0.05)
sig.level = -1, #all significant values are displayed (i.e. anything smaller than 0.05)
order = 'hclust', #how we cluster our data
diag = FALSE)
You also have the option to cross out insignificant values, so that the focus is only on significant relationships!
corrplot(S, p.mat = s3$p,
method = 'circle',
type = 'lower',
#all insignificant values are crossed out when you don't include an "insig"
order = 'hclust',
diag = FALSE)
To wrap up the statistical portion, we can also use corrplot to visualize confidence intervals:
corrplot(S,
lowCI = s3$lowCI, #pulling the lower end of the confidence interval from s3
uppCI = s3$uppCI, #pulling the upper end of the confidence interval from s3
order = 'hclust',
tl.pos = 'd', tl.col="black", tl.cex = 0.6, #text position: 'in diagonal' or d, color:,
#and size:
rect.col = 'navy', plotC = 'rect', cl.pos = 'n'#color of CI border, shape of CI (rectangle), and
#whether there should be a text label ('n' means no)
)Discussion and Conclusion
Before this code-through ends, I wanted to briefly look at the (significant) correlations. Examination correlates the most highly with education, which makes sense since examination is a score that measures how well army recruits did on their entrance examination. There is also a negative relationship between Catholicism, fertility and agriculture with doing well on entrance examinations. A similar relationship exists between agriculture and fertility with education. While at the beginning of this code-through the relationship between these variables was unclear, through the manipulation and presentation of the data throughout the exploration of the corrplot package, we are able to discern the relationship in a variety of ways.
References:
Kassambara, A. (n.d.). Compute Correlation Matrix with P-values—Cor_mat. Retrieved April 28, 2022, from https://rpkgs.datanovia.com/rstatix/reference/cor_mat.html
Visualize correlation matrix using correlogram . (n.d.). STHDA. Retrieved April 7, 2022, from http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram
Wei, T., & Simko, V. (2021, November 28). An Introduction to corrplot package. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html