Correlation is simply the mutual relation of two, or more, things; the degree to which two, or more, attributes show a tendency to vary together. In general, this can be a good starting point to guide a researcher where to take a closer look. However, while correlation appears to be very simple, there are particular steps one should take to ensure it is being handled properly.
This tutorial aims to introduce the concept of correlation for use in R. See below for a list of the required packages to follow along and replicate the analysis. Once loaded, please read each section for a definition of each component of correlation analysis along with examples and/or applications.
This tutorial leverages the following packages:
library(rmarkdown)
library(Hmisc)
library(plyr)
library(dplyr)
library(corrplot)
library(knitr)
library(ggplot2)
library(ppcor)
library(car)
library(gridExtra)
library(tibble)
library(tidyr)
data(mtcars)
data(anscombe)
You will use these packages and datasets to replicate the associated analysis and for additional practice to master the application. Some information in this tutorial is pulled from the AFIT Data Science Lab R Programming Guide and this data, along with additional information and resources, can be accessed here: AFIT Data Science Lab
Correlation measures the extent that two variables are related to one another. It is a single-number measure of the relationship between two variables. This relationship can be a useful measure, but it may be difficult to picture exactly what the association is between two variables based on a single statistic. If you have taken a statistics course, you are likely to have heard of ‘spurious correlations’. These are scenarios which appear to be correlated.
Figure 1. Spelling Bee and Venemous Spider Bites
As can be seen with Figure 1, just because two events can be cleverly graphed to imply a relationship does not prove correlation. The challenge is in verifying their association beyond a simple chart.
To verify association, begin with the correlational coefficient. The value of the correlational coefficient varies between +1 and -1 (near +1 implies a strong positive association and near -1 implies a strong negative association). The coefficient is simply the numerical representation of the strength and direction of the relationship. As the correlation coefficient nears 0, the relationship between the two variables weakens with a near 0 value implying no association between the two variables. Using the previously loaded data set “mtcars” we find the correlational coefficient between MPG and HP (miles-per-gallon, horsepower).
cor(hp, mpg)
## [1] -0.7761684
Based on the coefficient, we may conclude the variables are negatively related, meaning that as the car’s horsepower is increased, the miles-per-gallon is decreased.
At this point, you might want to plot your data. However, I would recommend further verifying the correlation before plotting. The cor() function we just used only returns the correlational coefficient, but we can more adequately check for significance using the cor.test() function which returns both the coefficient and the significance level (Recall that a p-value of <.05 is typically considered significant).
You can specify which type of method to be used: Pearson, Kendall, or Spearman. The cor.test() will default to Pearson as it is the most commonly used, but you may want to look at the application of Kendall or Spearman for your specific research.
cor.test(hp, mpg, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: hp and mpg
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
hist(hp)
hist(mpg)
Even though the Pearson correlation coefficient appears to be significant, the variables are not normally distributed and should therefore not be evaluated with the Pearson formula.
Figure 2. Monotonic and Non-Monotonic correlations
cor.test(hp, mpg, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: hp and mpg
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8946646
Imagine two interviewers evaluate the same candidates and rank them (Figure 3).
Figure 3. Kendall correlation example
Using Kendall, we can count the concordant pairs (how many in the second column are ‘ranked’ above those below them). For example, the first rank in the second interviewer’s column is a “1”, so all 11 ranks below it are larger. Then, we can count the discordant pairs (how many have lower ranks below them). Finally, use Kendall’s Tau = (C - D / C + D) to find = (61 - 5) / (61 + 5) = 56 / 66 = .85. In this example, the correlational coefficient would be .85 which would be a strong positive relationship.
cor.test(hp, mpg, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: hp and mpg
## z = -5.871, p-value = 4.332e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7428125
Which cor.test() method you select will be based on your research area and the outcome of the above tests. Based on what we know so far, we can progress using Spearman’s coefficient.
Now we can visualize the correlation using a scatterplot. The scatterplot provides a vivid illustration of the relationship between two variables and may convey more information than a single correlational coefficient or simply be easier to understand by the reader. The graph is a good quick reference to guide the researcher ‘how’ to determine relationship, but is not sufficient by itself to determine correlation.
qplot(x = hp, y = mpg, data = mtcars) +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Figure 4: Strong Negative Association")
Some correlations will not be linear. The built-in dataset ‘anscombe’ shows four data sets that have the same mean, variance, and correlation; however, there are significant differences in the variable relationships.
sapply(1:4, function(x) cor(anscombe[, x], anscombe[, x+4]))
## [1] 0.8164205 0.8162365 0.8162867 0.8165214
Each x-y combination in the plot has a correlation of .82 (strong positive) but there are significant differences in the association between these variables. This servers as a reminder to verify your graph with additional testing. From here, the linearity (or non-linearity) of the relationship would direct your analytical approach.
(The next tutorial on Regression will cover some of these concepts in more depth.)
Now we can look at some functions which allow us to better manipulate our data.
- Pairs() The pairs() function produces a matrix and allows one to quickly look at your dataset to check for general correlations. From here, you can determine which variables are of further interest.
pairs(mtcars[c("mpg", "hp", "cyl", "disp", "carb")], main = "mtcars Pairs Matrix", pch=21)
- rcorr() The rcorr() function allows us to compute pearson and spearman correlations, but is efficient because we can return the p-value and coefficients. First, we turn our data (mtcars) into a matrix.
cor_2 <- rcorr(as.matrix(mtcars))
Then we can pull out the relevant data, such as the p-value. However, this may return too much or non-relevant information.
cor_2$P
## mpg cyl disp hp drat
## mpg NA 6.112697e-10 9.380354e-10 1.787838e-07 1.776241e-05
## cyl 6.112697e-10 NA 1.803002e-12 3.477856e-09 8.244635e-06
## disp 9.380354e-10 1.803002e-12 NA 7.142686e-08 5.282028e-06
## hp 1.787838e-07 3.477856e-09 7.142686e-08 NA 9.988768e-03
## drat 1.776241e-05 8.244635e-06 5.282028e-06 9.988768e-03 NA
## wt 1.293956e-10 1.217567e-07 1.222311e-11 4.145833e-05 4.784268e-06
## qsec 1.708199e-02 3.660527e-04 1.314403e-02 5.766250e-06 6.195823e-01
## vs 3.415940e-05 1.843015e-08 5.235010e-06 2.940897e-06 1.167552e-02
## am 2.850209e-04 2.151208e-03 3.662112e-04 1.798309e-01 4.726797e-06
## gear 5.400949e-03 4.173297e-03 9.635927e-04 4.930119e-01 8.360117e-06
## carb 1.084446e-03 1.942341e-03 2.526789e-02 7.827805e-07 6.211834e-01
## wt qsec vs am gear
## mpg 1.293956e-10 1.708199e-02 3.415940e-05 2.850209e-04 5.400949e-03
## cyl 1.217567e-07 3.660527e-04 1.843015e-08 2.151208e-03 4.173297e-03
## disp 1.222311e-11 1.314403e-02 5.235010e-06 3.662112e-04 9.635927e-04
## hp 4.145833e-05 5.766250e-06 2.940897e-06 1.798309e-01 4.930119e-01
## drat 4.784268e-06 6.195823e-01 1.167552e-02 4.726797e-06 8.360117e-06
## wt NA 3.388682e-01 9.798495e-04 1.125438e-05 4.586600e-04
## qsec 3.388682e-01 NA 1.029669e-06 2.056622e-01 2.425345e-01
## vs 9.798495e-04 1.029669e-06 NA 3.570440e-01 2.579439e-01
## am 1.125438e-05 2.056622e-01 3.570440e-01 NA 5.834051e-08
## gear 4.586600e-04 2.425345e-01 2.579439e-01 5.834051e-08 NA
## carb 1.463861e-02 4.536940e-05 6.670496e-04 7.544526e-01 1.290291e-01
## carb
## mpg 1.084446e-03
## cyl 1.942341e-03
## disp 2.526789e-02
## hp 7.827805e-07
## drat 6.211834e-01
## wt 1.463861e-02
## qsec 4.536940e-05
## vs 6.670496e-04
## am 7.544526e-01
## gear 1.290291e-01
## carb NA
To further increase efficiency, we could create a function to pull interested variables related to our independent variable along with the related coefficient and p-value all at once.
flat_cor_mat <- function(cor_r, cor_p){
cor_r <- rownames_to_column(as.data.frame(cor_r), var = "row")
cor_r <- gather(cor_r, column, cor, -1)
cor_p <- rownames_to_column(as.data.frame(cor_p), var = "row")
cor_p <- gather(cor_p, column, p, -1)
cor_p_matrix <- left_join(cor_r, cor_p, by = c("row", "column"))
cor_p_matrix
}
cor_3 <- rcorr(as.matrix(mtcars[, 1:7]))
my_cor_matrix <- flat_cor_mat(cor_3$r, cor_3$P)
head(my_cor_matrix)
- GGplot() If one wanted to assess correlation across multiple levels, you could use ggplot to see how they relate. For example, if we wanted to check correlation between hp and mpg by cyl. However, this may not be the most clear way to show the relationships.
ggplot(mtcars, aes(x = hp, y = mpg, color=cyl)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(name = "HP")+
scale_y_continuous(name = "MPG")+
ggtitle("HP and MPG by Cylinder")
- ddply() Instead, you could use ddply() to assess correlation between variable by category in a table form.
ddply(mtcars, "cyl", summarize, corr=cor(mpg, hp))
- scatterplot Or use scatterplot() to plot on a graph.
scatterplot(mpg ~ hp | cyl, data=mtcars,
xlab="HP", ylab="MPG",
main="HP and MPG by Cylinder",
labels=row.names(mtcars))
Now that we have a good basis of how to test for correlation and plot their relationship, we can begin to try different ways to present the information. You may find that presenting the data in different or unique ways may make it easier to understand for the viewer or even reveal some previously unknown characteristic. We will spend more time in class looking at some visualization options, but you can review this site for the Top 50 GGPlot visuals.
What does all this mean? Based on your findings, you may want to go back to your theory and hypotheses.
Is it really a linear relationship between the predictors and the outcome? Have you checked using both Pearson and Spearman based on the relationship?
Or, is there any important variable that you left out from your model? Other variables you didn’t include (e.g., age or gender) may play an important role in your model and data. Or are there confounding variables which are affecting the outcome?
In class, we will discuss how we can use correlation analysis to guide our early research. Then, we will apply these concepts to different datasets and look at some coding applications to make our analysis both easier and more effective. To finish, we will look at some visualization options to better share our findings.
Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.
Randall Munroe
Using the built-in datasets