Overview

      Correlation is simply the mutual relation of two, or more, things; the degree to which two, or more, attributes show a tendency to vary together. In general, this can be a good starting point to guide a researcher where to take a closer look. However, while correlation appears to be very simple, there are particular steps one should take to ensure it is being handled properly.

      This tutorial aims to introduce the concept of correlation for use in R. See below for a list of the required packages to follow along and replicate the analysis. Once loaded, please read each section for a definition of each component of correlation analysis along with examples and/or applications.

Requirements

      This tutorial leverages the following packages:

library(rmarkdown)
library(Hmisc)
library(plyr)
library(dplyr)
library(corrplot)
library(knitr)
library(ggplot2)
library(ppcor)
library(car)
library(gridExtra)
library(tibble)
library(tidyr)
data(mtcars)
data(anscombe)

      You will use these packages and datasets to replicate the associated analysis and for additional practice to master the application. Some information in this tutorial is pulled from the AFIT Data Science Lab R Programming Guide and this data, along with additional information and resources, can be accessed here: AFIT Data Science Lab

Correlation

      Correlation measures the extent that two variables are related to one another. It is a single-number measure of the relationship between two variables. This relationship can be a useful measure, but it may be difficult to picture exactly what the association is between two variables based on a single statistic. If you have taken a statistics course, you are likely to have heard of ‘spurious correlations’. These are scenarios which appear to be correlated.

Figure 1. Spelling Bee and Venemous Spider Bites

Figure 1. Spelling Bee and Venemous Spider Bites

      As can be seen with Figure 1, just because two events can be cleverly graphed to imply a relationship does not prove correlation. The challenge is in verifying their association beyond a simple chart.

Correlational Coefficient

      To verify association, begin with the correlational coefficient. The value of the correlational coefficient varies between +1 and -1 (near +1 implies a strong positive association and near -1 implies a strong negative association). The coefficient is simply the numerical representation of the strength and direction of the relationship. As the correlation coefficient nears 0, the relationship between the two variables weakens with a near 0 value implying no association between the two variables. Using the previously loaded data set “mtcars” we find the correlational coefficient between MPG and HP (miles-per-gallon, horsepower).

cor(hp, mpg)
## [1] -0.7761684

Based on the coefficient, we may conclude the variables are negatively related, meaning that as the car’s horsepower is increased, the miles-per-gallon is decreased.

Significance

      At this point, you might want to plot your data. However, I would recommend further verifying the correlation before plotting. The cor() function we just used only returns the correlational coefficient, but we can more adequately check for significance using the cor.test() function which returns both the coefficient and the significance level (Recall that a p-value of <.05 is typically considered significant).

      You can specify which type of method to be used: Pearson, Kendall, or Spearman. The cor.test() will default to Pearson as it is the most commonly used, but you may want to look at the application of Kendall or Spearman for your specific research.

  • Pearson is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. The assumptions are that both variables should be normally distributed (normally distributed variables have a bell-shaped curve), there is linearity (a straight line relationship between each of the variables in the analysis) and homoscedasticity (data is normally distributed about the regression line). We will look at the correlation coefficient and then check if the data meets the requirements.
cor.test(hp, mpg, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  hp and mpg
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684
hist(hp) 

hist(mpg)

      Even though the Pearson correlation coefficient appears to be significant, the variables are not normally distributed and should therefore not be evaluated with the Pearson formula.

  • Spearman rank correlation measures the degree of association between two variables. Spearman rank correlation test does not make any assumptions about the distribution, but the data must be at least ordinal and scores on one variable must be montonically related to the other variable. A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases. Examples of monotonic and non-monotonic relationships are presented in Figure 2 below. Therefore, Spearman’s correlation determines the strength and direction of the monotonic relationship between your two variables rather than the strength and direction of the linear relationship between your two variables like in Pearson.
Figure 2. Monotonic and Non-Monotonic correlations

Figure 2. Monotonic and Non-Monotonic correlations

cor.test(hp, mpg, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  hp and mpg
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8946646
  • Kendall rank correlation measures the strength of dependence between two variables. The Kendall tau-b correlation coefficient measures association based on the number of concordances and discordances in paired observations. Kendall is very similar to Spearman, but because Kendall estimates a population parameter (rather than a sample) many statisticians prefer the Kendall tau-b to the Spearman rank correlation coefficient.

      Imagine two interviewers evaluate the same candidates and rank them (Figure 3).

Figure 3. Kendall correlation example

Figure 3. Kendall correlation example

      Using Kendall, we can count the concordant pairs (how many in the second column are ‘ranked’ above those below them). For example, the first rank in the second interviewer’s column is a “1”, so all 11 ranks below it are larger. Then, we can count the discordant pairs (how many have lower ranks below them). Finally, use Kendall’s Tau = (C - D / C + D) to find = (61 - 5) / (61 + 5) = 56 / 66 = .85. In this example, the correlational coefficient would be .85 which would be a strong positive relationship.

cor.test(hp, mpg, method = "kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  hp and mpg
## z = -5.871, p-value = 4.332e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.7428125

      Which cor.test() method you select will be based on your research area and the outcome of the above tests. Based on what we know so far, we can progress using Spearman’s coefficient.

Scatterplot

      Now we can visualize the correlation using a scatterplot. The scatterplot provides a vivid illustration of the relationship between two variables and may convey more information than a single correlational coefficient or simply be easier to understand by the reader. The graph is a good quick reference to guide the researcher ‘how’ to determine relationship, but is not sufficient by itself to determine correlation.

qplot(x = hp, y = mpg, data = mtcars) +
        geom_smooth(method = "lm", se = FALSE) +
        ggtitle("Figure 4: Strong Negative Association")

Non-Linear

      Some correlations will not be linear. The built-in dataset ‘anscombe’ shows four data sets that have the same mean, variance, and correlation; however, there are significant differences in the variable relationships.

sapply(1:4, function(x) cor(anscombe[, x], anscombe[, x+4]))
## [1] 0.8164205 0.8162365 0.8162867 0.8165214

     Each x-y combination in the plot has a correlation of .82 (strong positive) but there are significant differences in the association between these variables. This servers as a reminder to verify your graph with additional testing. From here, the linearity (or non-linearity) of the relationship would direct your analytical approach.

      (The next tutorial on Regression will cover some of these concepts in more depth.)

Additional Functions

      Now we can look at some functions which allow us to better manipulate our data.

      - Pairs() The pairs() function produces a matrix and allows one to quickly look at your dataset to check for general correlations. From here, you can determine which variables are of further interest.

pairs(mtcars[c("mpg", "hp", "cyl", "disp", "carb")], main = "mtcars Pairs Matrix", pch=21)

      - rcorr() The rcorr() function allows us to compute pearson and spearman correlations, but is efficient because we can return the p-value and coefficients. First, we turn our data (mtcars) into a matrix.

cor_2 <- rcorr(as.matrix(mtcars))

      Then we can pull out the relevant data, such as the p-value. However, this may return too much or non-relevant information.

cor_2$P
##               mpg          cyl         disp           hp         drat
## mpg            NA 6.112697e-10 9.380354e-10 1.787838e-07 1.776241e-05
## cyl  6.112697e-10           NA 1.803002e-12 3.477856e-09 8.244635e-06
## disp 9.380354e-10 1.803002e-12           NA 7.142686e-08 5.282028e-06
## hp   1.787838e-07 3.477856e-09 7.142686e-08           NA 9.988768e-03
## drat 1.776241e-05 8.244635e-06 5.282028e-06 9.988768e-03           NA
## wt   1.293956e-10 1.217567e-07 1.222311e-11 4.145833e-05 4.784268e-06
## qsec 1.708199e-02 3.660527e-04 1.314403e-02 5.766250e-06 6.195823e-01
## vs   3.415940e-05 1.843015e-08 5.235010e-06 2.940897e-06 1.167552e-02
## am   2.850209e-04 2.151208e-03 3.662112e-04 1.798309e-01 4.726797e-06
## gear 5.400949e-03 4.173297e-03 9.635927e-04 4.930119e-01 8.360117e-06
## carb 1.084446e-03 1.942341e-03 2.526789e-02 7.827805e-07 6.211834e-01
##                wt         qsec           vs           am         gear
## mpg  1.293956e-10 1.708199e-02 3.415940e-05 2.850209e-04 5.400949e-03
## cyl  1.217567e-07 3.660527e-04 1.843015e-08 2.151208e-03 4.173297e-03
## disp 1.222311e-11 1.314403e-02 5.235010e-06 3.662112e-04 9.635927e-04
## hp   4.145833e-05 5.766250e-06 2.940897e-06 1.798309e-01 4.930119e-01
## drat 4.784268e-06 6.195823e-01 1.167552e-02 4.726797e-06 8.360117e-06
## wt             NA 3.388682e-01 9.798495e-04 1.125438e-05 4.586600e-04
## qsec 3.388682e-01           NA 1.029669e-06 2.056622e-01 2.425345e-01
## vs   9.798495e-04 1.029669e-06           NA 3.570440e-01 2.579439e-01
## am   1.125438e-05 2.056622e-01 3.570440e-01           NA 5.834051e-08
## gear 4.586600e-04 2.425345e-01 2.579439e-01 5.834051e-08           NA
## carb 1.463861e-02 4.536940e-05 6.670496e-04 7.544526e-01 1.290291e-01
##              carb
## mpg  1.084446e-03
## cyl  1.942341e-03
## disp 2.526789e-02
## hp   7.827805e-07
## drat 6.211834e-01
## wt   1.463861e-02
## qsec 4.536940e-05
## vs   6.670496e-04
## am   7.544526e-01
## gear 1.290291e-01
## carb           NA

      To further increase efficiency, we could create a function to pull interested variables related to our independent variable along with the related coefficient and p-value all at once.

flat_cor_mat <- function(cor_r, cor_p){
  cor_r <- rownames_to_column(as.data.frame(cor_r), var = "row")
  cor_r <- gather(cor_r, column, cor, -1)
    cor_p <- rownames_to_column(as.data.frame(cor_p), var = "row")
    cor_p <- gather(cor_p, column, p, -1)
    cor_p_matrix <- left_join(cor_r, cor_p, by = c("row", "column"))
    cor_p_matrix
}
cor_3 <- rcorr(as.matrix(mtcars[, 1:7]))
my_cor_matrix <- flat_cor_mat(cor_3$r, cor_3$P)
head(my_cor_matrix)

      - GGplot() If one wanted to assess correlation across multiple levels, you could use ggplot to see how they relate. For example, if we wanted to check correlation between hp and mpg by cyl. However, this may not be the most clear way to show the relationships.

ggplot(mtcars, aes(x = hp, y = mpg, color=cyl)) +
        geom_line() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(name = "HP")+
  scale_y_continuous(name = "MPG")+
    ggtitle("HP and MPG by Cylinder")

      - ddply() Instead, you could use ddply() to assess correlation between variable by category in a table form.

ddply(mtcars, "cyl", summarize, corr=cor(mpg, hp))

      - scatterplot Or use scatterplot() to plot on a graph.

scatterplot(mpg ~ hp | cyl, data=mtcars, 
    xlab="HP", ylab="MPG", 
   main="HP and MPG by Cylinder", 
   labels=row.names(mtcars))

Visualizations

      Now that we have a good basis of how to test for correlation and plot their relationship, we can begin to try different ways to present the information. You may find that presenting the data in different or unique ways may make it easier to understand for the viewer or even reveal some previously unknown characteristic. We will spend more time in class looking at some visualization options, but you can review this site for the Top 50 GGPlot visuals.

Next Steps

  • What does all this mean? Based on your findings, you may want to go back to your theory and hypotheses.

  • Is it really a linear relationship between the predictors and the outcome? Have you checked using both Pearson and Spearman based on the relationship?

  • Or, is there any important variable that you left out from your model? Other variables you didn’t include (e.g., age or gender) may play an important role in your model and data. Or are there confounding variables which are affecting the outcome?

Summary

      In class, we will discuss how we can use correlation analysis to guide our early research. Then, we will apply these concepts to different datasets and look at some coding applications to make our analysis both easier and more effective. To finish, we will look at some visualization options to better share our findings.

Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.

Randall Munroe

Exercise, Exercise, Exercise

Using the built-in datasets

  1. (faithful) Determine which method should be used for the faithful data (spearman or pearson)
  2. (mtcars) Determine correlation between mpg and hp given: vs, am, and cyl
  3. (mtcars) Visually assess correlation of mpg and hp given cyl