First, before proceeding with this tutorial, we will have to install the following required packages:
tseries
reshape
ggplot2
The following chunk will check each package’s availability and install them if necessary
required_packages <- c('tseries', 'reshape', 'ggplot2')
for (p in required_packages) {
if(!require(p,character.only = TRUE)) {
install.packages(p, dep = TRUE)
}
}
A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components of a multivariate random variable with a known distribution.
Several types of correlation coefficient exist, each with their own definition and own range of usability and characteristics. They all assume values in the range from \(−1\) to \(+1\), where \(\pm 1\) indicates the strongest possible agreement and 0 the strongest possible disagreement. As tools of analysis, correlation coefficients present certain problems, including the propensity of some types to be distorted by outliers and the possibility of incorrectly being used to infer a causal relationship between the variables
Pearson correlation coefficient, also referred to as Pearson’s R, is a measure of the linear correlation between two variables X and Y. This is the best-known and most commonly used type of correlation coefficient.
According to the Cauchy-Schwarz inequality, it has a value between \(+1\) and \(-1\), where
Pearson’s correlation coefficient when applied to a sample is commonly represented by \(r_{xy}\) and may be referred to as the sample correlation coefficient or the sample Pearson correlation coefficient. Given paried data \(\{ (x_1, y_1), (x_2, y_2),..., (x_n, y_n) \}\) consisting of \(n\) paris, \(r_{xy}\) is defined as:
\[ r_{xy} = \frac{n \sum x_i y_i - \sum x_i \sum y_i}{\sqrt{n \sum x_i^2 - \left( \sum x_i\right)^2} \sqrt{n \sum y_i^2 - \left( \sum y_i\right)^2}} \]
where
\(n\) is the sample size
\(x_i, y_i\) are the individual sample points indexed with \(i\)
There are some assumptions relating to the usage of Pearson R correlation coefficient:
Both variables should be normally distributed
Linearity: Linearity assumes a straight line relationship between each of the two variables
Homoscedasticity: Homoscedasticity assumes that data is equally distributed around the regression line
A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. Some of the most popular rank correlation statistics include:
Within this the scope of this tutorial, we only consider two of the most popular rank correlations coefficient (Spearman’s \(\rho\) and Kendall’s \(\tau\)) for the following reasons:
R
to calculate the correlation (or covariance) between two variables, cor
, offer three different methods: Pearson (default), Kendall and SpearmanKendall rank correlation coefficient, commonly referred to as Kendall’s \(\tau\), is a statistic used to measure the ordinal association between two measured quantities. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities.
Let \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\) be a set of observations of the joint random variables X and Y respectively, such that all the values of \((x_i)\) and \((y_i)\) are unique. Any pair of observations \((x_i, y_i)\) and \((x_j,y_j)\), where \(i < j\), are said to be
concordant if the ranks for both elements (more precisely, the sort order by \(x\) and by \(y\)) agree: that is, if
discordant if
neither concordant or discordant if \(x_i = x_j\) or \(y_i = y_j\)
The Kendall \(\tau\) coefficient is defined as:
\[ \tau = \frac{\text{(number of concordant pairs)} - \text{(number of discordant pairs)}}{\left( \begin{matrix} n \\ 2 \end{matrix} \right)} \]
An explicit expression for Kendall’s rank coefficient would be:
\[ \tau = \frac{2}{n(n-1)} \times \sum_{i < j} \text{sgn}(x_i - x_j) \text{sgn}(y_i - y_j) \]
In order to ensure that we obtain valid results (not just numbers on monitors), we need to make sure that our data satisfies the following assumptions
The Kendall rank coefficient is often used as a test statistic in a statistical hypothesis test to establish whether two variables may be regarded as statistically dependent. This test is non-parametric, as it does not rely on any assumptions on the distributions of \(X\) or \(Y\) or the distribution of \((X,Y)\).
Under the null hypothesis of independence of X and Y, the sampling distribution of τ has an expected value of zero. The precise distribution cannot be characterized in terms of common distributions, but may be calculated exactly for small samples; for larger samples, it is common to use an approximation to the normal distribution, with mean \(0\) and variance \(\frac{2(2n+5)}{9n(n-1)}\).
Spearman rank correlation coefficient (or Spearman’s \(\rho\)), is a non-parametric measure of rank correlation. It assesses how well the relationship between two variables can be described using a monotonic function.
The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson’s correlation assesses linear relationships, Spearman’s correlation assesses monotonic relationships (whether linear or not).
Spearman’s coefficient is appropriate for both continuous and discrete ordinal variables.
For a sample of size \(n\), the \(n\) raw scores \(X_i, Y_i\) are converted to ranks \(\text{rg}X, \text{rg}Y\) and \(r_s\) is computed from:
\[ r_s = \rho_{\text{rg}X, \text{rg}Y} = \frac{\text{cov}(\text{rg}X, \text{rg}Y)}{\rho_{\text{rg}X} \rho_{\text{rg}Y}} \]
where
R
Correlation coefficients can be computed in R
using the functions cor()
or cor.test()
.
cor()
computes the correlation coefficientcor.test()
returns a more complete analysis of the correlation between a paired set of samples. It returns both the correlation coefficient and the significant level (or p-value) of the correlation.The format of the function is as follow:
cor()
cor(x,y, na.rm, use, method = c('pearson', 'kendall', 'spearman'))
with
x
,y
: numeric, vector, matrix or data frame with the same lengthna.rm
: logical, should missing value be removed?use
: method in the presence of missing values, must be one of "everything"
, "all.obs"
, "complete.obs"
, "na.or.complete"
, or “pairwise.complete.obs”method
: which correlation to be computed, must be one of "pearson"
(default), "kendall"
, or "spearman"
cor.test()
cor.test(x, y,
alternative = c("two.sided", "less", "greater"),
method = c("pearson", "kendall", "spearman"),
exact, conf.level, continuity)
with
x
,y
: numeric, vector, matrix or data frame with the same lengthalternative
: alternative hypothesis, must be one of "two.sided"
, "less"
, or "greater"
. "greater"
and "less"
corresponds to positive and negative correlation, respectivelymethod
: which correlation to be computed, must be one of "pearson"
(default), "kendall"
, or "spearman"
exact
: logical, whether an exact p-value should be computed, used for Kendall’s \(\tau\) and Spearman’s \(\rho\)conf. level
: confidence level for the confidence intervalcontinuity
: logical; when True
, a continuity correction is used for Kendall’s \(\tau\) and Spearman’s \(\rho\) 1when not computed exactly.As an example of calculating correlation coefficients, we will use a built-in dataset of R
, which is mtcars
. The data was extracted from the 1964 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models)
We load and view the head of this dataset as follow
data("mtcars")
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In this example, we will consider the correlation between two variables, mpg
and wt
. These are all continuous variables, which will be a perfect example for a correlation between two numerical variables.
Note: Throughout this example, we will use the library ggplot2
for all the visualization work. This ensures the coherence of all the tutorials produced.
First of all, we will investigate the correlation between these two variables by a scatterplot.
library(ggplot2)
ggplot(mtcars, aes(x = mpg, y = wt)) +
geom_point() +
geom_smooth(method = lm) +
labs(x = "Miles/(US) gallon", y = "Weight (1000 lbs)")
From the scatterplot, we can see that the relationship between these two variables are linear. Next, we will inspect the distribution of each column, first by visualizing it using the Q-Q plot (quantile - quantile plot).
## Q-Q plot for mpg
ggplot(mtcars, aes(sample = mpg)) +
stat_qq() +
stat_qq_line() +
labs(y = "MPG")
## Q-Q plot for mp
ggplot(mtcars, aes(sample = wt)) +
stat_qq() +
stat_qq_line() +
labs(y = "WT")
Here, we want to know whether these two variables follow a normal distribution. After inspecting the Q-Q plot, we can perform some statistics test as follow (which will be mentioned deeper in another publication):
R
as follow:shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
shapiro.test(mtcars$wt)
##
## Shapiro-Wilk normality test
##
## data: mtcars$wt
## W = 0.94326, p-value = 0.09265
library(tseries)
jarque.bera.test(mtcars$mpg)
##
## Jarque Bera Test
##
## data: mtcars$mpg
## X-squared = 2.2412, df = 2, p-value = 0.3261
jarque.bera.test(mtcars$wt)
##
## Jarque Bera Test
##
## data: mtcars$wt
## X-squared = 1.09, df = 2, p-value = 0.5798
Since all the p-values of the tests are above \(\alpha = 0.05\), we do not reject the null hypothesis of normality for both of these variables.
Next, we will apply the correlation tests
cor.test(mtcars$wt, mtcars$mpg, method = "pearson", conf.level = 0.95)
##
## Pearson's product-moment correlation
##
## data: mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
Here, we can see that the p-value of the statistics test is \(1.294 \times 10^{-10}\), which is less than the significance level of \(\alpha = 0.05\). This means that the correlation between two variables wt
and mpg
is significance, and the correlation coefficient is approximately \(-0.87\).
We also kindly note that the conf.level
can only now be used for the Pearson product moment correlation when there are at least f4 complete pairs of observations. As a result, similarly, for the other correlation tests, we would also have:
cor.test(mtcars$wt, mtcars$mpg, method = "kendall")
## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "kendall"):
## Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: mtcars$wt and mtcars$mpg
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7278321
cor.test(mtcars$wt, mtcars$mpg, method = "kendall")
## Warning in cor.test.default(mtcars$wt, mtcars$mpg, method = "kendall"):
## Cannot compute exact p-value with ties
##
## Kendall's rank correlation tau
##
## data: mtcars$wt and mtcars$mpg
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7278321
Correlation matrix is a table showing the correlation between sets of variables. The correlation matrix can be easily computed, using again the cor
function. In this example, for the sake of simplicity and aesthetics, we will only consider the correlation between the first 7 variables (7 first columns) of the dataset mtcars
.
mtcars_new <- mtcars[,c(1:7)]
corrmat_pearson <- cor(mtcars_new, method = "pearson")
corrmat_pearson
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## qsec
## mpg 0.41868403
## cyl -0.59124207
## disp -0.43369788
## hp -0.70822339
## drat 0.09120476
## wt -0.17471588
## qsec 1.00000000
In order to create the correlation hearmap, first, we will have to melt the correlation matrix by using the melt
function from the package reshape
library(reshape)
melted_corrmat_pearson <- melt(corrmat_pearson)
head(melted_corrmat_pearson)
## X1 X2 value
## 1 mpg mpg 1.0000000
## 2 cyl mpg -0.8521620
## 3 disp mpg -0.8475514
## 4 hp mpg -0.7761684
## 5 drat mpg 0.6811719
## 6 wt mpg -0.8676594
To visualize the correlation matrix, we can simply use the function geom_tile()
in the package ggplot2
as follow:
ggplot(melted_corrmat_pearson, aes(x = X1, y = X2, fill = value)) +
geom_tile() +
labs(x = 'Variables', y = 'Variables') +
coord_fixed()
However, we can see that, this correlation matrix is not perfect. We can change the appearance of this hearmap by adding correlation coefficients and change the color gradient. Although this is a little bit complicated, it is totally achievable. After such modification, we will notice that our correlation hearmap is much clearer and more nice-looking.
To do it, first, we will have to round the correlation matrix to 2 significant figures (to add the values into the correlation heatmap).
corrmat_pearson_rounded <- round(corrmat_pearson, 2)
melted_corrmat_pearson_rounded <- melt(corrmat_pearson_rounded)
ggplot(melted_corrmat_pearson_rounded, aes(x = X1, y = X2, fill = value)) +
labs(title = "Correlation heatmap for first 7 variables of the dataset mrcars") +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, limit = c(-1,1),
space = "Lab", name = "Correlation coefficient") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, size = 12, hjust = 1, vjust = 1),
axis.text.y = element_text(size = 12, hjust = 1, vjust = 0.5),
axis.title.x = element_blank(), axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank()) +
geom_text(aes(x = X1, y = X2, label = value), color = "black", size = 4) +
coord_fixed()
In this tutorial, we have discussed the most common correlation parameters between continuous/numerical variables. Next time, we will cover correlation coefficients for discrete/categorical variables.
“Correlation Test Between Two Variables in R.” n.d. Statistical Tools for High-Throughput Data Analysis (STHDA). http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r.
Daniels, H. E. 1944. “The Relation Between Measures of Correlation in the Universe of Sample Permutations.” Biometrika 33 (2): 171–91.
“Ggplot2 : Quick Correlation Matrix Heatmap - R Software and Data Visualization.” n.d. Statistical Tools for High-Throughput Data Analysis (STHDA). http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization.
Magiya, Joseph. 2019. “Kendall Rank Correlation Explained.” Towards Data Science. https://towardsdatascience.com/kendall-rank-correlation-explained-dee01d99c535.
Wikipedia contributors. 2019a. “Kendall Rank Correlation Coefficient — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Kendall_rank_correlation_coefficient&oldid=921851634.
———. 2019b. “Pearson Correlation Coefficient — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=922293481.
———. 2019c. “Rank Correlation — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Rank_correlation&oldid=909175766.
———. 2019d. “Spearman’s Rank Correlation Coefficient — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Spearman%27s_rank_correlation_coefficient&oldid=925494275.