Correlation is a measure of the linear relationship between to variables. There are three types of correlation. They are:

Lets examine each type of correlations.

1. Pearson Correlation:

There are several assumptions for Pearson’s Correlation. They are as follows:

Refer to the attached document for Pearson correlation formula.

Coding

Let’s practice with some coding.

We will use Iris dataset to practice the Pearson’s Correlation

Plot figure

Use the following code to plot the figure.

plot(iris$Sepal.Length, iris$Sepal.Width, col = "red")

Try to estimate the slope and intercept of the line. Also, try to think how the residuals would look like if a regression line is fitted into the data.

Measuring Correlation

There are two different codes to produce the result.

cor(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))
## [1] -0.1175698

As you can see that the above code only gives you the point estmiate correlation coefficient for the colorize(“population correlation coefficient ρ.”, “red”)

To get the confidence interval and p value (hypothesis testing), use the following code.

cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

The above output gives you all the test statistics requied. If you need to know what statistics are stored, you can use the function attributes().

Cor.p <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))

attributes(Cor.p)
## $names
## [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value" 
## [6] "alternative" "method"      "data.name"   "conf.int"   
## 
## $class
## [1] "htest"
Cor.p$conf.int  # To get the confidence interval
## [1] -0.27269325  0.04351158
## attr(,"conf.level")
## [1] 0.95

Confidence Interval

To calculate the confidence interval, Fisher Z transformation is used. First,the correlation coefficient is converted into the z’ and then confidence interval is calcualted. The calculated confidence interval is converted again back to the correaltion coefficient.

This is necessay because the distribution of the correlation is skewed and is not normally distributed. The distribution of the z’ is normally distributed.

Refer to the handout for formula and more information.

Hypothesis testing

The null hypothesis states that the corrleation coefficient is zero and the alternative hypothesis states that the corrleation coefficient is not equal to zero.

Refer to handout for formula and more information.

Parameters into dataframe?

Sometimes, we want the test statistics and parameters to store into a dataframe. broom package created by David Robinson is a great tool to achieve our objective.

library(broom)

Cor.p <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))

Cor.p.dataframe <- tidy(Cor.p)

names(Cor.p.dataframe)
## [1] "estimate"    "statistic"   "p.value"     "parameter"   "conf.low"   
## [6] "conf.high"   "method"      "alternative"
# We can use logical operator whether the output created by tidy function is a dataframe or not

is.data.frame(Cor.p.dataframe)
## [1] TRUE

2. Spearman Correlation

It is a non-parametric measure of rank-correlation, also known as statistical dependence between the rankings of two variables. There are some assumptions of Spearman rank correlation. They are:

Coding

We will use the iris dataset again. Although the dataset is fit for Pearson correlation rather than Spearman Correlation, we will use it anyway for practice purpose.

cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("spearman"))
## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## c("spearman")): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1667777

The test is non parametric. So, it is not possible to get confidence interval.

Please refer to handout for formula to compute Spearman rank correlation.

We can use broom package to tidy the output of the cor.test() result.

Cor.s <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("spearman"))
## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## c("spearman")): Cannot compute exact p-value with ties
# Putting into dataframe
  
Cor.s.dataframe <- tidy(Cor.s)

# Using the logical operator to check whether the output is dataframe or not

is.data.frame(Cor.s.dataframe)
## [1] TRUE

3. Kendals Tau Test

Commonly known as Kendall’s tau coefficient, it measures the rank correlation between to variables. It is a non-parametric test that measures the ordinal assocation between two quantities. Following are the propoerties of the correlation:

Coding

We will use the iris dataset again for the test.

cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("kendal"))
## 
##  Kendall's rank correlation tau
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## z = -1.3318, p-value = 0.1829
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.07699679

We can use broom package to tidy the output of the test.

Cor.k <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("kendal"))
  
# Putting into dataframe

Cor.k.dataframe <- tidy(Cor.k)

# Logcial operator to check the output

is.data.frame(Cor.k.dataframe)
## [1] TRUE
# This is Correlation tutorial

Thanks,

Rajesh Sigdel