Understanding Correlation

Correlation is a measure of the linear relationship between to variables. There are three types of correlation. They are:

Pearson Corrleation
Spearman’s Rank Correlation
Kendal’s Tau Correlation

Lets examine each type of correlations.

1. Pearson Correlation:

There are several assumptions for Pearson’s Correlation. They are as follows:

Level of measurement: Each variable should be continous. If one or both variables happens to be ordinal or rank based, then a Spearman correlation is recommended.
Related pairs: Each observation in the sample data (row) should be in pairs. For example, each heartbeat and row observation should correspond to each person.
Absence of Outliers: Outliers typically skew the distribuition.
Absence of homoscedasticity: It is the shape formed by the scatterplot.

Refer to the attached document for Pearson correlation formula.

Coding

Let’s practice with some coding.

We will use Iris dataset to practice the Pearson’s Correlation

Print column names.

You can use several other functions such as str(), View() to look up column names. names() is one of the handy functions to quickly glance the names of the columns.

names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

The relation between Sepal Lenght and Sepal Width will be examined using Pearson Correlation Coefficient.

It is important to plot the variables to to check if the assumptions are true or not.

Plot figure

Use the following code to plot the figure.

plot(iris$Sepal.Length, iris$Sepal.Width, col = "red")

Try to estimate the slope and intercept of the line. Also, try to think how the residuals would look like if a regression line is fitted into the data.

Measuring Correlation

There are two different codes to produce the result.

cor(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))

## [1] -0.1175698

As you can see that the above code only gives you the point estmiate correlation coefficient for the colorize(“population correlation coefficient ρ.”, “red”)

To get the confidence interval and p value (hypothesis testing), use the following code.

cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

The above output gives you all the test statistics requied. If you need to know what statistics are stored, you can use the function attributes().

Cor.p <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))

attributes(Cor.p)

## $names
## [1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value" 
## [6] "alternative" "method"      "data.name"   "conf.int"   
## 
## $class
## [1] "htest"

Cor.p$conf.int  # To get the confidence interval

## [1] -0.27269325  0.04351158
## attr(,"conf.level")
## [1] 0.95

Confidence Interval

To calculate the confidence interval, Fisher Z transformation is used. First,the correlation coefficient is converted into the z’ and then confidence interval is calcualted. The calculated confidence interval is converted again back to the correaltion coefficient.

This is necessay because the distribution of the correlation is skewed and is not normally distributed. The distribution of the z’ is normally distributed.

Refer to the handout for formula and more information.

Hypothesis testing

The null hypothesis states that the corrleation coefficient is zero and the alternative hypothesis states that the corrleation coefficient is not equal to zero.

Refer to handout for formula and more information.

Parameters into dataframe?

Sometimes, we want the test statistics and parameters to store into a dataframe. broom package created by David Robinson is a great tool to achieve our objective.

library(broom)

Cor.p <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))

Cor.p.dataframe <- tidy(Cor.p)

names(Cor.p.dataframe)

## [1] "estimate"    "statistic"   "p.value"     "parameter"   "conf.low"   
## [6] "conf.high"   "method"      "alternative"

# We can use logical operator whether the output created by tidy function is a dataframe or not

is.data.frame(Cor.p.dataframe)

## [1] TRUE

2. Spearman Correlation

It is a non-parametric measure of rank-correlation, also known as statistical dependence between the rankings of two variables. There are some assumptions of Spearman rank correlation. They are:

The data must be atleast ordinal.
The score on one variable must be monotonically realted to the other variables.

Coding

We will use the iris dataset again. Although the dataset is fit for Pearson correlation rather than Spearman Correlation, we will use it anyway for practice purpose.

cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("spearman"))

## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## c("spearman")): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1667777

The test is non parametric. So, it is not possible to get confidence interval.

Please refer to handout for formula to compute Spearman rank correlation.

We can use broom package to tidy the output of the cor.test() result.

Cor.s <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("spearman"))

## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## c("spearman")): Cannot compute exact p-value with ties

# Putting into dataframe
  
Cor.s.dataframe <- tidy(Cor.s)

# Using the logical operator to check whether the output is dataframe or not

is.data.frame(Cor.s.dataframe)

## [1] TRUE

3. Kendals Tau Test

Commonly known as Kendall’s tau coefficient, it measures the rank correlation between to variables. It is a non-parametric test that measures the ordinal assocation between two quantities. Following are the propoerties of the correlation:

If the agreement between two ranking is perfect, the coefficient is 1.
If the disagreement between to ranking is perfect, the coefficient is -1.
If there is no correlation in the ranking, the coefficient is 0.

Coding

We will use the iris dataset again for the test.

cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("kendal"))

## 
##  Kendall's rank correlation tau
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## z = -1.3318, p-value = 0.1829
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.07699679

We can use broom package to tidy the output of the test.

Cor.k <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("kendal"))
  
# Putting into dataframe

Cor.k.dataframe <- tidy(Cor.k)

# Logcial operator to check the output

is.data.frame(Cor.k.dataframe)

## [1] TRUE

# This is Correlation tutorial

Thanks,

Rajesh Sigdel

Understanding Correlation

Rajesh Sigdel

12/17/2019

1. Pearson Correlation:

Coding

Print column names.

Plot figure

Measuring Correlation

Confidence Interval

Hypothesis testing

Parameters into dataframe?

2. Spearman Correlation

Coding

3. Kendals Tau Test

Coding