Correlation is a measure of the linear relationship between to variables. There are three types of correlation. They are:
Lets examine each type of correlations.
There are several assumptions for Pearson’s Correlation. They are as follows:
Level of measurement: Each variable should be continous. If one or both variables happens to be ordinal or rank based, then a Spearman correlation is recommended.
Related pairs: Each observation in the sample data (row) should be in pairs. For example, each heartbeat and row observation should correspond to each person.
Absence of Outliers: Outliers typically skew the distribuition.
Absence of homoscedasticity: It is the shape formed by the scatterplot.
Refer to the attached document for Pearson correlation formula.
Let’s practice with some coding.
We will use Iris dataset to practice the Pearson’s Correlation
You can use several other functions such as str(), View() to look up column names. names() is one of the handy functions to quickly glance the names of the columns.
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
The relation between Sepal Lenght and Sepal Width will be examined using Pearson Correlation Coefficient.
It is important to plot the variables to to check if the assumptions are true or not.
Use the following code to plot the figure.
plot(iris$Sepal.Length, iris$Sepal.Width, col = "red")
Try to estimate the slope and intercept of the line. Also, try to think how the residuals would look like if a regression line is fitted into the data.
There are two different codes to produce the result.
cor(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))
## [1] -0.1175698
As you can see that the above code only gives you the point estmiate correlation coefficient for the colorize(“population correlation coefficient ρ.”, “red”)
To get the confidence interval and p value (hypothesis testing), use the following code.
cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))
##
## Pearson's product-moment correlation
##
## data: iris$Sepal.Length and iris$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269325 0.04351158
## sample estimates:
## cor
## -0.1175698
The above output gives you all the test statistics requied. If you need to know what statistics are stored, you can use the function attributes().
Cor.p <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))
attributes(Cor.p)
## $names
## [1] "statistic" "parameter" "p.value" "estimate" "null.value"
## [6] "alternative" "method" "data.name" "conf.int"
##
## $class
## [1] "htest"
Cor.p$conf.int # To get the confidence interval
## [1] -0.27269325 0.04351158
## attr(,"conf.level")
## [1] 0.95
To calculate the confidence interval, Fisher Z transformation is used. First,the correlation coefficient is converted into the z’ and then confidence interval is calcualted. The calculated confidence interval is converted again back to the correaltion coefficient.
This is necessay because the distribution of the correlation is skewed and is not normally distributed. The distribution of the z’ is normally distributed.
Refer to the handout for formula and more information.
The null hypothesis states that the corrleation coefficient is zero and the alternative hypothesis states that the corrleation coefficient is not equal to zero.
Refer to handout for formula and more information.
Sometimes, we want the test statistics and parameters to store into a dataframe. broom package created by David Robinson is a great tool to achieve our objective.
library(broom)
Cor.p <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("pearson"))
Cor.p.dataframe <- tidy(Cor.p)
names(Cor.p.dataframe)
## [1] "estimate" "statistic" "p.value" "parameter" "conf.low"
## [6] "conf.high" "method" "alternative"
# We can use logical operator whether the output created by tidy function is a dataframe or not
is.data.frame(Cor.p.dataframe)
## [1] TRUE
It is a non-parametric measure of rank-correlation, also known as statistical dependence between the rankings of two variables. There are some assumptions of Spearman rank correlation. They are:
We will use the iris dataset again. Although the dataset is fit for Pearson correlation rather than Spearman Correlation, we will use it anyway for practice purpose.
cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("spearman"))
## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## c("spearman")): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: iris$Sepal.Length and iris$Sepal.Width
## S = 656283, p-value = 0.04137
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.1667777
The test is non parametric. So, it is not possible to get confidence interval.
Please refer to handout for formula to compute Spearman rank correlation.
We can use broom package to tidy the output of the cor.test() result.
Cor.s <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("spearman"))
## Warning in cor.test.default(iris$Sepal.Length, iris$Sepal.Width, method =
## c("spearman")): Cannot compute exact p-value with ties
# Putting into dataframe
Cor.s.dataframe <- tidy(Cor.s)
# Using the logical operator to check whether the output is dataframe or not
is.data.frame(Cor.s.dataframe)
## [1] TRUE
Commonly known as Kendall’s tau coefficient, it measures the rank correlation between to variables. It is a non-parametric test that measures the ordinal assocation between two quantities. Following are the propoerties of the correlation:
If the agreement between two ranking is perfect, the coefficient is 1.
If the disagreement between to ranking is perfect, the coefficient is -1.
If there is no correlation in the ranking, the coefficient is 0.
We will use the iris dataset again for the test.
cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("kendal"))
##
## Kendall's rank correlation tau
##
## data: iris$Sepal.Length and iris$Sepal.Width
## z = -1.3318, p-value = 0.1829
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.07699679
We can use broom package to tidy the output of the test.
Cor.k <- cor.test(iris$Sepal.Length, iris$Sepal.Width, method = c("kendal"))
# Putting into dataframe
Cor.k.dataframe <- tidy(Cor.k)
# Logcial operator to check the output
is.data.frame(Cor.k.dataframe)
## [1] TRUE
# This is Correlation tutorial
Thanks,
Rajesh Sigdel