Correlation in Data

2024-03-20

Correlation

Correlation represents the linear dependency between two variables. The variable ‘r’ is used to denote correlation.

The two methods of calculation used in this presentation are Pearson and Spearman’s equations for correlation coefficient.

Correlation

r spans between -1 and +1, the closer r is to zero, the weaker the linear relationship. A +1 represents a strong positive correlation and a -1 represents a weak negative correlation.

Basic Example of Correlation Types

Strong Positive Correlation:

Basic Example of Correlation Types

Strong Negative Correlation:

Basic Example of Correlation Types

Weak Correlation:

Basic Example of Correlation Types

No Correlation:

How to Calculate Pearson Correlation

\(r_{xy}=\frac{\sum_{i=1}^n ((x_i-\overline{x})(y_i-\overline{y}))}{\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2}\sqrt{\sum_{i=1}^n (y_i-\overline{y})^2}}\)

\(\overline{x}\) and \(\overline{y}\) are the sample means of x and y

\(x_i\) and \(y_i\) and sample points at i

How to Calculate Spearman’s Correlation

In Spearman’s calculation, \(X_i\) and \(Y_i\) (the sample points at i) are converted to ranks = \(R(X_i), R(Y_i)\)

\(r_s=\rho R(X),R(Y)=\frac{cov(R(X),R(Y))}{\sigma_{R(X)}\sigma_{R(Y))}}\)

\(\rho\) is Pearson’s correlation coefficiant applied to the rank variables

cov(R(X),R(Y)) is the covariable of the rank variables

\(\sigma_{R(X)}\sigma_{R(Y))}\) is the standard deviation of the rank variables

Example of Correlation in a Data Set

r = 0.1037 - conclude no correlation between the rating of a movie and how many votes it recieved

p5 <- plot_ly(data=movies, x=~rating, y=~votes, type = 'scatter', 
mode = 'markers', height = 300, text = "Movie Length vs Movie Budget")
p5