Correlation represents the linear dependency between two variables. The variable ‘r’ is used to denote correlation.
The two methods of calculation used in this presentation are Pearson and Spearman’s equations for correlation coefficient.
2024-03-20
Correlation represents the linear dependency between two variables. The variable ‘r’ is used to denote correlation.
The two methods of calculation used in this presentation are Pearson and Spearman’s equations for correlation coefficient.
r spans between -1 and +1, the closer r is to zero, the weaker the linear relationship. A +1 represents a strong positive correlation and a -1 represents a weak negative correlation.
Strong Positive Correlation:
Strong Negative Correlation:
Weak Correlation:
No Correlation:
\(r_{xy}=\frac{\sum_{i=1}^n ((x_i-\overline{x})(y_i-\overline{y}))}{\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2}\sqrt{\sum_{i=1}^n (y_i-\overline{y})^2}}\)
\(\overline{x}\) and \(\overline{y}\) are the sample means of x and y
\(x_i\) and \(y_i\) and sample points at i
In Spearman’s calculation, \(X_i\) and \(Y_i\) (the sample points at i) are converted to ranks = \(R(X_i), R(Y_i)\)
\(r_s=\rho R(X),R(Y)=\frac{cov(R(X),R(Y))}{\sigma_{R(X)}\sigma_{R(Y))}}\)
\(\rho\) is Pearson’s correlation coefficiant applied to the rank variables
cov(R(X),R(Y)) is the covariable of the rank variables
\(\sigma_{R(X)}\sigma_{R(Y))}\) is the standard deviation of the rank variables
r = 0.1037 - conclude no correlation between the rating of a movie and how many votes it recieved
p5 <- plot_ly(data=movies, x=~rating, y=~votes, type = 'scatter', mode = 'markers', height = 300, text = "Movie Length vs Movie Budget") p5