The info is coming from:
require(ggplot2)
require(dplyr)
require(tidyr)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")
In statistics you usually have samples of a population. From your samples you make inferences on the population.
for these inferences, frequentists are to distinguish from Bayesians (they adjust probabilities when there is new information)
statistics are sample estimators of the population value (the estimand)
Often in diagnostic tests.
A general Formula by Bayes is:
Expected value of a RV is the central tendency of the RV
NOTE: these parameters always look at two paired distributions For example RV X is the BMI of a subject, and RV Y is the blood pressure. So x1 is the BMI of subject 1 and y1 the blood pressure of subject 1.
Simplification of covariance equation
\[\frac{1}{n-1} \sum{( (x_{i} - mean(X))*(y_{i} - mean(Y)) )}\] \[\frac{1}{n-1} \sum{( (x_{i}*y_{i} - y_{i}mean(X) - x_{i}mean(Y) + mean(X)*mean(Y)) )}\] \[\frac{1}{n-1} * ( \sum{( x_{i}*y_{i} - y_{i}mean(X) - x_{i}mean(Y) )} + n*mean(X)*mean(Y))\] \[\frac{1}{n-1} *( \sum{(x_{i}*y_{i})} - n*mean(Y)mean(X) - n*mean(X)mean(Y) + n*mean(X)*mean(Y))\] \[\frac{1}{n-1} * \sum{(x_{i}*y_{i})} - \frac{n}{n-1}*mean(Y)mean(X) \]
But well, nothing is easier in R than cov(a,b).
Towards a demonstration that a perfect positive correlation is always 1
\[Cor(X,Y) = \frac{\frac{1}{n-1} \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{ \sqrt{\frac{1}{n-1}*\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\frac{1}{n-1}*\sum{(y_{i} - mean(Y))^{2}}} }\] \[Cor(X,Y) = \frac{\frac{1}{n-1} \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{ \frac{1}{n-1}*\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(y_{i} - mean(Y))^{2}}} }\] \[Cor(X,Y) = \frac{ \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(y_{i} - mean(Y))^{2}}} }\]
assume, Y = 1.5*X \[Cor(X,Y) = \frac{ \sum{(x_{i} - mean(X))*1.5(x_{i} - mean(X))}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(1.5x_{i} - 1.5mean(X))^{2}}} }\]
\[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(1.5x_{i} - 1.5mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{1.5^{2}*(x_{i} - mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}} }{ 1.5*\sum{(x_{i} - mean(X))^{2}} } = 1\]
note in case of, Y = -1.5*X \[Cor(X,Y) = \frac{ -1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(-1.5)^{2}*(x_{i} - mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ -1.5*\sum{(x_{i} - mean(X))^{2}} }{ 1.5*\sum{(x_{i} - mean(X))^{2}} } = -1\]
Some examples in R
set.seed(123)
X <- sample(1:10, 10)
Y <- 2 + 1.5*X # so a RV Y totally correlated with X
n <- length(X) # which has to be also length(Y)
cov(X,Y)
## [1] 13.75
cov(Y,X)
## [1] 13.75
1/(n-1)*sum((X-mean(X))*(Y-mean(Y)))
## [1] 13.75
cor(X,Y)
## [1] 1
c2 <- rnorm(10)
Y2 <- Y-c2
cor(X,Y2)
## [1] 0.9829346
df <- data.frame(X = X, Y = Y, Y2 = Y2)
df <- gather(df, Vector, Value, -X)
df2 <- data.frame(XMinusMean = (X-mean(X)), YMinusMean = (Y-mean(Y)), Y2MinusMean = (Y2 - mean(Y2)))
df2
## XMinusMean YMinusMean Y2MinusMean
## 1 -2.5 -3.75 -5.33327361
## 2 2.5 3.75 3.42087517
## 3 -1.5 -2.25 -0.85314739
## 4 1.5 2.25 3.06864423
## 5 0.5 0.75 1.32745335
## 6 -4.5 -6.75 -7.84229042
## 7 4.5 6.75 6.52197755
## 8 3.5 5.25 4.98101993
## 9 -3.5 -5.25 -5.22889134
## 10 -0.5 -0.75 -0.06236749
so even though Y2-mean(Y2) is still always positive or negative at the same sites where X-mean(X) is, the correlation still decreased compared to Y.
Tr <- ggplot(df, aes(x = X, y = Value, colour = Vector)) +
geom_point(size = 5) +
scale_colour_manual(values = cbPalette[2:3]) +
theme_bw(14)
Tr
cor(X,Y)*sd(Y)/sd(X)
## [1] 1.5
cov(X,Y)/sd(X)^2
## [1] 1.5