The info is coming from:

require(ggplot2)
require(dplyr)
require(tidyr)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
               "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

Random Variables, statistics, PDF, CDF, and so on

In statistics you usually have samples of a population. From your samples you make inferences on the population.

conditional probabilities

Often in diagnostic tests.

A general Formula by Bayes is:

Expected Values (characterise the center of a distribution)

Expected value of a RV is the central tendency of the RV

Variance (characterises the spread/dispersion of a distribution around the mean)

Covariance and Correlation, finally regression

NOTE: these parameters always look at two paired distributions For example RV X is the BMI of a subject, and RV Y is the blood pressure. So x1 is the BMI of subject 1 and y1 the blood pressure of subject 1.

Simplification of covariance equation

\[\frac{1}{n-1} \sum{( (x_{i} - mean(X))*(y_{i} - mean(Y)) )}\] \[\frac{1}{n-1} \sum{( (x_{i}*y_{i} - y_{i}mean(X) - x_{i}mean(Y) + mean(X)*mean(Y)) )}\] \[\frac{1}{n-1} * ( \sum{( x_{i}*y_{i} - y_{i}mean(X) - x_{i}mean(Y) )} + n*mean(X)*mean(Y))\] \[\frac{1}{n-1} *( \sum{(x_{i}*y_{i})} - n*mean(Y)mean(X) - n*mean(X)mean(Y) + n*mean(X)*mean(Y))\] \[\frac{1}{n-1} * \sum{(x_{i}*y_{i})} - \frac{n}{n-1}*mean(Y)mean(X) \]

But well, nothing is easier in R than cov(a,b).

Towards a demonstration that a perfect positive correlation is always 1

\[Cor(X,Y) = \frac{\frac{1}{n-1} \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{ \sqrt{\frac{1}{n-1}*\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\frac{1}{n-1}*\sum{(y_{i} - mean(Y))^{2}}} }\] \[Cor(X,Y) = \frac{\frac{1}{n-1} \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{ \frac{1}{n-1}*\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(y_{i} - mean(Y))^{2}}} }\] \[Cor(X,Y) = \frac{ \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(y_{i} - mean(Y))^{2}}} }\]

assume, Y = 1.5*X \[Cor(X,Y) = \frac{ \sum{(x_{i} - mean(X))*1.5(x_{i} - mean(X))}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(1.5x_{i} - 1.5mean(X))^{2}}} }\]

\[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(1.5x_{i} - 1.5mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{1.5^{2}*(x_{i} - mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}} }{ 1.5*\sum{(x_{i} - mean(X))^{2}} } = 1\]

note in case of, Y = -1.5*X \[Cor(X,Y) = \frac{ -1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(-1.5)^{2}*(x_{i} - mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ -1.5*\sum{(x_{i} - mean(X))^{2}} }{ 1.5*\sum{(x_{i} - mean(X))^{2}} } = -1\]

Some examples in R

set.seed(123)
X <- sample(1:10, 10)
Y <- 2 + 1.5*X # so a RV Y totally correlated with X
n <- length(X) # which has to be also length(Y)
cov(X,Y)
## [1] 13.75
cov(Y,X)
## [1] 13.75
1/(n-1)*sum((X-mean(X))*(Y-mean(Y)))
## [1] 13.75
cor(X,Y)
## [1] 1
c2 <- rnorm(10)
Y2 <- Y-c2
cor(X,Y2)
## [1] 0.9829346
df <- data.frame(X = X, Y = Y, Y2 = Y2)
df <- gather(df, Vector, Value, -X)
df2 <- data.frame(XMinusMean = (X-mean(X)), YMinusMean = (Y-mean(Y)), Y2MinusMean = (Y2 - mean(Y2)))
df2
##    XMinusMean YMinusMean Y2MinusMean
## 1        -2.5      -3.75 -5.33327361
## 2         2.5       3.75  3.42087517
## 3        -1.5      -2.25 -0.85314739
## 4         1.5       2.25  3.06864423
## 5         0.5       0.75  1.32745335
## 6        -4.5      -6.75 -7.84229042
## 7         4.5       6.75  6.52197755
## 8         3.5       5.25  4.98101993
## 9        -3.5      -5.25 -5.22889134
## 10       -0.5      -0.75 -0.06236749

so even though Y2-mean(Y2) is still always positive or negative at the same sites where X-mean(X) is, the correlation still decreased compared to Y.

Tr <- ggplot(df, aes(x = X, y = Value, colour = Vector)) +
        geom_point(size = 5) +
        scale_colour_manual(values = cbPalette[2:3]) +
        theme_bw(14)
Tr

Linear Regression

  • in least square linear regression the solution for the slope is: \[\beta = cor(X,Y)*sd(Y)/sd(X)\]
    • which is equal to: \[\beta = cov(X,Y)/sd(X)^2\]
cor(X,Y)*sd(Y)/sd(X)
## [1] 1.5
cov(X,Y)/sd(X)^2
## [1] 1.5