150516_StatisticalInferenceARegressionBasics

The info is coming from:

swirl and coursera course: statistical inference

require(ggplot2)
require(dplyr)
require(tidyr)

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
               "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

Random Variables, statistics, PDF, CDF, and so on

In statistics you usually have samples of a population. From your samples you make inferences on the population.

for these inferences, frequentists are to distinguish from Bayesians (they adjust probabilities when there is new information)
a statistic always refers to a sample
a random variable (RV) is the outcome of an experiment (or a calculation)
- identically and independently distributed (iid) RV: independent means statistically unrelated outcomes, identically distributed means that all outcomes have been drawn from the same population distribution
- RV can be discrete or continuous.
- a continuous RV has a probability density function (PDF)
- a discrete RV has a probability mass function (PMF)
- the cumulative distribution function (CDF) is equal to the probability that X is smaller or equal x. The PDF is the derivative of the CDF. Or the CDF is obtained by integrating the PDF (e.g. with the integrate function in R), i.e. it is the area under the PDF.
- the survivor function is equal to the probability that X is greater x.
statistics are sample estimators of the population value (the estimand)

conditional probabilities

Often in diagnostic tests.

sensitivity: P(+|D)
specificity: P(-|~D)
- both should be close to 1
odds very generally when you divide a probability by 1 minus that probability, e.g. P(X=x)/(1-P(X=x))
positive predictive value P(D|+)
negative predictive value P(~D|-)
diagnostic likelihood ratio of positive test DLR_+: P(+|D)/P(+|~D)
- note the DLR is not an odds. But it relates the post-test odds of having disease to the pretest-odds of having disease: P(D|+)/P(~D|+) = DLR_+ * P(D)/P(~D)
- note the DLR_+ contains the sensitivity as numerator, and 1-specificity as denominator, so it should be big
diagnostic likelihood ratio of negative test DLR_+: P(-|D)/P(-|~D)
- It again relates the post-test odds of having disease to the pretest-odds of having disease: P(D|-)/P(~D|-) = DLR_- * P(D)/P(~D)
- note, the DLR_- has 1-sensitivity as numerator and specificity as denominator, so it should be small.

A general Formula by Bayes is:

\(P(D|+) = P(+|D) \times P(D) / ( P(+|D) \times P(D) + P(+|\sim{D}) \times P(\sim{D}) )\)
- so bayes was good in conditional probabilities explaining why bayesians always adjust probabilities as soon as there is new info.

Expected Values (characterise the center of a distribution)

Expected value of a RV is the central tendency of the RV

E(aX + bY) = aE(X)+bE(Y)
- the expectation of a sum is the sum of the expectations
for a continuous RV you have to integrate while for a discrete you just have to sum.
- E(X) for X continuous is the area under the function t*f(t) with f(t) = PDF
expected values of the sample mean is the population mean, the sample mean is an unbiased estimator of the population mean
an estimator e of a parameter v is unbiased if its expected value equals v, i.e. E(e) = v

Variance (characterises the spread/dispersion of a distribution around the mean)

sample variance should be an unbiased estimator of the population variance
- see for example this why unbiased sample variance with n-1
defined as the expected value of the squared distance from the mean
- Var(X) = E( (X-mu)^2 ) = E(X^2) - E(X)^2
- the variance is also the covariance of a random variable with itself. \[ Var(X) = Cov(X,X) \]
- a is constant Var(aX) = a^2 Var(X)
- var(a+b) = var(a) + var(b) + 2cov(a,b)
- you find the proof in 150521_ProportionalityMeasureTheta, or on youtube.
- var(a-b) = var(a) + var(b) - 2cov(a,b)
population variance is sigma^2, sample variance is s^2
- sample variance divides by n-1, the degree of freedom in the system (only n-1 are independent, the last could be calculated from the mean)
- sample variance is again an unbiased estimator of the population variance (its expected value equals the estimand)
the standard deviation of a statistic is called standard error.
- for the sample mean, the standard error is s/sqrt(n) with s^2 being the sample variance. So if you simulate samples of n = 10 and take the sd of the mean of these samples, it will be equal to s/sqrt(n).
- in other words: the variance of the sample mean is the population variance divided by n (sample size)
- and again: the standard error of the sample mean is the sample standard deviation s divided by sqrt(n)

Covariance and Correlation, finally regression

NOTE: these parameters always look at two paired distributions For example RV X is the BMI of a subject, and RV Y is the blood pressure. So x1 is the BMI of subject 1 and y1 the blood pressure of subject 1.

Covariance Cov(X,Y): \(\frac{1}{n-1} \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}\) (see below, can be simplified)
- so if both xi and yi are below the respective mean you get a positive number, if one is above it’s mean the other below you get a negative number!
- so covariance can be quite big (absolute value)
Correlation Cor(X,Y): \(Cov(X,Y)/(sd(X)*sd(Y))\)
- the correlation ranges from -1 to 1, and it is unitless
- correlation measures the strength of a linear relationship between X and Y

Simplification of covariance equation

for this just note that \(\sum{x_{i}} = n*mean(X)\)

\[\frac{1}{n-1} \sum{( (x_{i} - mean(X))*(y_{i} - mean(Y)) )}\] \[\frac{1}{n-1} \sum{( (x_{i}*y_{i} - y_{i}mean(X) - x_{i}mean(Y) + mean(X)*mean(Y)) )}\] \[\frac{1}{n-1} * ( \sum{( x_{i}*y_{i} - y_{i}mean(X) - x_{i}mean(Y) )} + n*mean(X)*mean(Y))\] \[\frac{1}{n-1} *( \sum{(x_{i}*y_{i})} - n*mean(Y)mean(X) - n*mean(X)mean(Y) + n*mean(X)*mean(Y))\] \[\frac{1}{n-1} * \sum{(x_{i}*y_{i})} - \frac{n}{n-1}*mean(Y)mean(X) \]

But well, nothing is easier in R than cov(a,b).

Towards a demonstration that a perfect positive correlation is always 1

\[Cor(X,Y) = \frac{\frac{1}{n-1} \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{ \sqrt{\frac{1}{n-1}*\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\frac{1}{n-1}*\sum{(y_{i} - mean(Y))^{2}}} }\] \[Cor(X,Y) = \frac{\frac{1}{n-1} \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{ \frac{1}{n-1}*\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(y_{i} - mean(Y))^{2}}} }\] \[Cor(X,Y) = \frac{ \sum{(x_{i} - mean(X))*(y_{i} - mean(Y))}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(y_{i} - mean(Y))^{2}}} }\]

assume, Y = 1.5*X \[Cor(X,Y) = \frac{ \sum{(x_{i} - mean(X))*1.5(x_{i} - mean(X))}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(1.5x_{i} - 1.5mean(X))^{2}}} }\]

\[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(1.5x_{i} - 1.5mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{1.5^{2}*(x_{i} - mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ 1.5*\sum{(x_{i} - mean(X))^{2}} }{ 1.5*\sum{(x_{i} - mean(X))^{2}} } = 1\]

note in case of, Y = -1.5*X \[Cor(X,Y) = \frac{ -1.5*\sum{(x_{i} - mean(X))^{2}}}{\sqrt{\sum{(x_{i} - mean(X))^{2}}} * \sqrt{\sum{(-1.5)^{2}*(x_{i} - mean(X))^{2}}} }\] \[Cor(X,Y) = \frac{ -1.5*\sum{(x_{i} - mean(X))^{2}} }{ 1.5*\sum{(x_{i} - mean(X))^{2}} } = -1\]

you still have to show it for Y <- 2 + 1.5*X
You also still have to show that the absolute value of the numerator is maximal the absolute value of the denominator. Then it is clear that Cor is always between -1 and 1.

Some examples in R

set.seed(123)
X <- sample(1:10, 10)
Y <- 2 + 1.5*X # so a RV Y totally correlated with X
n <- length(X) # which has to be also length(Y)
cov(X,Y)

## [1] 13.75

cov(Y,X)

## [1] 13.75

1/(n-1)*sum((X-mean(X))*(Y-mean(Y)))

## [1] 13.75

cor(X,Y)

## [1] 1

c2 <- rnorm(10)
Y2 <- Y-c2
cor(X,Y2)

## [1] 0.9829346

df <- data.frame(X = X, Y = Y, Y2 = Y2)
df <- gather(df, Vector, Value, -X)
df2 <- data.frame(XMinusMean = (X-mean(X)), YMinusMean = (Y-mean(Y)), Y2MinusMean = (Y2 - mean(Y2)))
df2

##    XMinusMean YMinusMean Y2MinusMean
## 1        -2.5      -3.75 -5.33327361
## 2         2.5       3.75  3.42087517
## 3        -1.5      -2.25 -0.85314739
## 4         1.5       2.25  3.06864423
## 5         0.5       0.75  1.32745335
## 6        -4.5      -6.75 -7.84229042
## 7         4.5       6.75  6.52197755
## 8         3.5       5.25  4.98101993
## 9        -3.5      -5.25 -5.22889134
## 10       -0.5      -0.75 -0.06236749

so even though Y2-mean(Y2) is still always positive or negative at the same sites where X-mean(X) is, the correlation still decreased compared to Y.

Tr <- ggplot(df, aes(x = X, y = Value, colour = Vector)) +
        geom_point(size = 5) +
        scale_colour_manual(values = cbPalette[2:3]) +
        theme_bw(14)

Tr

Linear Regression

in least square linear regression the solution for the slope is: \[\beta = cor(X,Y)*sd(Y)/sd(X)\]
- which is equal to: \[\beta = cov(X,Y)/sd(X)^2\]

cor(X,Y)*sd(Y)/sd(X)

## [1] 1.5

cov(X,Y)/sd(X)^2

## [1] 1.5