Part I. Deviation, Variance and Standard Deviation

Variability measure how the data values spread out

1.1 Deviation

What is it? Deviation = the difference between the observed values and the estimation of location.

Formula:

\[ Mean\ absolution\ deviation = \frac{\sum_{i=1}^n |x_i - \overline{x}|}{n} \]

Calculation:

Programming:

x <- c(1,3,5,7,9)
x_head <- mean(x)
n <- length(x)

mean_absolution_deviation <- sum(abs(x - x_head))/n
mean_absolution_deviation

## [1] 2.4

Formula:

\[ \sigma^2 = \frac{ \sum (x_i - \overline{x})^2}{n-1} \]

Where:

Why do we use n-1 instead of n ?

If we calculate variance of a sample (small part of population) => use n-1, if it’s all population => use n.
Sample vs. population: In most of case, what we have is just sample. For example, we wanna calculate variance of height of students in a school (about 1000 students), to have the true value, we need to measure every student’s height. The population is 1000 students. However, it’s not possible to do so, so we measure a sample, which is about 100 students only.
Why not n? When we calculate variance of a sample, if the denominator is n, we might underestimate the value of variance, because we have only a part of population, which doesn’t correctly represent the whole population. Therefore, we will increase a little bit of the value by dividing by n-1. n-1 will make the variance value sensitive with small sample and less sensitive with large sample, but overall, it makes the sample variance more or less around population variance.
Why n-1? In reality, when they simulate population variance vs. sample variance (sample is picked randomly), mean of all sample variance in all case will tend to move forward to the value with denominator of n-1. That’s why we use n-1 (experimentally).

Calculation:

Programming:

#manually
x <- c(1,3,5,7,9)
x_head <- mean(x)
n <- length(x)

variance <- sum((x-x_head)^2) / (n-1)
variance

## [1] 10

#built-in function
var(x)*(n-1)/n

## [1] 8

Formula:

\[ \sigma = \sqrt{Variance} \]

Empirical rule: to roughly estimate how far a data point is from the mean in a normal distribution

Programming:

#calculate standard deviation
sqrt(variance)

## [1] 3.162278

#built-in function
sd(x)

## [1] 3.162278