Part I. Deviation, Variance and Standard Deviation

Variability measure how the data values spread out

1.1 Deviation

What is it? Deviation = the difference between the observed values and the estimation of location.

Formula:

\[ Mean\ absolution\ deviation = \frac{\sum_{i=1}^n |x_i - \overline{x}|}{n} \]

Calculation:

Programming:

x <- c(1,3,5,7,9)
x_head <- mean(x)
n <- length(x)

mean_absolution_deviation <- sum(abs(x - x_head))/n
mean_absolution_deviation
## [1] 2.4

1.2 Variance

Formula:

\[ \sigma^2 = \frac{ \sum (x_i - \overline{x})^2}{n-1} \]

Where:

  • \(\sigma^2\) = sample variance
  • \(x_i\) = value of one observation
  • \(\overline{x}\) = mean value of all observation
  • n = number of observation

Why do we use n-1 instead of n ?

  • If we calculate variance of a sample (small part of population) => use n-1, if it’s all population => use n.
  • Sample vs. population: In most of case, what we have is just sample. For example, we wanna calculate variance of height of students in a school (about 1000 students), to have the true value, we need to measure every student’s height. The population is 1000 students. However, it’s not possible to do so, so we measure a sample, which is about 100 students only.
  • Why not n? When we calculate variance of a sample, if the denominator is n, we might underestimate the value of variance, because we have only a part of population, which doesn’t correctly represent the whole population. Therefore, we will increase a little bit of the value by dividing by n-1. n-1 will make the variance value sensitive with small sample and less sensitive with large sample, but overall, it makes the sample variance more or less around population variance.
  • Why n-1? In reality, when they simulate population variance vs. sample variance (sample is picked randomly), mean of all sample variance in all case will tend to move forward to the value with denominator of n-1. That’s why we use n-1 (experimentally).

Calculation:

Programming:

#manually
x <- c(1,3,5,7,9)
x_head <- mean(x)
n <- length(x)

variance <- sum((x-x_head)^2) / (n-1)
variance
## [1] 10
#built-in function
var(x)*(n-1)/n
## [1] 8

1.3 Standard deviation

Formula:

\[ \sigma = \sqrt{Variance} \]

Empirical rule: to roughly estimate how far a data point is from the mean in a normal distribution

Programming:

#calculate standard deviation
sqrt(variance)
## [1] 3.162278
#built-in function
sd(x)
## [1] 3.162278

Part 2. Quantile, Percentile, Quartile

Quantile = Phân vị

Percentile = Bách phân vị

Quartile = Tứ phân vị

2.1 Percentiles

https://www.mathsisfun.com/

If we say X is the 80th percentile, we means that in our data set, 80% values is less than or equal to X and 30% of values are more than or equal to X.

Formula 1:

\[ Percentile = \frac{Number\ of\ value\ below\ x} {Total\ number\ of\ value - 1} 100\% \]

Formula 2:

\[ Percentile = \frac{Number\ of\ value\ below\ or\ equal\ to\ x} {Total\ number\ of\ value} 100\% \]

Calculate:

Programming:

#input data
data_a <- c(1:10)
data_b <- c(1:1000)
#calculate percentile
formula_1 <- function(xi,data){
  sum(data < xi)/(length(data)-1)*100
}

formula_2 <- function(xi,data){
  sum(data <= xi)/length(data)*100
}
formula_1(4,data_a)
## [1] 33.33333
formula_2(4,data_a)
## [1] 40
formula_1(400,data_b)
## [1] 39.93994
formula_2(400,data_b)
## [1] 40

2.2 Interquartile range (IQR)

Use percentiles, we can look at the spread of sorted data to estimate dispersion. If we want to avoid the sensitivity to outliers, we can drop some value at each end. So we pick up some critical point to estimate variability of dataset. Typical values are 5th , 25th, 50th, 75th, 95th percentile. IQR is the difference between 25th percentile and 75 percentile.

Calculate:

Programming:

interquartile <- quantile(data_b,c(0.25,0.75))
interquartile
##    25%    75% 
## 250.75 750.25
interquartile[2] - interquartile[1]
##   75% 
## 499.5
IQR(data_b)
## [1] 499.5
summary(data_b)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   250.8   500.5   500.5   750.2  1000.0
quantile(data_b,c(0.05,0.25,0.5,0.75,0.95))
##     5%    25%    50%    75%    95% 
##  50.95 250.75 500.50 750.25 950.05
boxplot(data_b)

Part 3. Application on EDA