Variability measure how the data values spread out
What is it? Deviation = the difference between the observed values and the estimation of location.
Formula:
\[ Mean\ absolution\ deviation = \frac{\sum_{i=1}^n |x_i - \overline{x}|}{n} \]
Calculation:
Programming:
x <- c(1,3,5,7,9)
x_head <- mean(x)
n <- length(x)
mean_absolution_deviation <- sum(abs(x - x_head))/n
mean_absolution_deviation
## [1] 2.4
Formula:
\[ \sigma^2 = \frac{ \sum (x_i - \overline{x})^2}{n-1} \]
Where:
Why do we use n-1 instead of n ?
Calculation:
Programming:
#manually
x <- c(1,3,5,7,9)
x_head <- mean(x)
n <- length(x)
variance <- sum((x-x_head)^2) / (n-1)
variance
## [1] 10
#built-in function
var(x)*(n-1)/n
## [1] 8
Formula:
\[ \sigma = \sqrt{Variance} \]
Empirical rule: to roughly estimate how far a data point is from the mean in a normal distribution
Programming:
#calculate standard deviation
sqrt(variance)
## [1] 3.162278
#built-in function
sd(x)
## [1] 3.162278
Quantile = Phân vị
Percentile = Bách phân vị
Quartile = Tứ phân vị
If we say X is the 80th percentile, we means that in our data set, 80% values is less than or equal to X and 30% of values are more than or equal to X.
Formula 1:
\[ Percentile = \frac{Number\ of\ value\ below\ x} {Total\ number\ of\ value - 1} 100\% \]
Formula 2:
\[ Percentile = \frac{Number\ of\ value\ below\ or\ equal\ to\ x} {Total\ number\ of\ value} 100\% \]
Calculate:
Programming:
#input data
data_a <- c(1:10)
data_b <- c(1:1000)
#calculate percentile
formula_1 <- function(xi,data){
sum(data < xi)/(length(data)-1)*100
}
formula_2 <- function(xi,data){
sum(data <= xi)/length(data)*100
}
formula_1(4,data_a)
## [1] 33.33333
formula_2(4,data_a)
## [1] 40
formula_1(400,data_b)
## [1] 39.93994
formula_2(400,data_b)
## [1] 40
Use percentiles, we can look at the spread of sorted data to estimate dispersion. If we want to avoid the sensitivity to outliers, we can drop some value at each end. So we pick up some critical point to estimate variability of dataset. Typical values are 5th , 25th, 50th, 75th, 95th percentile. IQR is the difference between 25th percentile and 75 percentile.
Calculate:
Programming:
interquartile <- quantile(data_b,c(0.25,0.75))
interquartile
## 25% 75%
## 250.75 750.25
interquartile[2] - interquartile[1]
## 75%
## 499.5
IQR(data_b)
## [1] 499.5
summary(data_b)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 250.8 500.5 500.5 750.2 1000.0
quantile(data_b,c(0.05,0.25,0.5,0.75,0.95))
## 5% 25% 50% 75% 95%
## 50.95 250.75 500.50 750.25 950.05
boxplot(data_b)