We’ll discuss normal distribution (also called Gaussian distribution) and its features.
Normal distribution is a fundamental idea in statistics and used as assumptions of many analytical methods.Main features of normal distribution is 1) mean, mode, and median are the same value, 2) symmetric to the central peak.
Mathematically, normal destribution is described as below.
Let’s create a standard normal distribution with 1000 samples.
*Standard normal distribution: normal distribution with mean 0 and standard deviation 1.
dist_norm <- rnorm(1000)
hist(dist_norm, breaks = 10)
Now we’ll go over methodologies to identify “normality” of distributions (= closeness to normal distribution).
One of the most major methods to identify normality is Shapiro-Wilk test.
The null hypothesis is “samples come from a normal distribution,” and the alternative hypothesis is “samples don’t come from a normal distribution.” Thus, if p-value is below a significance level (use 0.05 this time), we conclude “the distribution doesn’t follow a normal distribution.”
Let’s first create distributions to test.
We’ll use dist_1 (standard normal distribution) and dist_2 (distorted distribution).
#standard normal distribution
dist_1 <- rnorm(3000)
hist(dist_1, breaks = 20)
#distorted distribution
dist_2 <- c(0,0,1,1,2,2,2,2,3,3,3,4,4,4,5,7,8,9,10,10,12)
hist(dist_2, breaks = 20)
The below is the the results of Shapiro-Wilk test for dist_1 and dist_2.
shapiro.test(x=dist_1)
##
## Shapiro-Wilk normality test
##
## data: dist_1
## W = 0.99926, p-value = 0.2646
shapiro.test(x=dist_2)
##
## Shapiro-Wilk normality test
##
## data: dist_2
## W = 0.89796, p-value = 0.03195
Since the p-value of dist_1 is above the significance level 0.05, we conclude it follows normal distribution. On the other hand, as dist_2 had the p-value below 0.05, we reject the null hypothesis and conclude it doesn’t follow normal distribution.
The alternative of Shapiro-Wilk test include Anderson-Darling test, which puts emphasis on the tail.
Skewness and kurtosis represent the distortion of normal distribution. Skewness indicates the degree of symmetry, and kurtosis represents the concentration of data on the central peak.
Let’s check the skewness and kurtosis of above distributions, using the package “e1071.”
When skewness is close to 0, we conclude the distribution is more symmetric. If the data is concentrated on the left, the skewness goes above 0, and it’d be lower than 0 if the data is concentrated on the righ side.
library(e1071)
skewness(dist_1, type=2)
## [1] 0.04576676
skewness(dist_2, type=2)
## [1] 0.8032202
On above examples, the skewness of dist_2 is above 0 since its data is concentrated on the left side, while dist_1 has skewness close to 0.
Let’s check out their kurtosis now.
Just like t skewness, when the kurtosis is close to 0, we conclude the distribution has data concentration like normal distribution. Also, kurtosis goes above 0 when the data is concentrated on the peak, and it’s below 0 when the data is more dispersed.
kurtosis(dist_1, type=2)
## [1] 0.09240571
kurtosis(dist_2, type=2)
## [1] -0.4790917
From the above result, we conclude dist_1 has a shape of normal distribution as the kurtosis is close to 0. However, dist_2 has data less concentrated on its peak compared to normal distribution and therfore has kurtosis much lower than 0.