## Check and install the necessary packages, along with dependencies, if any
needpckgs <- c("energy", "ICS", "mvtnorm", "nortest", "DescTools")
DoIHaveIt <- needpckgs %in% rownames(installed.packages())
if(any(!DoIHaveIt)) install.packages(needpckgs[!DoIHaveIt], dependencies = T)
The normal distribution is immensely useful because of the central limit theorem, which states that, under mild conditions, the mean of many random variables independently drawn from the same distribution is approximately normally distributed, irrespective of the form of the original distribution.
A simple example of the central limit theorem is rolling a large number of identical, unbiased dice. The distribution of the sum (or average) of the rolled numbers will be well approximated by a normal distribution.
Since real-world quantities are often the balanced sum of many unobserved random events, the central limit theorem also provides a partial explanation for the prevalence of the normal probability distribution.
It also justifies the approximation of large-sample statistics to the normal distribution in controlled experiments.
For this tutorial I am going to use the following dataset:
“Six-point board thickness”
Description: Thickness of 2x6 SPF boards from a saw mill. The thickness is measured with a laser and the units of measurement are mils (one thousands of an inch).
Data source: This is a subset of a larger industrial data set. No further adjustments were made to the data. it is available at - “http://datasets.connectmv.com/info/six-point-board-thickness”
Data shape: 5000 rows and 6 columns
# Setting default directory
setwd("E:/TestData/")
# load data
board.data <- read.csv("six-point-board-thickness.csv", stringsAsFactors = F)
str(board.data)
## 'data.frame': 5000 obs. of 7 variables:
## $ Date.Time: chr "2/18/2010 3:04" "2/18/2010 3:37" "2/18/2010 3:37" "2/18/2010 3:37" ...
## $ Pos1 : int 1761 1801 1697 1679 1699 1778 1705 1695 1678 1671 ...
## $ Pos2 : int 1739 1688 1682 1712 1688 1752 1748 1665 1617 1694 ...
## $ Pos3 : int 1758 1753 1663 1672 1699 1763 1760 1715 1632 1717 ...
## $ Pos4 : int 1677 1741 1671 1703 1678 1693 1677 1742 1711 1693 ...
## $ Pos5 : int 1684 1692 1685 1683 1688 1678 1678 1726 1709 1675 ...
## $ Pos6 : int 1692 1675 1651 1674 1705 1706 1692 1745 1683 1709 ...
names(board.data)[1] <- "DateTime"
board.data$Time <- NULL
board.data$Date <- NULL
newset <- do.call( rbind , strsplit( as.character( board.data$DateTime ) , " " ) )
board.data <- cbind( board.data , Date = newset[, 1] , Time = newset[, 2] )
# Repositionsing Columns for better look
board.data <- board.data[ , c(1, 8, 9, 2:7)]
knitr::kable(head(board.data), format = "markdown", padding = 2)
DateTime | Date | Time | Pos1 | Pos2 | Pos3 | Pos4 | Pos5 | Pos6 |
---|---|---|---|---|---|---|---|---|
2/18/2010 3:04 | 2/18/2010 | 3:04 | 1761 | 1739 | 1758 | 1677 | 1684 | 1692 |
2/18/2010 3:37 | 2/18/2010 | 3:37 | 1801 | 1688 | 1753 | 1741 | 1692 | 1675 |
2/18/2010 3:37 | 2/18/2010 | 3:37 | 1697 | 1682 | 1663 | 1671 | 1685 | 1651 |
2/18/2010 3:37 | 2/18/2010 | 3:37 | 1679 | 1712 | 1672 | 1703 | 1683 | 1674 |
2/18/2010 3:37 | 2/18/2010 | 3:37 | 1699 | 1688 | 1699 | 1678 | 1688 | 1705 |
2/18/2010 3:37 | 2/18/2010 | 3:37 | 1778 | 1752 | 1763 | 1693 | 1678 | 1706 |
library(lubridate)
board.data$Date <- mdy(board.data$Date)
str(board.data)
## 'data.frame': 5000 obs. of 9 variables:
## $ DateTime: chr "2/18/2010 3:04" "2/18/2010 3:37" "2/18/2010 3:37" "2/18/2010 3:37" ...
## $ Date : POSIXct, format: "2010-02-18" "2010-02-18" ...
## $ Time : Factor w/ 283 levels "10:02","10:03",..: 126 127 127 127 127 127 127 127 127 127 ...
## $ Pos1 : int 1761 1801 1697 1679 1699 1778 1705 1695 1678 1671 ...
## $ Pos2 : int 1739 1688 1682 1712 1688 1752 1748 1665 1617 1694 ...
## $ Pos3 : int 1758 1753 1663 1672 1699 1763 1760 1715 1632 1717 ...
## $ Pos4 : int 1677 1741 1671 1703 1678 1693 1677 1742 1711 1693 ...
## $ Pos5 : int 1684 1692 1685 1683 1688 1678 1678 1726 1709 1675 ...
## $ Pos6 : int 1692 1675 1651 1674 1705 1706 1692 1745 1683 1709 ...
board.data1 <- board.data[, -c(1, 3)]
boxplot(board.data1[2:6])
Now, we will work with NORMALITY Tests.
We will first perform a few Univariate tests.
Like many research, we will also start with performing Q-Q Test.
Let us use, loaded dataset in our R environment.
We will use only Pos1 for this purpose.
POS1 <- board.data1$Pos1
# summary
summary(POS1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 880 1670 1685 1689 1705 1902
qqnorm(POS1, pch = 1, cex = 0.5)
qqline(POS1, col = "red", lwd = 1)
So what do you say about this kind of data??
Do you see the data to be Normal?
The Anderson-Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution.
In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free.
However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values.
When applied to testing if a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality.
Definition
The Anderson-Darling test is defined as:
H0: The data follow a specified distribution.
Ha: The data do not follow the specified distribution.
Significance level: Alpha = 0.05
Critical value : 0.752
Critical region : Reject H0 if A2 > 0.752
There are two R Packages I am able to recall for this test:
"DescTools" and "nortest"
Since we already have installed (see above) these libraries, lets use and compare the results
from them...
# Using DescTools Package
library(DescTools)
# Test a variable
AndersonDarlingTest(board.data1$Pos1)
##
## Anderson-Darling test of goodness-of-fit
## Null hypothesis: uniform distribution
##
## data: board.data1$Pos1
## An = Inf, p-value = 1.2e-07
# Using nortest Package
library(nortest)
# Re-test the same variable
ad.test(board.data1$Pos1)
##
## Anderson-Darling normality test
##
## data: board.data1$Pos1
## A = 115.7614, p-value < 2.2e-16
In both the tests above the value of A-squared > Critical, so we can not accept the Null Hypothesis.
Or can we??
One thing to note is that many researchers have found that the Anderson-Darling Normality Test is
not quite as good as Shapiro-Wilk Test, but is surely better than many other tests.
can be used for both, Uni as well as Multi-variate data
The Shapiro-Wilk test utilizes the null hypothesis principle to check
whether a sample X1, ..., Xn came from a normally distributed population?
Interpretation
The null-hypothesis of this test is that the population is normally distributed.
Thus if the p-value is less than the chosen alpha level, then the null hypothesis is rejected
and there is evidence that the data tested are not from a normally distributed population.
In other words, the data are not normal.
On the contrary, if the p-value is greater than the chosen alpha level, then the null hypothesis
that the data came from a normally distributed population cannot be rejected.
e.g. for an alpha level of 0.05, a data set with a p-value of 0.02 rejects the null hypothesis
that the data are from a normally distributed population.
However, since the test is biased by sample size, the test may be statistically significant from
a normal distribution in any large samples.
Thus a Q-Q plot is required for verification in addition to the test.
# Test one of the variables in the dataset
shapiro.test(POS1)
##
## Shapiro-Wilk normality test
##
## data: POS1
## W = 0.8771, p-value < 2.2e-16
# Test another variable in the dataset
shapiro.test(board.data1$Pos2)
##
## Shapiro-Wilk normality test
##
## data: board.data1$Pos2
## W = 0.946, p-value < 2.2e-16
Interpreting results from Shapiro-Wilks-Test above:
The Shapiro-Wilks-Test in R the NULL hypothes is that the samples came from a Normal distribution.
This means that if our test's p-value <= 0.05, then we would reject the NULL hypothesis that the samples
came from a Normal distribution.
In case of the null hypothesis of the above tests for two individual variables (POS1 and Pos2),
p-values are <= 0.05 in both the cases hence we would reject the null hypothesis that the samples come
from normal distribution.
Alternatively we can say that there is a rare chance that the samples came from a normal distribution.
AM I RIGHT??
Using this test, we rarely come with "UNDOUBTED" results.
To illustrate, take for example:
set.seed(450)
x <- runif(50, min = 2, max = 4)
shapiro.test(x)
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.9601, p-value = 0.08995
# But for the same range and larger dataset, we get different p-values
x <- runif(500, min = 2, max = 4)
shapiro.test(x)
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.9429, p-value = 6.106e-13
Wow! The sample runif(50, min = 2, max = 4) comes from a normal distribution according to this test.
Because p-value > 0.05
Unlike above, the sample runif(500, min = 2, max = 4) do not come from a normal distribution according
to the same test.
Because p-value < 0.05
"What to say???"
In my personal opinion "Sapiro-Wilks Test" may be applicable to "Smaller data" size.
But I will always doubt its capabilities when "Large Data" samples are to be tested.
So we will look for an alternate test " Normality Test"