Univariate and Multivariate Normality Tests

Tutorial on various important tests

## Check and install the necessary packages, along with dependencies, if any

needpckgs <- c("energy", "ICS", "mvtnorm", "nortest", "DescTools")

DoIHaveIt   <- needpckgs %in% rownames(installed.packages())

if(any(!DoIHaveIt)) install.packages(needpckgs[!DoIHaveIt], dependencies = T)

The normal distribution is immensely useful because of the central limit theorem, which states that, under mild conditions, the mean of many random variables independently drawn from the same distribution is approximately normally distributed, irrespective of the form of the original distribution.

A simple example of the central limit theorem is rolling a large number of identical, unbiased dice. The distribution of the sum (or average) of the rolled numbers will be well approximated by a normal distribution.

Since real-world quantities are often the balanced sum of many unobserved random events, the central limit theorem also provides a partial explanation for the prevalence of the normal probability distribution.

It also justifies the approximation of large-sample statistics to the normal distribution in controlled experiments.

For this tutorial I am going to use the following dataset:

“Six-point board thickness”

Description: Thickness of 2x6 SPF boards from a saw mill. The thickness is measured with a laser and the units of measurement are mils (one thousands of an inch).

Data source: This is a subset of a larger industrial data set. No further adjustments were made to the data. it is available at - “http://datasets.connectmv.com/info/six-point-board-thickness”

Data shape: 5000 rows and 6 columns

Loading data into R

# Setting default directory
setwd("E:/TestData/")

# load data
board.data <- read.csv("six-point-board-thickness.csv", stringsAsFactors = F)

str(board.data)

## 'data.frame':    5000 obs. of  7 variables:
##  $ Date.Time: chr  "2/18/2010 3:04" "2/18/2010 3:37" "2/18/2010 3:37" "2/18/2010 3:37" ...
##  $ Pos1     : int  1761 1801 1697 1679 1699 1778 1705 1695 1678 1671 ...
##  $ Pos2     : int  1739 1688 1682 1712 1688 1752 1748 1665 1617 1694 ...
##  $ Pos3     : int  1758 1753 1663 1672 1699 1763 1760 1715 1632 1717 ...
##  $ Pos4     : int  1677 1741 1671 1703 1678 1693 1677 1742 1711 1693 ...
##  $ Pos5     : int  1684 1692 1685 1683 1688 1678 1678 1726 1709 1675 ...
##  $ Pos6     : int  1692 1675 1651 1674 1705 1706 1692 1745 1683 1709 ...

Splitting the table of “Date.Time” as “Date” and “Time”

names(board.data)[1] <- "DateTime"
board.data$Time <- NULL
board.data$Date <- NULL

newset <- do.call( rbind , strsplit( as.character( board.data$DateTime ) , " " ) )
board.data <- cbind( board.data , Date = newset[, 1] , Time = newset[, 2] )

# Repositionsing Columns for better look
board.data <- board.data[ , c(1, 8, 9, 2:7)]

Viewing Table

knitr::kable(head(board.data), format = "markdown", padding = 2)

DateTime	Date	Time	Pos1	Pos2	Pos3	Pos4	Pos5	Pos6
2/18/2010 3:04	2/18/2010	3:04	1761	1739	1758	1677	1684	1692
2/18/2010 3:37	2/18/2010	3:37	1801	1688	1753	1741	1692	1675
2/18/2010 3:37	2/18/2010	3:37	1697	1682	1663	1671	1685	1651
2/18/2010 3:37	2/18/2010	3:37	1679	1712	1672	1703	1683	1674
2/18/2010 3:37	2/18/2010	3:37	1699	1688	1699	1678	1688	1705
2/18/2010 3:37	2/18/2010	3:37	1778	1752	1763	1693	1678	1706

Changing to Date type

library(lubridate)
board.data$Date <- mdy(board.data$Date)

str(board.data)

## 'data.frame':    5000 obs. of  9 variables:
##  $ DateTime: chr  "2/18/2010 3:04" "2/18/2010 3:37" "2/18/2010 3:37" "2/18/2010 3:37" ...
##  $ Date    : POSIXct, format: "2010-02-18" "2010-02-18" ...
##  $ Time    : Factor w/ 283 levels "10:02","10:03",..: 126 127 127 127 127 127 127 127 127 127 ...
##  $ Pos1    : int  1761 1801 1697 1679 1699 1778 1705 1695 1678 1671 ...
##  $ Pos2    : int  1739 1688 1682 1712 1688 1752 1748 1665 1617 1694 ...
##  $ Pos3    : int  1758 1753 1663 1672 1699 1763 1760 1715 1632 1717 ...
##  $ Pos4    : int  1677 1741 1671 1703 1678 1693 1677 1742 1711 1693 ...
##  $ Pos5    : int  1684 1692 1685 1683 1688 1678 1678 1726 1709 1675 ...
##  $ Pos6    : int  1692 1675 1651 1674 1705 1706 1692 1745 1683 1709 ...

Removing extra columns

board.data1 <- board.data[, -c(1, 3)]

boxplot(board.data1[2:6])

Now, we will work with NORMALITY Tests.

We will first perform a few Univariate tests.

Univariate normality

Q-Q Test

Like many research, we will also start with performing Q-Q Test.

Let us use, loaded dataset in our R environment.

We will use only Pos1 for this purpose.

POS1 <- board.data1$Pos1

# summary
summary(POS1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     880    1670    1685    1689    1705    1902

qqnorm(POS1, pch = 1, cex = 0.5)
qqline(POS1, col = "red", lwd = 1)

So what do you say about this kind of data??

Do you see the data to be Normal?

Anderson-Darling Normality Test

The Anderson-Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. 

In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free. 

However, the test is most often used in contexts where a family of distributions is being tested, in which case the parameters of that family need to be estimated and account must be taken of this in adjusting either the test-statistic or its critical values. 

When applied to testing if a normal distribution adequately describes a set of data, it is one of the most powerful statistical tools for detecting most departures from normality.

Definition   

The Anderson-Darling test is defined as:

H0:     The data follow a specified distribution.
Ha:     The data do not follow the specified distribution.

Significance level: Alpha = 0.05

Critical value : 0.752

Critical region : Reject H₀ if A² > 0.752

There are two R Packages I am able to recall for this test:

"DescTools" and "nortest"

Since we already have installed (see above) these libraries, lets use and compare the results
from them...

# Using DescTools Package
library(DescTools)
# Test a variable
AndersonDarlingTest(board.data1$Pos1)

## 
##  Anderson-Darling test of goodness-of-fit
##  Null hypothesis: uniform distribution
## 
## data:  board.data1$Pos1
## An = Inf, p-value = 1.2e-07

# Using nortest Package
library(nortest)
# Re-test the same variable
ad.test(board.data1$Pos1)

## 
##  Anderson-Darling normality test
## 
## data:  board.data1$Pos1
## A = 115.7614, p-value < 2.2e-16

In both the tests above the value of A-squared > Critical, so we can not accept the Null Hypothesis.

Or can we??

One thing to note is that many researchers have found that the Anderson-Darling Normality Test is 
not quite as good as Shapiro-Wilk Test, but is surely better than many other tests.

Shapiro-Wilk Test

can be used for both, Uni as well as Multi-variate data

The Shapiro-Wilk test utilizes the null hypothesis principle to check
whether a sample X1, ..., Xn came from a normally distributed population?

Interpretation

The null-hypothesis of this test is that the population is normally distributed. 

Thus if the p-value is less than the chosen alpha level, then the null hypothesis is rejected
and there is evidence that the data tested are not from a normally distributed population. 
In other words, the data are not normal. 

On the contrary, if the p-value is greater than the chosen alpha level, then the null hypothesis
that the data came from a normally distributed population cannot be rejected. 
e.g. for an alpha level of 0.05, a data set with a p-value of 0.02 rejects the null hypothesis 
that the data are from a normally distributed population.

However, since the test is biased by sample size, the test may be statistically significant from
a normal distribution in any large samples. 

Thus a Q-Q plot is required for verification in addition to the test.

# Test one of the variables in the dataset
shapiro.test(POS1)

## 
##  Shapiro-Wilk normality test
## 
## data:  POS1
## W = 0.8771, p-value < 2.2e-16

# Test another variable in the dataset
shapiro.test(board.data1$Pos2)

## 
##  Shapiro-Wilk normality test
## 
## data:  board.data1$Pos2
## W = 0.946, p-value < 2.2e-16

Interpreting results from Shapiro-Wilks-Test above:

The Shapiro-Wilks-Test in R the NULL hypothes is that the samples came from a Normal distribution.

This means that if our test's p-value <= 0.05, then we would reject the NULL hypothesis that the samples 
came from a Normal distribution. 

In case of the null hypothesis of the above tests for two individual variables (POS1 and Pos2), 
p-values are <= 0.05 in both the cases hence we would reject the null hypothesis that the samples come 
from normal distribution. 
Alternatively we can say that there is a rare chance that the samples came from a normal distribution. 

AM I RIGHT??

Using this test, we rarely come with "UNDOUBTED" results. 

To illustrate, take for example:

set.seed(450)

x <- runif(50, min = 2, max = 4)
shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.9601, p-value = 0.08995

# But for the same range and larger dataset, we get different p-values
x <- runif(500, min = 2, max = 4)
shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.9429, p-value = 6.106e-13

Wow! The sample runif(50, min = 2, max = 4) comes from a normal distribution according to this test.
Because p-value > 0.05 

Unlike above, the sample runif(500, min = 2, max = 4) do not come from a normal distribution according 
to the same test.
Because p-value < 0.05 

"What to say???"

In my personal opinion "Sapiro-Wilks Test" may be applicable to "Smaller data" size.
But I will always doubt its capabilities when "Large Data" samples are to be tested.

So we will look for an alternate test "     Normality Test"

Univariate and Multivariate Normality Tests

Manoj Kumar

Friday, March 20, 2015

Tutorial on various important tests

Loading data into R

Splitting the table of “Date.Time” as “Date” and “Time”

Viewing Table

Changing to Date type

Removing extra columns

Univariate normality

Q-Q Test

Anderson-Darling Normality Test

Shapiro-Wilk Test

Anderson-Darling Normality Test