Testes de aderência

Title: Main procedures to check normality

Synopsis: This document is aimed at helping to check normality in a data set.

Referêncial teórico

Hipóteses

H0: A amostra é gaussiana (Normal distribution) H1: A amostra não é gaussiana (Non normal distribution)

Se p<=0.05 a amostra não é gaussiana.

Tipos dos testes

KOLMOGOROV-SMIRNOV (K-S): É um dos mais antigos testes e o mais “complacente” em termos de sua tendência a aceitar uma dada amostra como gaussiana.

LILLIEFORS: É uma variante mais precisa do teste K-S, sendo menos “complacente” do que ele.

SHAPIRO-WILK W: É o teste de mais amplamente usado, dado o seu poder estatístico, além de também ser o mais “rigoroso” quanto a aceitar a hipótese da gaussianidade. Deve ser combinado com o Histograma e o QQplot. Trabalha bem até 5000, valor limitado pela função. Trabalha em sincronia com o o teste Anderson-Darling.

ANDERSON-DARLING:

Central limit theorem

If random samples are taken from any population, the distribution of the sample means (sampling distribution) is approximatelly normally distributed.
This approximation goes better as the sampels go larges.
Imagine that you would take repeated samples of size n of a population and each time calculateing the mean of a variable and then draw a histogram of these means, that is what is called a sampling distribuition.
A sampling distribution will have the same mean as the population but the standard deviation of the sampling distribution will be smaller.

When sampling from a normally distributed population, the sampling distribution will be normally distributed.
When sampling froma a not normally distributed population the sampling distribution will be approximatelly distributed.

The Central Limit Theorem (CLT) shows that the normal distribution can be relevant even if the variable is not normally distributed.

Testes de aderência

Shapiro-Wilk test

# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.9978, p-value = 0.9851

# Non Normal distribution

x = rt(100, df=3)
shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.8221, p-value = 1.254e-09

Anderson-Darling test

Goes along with Shapiro-Wilk test

# Normal distribution
#install.packages("nortest")
library(nortest)
set.seed(777)
x <- rnorm(250,10,1)
ad.test(x)

## 
##  Anderson-Darling normality test
## 
## data:  x
## A = 0.1394, p-value = 0.9745

# Non Normal distribution

x = rt(100, df=3)
shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.8221, p-value = 1.254e-09

Lilliefors test

# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
#install.packages("nortest")
library(nortest)
lillie.test(x)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.0243, p-value = 0.9743

# Non Normal distribution
x = rt(100, df=3)
lillie.test(x)

## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  x
## D = 0.1459, p-value = 1.821e-05

Kolmogorov-Smirnov test

set.seed(777)
x <- rnorm(250,10,1)
ks.test(x,"pnorm",mean(x),sqrt(var(x)))

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.0243, p-value = 0.9984
## alternative hypothesis: two-sided

shapiro.test(x)

## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.9978, p-value = 0.9851

Medidas de tendência central

Média

set.seed(777)
x <- rnorm(250,10,1)
mean(x)

## [1] 10.04907

Moda

statmod <- function(x) { z <- table(as.vector(x)) names(z)[z == max(z)] }

statmod(mtcars$drat)

Mediana

set.seed(777)
x <- rnorm(250,10,1)
median(x)

## [1] 10.0096

Medidas de dispersão

Desvio padrão

set.seed(777)
x <- rnorm(250,10,1)
sd(x)

## [1] 1.016281

Variância

set.seed(777)
x <- rnorm(250,10,1)
var(x)

## [1] 1.032827

Medidas Diversas

skewness (should be around 0,3)

Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicates that the mean of the data values is larger than the median, and the data distribution is right-skewed. If the skewness of S is 0 then the distribution represented by S is perfectly symmetric.

Skewness quantifies how symmetrical the distribution is.

. A symmetrical distribution has a skewness of zero. . An asymmetrical distribution with a long tail to the right (higher values) has a positive skew. . An asymmetrical distribution with a long tail to the left (lower values) has a negative skew. . The skewness is unitless. . Any threshold or rule of thumb is arbitrary, but here is one: If the skewness is greater than 1.0 (or less than -1.0), the skewness is substantial and the distribution is far from symmetrical.

### I did not ye the moments package but below it is
# might be needed for kurtosis and skewness
#install.packages("moments")
#library(moments)


set.seed(777)
x <- rnorm(250,10,1)
# skewness should be around (0,3)

#install.packages("moments")
library(moments)

#install.packages("e1071")
library(e1071)

## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:moments':
## 
##     kurtosis, moment, skewness

skewness(x)

## [1] 0.1004742

Kurtosis (should be around 0,3)

Intuitively, the kurtosis is a measure of the peakedness of the data distribution. Negative kurtosis would indicates a flat data distribution, which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to be mesokurtic.

We use kurtosis as a measure of peakedness (or flatness). Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.

Kurtosis quantifies whether the shape of the data distribution matches the Gaussian distribution.

. A Gaussian distribution has a kurtosis of 0. . A flatter distribution has a negative kurtosis, . A distribution more peaked than a Gaussian distribution has a positive kurtosis. . Kurtosis has no units. . The value that Prism reports is sometimes called the excess kurtosis since the expected kurtosis for a Gaussian distribution is 0.0. . An alternative definition of kurtosis is computed by adding 3 to the value reported by Prism. With this definition, a Gaussian distribution is expected to have a kurtosis of 3.0.

set.seed(777)
x <- rnorm(250,10,1)

#install.packages("moments")
library(moments)

#install.packages("e1071")
library(e1071)
kurtosis(x)

## [1] 0.06239684

Medidas gráficas

Histogram

set.seed(777)
x <- rnorm(250,10,1)
hist(x)

Histogram with fitted normal density

# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
hist(x,prob=T, ylim=c(0,0.4), main="Normally distributed")
xbar = mean(x)
S = sd(x)
curve(dnorm(x,xbar,S), add=TRUE)

# Non Normal distribution
x = rt(100, df=3)
hist(x,prob=T, ylim=c(0,0.4),main="Not normally distributed")
xbar = mean(x)
S = sd(x)
curve(dnorm(x,xbar,S), add=TRUE)

qq-plot: you should observe a good fit of the straight line

In statistics, a Q-Q plot[1] (“Q” stands for quantile) is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
First, the set of intervals for the quantiles is chosen.
A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate).
Thus the line is a parametric curve with the parameter which is the (number of the) interval for the quantile.

# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
qqnorm(x,main = "Normally distributed Q-Q Plot")
qqline(x)

# Non Normal distribution
x = rt(100, df=3)
qqnorm(x, main = "Not normally distributed Q-Q Plot")
qqline(x)

p-plot: you should observe a good fit of the straight line

Maybe not so usefull.

# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
#install.packages("e1071")
library(e1071)
probplot(x, qdist=)

# Non Normal distribution
x = rt(100, df=3)
#install.packages("e1071")
library(e1071)
probplot(x, qdist=qnorm)