Hipóteses
H0: A amostra é gaussiana (Normal distribution) H1: A amostra não é gaussiana (Non normal distribution)
Se p<=0.05 a amostra não é gaussiana.
Tipos dos testes
KOLMOGOROV-SMIRNOV (K-S): É um dos mais antigos testes e o mais “complacente” em termos de sua tendência a aceitar uma dada amostra como gaussiana.
LILLIEFORS: É uma variante mais precisa do teste K-S, sendo menos “complacente” do que ele.
SHAPIRO-WILK W: É o teste de mais amplamente usado, dado o seu poder estatístico, além de também ser o mais “rigoroso” quanto a aceitar a hipótese da gaussianidade. Deve ser combinado com o Histograma e o QQplot. Trabalha bem até 5000, valor limitado pela função. Trabalha em sincronia com o o teste Anderson-Darling.
ANDERSON-DARLING:
Central limit theorem
If random samples are taken from any population, the distribution of the sample means (sampling distribution) is approximatelly normally distributed.
This approximation goes better as the sampels go larges.
Imagine that you would take repeated samples of size n of a population and each time calculateing the mean of a variable and then draw a histogram of these means, that is what is called a sampling distribuition.
A sampling distribution will have the same mean as the population but the standard deviation of the sampling distribution will be smaller.
When sampling from a normally distributed population, the sampling distribution will be normally distributed.
When sampling froma a not normally distributed population the sampling distribution will be approximatelly distributed.
The Central Limit Theorem (CLT) shows that the normal distribution can be relevant even if the variable is not normally distributed.
# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
shapiro.test(x)
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.9978, p-value = 0.9851
# Non Normal distribution
x = rt(100, df=3)
shapiro.test(x)
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.8221, p-value = 1.254e-09
Goes along with Shapiro-Wilk test
# Normal distribution
#install.packages("nortest")
library(nortest)
set.seed(777)
x <- rnorm(250,10,1)
ad.test(x)
##
## Anderson-Darling normality test
##
## data: x
## A = 0.1394, p-value = 0.9745
# Non Normal distribution
x = rt(100, df=3)
shapiro.test(x)
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.8221, p-value = 1.254e-09
# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
#install.packages("nortest")
library(nortest)
lillie.test(x)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.0243, p-value = 0.9743
# Non Normal distribution
x = rt(100, df=3)
lillie.test(x)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: x
## D = 0.1459, p-value = 1.821e-05
set.seed(777)
x <- rnorm(250,10,1)
ks.test(x,"pnorm",mean(x),sqrt(var(x)))
##
## One-sample Kolmogorov-Smirnov test
##
## data: x
## D = 0.0243, p-value = 0.9984
## alternative hypothesis: two-sided
shapiro.test(x)
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.9978, p-value = 0.9851
set.seed(777)
x <- rnorm(250,10,1)
mean(x)
## [1] 10.04907
statmod <- function(x) { z <- table(as.vector(x)) names(z)[z == max(z)] }
statmod(mtcars$drat)
set.seed(777)
x <- rnorm(250,10,1)
median(x)
## [1] 10.0096
set.seed(777)
x <- rnorm(250,10,1)
sd(x)
## [1] 1.016281
set.seed(777)
x <- rnorm(250,10,1)
var(x)
## [1] 1.032827
Intuitively, the skewness is a measure of symmetry. As a rule, negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed. Positive skewness would indicates that the mean of the data values is larger than the median, and the data distribution is right-skewed. If the skewness of S is 0 then the distribution represented by S is perfectly symmetric.
Skewness quantifies how symmetrical the distribution is.
. A symmetrical distribution has a skewness of zero. . An asymmetrical distribution with a long tail to the right (higher values) has a positive skew. . An asymmetrical distribution with a long tail to the left (lower values) has a negative skew. . The skewness is unitless. . Any threshold or rule of thumb is arbitrary, but here is one: If the skewness is greater than 1.0 (or less than -1.0), the skewness is substantial and the distribution is far from symmetrical.
### I did not ye the moments package but below it is
# might be needed for kurtosis and skewness
#install.packages("moments")
#library(moments)
set.seed(777)
x <- rnorm(250,10,1)
# skewness should be around (0,3)
#install.packages("moments")
library(moments)
#install.packages("e1071")
library(e1071)
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:moments':
##
## kurtosis, moment, skewness
skewness(x)
## [1] 0.1004742
Intuitively, the kurtosis is a measure of the peakedness of the data distribution. Negative kurtosis would indicates a flat data distribution, which is said to be platykurtic. Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic. Incidentally, the normal distribution has zero kurtosis, and is said to be mesokurtic.
We use kurtosis as a measure of peakedness (or flatness). Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.
Kurtosis quantifies whether the shape of the data distribution matches the Gaussian distribution.
. A Gaussian distribution has a kurtosis of 0. . A flatter distribution has a negative kurtosis, . A distribution more peaked than a Gaussian distribution has a positive kurtosis. . Kurtosis has no units. . The value that Prism reports is sometimes called the excess kurtosis since the expected kurtosis for a Gaussian distribution is 0.0. . An alternative definition of kurtosis is computed by adding 3 to the value reported by Prism. With this definition, a Gaussian distribution is expected to have a kurtosis of 3.0.
set.seed(777)
x <- rnorm(250,10,1)
#install.packages("moments")
library(moments)
#install.packages("e1071")
library(e1071)
kurtosis(x)
## [1] 0.06239684
set.seed(777)
x <- rnorm(250,10,1)
hist(x)
# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
hist(x,prob=T, ylim=c(0,0.4), main="Normally distributed")
xbar = mean(x)
S = sd(x)
curve(dnorm(x,xbar,S), add=TRUE)
# Non Normal distribution
x = rt(100, df=3)
hist(x,prob=T, ylim=c(0,0.4),main="Not normally distributed")
xbar = mean(x)
S = sd(x)
curve(dnorm(x,xbar,S), add=TRUE)
In statistics, a Q-Q plot[1] (“Q” stands for quantile) is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other.
First, the set of intervals for the quantiles is chosen.
A point (x, y) on the plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate).
Thus the line is a parametric curve with the parameter which is the (number of the) interval for the quantile.
# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
qqnorm(x,main = "Normally distributed Q-Q Plot")
qqline(x)
# Non Normal distribution
x = rt(100, df=3)
qqnorm(x, main = "Not normally distributed Q-Q Plot")
qqline(x)
Maybe not so usefull.
# Normal distribution
set.seed(777)
x <- rnorm(250,10,1)
#install.packages("e1071")
library(e1071)
probplot(x, qdist=)
# Non Normal distribution
x = rt(100, df=3)
#install.packages("e1071")
library(e1071)
probplot(x, qdist=qnorm)