The bootstrapping approach will always work because it doesn’t assume any underlying distribution of the data.
How sensitive is the two-sample t-test to violations of the normality assumption? Test this by using the rpois() function, and try different lambda values
set.seed(123)
hist(rpois(100, 10))
hist(rpois(100, 5))
hist(rpois(100, 2)) #not normality intuition
psum=c()
r=1
while (r<=20) {
r=r+1
count=1
pval=c()
while ( count<=1000) {
sample1=rpois(100,r) #re sample
sample2=rpois(100,r)
pval=c(pval,t.test(sample1,sample2)$p.value)
count=count+1
}
psum=c(psum,sum(pval<0.05)/1000)
}
psum
## [1] 0.048 0.050 0.059 0.039 0.039 0.047 0.061 0.044 0.039 0.046 0.051 0.053
## [13] 0.046 0.045 0.052 0.047 0.051 0.046 0.043 0.045
two-sample t-test is robust even violation of the normality assumption but it will be not applicable under some extrem conditions.
The purpose of the t-test is to compare certain characteristics representing groups, and the mean values become representative when the population has a normal distribution. This is the reason why satisfaction of the normality assumption is essential in the t-test.
Consider what null hypothesis (Sample with replacement to create empirical sampling distributions); After creating bootstrap samples, run statistical test; p-value= proportion of times bootstrap data show statistics more extreme than the observed statistic (and CI);
# example 1
hist(iris$Petal.Width) #violation of normal distribution
one.way <- aov(Petal.Width ~ Species, data = iris)
anov=summary(one.way)[[1]]$`F value`[1]
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.41 40.21 960 <2e-16 ***
## Residuals 147 6.16 0.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
set.seed(123)
count=1
boot=c()
while ( count<=1000) {
sample1=iris[sample(1:nrow(iris),replace=T,size = table(iris$Species)[1]),]
sample1$group=1
sample2=iris[sample(1:nrow(iris),replace=T,size = table(iris$Species)[2]),]
sample2$group=2
sample3=iris[sample(1:nrow(iris),replace=T,size = table(iris$Species)[3]),]
sample3$group=3
sample=rbind(sample1,sample2,sample3)
test=aov(Petal.Width ~ as.factor(group), data = sample)
boot =c(boot,summary(test)[[1]]$`F value`[1])
count=count+1
}
(sum(abs(boot)>abs(anov))/1000) # Equivalent to p value
## [1] 0
hist(boot)
quantile(boot,.025)
## 2.5%
## 0.02379108
quantile(boot,.975)
## 97.5%
## 3.848376
sampling distribution of means from a population will be approximately normally distributed regardless of the underlying population distribution.
set.seed(123)
hist(rpois(1000, 2))
r=1
while (r<=10) {
r=r+1
count=1
pval=c()
while ( count<=100) {
sample1=rpois(30,r)
pval=c(pval,mean(sample1))
count=count+1
}
hist(pval, prob = TRUE)
lines(density(pval), col = 4, lwd = 2)
}
the means (not other statistics, like median) follow the normal distribution but it also depends on the sample size and the original distribution.
# if median
r=1
while (r<=10) {
r=r+1
count=1
pval=c()
while ( count<=1000) {
sample1=rpois(30,r)
pval=c(pval,median(sample1))
count=count+1
}
hist(pval, prob = TRUE)
lines(density(pval), col = 4, lwd = 2)
}
The critical assumption for this type of test concerns the normality of residuals, not the dependent variable. Also, with a large sample size the null of the Shapiro-Wilk test will often be rejected.
hist(residuals(one.way))
#Create a Normal Q-Q Plot of the residuals for "model_job_loss".
qqnorm(residuals(one.way), main = "Normal Q-Q Plot of Residuals")
qqline(residuals(one.way))
shapiro.test(residuals(one.way))
##
## Shapiro-Wilk normality test
##
## data: residuals(one.way)
## W = 0.97217, p-value = 0.003866
kruskal.test(iris$Petal.Width ,iris$Species)
##
## Kruskal-Wallis rank sum test
##
## data: iris$Petal.Width and iris$Species
## Kruskal-Wallis chi-squared = 131.19, df = 2, p-value < 2.2e-16