Recipe 7: Resampling Methods: ANOVA Testing with Boostrap

Design of Experiments

Caroline Hsia

Rensselaer Polytechnic Institute

November 5, 2014 V1

1. Setting

System under test

This study takes a dataset from “100+ Interesting data Sets for Statistics”. The dataset was actually in the “Ecdat” package within R. It was collected in order to record the number of accidents based on certain types of ships.

#Retreive the dataset
remove(list=ls())
require("Ecdat")
## Loading required package: Ecdat
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
library("Ecdat", lib.loc="~/R/win-library/3.1")
data = Accident
attach(data)

Factors and Levels

The data being analyzed has 40 observations on 5 variables.

#Brief look at the dataset
head(data)
##   type constr operate months acc
## 1    A  C6064   O6074    127   0
## 2    A  C6064   O7579     63   0
## 3    A  C6569   O6074   1095   3
## 4    A  C6569   O7579   1095   4
## 5    A  C7074   O6074   1512   6
## 6    A  C7074   O7579   3353  18
tail(data)
##    type constr operate months acc
## 35    E  C6569   O6074    789   7
## 36    E  C6569   O7579    437   7
## 37    E  C7074   O6074   1157   5
## 38    E  C7074   O7579   2161  12
## 39    E  C7579   O6074     NA  NA
## 40    E  C7579   O7579    542   1
str(data)
## 'data.frame':    40 obs. of  5 variables:
##  $ type   : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ constr : Factor w/ 4 levels "C6064","C6569",..: 1 1 2 2 3 3 4 4 1 1 ...
##  $ operate: Factor w/ 2 levels "O6074","O7579": 1 2 1 2 1 2 1 2 1 2 ...
##  $ months : int  127 63 1095 1095 1512 3353 NA 2244 44882 17176 ...
##  $ acc    : int  0 0 3 4 6 18 NA 11 39 29 ...
summary (data)
##  type    constr    operate       months           acc      
##  A:8   C6064: 9   O6074:19   Min.   :   45   Min.   : 0.0  
##  B:8   C6569:10   O7579:21   1st Qu.:  371   1st Qu.: 1.0  
##  C:8   C7074:10              Median : 1095   Median : 4.0  
##  D:8   C7579:11              Mean   : 4811   Mean   :10.5  
##  E:8                         3rd Qu.: 2223   3rd Qu.:11.8  
##                              Max.   :44882   Max.   :58.0  
##                              NA's   :6       NA's   :6
#Graphical representation of data variables
boxplot(acc~type, data = data)

plot of chunk unnamed-chunk-2

#Histogram of Response variable
hist(acc)

plot of chunk unnamed-chunk-2

Continuous variables (if any)

The continuous variables in this dataset include type, operate, and constr. Operate is the year operators which is based on a number of levels. The variable constr describes the year constructed, also based on a level system. type is the number of service amounts for that one ship.

Response variables

The response variable for this dataset is acc, which represents accidents.

The Data: How is it organized and what does it look like?

The data has 40 observations across 5 different variables.

Randomization

This data was collected by another source and it is unknown if it was collected through a randomly designed experiment.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This data was publically available for anyone to use and perform analysis on. In this analysis, we will be using a 1 factor ANOVA testing and then performing resampling within the data to further perform ANOVA testing.

The null hypothesis that will be tested is:

The variance in type of service has no significant impact on the variance of accidents.

This makes the alternate hypothesis:

The variance in type of service has a significant impact on the variance of accidents.

What is the rationale for this design?

The rationale for this design comes from the use of the type of analysis we are performing - ANCOVA. It is used to analyse the effects of a categorical variable on a continuous dependent variable. We want to see the affect that the neighboring states and demographics has on the cigarette sales in packs per capita.

Randomize: What is the Randomization Scheme?

This data was collected with no intention, just for data collection.

Replicate: Are there replicates and/or repeated measures?

There were no replicates collected for this dataset.

Block: Did you use blocking in the design?

No blocking was used in the design

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

# Initial ANOVA testing
model = lm (acc~type, data = data)
anova(model)
## Analysis of Variance Table
## 
## Response: acc
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## type       4   5901    1475    18.9 9.8e-08 ***
## Residuals 29   2269      78                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(acc)
## 
##  Shapiro-Wilk normality test
## 
## data:  acc
## W = 0.6894, p-value = 3.396e-07
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-3

plot(fitted(model), residuals(model))

plot of chunk unnamed-chunk-3

From this ANOVA test, the p-value of 9.81e-08 returned was less than 0.05. We can reject the null hypothesis that the variance in type of service has no significant effect on the variance of accidents.

The results of the Shapiro-Wilk test gives us a p-value of 3.39e-07, which indicates that the data is not really normally distributed.

The qq plot is fairly normal for most observations. However, many points still deviate from the normal line quite a bit. With the Shapiro-Wilk test and the residuals test, we can conclude that the data is not assumed normal.

Based on the residuals plot, there is no trend in the scatter plot, but the residuals are fairly well distributed about 0. This tells us taht the model we have chosen was a good choice.

Resampling

A resampling ANOVA using bootstrap will be performed. The bootstrapping method allows us to sample from a larger parent distribution, knowing nothing about it for certain. The sample we will be pulling from is the orginal dataset.

databootstrap <-subset (data, acc >=10)
with(databootstrap,tapply(acc,type,mean))
##     A     B     C     D     E 
## 14.50 36.14    NA 11.00 12.00
with(databootstrap,tapply(acc,type,var))
##     A     B     C     D     E 
##  24.5 299.1    NA    NA    NA
with(databootstrap,tapply(acc,type,length))
##  A  B  C  D  E 
##  2  7 NA  1  1
summary(aov(acc~type, data = databootstrap))
##             Df Sum Sq Mean Sq F value Pr(>F)
## type         3   1373     458    1.76   0.24
## Residuals    7   1819     260
meanstar = with(databootstrap, tapply(acc, type, mean))
meanstar
##     A     B     C     D     E 
## 14.50 36.14    NA 11.00 12.00
meanacc = mean(acc)
sdstar = sqrt(78.3)
simtype = type


grpA = acc[type=='A'] - meanstar[1]
grpB = acc[type=='B'] - meanstar[2]
grpC = acc[type=='C'] - meanstar[3]
grpD = acc[type=='D'] - meanstar[4]
grpE = acc[type=='E'] - meanstar[5]

Runs = 1000
Fstar = numeric(Runs)

for (i in 1:Runs) 
  {A = sample(grpA, size = 8, replace = T)
  B = sample(grpB, size = 8, replace = T)
  C = sample(grpC, size = 8, replace = T)
  D = sample(grpD, size = 8, replace = T)
  E = sample(grpE, size = 8, replace = T)
  simacc = c(A,B,C,D,E)
  simdata = data.frame(simacc,simtype)
  Fstar[i] = oneway.test(simacc~simtype, var.equal=T, data=simdata)$statistic}

hist(Fstar, prob=T)
x=seq(0.05, 6, 0.25)
points(x, y = df(x,1,604), type = "b", col = "red")

plot of chunk unnamed-chunk-5

#Alpha level from Boostrap resampling
print(realFstar<-oneway.test(acc~type, var.equal=T, data = data)$statistic)
##     F 
## 18.85
mean(Fstar>=realFstar)
## [1] 0.026
par(mfrow=c(1,2))

hist(Fstar, prob=TRUE, main = "PDF of Theoretical F Distribution")
x = seq(0.25, 6, 0.25)
points (x, y=df(x,1,85), type = "b", col = "red")

hist(realFstar, , prob= TRUE, main = "PDF of Actual F Distribution")
new1=seq(0.025, 2.5, 0.25)
points(new1, y=df(new1, 1, 95), type = "b", col = "red")

plot of chunk unnamed-chunk-6

#Generate quantiles of analytic F distribution
qf(0.95,3,22)
## [1] 3.049
#Confidence interval estimation
quantile(realFstar, 0.95)
##   95% 
## 18.85
#Estimate standard error
sd(Fstar)
## [1] 5.945
#Bias
mean(Fstar)-mean(realFstar)
## [1] -15.5

After performing the bootstrapping method, it can be seen on the two side by side plots of the theoretical and actual F distribution that they are still signifantly different from one another. We can continue and perform parameter estimation and find confidence intervals, standard error, and bias in the bootstrapped data. With the large number of bootstrap replicates, it can be one of the factors we consider in order to explain the large difference in results.

ANOVA on Bootstrap

##ANOVA of resampling
model2=aov(simacc~simtype, data = data)
anova(model2)
## Analysis of Variance Table
## 
## Response: simacc
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## simtype    3   1513     504    20.3 8.9e-07 ***
## Residuals 24    595      25                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An ANOVA is performed on the bootstrapped data, however, the p-value returned is 0.6767, which makes us fail to reject the null hypothesis that the variance in type of service has no significant impact on the variance of accidents. Before we confirm this result, we will perform ad-hoc analysis on the data.

shapiro.test(simacc)
## 
##  Shapiro-Wilk normality test
## 
## data:  simacc
## W = 0.8135, p-value = 0.0001866
par(mfrow= c(2,2))
plot(model2)

plot of chunk unnamed-chunk-9

The results from the Shapiro-Wilk NOrmality test show us that the data is normally distributed because of the p-value being less than 0.1.

The QQ plot shows us that the values of our boostraped data is normal. However, scale-location plot shows us that the data is pretty scattered over the dynamic range. However, for the other plots, the points look evenly dispersed across the Cartesian plan for the residual error and the fitted model

4. References to the literature

No literature was used.

5. Appendices

A summary of, or pointer to, the raw data