This study takes a dataset from “100+ Interesting data Sets for Statistics”. The dataset was actually in the “Ecdat” package within R. It was collected in order to record the number of accidents based on certain types of ships.
#Retreive the dataset
remove(list=ls())
require("Ecdat")
## Loading required package: Ecdat
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
library("Ecdat", lib.loc="~/R/win-library/3.1")
data = Accident
attach(data)
The data being analyzed has 40 observations on 5 variables.
#Brief look at the dataset
head(data)
## type constr operate months acc
## 1 A C6064 O6074 127 0
## 2 A C6064 O7579 63 0
## 3 A C6569 O6074 1095 3
## 4 A C6569 O7579 1095 4
## 5 A C7074 O6074 1512 6
## 6 A C7074 O7579 3353 18
tail(data)
## type constr operate months acc
## 35 E C6569 O6074 789 7
## 36 E C6569 O7579 437 7
## 37 E C7074 O6074 1157 5
## 38 E C7074 O7579 2161 12
## 39 E C7579 O6074 NA NA
## 40 E C7579 O7579 542 1
str(data)
## 'data.frame': 40 obs. of 5 variables:
## $ type : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ constr : Factor w/ 4 levels "C6064","C6569",..: 1 1 2 2 3 3 4 4 1 1 ...
## $ operate: Factor w/ 2 levels "O6074","O7579": 1 2 1 2 1 2 1 2 1 2 ...
## $ months : int 127 63 1095 1095 1512 3353 NA 2244 44882 17176 ...
## $ acc : int 0 0 3 4 6 18 NA 11 39 29 ...
summary (data)
## type constr operate months acc
## A:8 C6064: 9 O6074:19 Min. : 45 Min. : 0.0
## B:8 C6569:10 O7579:21 1st Qu.: 371 1st Qu.: 1.0
## C:8 C7074:10 Median : 1095 Median : 4.0
## D:8 C7579:11 Mean : 4811 Mean :10.5
## E:8 3rd Qu.: 2223 3rd Qu.:11.8
## Max. :44882 Max. :58.0
## NA's :6 NA's :6
#Graphical representation of data variables
boxplot(acc~type, data = data)
#Histogram of Response variable
hist(acc)
The continuous variables in this dataset include type, operate, and constr. Operate is the year operators which is based on a number of levels. The variable constr describes the year constructed, also based on a level system. type is the number of service amounts for that one ship.
The response variable for this dataset is acc, which represents accidents.
The data has 40 observations across 5 different variables.
This data was collected by another source and it is unknown if it was collected through a randomly designed experiment.
This data was publically available for anyone to use and perform analysis on. In this analysis, we will be using a 1 factor ANOVA testing and then performing resampling within the data to further perform ANOVA testing.
The null hypothesis that will be tested is:
The variance in type of service has no significant impact on the variance of accidents.
This makes the alternate hypothesis:
The variance in type of service has a significant impact on the variance of accidents.
The rationale for this design comes from the use of the type of analysis we are performing - ANCOVA. It is used to analyse the effects of a categorical variable on a continuous dependent variable. We want to see the affect that the neighboring states and demographics has on the cigarette sales in packs per capita.
This data was collected with no intention, just for data collection.
There were no replicates collected for this dataset.
No blocking was used in the design
# Initial ANOVA testing
model = lm (acc~type, data = data)
anova(model)
## Analysis of Variance Table
##
## Response: acc
## Df Sum Sq Mean Sq F value Pr(>F)
## type 4 5901 1475 18.9 9.8e-08 ***
## Residuals 29 2269 78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(acc)
##
## Shapiro-Wilk normality test
##
## data: acc
## W = 0.6894, p-value = 3.396e-07
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model), residuals(model))
From this ANOVA test, the p-value of 9.81e-08 returned was less than 0.05. We can reject the null hypothesis that the variance in type of service has no significant effect on the variance of accidents.
The results of the Shapiro-Wilk test gives us a p-value of 3.39e-07, which indicates that the data is not really normally distributed.
The qq plot is fairly normal for most observations. However, many points still deviate from the normal line quite a bit. With the Shapiro-Wilk test and the residuals test, we can conclude that the data is not assumed normal.
Based on the residuals plot, there is no trend in the scatter plot, but the residuals are fairly well distributed about 0. This tells us taht the model we have chosen was a good choice.
A resampling ANOVA using bootstrap will be performed. The bootstrapping method allows us to sample from a larger parent distribution, knowing nothing about it for certain. The sample we will be pulling from is the orginal dataset.
databootstrap <-subset (data, acc >=10)
with(databootstrap,tapply(acc,type,mean))
## A B C D E
## 14.50 36.14 NA 11.00 12.00
with(databootstrap,tapply(acc,type,var))
## A B C D E
## 24.5 299.1 NA NA NA
with(databootstrap,tapply(acc,type,length))
## A B C D E
## 2 7 NA 1 1
summary(aov(acc~type, data = databootstrap))
## Df Sum Sq Mean Sq F value Pr(>F)
## type 3 1373 458 1.76 0.24
## Residuals 7 1819 260
meanstar = with(databootstrap, tapply(acc, type, mean))
meanstar
## A B C D E
## 14.50 36.14 NA 11.00 12.00
meanacc = mean(acc)
sdstar = sqrt(78.3)
simtype = type
grpA = acc[type=='A'] - meanstar[1]
grpB = acc[type=='B'] - meanstar[2]
grpC = acc[type=='C'] - meanstar[3]
grpD = acc[type=='D'] - meanstar[4]
grpE = acc[type=='E'] - meanstar[5]
Runs = 1000
Fstar = numeric(Runs)
for (i in 1:Runs)
{A = sample(grpA, size = 8, replace = T)
B = sample(grpB, size = 8, replace = T)
C = sample(grpC, size = 8, replace = T)
D = sample(grpD, size = 8, replace = T)
E = sample(grpE, size = 8, replace = T)
simacc = c(A,B,C,D,E)
simdata = data.frame(simacc,simtype)
Fstar[i] = oneway.test(simacc~simtype, var.equal=T, data=simdata)$statistic}
hist(Fstar, prob=T)
x=seq(0.05, 6, 0.25)
points(x, y = df(x,1,604), type = "b", col = "red")
#Alpha level from Boostrap resampling
print(realFstar<-oneway.test(acc~type, var.equal=T, data = data)$statistic)
## F
## 18.85
mean(Fstar>=realFstar)
## [1] 0.026
par(mfrow=c(1,2))
hist(Fstar, prob=TRUE, main = "PDF of Theoretical F Distribution")
x = seq(0.25, 6, 0.25)
points (x, y=df(x,1,85), type = "b", col = "red")
hist(realFstar, , prob= TRUE, main = "PDF of Actual F Distribution")
new1=seq(0.025, 2.5, 0.25)
points(new1, y=df(new1, 1, 95), type = "b", col = "red")
#Generate quantiles of analytic F distribution
qf(0.95,3,22)
## [1] 3.049
#Confidence interval estimation
quantile(realFstar, 0.95)
## 95%
## 18.85
#Estimate standard error
sd(Fstar)
## [1] 5.945
#Bias
mean(Fstar)-mean(realFstar)
## [1] -15.5
After performing the bootstrapping method, it can be seen on the two side by side plots of the theoretical and actual F distribution that they are still signifantly different from one another. We can continue and perform parameter estimation and find confidence intervals, standard error, and bias in the bootstrapped data. With the large number of bootstrap replicates, it can be one of the factors we consider in order to explain the large difference in results.
##ANOVA of resampling
model2=aov(simacc~simtype, data = data)
anova(model2)
## Analysis of Variance Table
##
## Response: simacc
## Df Sum Sq Mean Sq F value Pr(>F)
## simtype 3 1513 504 20.3 8.9e-07 ***
## Residuals 24 595 25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An ANOVA is performed on the bootstrapped data, however, the p-value returned is 0.6767, which makes us fail to reject the null hypothesis that the variance in type of service has no significant impact on the variance of accidents. Before we confirm this result, we will perform ad-hoc analysis on the data.
shapiro.test(simacc)
##
## Shapiro-Wilk normality test
##
## data: simacc
## W = 0.8135, p-value = 0.0001866
par(mfrow= c(2,2))
plot(model2)
The results from the Shapiro-Wilk NOrmality test show us that the data is normally distributed because of the p-value being less than 0.1.
The QQ plot shows us that the values of our boostraped data is normal. However, scale-location plot shows us that the data is pretty scattered over the dynamic range. However, for the other plots, the points look evenly dispersed across the Cartesian plan for the residual error and the fitted model
No literature was used.