This recipe is examining Ship Accidents from the Ecdat package.Using this data set, we are testing the effect of one factor and 2 explanatory variables on the single response variable, accidents.
#installing package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\Rtmpm2qduk\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Pulling the Accident data from the library
Ship<-Accident
We are examining the effect of the factor year the ship was constructed (constr) on the response variable accidents (acc). We know when the ships were constructed however cannot control this because it was constructed in the past.
To reduce variation and improve the precision of our factor of interest in our analysis of covariance, we will look at two explanatory variables. The two explanatory variables are the type of ship (type) and the year operated (operate). By looking at the type of ship, that can help explain the accidents if there are certain ships that are more prone to accidents. Also if we look at the year operated, different years could explain why accidents happened, such as bad weather.
The year constructed can be broken into four levels : C6064, C6569, C7074, and C7579.
The explanatory variable type can be broken into 5 types: A, B, C, D, E. Operated year can be broken into two levels: O6074 and O7579.
head(Ship)
## type constr operate months acc
## 1 A C6064 O6074 127 0
## 2 A C6064 O7579 63 0
## 3 A C6569 O6074 1095 3
## 4 A C6569 O7579 1095 4
## 5 A C7074 O6074 1512 6
## 6 A C7074 O7579 3353 18
tail(Ship)
## type constr operate months acc
## 35 E C6569 O6074 789 7
## 36 E C6569 O7579 437 7
## 37 E C7074 O6074 1157 5
## 38 E C7074 O7579 2161 12
## 39 E C7579 O6074 NA NA
## 40 E C7579 O7579 542 1
The continuous variables in the data set are the months of service by the ships and the number of accidents.
In this experiment, we are looking at the number of accidents as the response variable and how it is effected by the year of construction and being explained by the type of ship and year it was operated.
The dataset has 40 observations. It is organized by five variables: Type, constr, operate, months, and acc. Type is the ship type and can range from A-E. Constr is the year that the ship was constructed. There are four different years when the ships were constructed: C6064, C6569, C7074, and C7579. Operate is the year the hsip operated and the two years under observation are O6074 and O7579. Months is the to measure the amount of service of the ship. acc is the number of accidents for the ships.
It is unclear if the data collected was randomly designed.
This experiment is using an analysis of covariance test (ANCOVA) and looking at a single factor, the year constructed. The two explanatory variables, type of ship and year operated are used to predict the response variable accidents.
We know that the type of ship and the year operated have an effect on the number of accidents so these variables are included to improve the precision of our factor of interest. Covariance measures how much two variables change together and the relationship between them.
The dataset is counting the number of accidents so it was not randomized
There are no replicates or repeated measures
There was no blocking used in the design.
summary(Ship)
## type constr operate months acc
## A:8 C6064: 9 O6074:19 Min. : 45 Min. : 0.0
## B:8 C6569:10 O7579:21 1st Qu.: 371 1st Qu.: 1.0
## C:8 C7074:10 Median : 1095 Median : 4.0
## D:8 C7579:11 Mean : 4811 Mean :10.5
## E:8 3rd Qu.: 2223 3rd Qu.:11.8
## Max. :44882 Max. :58.0
## NA's :6 NA's :6
#removing NAs in data
Ship2<-Ship[!is.na(Ship$acc),]
summary(Ship2)
## type constr operate months acc
## A:7 C6064: 8 O6074:14 Min. : 45 Min. : 0.0
## B:7 C6569:10 O7579:20 1st Qu.: 371 1st Qu.: 1.0
## C:7 C7074:10 Median : 1095 Median : 4.0
## D:7 C7579: 6 Mean : 4811 Mean :10.5
## E:6 3rd Qu.: 2223 3rd Qu.:11.8
## Max. :44882 Max. :58.0
boxplot(acc~type, data=Ship2, xlab="Type of Ship", ylab="Number of Accidents")
boxplot(acc~operate, data=Ship2, xlab="Year Operated", ylab="Number of Accidents")
boxplot(acc~constr, data=Ship2, xlab="year constructed", ylab="Number of Accidents")
When looking at the boxplot of the type of ships,it appears that ships that were type B had more accidents than other types of ships. Also, the range of values of accidents for Type B was the largest. Type C has the smallest range of values for accidents and seem to be less prone to having accidents. When looking at years that the ships were operated, Ships operated in O7579 had a larger range of values for the number of accidents. The medians for the years seem about the same but O6074 seems more normally distributed while O7579 is skewed.When looking at the year constructed, half of the ship constructed during the year of C6064 experienced zero accidents. However the upper quartile for C6064 is larger than the interquartile ranges of the other years.
model1=lm(acc~constr+type+operate, data=Ship2)
anova(model1)
## Analysis of Variance Table
##
## Response: acc
## Df Sum Sq Mean Sq F value Pr(>F)
## constr 3 250 83 1.07 0.38
## type 4 5906 1476 18.94 2.8e-07 ***
## operate 1 66 66 0.85 0.37
## Residuals 25 1948 78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis for the analysis of covariance is that the variance in the number of accidents cannot be explained by anything other than randomization. The alternative hypothesis is that the variation in the number of accidents can be explained by something other than randomization.
From our model and looking at the year constructed, we would fail to reject the null hypothesis and the number of accidents cannot be explained by anything other than randomization with a p-value of .3 and an F value of 1.07. When looking at the type of ship, a p-value of 2.8e-07 would lead us to believe that the variation in accidents can be explained by the type of ship. We could see this from the boxplots when ship type B had values higher than any other type of ship. However, year operated may not have been a good explanatory variable with such a high p value. ### Diagnostics/Model Adequacy Checking
qqnorm(residuals(model1))
qqline(residuals(model1))
plot(fitted(model1), residuals(model1))
A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero however the points at the beginning don’t have as much variation and the variation increases towards the right of the plot.
tukey1<-TukeyHSD(aov(acc~constr, data=Ship2), ordered=FALSE, conf.level=.95)
tukey1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = acc ~ constr, data = Ship2)
##
## $constr
## diff lwr upr p adj
## C6569-C6064 4.550 -16.41 25.51 0.9342
## C7074-C6064 3.050 -17.91 24.01 0.9786
## C7579-C6064 -2.917 -26.78 20.94 0.9871
## C7074-C6569 -1.500 -21.26 18.26 0.9968
## C7579-C6569 -7.467 -30.28 15.35 0.8101
## C7579-C7074 -5.967 -28.78 16.85 0.8919
plot(tukey1)
When running a Tukey test, the null hypothesis is that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.The tukey test creates a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage. We would fail to reject the null hypothesis. When looking at the different combinations, all of them include 0 in the 95% confidence levels.
Since the dataset was counting the number of accidents for different types of ships, it is possible that all of the values were from chance.There may be other nuisance factors that were not recorded in real life that came into play. Also when looking at the boxplots of the variables we were testing, the distributions werent normal. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.
kruskal.test(acc~type, data=Ship2)
##
## Kruskal-Wallis rank sum test
##
## data: acc by type
## Kruskal-Wallis chi-squared = 18.21, df = 4, p-value = 0.001124
kruskal.test(acc~operate, data=Ship2)
##
## Kruskal-Wallis rank sum test
##
## data: acc by operate
## Kruskal-Wallis chi-squared = 0.1647, df = 1, p-value = 0.6848
kruskal.test(acc~constr, data=Ship2)
##
## Kruskal-Wallis rank sum test
##
## data: acc by constr
## Kruskal-Wallis chi-squared = 4.859, df = 3, p-value = 0.1824
When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the type of ship can help explain the variation in the number of accidents. However we would fail to reject the null hypothesis when looking at the year operated or year constructed.