Cheryl Tran

RPI

10/30/2014 Version 1

1. Setting

System under test

This recipe is examining Ship Accidents from the Ecdat package.Using this data set, we are testing the effect of one factor and 2 explanatory variables on the single response variable, accidents.

#installing package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\Rtmpm2qduk\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Pulling the Accident data from the library
Ship<-Accident

Factors and Levels

We are examining the effect of the factor year the ship was constructed (constr) on the response variable accidents (acc). We know when the ships were constructed however cannot control this because it was constructed in the past.

To reduce variation and improve the precision of our factor of interest in our analysis of covariance, we will look at two explanatory variables. The two explanatory variables are the type of ship (type) and the year operated (operate). By looking at the type of ship, that can help explain the accidents if there are certain ships that are more prone to accidents. Also if we look at the year operated, different years could explain why accidents happened, such as bad weather.

The year constructed can be broken into four levels : C6064, C6569, C7074, and C7579.

The explanatory variable type can be broken into 5 types: A, B, C, D, E. Operated year can be broken into two levels: O6074 and O7579.

head(Ship)
##   type constr operate months acc
## 1    A  C6064   O6074    127   0
## 2    A  C6064   O7579     63   0
## 3    A  C6569   O6074   1095   3
## 4    A  C6569   O7579   1095   4
## 5    A  C7074   O6074   1512   6
## 6    A  C7074   O7579   3353  18
tail(Ship)
##    type constr operate months acc
## 35    E  C6569   O6074    789   7
## 36    E  C6569   O7579    437   7
## 37    E  C7074   O6074   1157   5
## 38    E  C7074   O7579   2161  12
## 39    E  C7579   O6074     NA  NA
## 40    E  C7579   O7579    542   1

Continuous variables (if any)

The continuous variables in the data set are the months of service by the ships and the number of accidents.

Response variables

In this experiment, we are looking at the number of accidents as the response variable and how it is effected by the year of construction and being explained by the type of ship and year it was operated.

The Data: How is it organized and what does it look like?

The dataset has 40 observations. It is organized by five variables: Type, constr, operate, months, and acc. Type is the ship type and can range from A-E. Constr is the year that the ship was constructed. There are four different years when the ships were constructed: C6064, C6569, C7074, and C7579. Operate is the year the hsip operated and the two years under observation are O6074 and O7579. Months is the to measure the amount of service of the ship. acc is the number of accidents for the ships.

Randomization

It is unclear if the data collected was randomly designed.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment is using an analysis of covariance test (ANCOVA) and looking at a single factor, the year constructed. The two explanatory variables, type of ship and year operated are used to predict the response variable accidents.

What is the rationale for this design?

We know that the type of ship and the year operated have an effect on the number of accidents so these variables are included to improve the precision of our factor of interest. Covariance measures how much two variables change together and the relationship between them.

Randomize: What is the Randomization Scheme?

The dataset is counting the number of accidents so it was not randomized

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures

Block: Did you use blocking in the design?

There was no blocking used in the design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

summary(Ship)
##  type    constr    operate       months           acc      
##  A:8   C6064: 9   O6074:19   Min.   :   45   Min.   : 0.0  
##  B:8   C6569:10   O7579:21   1st Qu.:  371   1st Qu.: 1.0  
##  C:8   C7074:10              Median : 1095   Median : 4.0  
##  D:8   C7579:11              Mean   : 4811   Mean   :10.5  
##  E:8                         3rd Qu.: 2223   3rd Qu.:11.8  
##                              Max.   :44882   Max.   :58.0  
##                              NA's   :6       NA's   :6
#removing NAs in data
Ship2<-Ship[!is.na(Ship$acc),]
summary(Ship2)
##  type    constr    operate       months           acc      
##  A:7   C6064: 8   O6074:14   Min.   :   45   Min.   : 0.0  
##  B:7   C6569:10   O7579:20   1st Qu.:  371   1st Qu.: 1.0  
##  C:7   C7074:10              Median : 1095   Median : 4.0  
##  D:7   C7579: 6              Mean   : 4811   Mean   :10.5  
##  E:6                         3rd Qu.: 2223   3rd Qu.:11.8  
##                              Max.   :44882   Max.   :58.0
boxplot(acc~type, data=Ship2, xlab="Type of Ship", ylab="Number of Accidents")

plot of chunk unnamed-chunk-3

boxplot(acc~operate, data=Ship2, xlab="Year Operated", ylab="Number of Accidents")

plot of chunk unnamed-chunk-3

boxplot(acc~constr, data=Ship2, xlab="year constructed", ylab="Number of Accidents")

plot of chunk unnamed-chunk-3 When looking at the boxplot of the type of ships,it appears that ships that were type B had more accidents than other types of ships. Also, the range of values of accidents for Type B was the largest. Type C has the smallest range of values for accidents and seem to be less prone to having accidents. When looking at years that the ships were operated, Ships operated in O7579 had a larger range of values for the number of accidents. The medians for the years seem about the same but O6074 seems more normally distributed while O7579 is skewed.When looking at the year constructed, half of the ship constructed during the year of C6064 experienced zero accidents. However the upper quartile for C6064 is larger than the interquartile ranges of the other years.

Testing

model1=lm(acc~constr+type+operate, data=Ship2)
anova(model1)
## Analysis of Variance Table
## 
## Response: acc
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## constr     3    250      83    1.07    0.38    
## type       4   5906    1476   18.94 2.8e-07 ***
## operate    1     66      66    0.85    0.37    
## Residuals 25   1948      78                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis for the analysis of covariance is that the variance in the number of accidents cannot be explained by anything other than randomization. The alternative hypothesis is that the variation in the number of accidents can be explained by something other than randomization.

From our model and looking at the year constructed, we would fail to reject the null hypothesis and the number of accidents cannot be explained by anything other than randomization with a p-value of .3 and an F value of 1.07. When looking at the type of ship, a p-value of 2.8e-07 would lead us to believe that the variation in accidents can be explained by the type of ship. We could see this from the boxplots when ship type B had values higher than any other type of ship. However, year operated may not have been a good explanatory variable with such a high p value. ### Diagnostics/Model Adequacy Checking

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-5

plot(fitted(model1), residuals(model1))

plot of chunk unnamed-chunk-5 A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero however the points at the beginning don’t have as much variation and the variation increases towards the right of the plot.

tukey1<-TukeyHSD(aov(acc~constr, data=Ship2), ordered=FALSE, conf.level=.95)
tukey1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = acc ~ constr, data = Ship2)
## 
## $constr
##               diff    lwr   upr  p adj
## C6569-C6064  4.550 -16.41 25.51 0.9342
## C7074-C6064  3.050 -17.91 24.01 0.9786
## C7579-C6064 -2.917 -26.78 20.94 0.9871
## C7074-C6569 -1.500 -21.26 18.26 0.9968
## C7579-C6569 -7.467 -30.28 15.35 0.8101
## C7579-C7074 -5.967 -28.78 16.85 0.8919
plot(tukey1)

plot of chunk unnamed-chunk-6

When running a Tukey test, the null hypothesis is that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.The tukey test creates a set of confidence intervals on the differences between the means of the levels of a factor with the specified family-wise probability of coverage. We would fail to reject the null hypothesis. When looking at the different combinations, all of them include 0 in the 95% confidence levels.

4. Contingencies

Since the dataset was counting the number of accidents for different types of ships, it is possible that all of the values were from chance.There may be other nuisance factors that were not recorded in real life that came into play. Also when looking at the boxplots of the variables we were testing, the distributions werent normal. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.

kruskal.test(acc~type, data=Ship2)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  acc by type
## Kruskal-Wallis chi-squared = 18.21, df = 4, p-value = 0.001124
kruskal.test(acc~operate, data=Ship2)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  acc by operate
## Kruskal-Wallis chi-squared = 0.1647, df = 1, p-value = 0.6848
kruskal.test(acc~constr, data=Ship2)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  acc by constr
## Kruskal-Wallis chi-squared = 4.859, df = 3, p-value = 0.1824

When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the type of ship can help explain the variation in the number of accidents. However we would fail to reject the null hypothesis when looking at the year operated or year constructed.

5. References to the literature

http://www.prenhall.com/greene/greene1.html

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code