This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This analysis uses drinking laws and demographic statistics by state to test the effect 6 different factors may have on traffic fatality rate in deaths per 10,000.
Below is the installation and initial examination of the dataset:
#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
data<-Fatality
head(data)
## state year mrall beertax mlda jaild comserd vmiles unrate perinc
## 1 1 1982 2.128 1.539 19.00 no no 7.234 14.4 10544
## 2 1 1983 2.348 1.789 19.00 no no 7.836 13.7 10733
## 3 1 1984 2.336 1.714 19.00 no no 8.263 11.1 11109
## 4 1 1985 2.193 1.653 19.67 no no 8.727 8.9 11333
## 5 1 1986 2.669 1.610 21.00 no no 8.953 9.8 11662
## 6 1 1987 2.719 1.560 21.00 no no 9.166 7.8 11944
attach(data)
For this two level, multi-factor analysis (2^6-1) we will be examining the following factors “mlda” which represents the minimum legal drinking age in that state “jaild” which represents which represents whether or not a mandatory jail sentence exists for that state “comserd” which represents whether or not there is mandatory community service “unrate” which shows the unemployment rate “vmiles” which is the average miles per dirve “perinc” which represents the per capita personal income for that state.
All factors are broken down into two levels, high and low or 1 and -1 respectively.
#Subset data to create two levels of all factors
#Note jaild and comserd are already two level factors
data$mlda[data$mlda >=18 & data$mlda < 20] = -1
data$mlda[data$mlda>=20] = 1
data$unrate[as.numeric(data$unrate)>= 2.400 & as.numeric(data$unrate) < 7.400] = -1
data$unrate[as.numeric(data$unrate)>=7.400] = 1
data$perinc[data$perinc>= 9000 & data$perinc <13880] = -1
data$perinc[data$perinc>= 13880] = 1
levels(data$comserd)<-c(-1,1)
levels(data$jaild)<-c(-1,1)
data$vmiles[as.numeric(data$vmiles)>= 4.500 & as.numeric(data$vmiles) < 7.891] = -1
data$vmiles[as.numeric(data$vmiles)>= 7.891] = 1
#Examine the Data
head(data)
## state year mrall beertax mlda jaild comserd vmiles unrate perinc
## 1 1 1982 2.128 1.539 -1 -1 -1 -1 1 -1
## 2 1 1983 2.348 1.789 -1 -1 -1 -1 1 -1
## 3 1 1984 2.336 1.714 -1 -1 -1 1 1 -1
## 4 1 1985 2.193 1.653 -1 -1 -1 1 1 -1
## 5 1 1986 2.669 1.610 1 -1 -1 1 1 -1
## 6 1 1987 2.719 1.560 1 -1 -1 1 1 -1
tail(data)
## state year mrall beertax mlda jaild comserd vmiles unrate perinc
## 331 56 1983 3.353 0.05161 -1 1 -1 1 1 -1
## 332 56 1984 3.060 0.04945 -1 1 -1 1 -1 -1
## 333 56 1985 2.986 0.04767 -1 1 -1 1 -1 -1
## 334 56 1986 3.314 0.04644 -1 1 -1 1 1 -1
## 335 56 1987 2.633 0.04500 -1 1 -1 1 1 -1
## 336 56 1988 3.236 0.04331 -1 1 -1 1 -1 -1
summary(data)
## state year mrall beertax
## Min. : 1.0 Min. :1982 Min. :0.821 Min. :0.0433
## 1st Qu.:18.8 1st Qu.:1983 1st Qu.:1.624 1st Qu.:0.2088
## Median :30.5 Median :1985 Median :1.956 Median :0.3526
## Mean :30.2 Mean :1985 Mean :2.040 Mean :0.5133
## 3rd Qu.:42.5 3rd Qu.:1987 3rd Qu.:2.418 3rd Qu.:0.6516
## Max. :56.0 Max. :1988 Max. :4.218 Max. :2.7208
## mlda jaild comserd vmiles unrate
## Min. :-1.000 -1:242 -1:274 Min. :-1.000 Min. :-1.0000
## 1st Qu.: 1.000 1 : 94 1 : 62 1st Qu.:-1.000 1st Qu.:-1.0000
## Median : 1.000 Median :-1.000 Median :-1.0000
## Mean : 0.554 Mean :-0.113 Mean :-0.0774
## 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.0000
## Max. : 1.000 Max. : 1.000 Max. : 1.0000
## perinc
## Min. :-1.0000
## 1st Qu.:-1.0000
## Median :-1.0000
## Mean :-0.0476
## 3rd Qu.: 1.0000
## Max. : 1.0000
All factors except mandatory jailtime and mandatory community service were continuous but they have been set to two levels.
For this experiment we will be focusing on the traffic fataility rate for the response variable being effected by the factors.
The data set is organized by the following variables: state, year, mrall, beertax, mlda, jaild, comserd, vmiles, unrate, and perinc.
State and year indicate the state ID number in which this data is observed and what year the data was recorded. Mrall represents the traffic fatality rate in deaths per 10,000 people.
beertax, mlda, jaild, and comserd all have to do with alcohol related laws in that state. Beer tax is the tax rate on a case of beer, mlda is the minimum legal drinking age, and jaild and comserd represent whether or not there is a mandatory jail sentence or community service time for that state in the event of alcohol related driving offenses.
vmiles is the average miles per drive.
Lastly unrate and perinc represent demographic information for that state. unrate is the unemployment rate and perinc is the per capita personal income for that observation.
The data itself is in order by state ID and then by year.
It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.
To perform the experiment the factors will first be subset into two levels if necessary. Then a linear model will be created from the 6 factors and an analysis of variance will be performed. From these analysis we will be able to test the null hypothesis, that the traffic fatality rate is independant of the 6 factors.
After this we will generate an array for a 2^6-1 design. From this design we will create a new dataframe with the 32 runs. We will run another analysis of variance with the new data frame to test the null hypothesis.
The rationale for using an analysis of variance test is used when multiple factors are considered. It checks whether the means of several groups are equal. The alternative would be to use multiple two-sample t-tests however there is more likely chance of the test resulting in a false hypothesis.
We use a 2^k factor design so we can examine all possible combinations of factors and their individual as well as interaction effects on the response variable.
The data was collected in an unknown way so we do not know if there was any randomization to it.
There are no replicated or repeated measures in the data. For our 2^6-1 factor analysis we remove all duplicates
There was no blocking performed in the design of this experiment.
To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of each factor.
#Defining Minimum Legal Drinking Age as a factor
data$mlda = as.factor(data$mlda)
#Defining Mandatory Jail Sentence as a factor
data$jaild = as.factor(data$jaild)
#Defining Unemployment Rate as a factor
data$unrate = as.factor(data$unrate)
#Defining Average Miles Driven as a factor
data$vmiles = as.factor(data$vmiles)
#Defining Per Capital Personal Income as a factor
data$perinc = as.factor(data$perinc)
#Defining Mandatory Community Service Sentence as a factor
data$comserd = as.factor(data$comserd)
#Boxplots of of the means of each variable against response variable
boxplot(mrall~mlda, data=data, xlab="Minimum Drinking Age", ylab="Fatality Rate")
boxplot(mrall~jaild, data=data, xlab="Mandatory Jail Time", ylab="Fatality Rate")
boxplot(mrall~perinc, data=data, xlab="Per Capita Personal Income", ylab="Fatality Rate")
boxplot(mrall~unrate, data=data, xlab="Unemployment Rate", ylab="Fatality Rate")
boxplot(mrall~comserd, data=data, xlab="Mandatory Community Service", ylab="Fatality Rate")
boxplot(mrall~vmiles, data=data, xlab="Average Miles Driven", ylab="Fatality Rate")
From the boxplots we examine positive correlation between the response variable and the factors mandatory jail time, unemployment rate, mandatory community service and average miles driven and we see negative relationships between minimum legal drinking age and income. However all of these relationships do not appear extremely significant.
To test the hypotheses we perform an ANOVA test on a linear model containing all the factors
The null hypothesis of the test is that the factors do not have an effect on the response variable of traffic fatality rate.
#Create of Linear Model
model1=lm(mrall~mlda+jaild+perinc+unrate+comserd+vmiles, data=data)
#Analysis of Variance
anova(model1)
## Analysis of Variance Table
##
## Response: mrall
## Df Sum Sq Mean Sq F value Pr(>F)
## mlda 1 2.5 2.54 12.89 0.00038 ***
## jaild 1 7.5 7.52 38.18 1.9e-09 ***
## perinc 1 12.1 12.13 61.54 6.1e-14 ***
## unrate 1 0.0 0.04 0.21 0.64355
## comserd 1 1.3 1.28 6.51 0.01120 *
## vmiles 1 20.6 20.58 104.47 < 2e-16 ***
## Residuals 329 64.8 0.20
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our analysis of variance showed significant p-values for the factors of minimum legal drinking age (mlda), mandatory jail time (jaild), income (perinc), and average miles driven (vmiles). We also saw a slightly less significant p-value for mandatory community services (comserd). Lastly our p-value for unemployment rate appeared to be insignificant. Overall we reject the null hypothesis for all factors except the unemployment rate.
First we have to install the data package required to generate our array of possible factor combinations.
library("FrF2", lib.loc="~/R/win-library/3.1")
## Warning: package 'FrF2' was built under R version 3.1.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
## aov, lm
##
## The following object is masked from 'package:graphics':
##
## plot.design
Next we generate our factor combinations:
#Creates all possible factor combinations
factdata = FrF2(32, nfactors=6, estimable=formula("~mlda+jaild+comserd+unrate+vmiles+perinc+mlda:(jaild+comserd+unrate+vmiles+perinc)"),factor.names=c("mlda","jaild","comserd","unrate","vmiles","perinc"), res4=TRUE, clear = FALSE)
factdata
## mlda jaild comserd unrate vmiles perinc
## 1 -1 -1 -1 1 1 -1
## 2 1 1 -1 1 -1 1
## 3 -1 -1 -1 -1 1 1
## 4 1 1 -1 -1 1 1
## 5 -1 -1 1 1 -1 -1
## 6 -1 1 1 -1 -1 -1
## 7 -1 1 -1 1 1 1
## 8 1 -1 1 -1 1 1
## 9 -1 1 1 1 1 -1
## 10 1 1 1 -1 1 -1
## 11 1 -1 1 1 1 -1
## 12 -1 -1 1 1 1 1
## 13 1 1 -1 -1 -1 -1
## 14 1 1 1 1 1 1
## 15 -1 -1 1 -1 1 -1
## 16 1 1 1 -1 -1 1
## 17 1 1 -1 1 1 -1
## 18 1 1 1 1 -1 -1
## 19 1 -1 1 1 -1 1
## 20 1 -1 -1 -1 1 -1
## 21 1 -1 -1 -1 -1 1
## 22 1 -1 -1 1 1 1
## 23 -1 1 -1 1 -1 -1
## 24 -1 -1 -1 1 -1 1
## 25 -1 1 -1 -1 1 -1
## 26 1 -1 1 -1 -1 -1
## 27 -1 1 1 -1 1 1
## 28 -1 -1 -1 -1 -1 -1
## 29 -1 1 -1 -1 -1 1
## 30 1 -1 -1 1 -1 -1
## 31 -1 1 1 1 -1 1
## 32 -1 -1 1 -1 -1 1
## class=design, type= FrF2.estimable
aliasprint(factdata)
## $legend
## [1] A=mlda B=jaild C=comserd D=unrate E=vmiles F=perinc
##
## [[2]]
## [1] no aliasing among main effects and 2fis
Now that we have our factor combinations we merge it with our original data to add the response variables and get a complete dataframe. We also remove all duplicate entrees of the unique combinations of factors.
factdata2 = merge(factdata, data, by=c("mlda","jaild","comserd","vmiles","unrate","perinc"), all = FALSE)
head(factdata2)
## mlda jaild comserd vmiles unrate perinc state year mrall beertax
## 1 -1 -1 -1 -1 -1 -1 19 1984 1.447 0.3462
## 2 -1 -1 -1 -1 -1 -1 55 1984 1.726 0.1593
## 3 -1 -1 -1 -1 -1 -1 16 1984 2.422 0.3709
## 4 -1 -1 -1 -1 -1 -1 50 1982 2.058 0.7115
## 5 -1 -1 -1 -1 1 1 36 1983 1.174 0.1329
## 6 -1 -1 -1 -1 1 1 36 1982 1.229 0.1193
factdata3 = unique(factdata2[ , 1:6])
factdata3
## mlda jaild comserd vmiles unrate perinc
## 1 -1 -1 -1 -1 -1 -1
## 5 -1 -1 -1 -1 1 1
## 8 -1 -1 -1 1 -1 1
## 13 -1 -1 -1 1 1 -1
## 21 -1 -1 1 -1 -1 1
## 22 -1 -1 1 -1 1 -1
## 23 -1 1 -1 -1 1 -1
## 31 -1 1 -1 1 -1 -1
## 34 -1 1 1 -1 -1 -1
## 35 1 -1 -1 -1 -1 1
## 74 1 -1 -1 -1 1 -1
## 112 1 -1 -1 1 -1 -1
## 133 1 -1 -1 1 1 1
## 137 1 -1 1 -1 1 1
## 138 1 -1 1 1 -1 1
## 141 1 1 -1 -1 1 1
## 146 1 1 -1 1 -1 1
## 149 1 1 -1 1 1 -1
## 151 1 1 1 -1 -1 1
## 157 1 1 1 -1 1 -1
## 162 1 1 1 1 -1 -1
## 166 1 1 1 1 1 1
Now that we have our unique combination of factors we also have row numbers which correspond to response variables in our new data frame (fractdata2). We pull these response variable entries and bind them to our new matrix in order to get our final data frame in which to perform our new analysis of variance on
mralldata = factdata2$mrall[index=c(1,5,8,13,21,22,23,31,34,35,74,112,133,137,138,141,146,149,151,157,162,166)]
mralldata
## [1] 1.447 1.174 2.247 2.261 2.547 2.532 2.295 3.060 2.829 1.487 1.856
## [12] 1.956 1.736 2.174 1.792 1.687 2.116 2.892 1.984 1.774 2.841 2.767
finaldata = cbind(factdata3,mralldata)
finaldata
## mlda jaild comserd vmiles unrate perinc mralldata
## 1 -1 -1 -1 -1 -1 -1 1.447
## 5 -1 -1 -1 -1 1 1 1.174
## 8 -1 -1 -1 1 -1 1 2.247
## 13 -1 -1 -1 1 1 -1 2.261
## 21 -1 -1 1 -1 -1 1 2.547
## 22 -1 -1 1 -1 1 -1 2.532
## 23 -1 1 -1 -1 1 -1 2.295
## 31 -1 1 -1 1 -1 -1 3.060
## 34 -1 1 1 -1 -1 -1 2.829
## 35 1 -1 -1 -1 -1 1 1.487
## 74 1 -1 -1 -1 1 -1 1.856
## 112 1 -1 -1 1 -1 -1 1.956
## 133 1 -1 -1 1 1 1 1.736
## 137 1 -1 1 -1 1 1 2.174
## 138 1 -1 1 1 -1 1 1.792
## 141 1 1 -1 -1 1 1 1.687
## 146 1 1 -1 1 -1 1 2.116
## 149 1 1 -1 1 1 -1 2.892
## 151 1 1 1 -1 -1 1 1.984
## 157 1 1 1 -1 1 -1 1.774
## 162 1 1 1 1 -1 -1 2.841
## 166 1 1 1 1 1 1 2.767
Using our new data frame we create a new linear model and perform an analysis of variance on it.
model2 = lm(mralldata~mlda+jaild+comserd+vmiles+unrate+perinc, data=finaldata)
anova(model2)
## Analysis of Variance Table
##
## Response: mralldata
## Df Sum Sq Mean Sq F value Pr(>F)
## mlda 1 0.180 0.180 1.40 0.2549
## jaild 1 1.582 1.582 12.33 0.0031 **
## comserd 1 0.457 0.457 3.56 0.0786 .
## vmiles 1 1.325 1.325 10.32 0.0058 **
## unrate 1 0.028 0.028 0.22 0.6484
## perinc 1 0.242 0.242 1.89 0.1899
## Residuals 15 1.925 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see from this new anova we get different results than our original. Here we see that our p-value for mandatory jail time is the most significant followed by income and then mandatory community service and average miles driven. Based on the p-values we got from the analysis for the other variables we fail to reject the null hypothesis.
Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid due to assumptions made during analysis of variance. We can see below that the residuals of our original linear model have a slight deviation from normality but our residuals from the fractional factorial model seem to appear normal.
#QQ plots for residuals of the original linear model
qqnorm(residuals(model1))
qqline(residuals(model1))
#QQ plots for residuals of the fractional factorial model
qqnorm(residuals(model2))
qqline(residuals(model2))
Next we plot a fitted model against the residuals. We can observe a large amount of variation among the plot which leads us to gain confidence in our model.
#Plot of Fitted vs Residuals of the general linear model
plot(fitted(model1), residuals(model1))
#Plot of Fitted vs Residuals of the fractional factorial model
plot(fitted(model2), residuals(model2))
Lastly we perform a shaprio wilke test on the data, which also tests for normality. The p-values from our test lead us to believe the data is in fact normally distributed.
#Shaprio Wilke test on numeric variables
shapiro.test(Fatality$mrall)
##
## Shapiro-Wilk normality test
##
## data: Fatality$mrall
## W = 0.9667, p-value = 5.869e-07
shapiro.test(Fatality$mlda)
##
## Shapiro-Wilk normality test
##
## data: Fatality$mlda
## W = 0.639, p-value < 2.2e-16
shapiro.test(Fatality$vmiles)
##
## Shapiro-Wilk normality test
##
## data: Fatality$vmiles
## W = 0.7019, p-value < 2.2e-16
shapiro.test(Fatality$unrate)
##
## Shapiro-Wilk normality test
##
## data: Fatality$unrate
## W = 0.9701, p-value = 2.071e-06
shapiro.test(Fatality$perinc)
##
## Shapiro-Wilk normality test
##
## data: Fatality$perinc
## W = 0.968, p-value = 9.169e-07
Overrall the results of our models lead us to believe our models are adequate and explain the effect of the factors on the traffic fatality rate.
No literature was used for this recipe
The R package in which this data was found can be located at http://cran.r-project.org/web/packages/Ecdat/index.html. The title of the dataset is “Fatality”
It is possible that the conclusions of our analysis are the results of chance. One concern is that the data is collected over several years. There are a large number of factors that fluctuate year to year that could effect the traffic fatality rate. In addition effects of factors like a mandatory jail sentence could be misleading since having a harsh versus a a light sentence could have different effects.
Different states also have many different types and amounts of roadways. It would be raitonal to hypothesize that a more rural state, with less large cities and areas with a large number of roads, despite different laws, drinking ages and income would have less traffic fatalities in general.
We can also perform the Kruskal-Wallis test which is a non-parametric test of two different groups of means. As we can see from our following tests our p-value is close to 0 for all factors and is therefore significant.
kruskal.test(mrall~mlda, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by mlda
## Kruskal-Wallis chi-squared = 8.507, df = 1, p-value = 0.003538
kruskal.test(mrall~jaild, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by jaild
## Kruskal-Wallis chi-squared = 28.15, df = 1, p-value = 1.12e-07
kruskal.test(mrall~comserd, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by comserd
## Kruskal-Wallis chi-squared = 15.62, df = 1, p-value = 7.735e-05
kruskal.test(mrall~vmiles, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by vmiles
## Kruskal-Wallis chi-squared = 54.54, df = 1, p-value = 1.526e-13
kruskal.test(mrall~unrate, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by unrate
## Kruskal-Wallis chi-squared = 9.609, df = 1, p-value = 0.001936
kruskal.test(mrall~perinc, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by perinc
## Kruskal-Wallis chi-squared = 55.37, df = 1, p-value = 1.001e-13