as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
This analysis uses drinking laws and demographic statistics by state to test the effect 4 different factors may have on the traffic fatality rate in deaths per 10,000.
Below is the installation and initial examination of the dataset:
#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Attaching Cigarrette Consumption dataset from library
data<-Fatality
attach(data)
For this 4 factor analysis we will be examining the following factors:
“beertax” which represents the tax on a case of beer in that state “mlda” which represents the minimum legal age to drink in the state “unrate” which shows the unemployment rate “vmiles” which is the average miles per dirve
All factors are broken down into 3 levels based on their range.
#Setting 3 levels of Beer Tax
data$beertax[as.numeric(data$beertax)>= 0 & as.numeric(data$beertax) <= .25 ] = "<.25"
data$beertax[as.numeric(data$beertax)>= .25 & as.numeric(data$beertax) <= 1 ] = ".25-1"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$beertax[as.numeric(data$beertax)>= 1] = ">1"
## Warning: NAs introduced by coercion
#Setting 3 levels of Minimum Legal Drinking Age
data$mlda[as.numeric(data$mlda)>= 18 & as.numeric(data$mlda)<20] = "18-19"
data$mlda[as.numeric(data$mlda)>= 20 & as.numeric(data$mlda)<21] = "20"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$mlda[as.numeric(data$mlda)>= 21] = "21+"
## Warning: NAs introduced by coercion
#Setting 3 levels of Average Miles traveled
data$vmiles[as.numeric(data$vmiles)>= 0 & as.numeric(data$vmiles) <9] = "0-9"
data$vmiles[as.numeric(data$vmiles)>= 9 & as.numeric(data$vmiles) <18] = "9-18"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$vmiles[as.numeric(data$vmiles)>= 18] = "18+"
## Warning: NAs introduced by coercion
#Setting 3 levels of unemployment rate
data$unrate[as.numeric(data$unrate)>= 0 & as.numeric(data$unrate) <5] = "0-5"
data$unrate[as.numeric(data$unrate)>= 5 & as.numeric(data$unrate) <10] = "5-10"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$unrate[as.numeric(data$unrate)>= 10 ] = ">10"
## Warning: NAs introduced by coercion
#Display Levels of Factors
unique(data$beertax)
## [1] ">1" "<.25" ".25-1"
unique(data$mlda)
## [1] "18-19" "21+" "20"
unique(data$vmiles)
## [1] "0-9" "9-18" "18+"
unique(data$unrate)
## [1] ">10" "5-10" "0-5"
#Overall Look at Data
head(data)
## state year mrall beertax mlda jaild comserd vmiles unrate perinc
## 1 1 1982 2.128 >1 18-19 no no 0-9 >10 10544
## 2 1 1983 2.348 >1 18-19 no no 0-9 >10 10733
## 3 1 1984 2.336 >1 18-19 no no 0-9 >10 11109
## 4 1 1985 2.193 >1 18-19 no no 0-9 5-10 11333
## 5 1 1986 2.669 >1 21+ no no 0-9 5-10 11662
## 6 1 1987 2.719 >1 21+ no no 9-18 5-10 11944
tail(data)
## state year mrall beertax mlda jaild comserd vmiles unrate perinc
## 331 56 1983 3.353 <.25 18-19 yes no 9-18 5-10 13575
## 332 56 1984 3.060 <.25 18-19 yes no 9-18 5-10 13456
## 333 56 1985 2.986 <.25 18-19 yes no 9-18 5-10 13595
## 334 56 1986 3.314 <.25 18-19 yes no 9-18 5-10 13127
## 335 56 1987 2.633 <.25 18-19 yes no 9-18 5-10 12719
## 336 56 1988 3.236 <.25 18-19 yes no 9-18 5-10 13098
summary(data)
## state year mrall beertax
## Min. : 1.0 Min. :1982 Min. :0.821 Length:336
## 1st Qu.:18.8 1st Qu.:1983 1st Qu.:1.624 Class :character
## Median :30.5 Median :1985 Median :1.956 Mode :character
## Mean :30.2 Mean :1985 Mean :2.040
## 3rd Qu.:42.5 3rd Qu.:1987 3rd Qu.:2.418
## Max. :56.0 Max. :1988 Max. :4.218
## mlda jaild comserd vmiles
## Length:336 no :242 no :274 Length:336
## Class :character yes: 94 yes: 62 Class :character
## Mode :character Mode :character
##
##
##
## unrate perinc
## Length:336 Min. : 9514
## Class :character 1st Qu.:12086
## Mode :character Median :13763
## Mean :13880
## 3rd Qu.:15175
## Max. :22193
All factors and variables in this data are continuous, however the factors have been set into 3 levels for analysis.
For this experiment we will be focusing on the traffic fatality rate in deaths per 10,000 as the response variable being effected by the chosen factors.
The data set is organized by the following variables: state, year, mrall, beertax, mlda, jaild, comserd, vmiles, unrate, and perinc.
State and year indicate the state ID number in which this data is observed and what year the data was recorded. Mrall represents the traffic fatality rate in deaths per 10,000 people.
beertax, mlda, jaild, and comserd all have to do with alcohol related laws in that state. Beer tax is the tax rate on a case of beer, mlda is the minimum legal drinking age, and jaild and comserd represent whether or not there is a mandatory jail sentence or community service time for that state in the event of alcohol related driving offenses.
vmiles is the average miles per drive.
Lastly unrate and perinc represent demographic information for that state. unrate is the unemployment rate and perinc is the per capita personal income for that observation.
The data itself is in order by state ID and then by year.
It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.
To perform the experiment the data will first be altered so that our factors have 3 levels. Then we will create a linear model using the factor and use a response surface method to perform an analysis of variance one the model. From these analysis we will be able to test the null hypothesis, that the traffic fatality rate is independant of the 4 factors. Our response surface method will include second order effects.
The rationale for using an analysis of variance test to check whether the means of several groups are equal. The alternative would be to use a two-sample t-tests however there is more likely chance of the test resulting in a false positive of the hypothesis. By using a response surface method we will also find an “optimal” set of inputs for a minimum or maximum response variable.
The data was collected in an unknown way so we do not know if there was any randomization to it.
There are no replicated or repeated measures in the data.
There was no blocking in the design of this experiment
To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors.
#Defining Minimum Legal Drinking Age as a factor
data$mlda = as.factor(data$mlda)
#Defining Tax on a Case of Beer as a factor
data$beertax = as.factor(data$beertax)
#Defining Unemployment Rate as a factor
data$unrate = as.factor(data$unrate)
#Defining Average Miles Driven as a factor
data$vmiles = as.factor(data$vmiles)
#Boxplots of of the means of each variable against response variable
boxplot(mrall~mlda, data=data, xlab="Minimum Drinking Age", ylab="Fatality Rate")
boxplot(mrall~beertax, data=data, xlab="Tax on Beer", ylab="Fatality Rate")
boxplot(mrall~unrate, data=data, xlab="Unemployment Rate", ylab="Fatality Rate")
boxplot(mrall~vmiles, data=data, xlab="Average Miles Driven", ylab="Fatality Rate")
There are no discernable relationships between the factors minimum legal drinking age and average miles and the repsonse variable. We do see a slight positive relationship between the tax on beer and the unemployment rate on the respones variable.
First we generate a linear model with our factors. Then to test the hypotheses we perform an ANOVA test via the response surface method package.
The null hypothesis of the tests is that the factors do not have an effect on the response variable of the traffic fatality rate.
#Loading response surface methods package
library(rsm)
## Warning: package 'rsm' was built under R version 3.1.2
#Set factors to binary operators for Response Surface method
data$mlda = as.integer(data$mlda)
data$beertax = as.integer(data$beertax)
data$unrate = as.integer(data$unrate)
data$vmiles = as.integer(data$vmiles)
#Generation of Linear Model using Response Surface Methods
model=rsm(mrall~SO(beertax, mlda, vmiles, unrate), data=data)
summary(model)
##
## Call:
## rsm(formula = mrall ~ SO(beertax, mlda, vmiles, unrate), data = data)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.07134 1.45081 3.50 0.00054 ***
## beertax -1.75611 0.29024 -6.05 4.0e-09 ***
## mlda -1.18777 0.40100 -2.96 0.00328 **
## vmiles 0.27957 1.87623 0.15 0.88164
## unrate -1.30648 0.33183 -3.94 0.00010 ***
## beertax:mlda 0.11860 0.04278 2.77 0.00589 **
## beertax:vmiles -0.05276 0.05344 -0.99 0.32425
## beertax:unrate -0.00331 0.04643 -0.07 0.94314
## mlda:vmiles -0.19996 0.05060 -3.95 9.5e-05 ***
## mlda:unrate 0.00797 0.03957 0.20 0.84060
## vmiles:unrate 0.01431 0.09021 0.16 0.87408
## beertax^2 0.44057 0.06154 7.16 5.6e-12 ***
## mlda^2 0.29341 0.09322 3.15 0.00180 **
## vmiles^2 0.14955 0.47472 0.32 0.75295
## unrate^2 0.31579 0.08282 3.81 0.00016 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.386, Adjusted R-squared: 0.359
## F-statistic: 14.4 on 14 and 321 DF, p-value: <2e-16
##
## Analysis of Variance Table
##
## Response: mrall
## Df Sum Sq Mean Sq F value Pr(>F)
## FO(beertax, mlda, vmiles, unrate) 4 19.3 4.83 23.17 < 2e-16
## TWI(beertax, mlda, vmiles, unrate) 6 4.2 0.70 3.38 0.003
## PQ(beertax, mlda, vmiles, unrate) 4 18.4 4.61 22.11 3.9e-16
## Residuals 321 66.9 0.21
## Lack of fit 19 5.1 0.27 1.32 0.170
## Pure error 302 61.8 0.20
##
## Stationary point of response surface:
## beertax mlda vmiles unrate
## 1.7875 1.8009 0.4869 2.0443
##
## Eigenanalysis:
## $values
## [1] 0.47327 0.31629 0.31179 0.09796
##
## $vectors
## [,1] [,2] [,3] [,4]
## beertax 0.89298 -0.08914 0.4410 0.01114
## mlda 0.40421 0.18867 -0.7687 -0.45835
## vmiles -0.19780 -0.05714 0.4114 -0.88791
## unrate -0.00816 0.97632 0.2129 0.03763
As we can see from the results of our model the tax on beer and unemployment rate are most significant and minimum legal drinking age is aslo significant. This is true because of our near zero p-values. It is also worth noting average miles driven had a very high p-value. Overall we reject the null hypothesis that the response variable mrall is independent of beertax, mlda and unrate.
Our stationary point of the response surface was: beertax mlda vmiles unrate 1.7874528 1.8009469 0.4868878 2.0442522
This should be an optimal point on the response surface.
Moving forward we will examine contour plots to examine two factor surfaces.
#Contour Plots for factor interaction surfaces
contour(model, ~beertax + mlda + vmiles + unrate, image = TRUE, at=summary(model$canonical$xs))
Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid since normality within the data is an assumption of ANOVA tests.
We see slight deviation from normality in our Q-Q plot
#QQ plots for residuals of the model
qqnorm(residuals(model))
qqline(residuals(model))
Second we plot a fitted model against the residuals. We don’t see a very large degree of variation among the plot which indicates the data is not normal.
#Plot of Fitted vs Residuals of model
plot(fitted(model), residuals(model))
Lastly we perform Shapiro-Wilke tests on our data to test for normality
#Shaprio Wilke tests on numeric variables
shapiro.test(Fatality$mrall)
##
## Shapiro-Wilk normality test
##
## data: Fatality$mrall
## W = 0.9667, p-value = 5.869e-07
shapiro.test(Fatality$mlda)
##
## Shapiro-Wilk normality test
##
## data: Fatality$mlda
## W = 0.639, p-value < 2.2e-16
shapiro.test(Fatality$vmiles)
##
## Shapiro-Wilk normality test
##
## data: Fatality$vmiles
## W = 0.7019, p-value < 2.2e-16
shapiro.test(Fatality$unrate)
##
## Shapiro-Wilk normality test
##
## data: Fatality$unrate
## W = 0.9701, p-value = 2.071e-06
shapiro.test(Fatality$beertax)
##
## Shapiro-Wilk normality test
##
## data: Fatality$beertax
## W = 0.7553, p-value < 2.2e-16
Overrall the results of our model adequacy testing lead us to believe our model is not adequate and does not explain the effect of the factors on the response variable of fatality rate.
No literature was used
The R package in which this data was found can be located in the R library called Ecdat. The dataset is named “Fatality”
It is possible that the conclusions of our analysis are the results of chance. One concern is that the data is collected over several years. There are a large number of factors that fluctuate year to year that could effect the traffic fatality rate. In addition effects of factors like a mandatory jail sentence could be misleading since having a harsh versus a a light sentence could have different effects.
Different states also have many different types and amounts of roadways. It would be rational to hypothesize that a more rural state, with less large cities and areas with a large number of roads, despite different laws, drinking ages and income would have less traffic fatalities in general.
We can also perform the Kruskal-Wallis test which is a non-parametric test of two different groups of means. As we can see from our following tests our p-value is close to 0 for all factors and is therefore significant.
kruskal.test(mrall~mlda, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by mlda
## Kruskal-Wallis chi-squared = 15.26, df = 2, p-value = 0.0004868
kruskal.test(mrall~beertax, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by beertax
## Kruskal-Wallis chi-squared = 59.43, df = 2, p-value = 1.244e-13
kruskal.test(mrall~vmiles, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by vmiles
## Kruskal-Wallis chi-squared = 43.56, df = 2, p-value = 3.483e-10
kruskal.test(mrall~unrate, data=data)
##
## Kruskal-Wallis rank sum test
##
## data: mrall by unrate
## Kruskal-Wallis chi-squared = 19.85, df = 2, p-value = 4.903e-05