This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

## 1. Setting

### Traffic Fatality and Drunk Driving Laws

This analysis uses drinking laws and demographic statistics by state to test the effect 6 different factors may have on traffic fatality rate in deaths per 10,000.

Below is the installation and initial examination of the dataset:

#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
##     Orange
data<-Fatality
head(data)
##   state year mrall beertax  mlda jaild comserd vmiles unrate perinc
## 1     1 1982 2.128   1.539 19.00    no      no  7.234   14.4  10544
## 2     1 1983 2.348   1.789 19.00    no      no  7.836   13.7  10733
## 3     1 1984 2.336   1.714 19.00    no      no  8.263   11.1  11109
## 4     1 1985 2.193   1.653 19.67    no      no  8.727    8.9  11333
## 5     1 1986 2.669   1.610 21.00    no      no  8.953    9.8  11662
## 6     1 1987 2.719   1.560 21.00    no      no  9.166    7.8  11944
attach(data)

### Factors and Levels

For this two level, multi-factor analysis (2^6-1) we will be examining the following factors “mlda” which represents the minimum legal drinking age in that state “jaild” which represents which represents whether or not a mandatory jail sentence exists for that state “comserd” which represents whether or not there is mandatory community service “unrate” which shows the unemployment rate “vmiles” which is the average miles per dirve “perinc” which represents the per capita personal income for that state.

All factors are broken down into two levels, high and low or 1 and -1 respectively.

#Subset data to create two levels of all factors
#Note jaild and comserd are already two level factors
data$mlda[data$mlda >=18 & data$mlda < 20] = -1 data$mlda[data$mlda>=20] = 1 data$unrate[as.numeric(data$unrate)>= 2.400 & as.numeric(data$unrate) < 7.400] = -1
data$unrate[as.numeric(data$unrate)>=7.400] = 1

data$perinc[data$perinc>= 9000 & data$perinc <13880] = -1 data$perinc[data$perinc>= 13880] = 1 levels(data$comserd)<-c(-1,1)
levels(data$jaild)<-c(-1,1) data$vmiles[as.numeric(data$vmiles)>= 4.500 & as.numeric(data$vmiles) < 7.891] = -1
data$vmiles[as.numeric(data$vmiles)>= 7.891] = 1

#Examine the Data
head(data)
##   state year mrall beertax mlda jaild comserd vmiles unrate perinc
## 1     1 1982 2.128   1.539   -1    -1      -1     -1      1     -1
## 2     1 1983 2.348   1.789   -1    -1      -1     -1      1     -1
## 3     1 1984 2.336   1.714   -1    -1      -1      1      1     -1
## 4     1 1985 2.193   1.653   -1    -1      -1      1      1     -1
## 5     1 1986 2.669   1.610    1    -1      -1      1      1     -1
## 6     1 1987 2.719   1.560    1    -1      -1      1      1     -1
tail(data)
##     state year mrall beertax mlda jaild comserd vmiles unrate perinc
## 331    56 1983 3.353 0.05161   -1     1      -1      1      1     -1
## 332    56 1984 3.060 0.04945   -1     1      -1      1     -1     -1
## 333    56 1985 2.986 0.04767   -1     1      -1      1     -1     -1
## 334    56 1986 3.314 0.04644   -1     1      -1      1      1     -1
## 335    56 1987 2.633 0.04500   -1     1      -1      1      1     -1
## 336    56 1988 3.236 0.04331   -1     1      -1      1     -1     -1
summary(data)
##      state           year          mrall          beertax
##  Min.   : 1.0   Min.   :1982   Min.   :0.821   Min.   :0.0433
##  1st Qu.:18.8   1st Qu.:1983   1st Qu.:1.624   1st Qu.:0.2088
##  Median :30.5   Median :1985   Median :1.956   Median :0.3526
##  Mean   :30.2   Mean   :1985   Mean   :2.040   Mean   :0.5133
##  3rd Qu.:42.5   3rd Qu.:1987   3rd Qu.:2.418   3rd Qu.:0.6516
##  Max.   :56.0   Max.   :1988   Max.   :4.218   Max.   :2.7208
##       mlda        jaild    comserd      vmiles           unrate
##  Min.   :-1.000   -1:242   -1:274   Min.   :-1.000   Min.   :-1.0000
##  1st Qu.: 1.000   1 : 94   1 : 62   1st Qu.:-1.000   1st Qu.:-1.0000
##  Median : 1.000                     Median :-1.000   Median :-1.0000
##  Mean   : 0.554                     Mean   :-0.113   Mean   :-0.0774
##  3rd Qu.: 1.000                     3rd Qu.: 1.000   3rd Qu.: 1.0000
##  Max.   : 1.000                     Max.   : 1.000   Max.   : 1.0000
##      perinc
##  Min.   :-1.0000
##  1st Qu.:-1.0000
##  Median :-1.0000
##  Mean   :-0.0476
##  3rd Qu.: 1.0000
##  Max.   : 1.0000

### Continuous variables (if any)

All factors except mandatory jailtime and mandatory community service were continuous but they have been set to two levels.

### Response variables

For this experiment we will be focusing on the traffic fataility rate for the response variable being effected by the factors.

### The Data: How is it organized and what does it look like?

The data set is organized by the following variables: state, year, mrall, beertax, mlda, jaild, comserd, vmiles, unrate, and perinc.

State and year indicate the state ID number in which this data is observed and what year the data was recorded. Mrall represents the traffic fatality rate in deaths per 10,000 people.

beertax, mlda, jaild, and comserd all have to do with alcohol related laws in that state. Beer tax is the tax rate on a case of beer, mlda is the minimum legal drinking age, and jaild and comserd represent whether or not there is a mandatory jail sentence or community service time for that state in the event of alcohol related driving offenses.

vmiles is the average miles per drive.

Lastly unrate and perinc represent demographic information for that state. unrate is the unemployment rate and perinc is the per capita personal income for that observation.

The data itself is in order by state ID and then by year.

### Randomization

It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.

## 2. (Experimental) Design

### How will the experiment be organized and conducted to test the hypothesis?

To perform the experiment the factors will first be subset into two levels if necessary. Then a linear model will be created from the 6 factors and an analysis of variance will be performed. From these analysis we will be able to test the null hypothesis, that the traffic fatality rate is independant of the 6 factors.

After this we will generate an array for a 2^6-1 design. From this design we will create a new dataframe with the 32 runs. We will run another analysis of variance with the new data frame to test the null hypothesis.

### What is the rationale for this design?

The rationale for using an analysis of variance test is used when multiple factors are considered. It checks whether the means of several groups are equal. The alternative would be to use multiple two-sample t-tests however there is more likely chance of the test resulting in a false hypothesis.

We use a 2^k factor design so we can examine all possible combinations of factors and their individual as well as interaction effects on the response variable.

### Randomize: What is the Randomization Scheme?

The data was collected in an unknown way so we do not know if there was any randomization to it.

### Replicate: Are there replicates and/or repeated measures?

There are no replicated or repeated measures in the data. For our 2^6-1 factor analysis we remove all duplicates

### Block: Did you use blocking in the design?

There was no blocking performed in the design of this experiment.

## 3. Statistical Analysis

### Exploratory Data Analysis: Graphics and Descriptive summary

To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of each factor.

#Defining Minimum Legal Drinking Age as a factor
data$mlda = as.factor(data$mlda)

#Defining Mandatory Jail Sentence as a factor
data$jaild = as.factor(data$jaild)

#Defining Unemployment Rate as a factor
data$unrate = as.factor(data$unrate)

#Defining Average Miles Driven as a factor
data$vmiles = as.factor(data$vmiles)

#Defining Per Capital Personal Income as a factor
data$perinc = as.factor(data$perinc)

#Defining Mandatory Community Service Sentence as a factor
data$comserd = as.factor(data$comserd)

#Boxplots of of the means of each variable against response variable
boxplot(mrall~mlda, data=data, xlab="Minimum Drinking Age", ylab="Fatality Rate")

boxplot(mrall~jaild, data=data, xlab="Mandatory Jail Time", ylab="Fatality Rate")

boxplot(mrall~perinc, data=data, xlab="Per Capita Personal Income", ylab="Fatality Rate")

boxplot(mrall~unrate, data=data, xlab="Unemployment Rate", ylab="Fatality Rate")

boxplot(mrall~comserd, data=data, xlab="Mandatory Community Service", ylab="Fatality Rate")

boxplot(mrall~vmiles, data=data, xlab="Average Miles Driven", ylab="Fatality Rate")

From the boxplots we examine positive correlation between the response variable and the factors mandatory jail time, unemployment rate, mandatory community service and average miles driven and we see negative relationships between minimum legal drinking age and income. However all of these relationships do not appear extremely significant.

### Linear Model Anova

To test the hypotheses we perform an ANOVA test on a linear model containing all the factors

The null hypothesis of the test is that the factors do not have an effect on the response variable of traffic fatality rate.

#Create of Linear Model
model1=lm(mrall~mlda+jaild+perinc+unrate+comserd+vmiles, data=data)
#Analysis of Variance
anova(model1)
## Analysis of Variance Table
##
## Response: mrall
##            Df Sum Sq Mean Sq F value  Pr(>F)
## mlda        1    2.5    2.54   12.89 0.00038 ***
## jaild       1    7.5    7.52   38.18 1.9e-09 ***
## perinc      1   12.1   12.13   61.54 6.1e-14 ***
## unrate      1    0.0    0.04    0.21 0.64355
## comserd     1    1.3    1.28    6.51 0.01120 *
## vmiles      1   20.6   20.58  104.47 < 2e-16 ***
## Residuals 329   64.8    0.20
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our analysis of variance showed significant p-values for the factors of minimum legal drinking age (mlda), mandatory jail time (jaild), income (perinc), and average miles driven (vmiles). We also saw a slightly less significant p-value for mandatory community services (comserd). Lastly our p-value for unemployment rate appeared to be insignificant. Overall we reject the null hypothesis for all factors except the unemployment rate.

### Factorial Design Model Anova

First we have to install the data package required to generate our array of possible factor combinations.

library("FrF2", lib.loc="~/R/win-library/3.1")
## Warning: package 'FrF2' was built under R version 3.1.2
## Loading required package: DoE.base
## Warning: package 'DoE.base' was built under R version 3.1.2
## Loading required package: grid
## Loading required package: conf.design
## Warning: package 'conf.design' was built under R version 3.1.2
##
## Attaching package: 'DoE.base'
##
## The following objects are masked from 'package:stats':
##
##     aov, lm
##
## The following object is masked from 'package:graphics':
##
##     plot.design

Next we generate our factor combinations:

#Creates all possible factor combinations
factdata = FrF2(32, nfactors=6, estimable=formula("~mlda+jaild+comserd+unrate+vmiles+perinc+mlda:(jaild+comserd+unrate+vmiles+perinc)"),factor.names=c("mlda","jaild","comserd","unrate","vmiles","perinc"), res4=TRUE, clear = FALSE)

factdata
##    mlda jaild comserd unrate vmiles perinc
## 1    -1    -1      -1      1      1     -1
## 2     1     1      -1      1     -1      1
## 3    -1    -1      -1     -1      1      1
## 4     1     1      -1     -1      1      1
## 5    -1    -1       1      1     -1     -1
## 6    -1     1       1     -1     -1     -1
## 7    -1     1      -1      1      1      1
## 8     1    -1       1     -1      1      1
## 9    -1     1       1      1      1     -1
## 10    1     1       1     -1      1     -1
## 11    1    -1       1      1      1     -1
## 12   -1    -1       1      1      1      1
## 13    1     1      -1     -1     -1     -1
## 14    1     1       1      1      1      1
## 15   -1    -1       1     -1      1     -1
## 16    1     1       1     -1     -1      1
## 17    1     1      -1      1      1     -1
## 18    1     1       1      1     -1     -1
## 19    1    -1       1      1     -1      1
## 20    1    -1      -1     -1      1     -1
## 21    1    -1      -1     -1     -1      1
## 22    1    -1      -1      1      1      1
## 23   -1     1      -1      1     -1     -1
## 24   -1    -1      -1      1     -1      1
## 25   -1     1      -1     -1      1     -1
## 26    1    -1       1     -1     -1     -1
## 27   -1     1       1     -1      1      1
## 28   -1    -1      -1     -1     -1     -1
## 29   -1     1      -1     -1     -1      1
## 30    1    -1      -1      1     -1     -1
## 31   -1     1       1      1     -1      1
## 32   -1    -1       1     -1     -1      1
## class=design, type= FrF2.estimable
aliasprint(factdata)
## $legend ## [1] A=mlda B=jaild C=comserd D=unrate E=vmiles F=perinc ## ## [[2]] ## [1] no aliasing among main effects and 2fis Now that we have our factor combinations we merge it with our original data to add the response variables and get a complete dataframe. We also remove all duplicate entrees of the unique combinations of factors. factdata2 = merge(factdata, data, by=c("mlda","jaild","comserd","vmiles","unrate","perinc"), all = FALSE) head(factdata2) ## mlda jaild comserd vmiles unrate perinc state year mrall beertax ## 1 -1 -1 -1 -1 -1 -1 19 1984 1.447 0.3462 ## 2 -1 -1 -1 -1 -1 -1 55 1984 1.726 0.1593 ## 3 -1 -1 -1 -1 -1 -1 16 1984 2.422 0.3709 ## 4 -1 -1 -1 -1 -1 -1 50 1982 2.058 0.7115 ## 5 -1 -1 -1 -1 1 1 36 1983 1.174 0.1329 ## 6 -1 -1 -1 -1 1 1 36 1982 1.229 0.1193 factdata3 = unique(factdata2[ , 1:6]) factdata3 ## mlda jaild comserd vmiles unrate perinc ## 1 -1 -1 -1 -1 -1 -1 ## 5 -1 -1 -1 -1 1 1 ## 8 -1 -1 -1 1 -1 1 ## 13 -1 -1 -1 1 1 -1 ## 21 -1 -1 1 -1 -1 1 ## 22 -1 -1 1 -1 1 -1 ## 23 -1 1 -1 -1 1 -1 ## 31 -1 1 -1 1 -1 -1 ## 34 -1 1 1 -1 -1 -1 ## 35 1 -1 -1 -1 -1 1 ## 74 1 -1 -1 -1 1 -1 ## 112 1 -1 -1 1 -1 -1 ## 133 1 -1 -1 1 1 1 ## 137 1 -1 1 -1 1 1 ## 138 1 -1 1 1 -1 1 ## 141 1 1 -1 -1 1 1 ## 146 1 1 -1 1 -1 1 ## 149 1 1 -1 1 1 -1 ## 151 1 1 1 -1 -1 1 ## 157 1 1 1 -1 1 -1 ## 162 1 1 1 1 -1 -1 ## 166 1 1 1 1 1 1 Now that we have our unique combination of factors we also have row numbers which correspond to response variables in our new data frame (fractdata2). We pull these response variable entries and bind them to our new matrix in order to get our final data frame in which to perform our new analysis of variance on mralldata = factdata2$mrall[index=c(1,5,8,13,21,22,23,31,34,35,74,112,133,137,138,141,146,149,151,157,162,166)]
mralldata
##  [1] 1.447 1.174 2.247 2.261 2.547 2.532 2.295 3.060 2.829 1.487 1.856
## [12] 1.956 1.736 2.174 1.792 1.687 2.116 2.892 1.984 1.774 2.841 2.767
finaldata = cbind(factdata3,mralldata)
finaldata
##     mlda jaild comserd vmiles unrate perinc mralldata
## 1     -1    -1      -1     -1     -1     -1     1.447
## 5     -1    -1      -1     -1      1      1     1.174
## 8     -1    -1      -1      1     -1      1     2.247
## 13    -1    -1      -1      1      1     -1     2.261
## 21    -1    -1       1     -1     -1      1     2.547
## 22    -1    -1       1     -1      1     -1     2.532
## 23    -1     1      -1     -1      1     -1     2.295
## 31    -1     1      -1      1     -1     -1     3.060
## 34    -1     1       1     -1     -1     -1     2.829
## 35     1    -1      -1     -1     -1      1     1.487
## 74     1    -1      -1     -1      1     -1     1.856
## 112    1    -1      -1      1     -1     -1     1.956
## 133    1    -1      -1      1      1      1     1.736
## 137    1    -1       1     -1      1      1     2.174
## 138    1    -1       1      1     -1      1     1.792
## 141    1     1      -1     -1      1      1     1.687
## 146    1     1      -1      1     -1      1     2.116
## 149    1     1      -1      1      1     -1     2.892
## 151    1     1       1     -1     -1      1     1.984
## 157    1     1       1     -1      1     -1     1.774
## 162    1     1       1      1     -1     -1     2.841
## 166    1     1       1      1      1      1     2.767

Using our new data frame we create a new linear model and perform an analysis of variance on it.

model2 = lm(mralldata~mlda+jaild+comserd+vmiles+unrate+perinc, data=finaldata)
anova(model2)
## Analysis of Variance Table
##
## Response: mralldata
##           Df Sum Sq Mean Sq F value Pr(>F)
## mlda       1  0.180   0.180    1.40 0.2549
## jaild      1  1.582   1.582   12.33 0.0031 **
## comserd    1  0.457   0.457    3.56 0.0786 .
## vmiles     1  1.325   1.325   10.32 0.0058 **
## unrate     1  0.028   0.028    0.22 0.6484
## perinc     1  0.242   0.242    1.89 0.1899
## Residuals 15  1.925   0.128
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see from this new anova we get different results than our original. Here we see that our p-value for mandatory jail time is the most significant followed by income and then mandatory community service and average miles driven. Based on the p-values we got from the analysis for the other variables we fail to reject the null hypothesis.

Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid due to assumptions made during analysis of variance. We can see below that the residuals of our original linear model have a slight deviation from normality but our residuals from the fractional factorial model seem to appear normal.

#QQ plots for residuals of the original linear model
qqnorm(residuals(model1))
qqline(residuals(model1))

#QQ plots for residuals of the fractional factorial model
qqnorm(residuals(model2))
qqline(residuals(model2))

Next we plot a fitted model against the residuals. We can observe a large amount of variation among the plot which leads us to gain confidence in our model.

#Plot of Fitted vs Residuals of the general linear model
plot(fitted(model1), residuals(model1))

#Plot of Fitted vs Residuals of the fractional factorial model
plot(fitted(model2), residuals(model2))

Lastly we perform a shaprio wilke test on the data, which also tests for normality. The p-values from our test lead us to believe the data is in fact normally distributed.

#Shaprio Wilke test on numeric variables
shapiro.test(Fatality$mrall) ## ## Shapiro-Wilk normality test ## ## data: Fatality$mrall
## W = 0.9667, p-value = 5.869e-07
shapiro.test(Fatality$mlda) ## ## Shapiro-Wilk normality test ## ## data: Fatality$mlda
## W = 0.639, p-value < 2.2e-16
shapiro.test(Fatality$vmiles) ## ## Shapiro-Wilk normality test ## ## data: Fatality$vmiles
## W = 0.7019, p-value < 2.2e-16
shapiro.test(Fatality$unrate) ## ## Shapiro-Wilk normality test ## ## data: Fatality$unrate
## W = 0.9701, p-value = 2.071e-06
shapiro.test(Fatality$perinc) ## ## Shapiro-Wilk normality test ## ## data: Fatality$perinc
## W = 0.968, p-value = 9.169e-07

Overrall the results of our models lead us to believe our models are adequate and explain the effect of the factors on the traffic fatality rate.

## 4. References to the literature

No literature was used for this recipe

## 5. Appendices

The R package in which this data was found can be located at http://cran.r-project.org/web/packages/Ecdat/index.html. The title of the dataset is “Fatality”

## 6. Contingencies

It is possible that the conclusions of our analysis are the results of chance. One concern is that the data is collected over several years. There are a large number of factors that fluctuate year to year that could effect the traffic fatality rate. In addition effects of factors like a mandatory jail sentence could be misleading since having a harsh versus a a light sentence could have different effects.

Different states also have many different types and amounts of roadways. It would be raitonal to hypothesize that a more rural state, with less large cities and areas with a large number of roads, despite different laws, drinking ages and income would have less traffic fatalities in general.

We can also perform the Kruskal-Wallis test which is a non-parametric test of two different groups of means. As we can see from our following tests our p-value is close to 0 for all factors and is therefore significant.

kruskal.test(mrall~mlda, data=data)
##
##  Kruskal-Wallis rank sum test
##
## data:  mrall by mlda
## Kruskal-Wallis chi-squared = 8.507, df = 1, p-value = 0.003538
kruskal.test(mrall~jaild, data=data)
##
##  Kruskal-Wallis rank sum test
##
## data:  mrall by jaild
## Kruskal-Wallis chi-squared = 28.15, df = 1, p-value = 1.12e-07
kruskal.test(mrall~comserd, data=data)
##
##  Kruskal-Wallis rank sum test
##
## data:  mrall by comserd
## Kruskal-Wallis chi-squared = 15.62, df = 1, p-value = 7.735e-05
kruskal.test(mrall~vmiles, data=data)
##
##  Kruskal-Wallis rank sum test
##
## data:  mrall by vmiles
## Kruskal-Wallis chi-squared = 54.54, df = 1, p-value = 1.526e-13
kruskal.test(mrall~unrate, data=data)
##
##  Kruskal-Wallis rank sum test
##
## data:  mrall by unrate
## Kruskal-Wallis chi-squared = 9.609, df = 1, p-value = 0.001936
kruskal.test(mrall~perinc, data=data)
##
##  Kruskal-Wallis rank sum test
##
## data:  mrall by perinc
## Kruskal-Wallis chi-squared = 55.37, df = 1, p-value = 1.001e-13