Recipes for the Design of Experiments: Recipe Outline

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Design of Experiments: Recipe 9 - Response Surface Methods

Kevin Toth

RPI

12/01/2014 V2.0

1. Setting

Traffic Fatality and Drunk Driving Laws

This analysis uses drinking laws and demographic statistics by state to test the effect 4 different factors may have on the traffic fatality rate in deaths per 10,000.

Below is the installation and initial examination of the dataset:

#Installing data package
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Attaching Cigarrette Consumption dataset from library
data<-Fatality
attach(data)

Factors and Levels

For this 4 factor analysis we will be examining the following factors:

“beertax” which represents the tax on a case of beer in that state “mlda” which represents the minimum legal age to drink in the state “unrate” which shows the unemployment rate “vmiles” which is the average miles per dirve

All factors are broken down into 3 levels based on their range.

#Setting 3 levels of Beer Tax
data$beertax[as.numeric(data$beertax)>= 0 & as.numeric(data$beertax) <= .25 ] = "<.25"
data$beertax[as.numeric(data$beertax)>= .25 & as.numeric(data$beertax) <= 1 ] = ".25-1"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$beertax[as.numeric(data$beertax)>= 1] = ">1"
## Warning: NAs introduced by coercion
#Setting 3 levels of Minimum Legal Drinking Age
data$mlda[as.numeric(data$mlda)>= 18 & as.numeric(data$mlda)<20] = "18-19"
data$mlda[as.numeric(data$mlda)>= 20 & as.numeric(data$mlda)<21] = "20"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$mlda[as.numeric(data$mlda)>= 21] = "21+"
## Warning: NAs introduced by coercion
#Setting 3 levels of Average Miles traveled
data$vmiles[as.numeric(data$vmiles)>= 0 & as.numeric(data$vmiles) <9] = "0-9"
data$vmiles[as.numeric(data$vmiles)>= 9 & as.numeric(data$vmiles) <18] = "9-18"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$vmiles[as.numeric(data$vmiles)>= 18] = "18+"
## Warning: NAs introduced by coercion
#Setting 3 levels of unemployment rate
data$unrate[as.numeric(data$unrate)>= 0 & as.numeric(data$unrate) <5] = "0-5"
data$unrate[as.numeric(data$unrate)>= 5 & as.numeric(data$unrate) <10] = "5-10"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
data$unrate[as.numeric(data$unrate)>= 10 ] = ">10"
## Warning: NAs introduced by coercion
#Display Levels of Factors
unique(data$beertax)
## [1] ">1"    "<.25"  ".25-1"
unique(data$mlda)
## [1] "18-19" "21+"   "20"
unique(data$vmiles)
## [1] "0-9"  "9-18" "18+"
unique(data$unrate)
## [1] ">10"  "5-10" "0-5"
#Overall Look at Data
head(data)
##   state year mrall beertax  mlda jaild comserd vmiles unrate perinc
## 1     1 1982 2.128      >1 18-19    no      no    0-9    >10  10544
## 2     1 1983 2.348      >1 18-19    no      no    0-9    >10  10733
## 3     1 1984 2.336      >1 18-19    no      no    0-9    >10  11109
## 4     1 1985 2.193      >1 18-19    no      no    0-9   5-10  11333
## 5     1 1986 2.669      >1   21+    no      no    0-9   5-10  11662
## 6     1 1987 2.719      >1   21+    no      no   9-18   5-10  11944
tail(data)
##     state year mrall beertax  mlda jaild comserd vmiles unrate perinc
## 331    56 1983 3.353    <.25 18-19   yes      no   9-18   5-10  13575
## 332    56 1984 3.060    <.25 18-19   yes      no   9-18   5-10  13456
## 333    56 1985 2.986    <.25 18-19   yes      no   9-18   5-10  13595
## 334    56 1986 3.314    <.25 18-19   yes      no   9-18   5-10  13127
## 335    56 1987 2.633    <.25 18-19   yes      no   9-18   5-10  12719
## 336    56 1988 3.236    <.25 18-19   yes      no   9-18   5-10  13098
summary(data)
##      state           year          mrall         beertax         
##  Min.   : 1.0   Min.   :1982   Min.   :0.821   Length:336        
##  1st Qu.:18.8   1st Qu.:1983   1st Qu.:1.624   Class :character  
##  Median :30.5   Median :1985   Median :1.956   Mode  :character  
##  Mean   :30.2   Mean   :1985   Mean   :2.040                     
##  3rd Qu.:42.5   3rd Qu.:1987   3rd Qu.:2.418                     
##  Max.   :56.0   Max.   :1988   Max.   :4.218                     
##      mlda           jaild     comserd      vmiles         
##  Length:336         no :242   no :274   Length:336        
##  Class :character   yes: 94   yes: 62   Class :character  
##  Mode  :character                       Mode  :character  
##                                                           
##                                                           
##                                                           
##     unrate              perinc     
##  Length:336         Min.   : 9514  
##  Class :character   1st Qu.:12086  
##  Mode  :character   Median :13763  
##                     Mean   :13880  
##                     3rd Qu.:15175  
##                     Max.   :22193

Continuous variables (if any)

All factors and variables in this data are continuous, however the factors have been set into 3 levels for analysis.

Response variables

For this experiment we will be focusing on the traffic fatality rate in deaths per 10,000 as the response variable being effected by the chosen factors.

The Data: How is it organized and what does it look like?

The data set is organized by the following variables: state, year, mrall, beertax, mlda, jaild, comserd, vmiles, unrate, and perinc.

State and year indicate the state ID number in which this data is observed and what year the data was recorded. Mrall represents the traffic fatality rate in deaths per 10,000 people.

beertax, mlda, jaild, and comserd all have to do with alcohol related laws in that state. Beer tax is the tax rate on a case of beer, mlda is the minimum legal drinking age, and jaild and comserd represent whether or not there is a mandatory jail sentence or community service time for that state in the event of alcohol related driving offenses.

vmiles is the average miles per drive.

Lastly unrate and perinc represent demographic information for that state. unrate is the unemployment rate and perinc is the per capita personal income for that observation.

The data itself is in order by state ID and then by year.

Randomization

It is unknown whether or not the data collected for this study was collected by a randomly designed experiment.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

To perform the experiment the data will first be altered so that our factors have 3 levels. Then we will create a linear model using the factor and use a response surface method to perform an analysis of variance one the model. From these analysis we will be able to test the null hypothesis, that the traffic fatality rate is independant of the 4 factors. Our response surface method will include second order effects.

What is the rationale for this design?

The rationale for using an analysis of variance test to check whether the means of several groups are equal. The alternative would be to use a two-sample t-tests however there is more likely chance of the test resulting in a false positive of the hypothesis. By using a response surface method we will also find an “optimal” set of inputs for a minimum or maximum response variable.

Randomize: What is the Randomization Scheme?

The data was collected in an unknown way so we do not know if there was any randomization to it.

Replicate: Are there replicates and/or repeated measures?

There are no replicated or repeated measures in the data.

Block: Did you use blocking in the design?

There was no blocking in the design of this experiment

3. Statistical Analysis

Exploratory Data Analysis: Graphics and Descriptive summary

To start our statistical analysis we will make our variables factors for the analysis of variance and look at some boxplots of those factors.

#Defining Minimum Legal Drinking Age as a factor
data$mlda = as.factor(data$mlda)

#Defining Tax on a Case of Beer as a factor
data$beertax = as.factor(data$beertax)

#Defining Unemployment Rate as a factor
data$unrate = as.factor(data$unrate)

#Defining Average Miles Driven as a factor
data$vmiles = as.factor(data$vmiles)


#Boxplots of of the means of each variable against response variable
boxplot(mrall~mlda, data=data, xlab="Minimum Drinking Age", ylab="Fatality Rate")

plot of chunk unnamed-chunk-3

boxplot(mrall~beertax, data=data, xlab="Tax on Beer", ylab="Fatality Rate")

plot of chunk unnamed-chunk-3

boxplot(mrall~unrate, data=data, xlab="Unemployment Rate", ylab="Fatality Rate")

plot of chunk unnamed-chunk-3

boxplot(mrall~vmiles, data=data, xlab="Average Miles Driven", ylab="Fatality Rate")

plot of chunk unnamed-chunk-3 There are no discernable relationships between the factors minimum legal drinking age and average miles and the repsonse variable. We do see a slight positive relationship between the tax on beer and the unemployment rate on the respones variable.

Testing

First we generate a linear model with our factors. Then to test the hypotheses we perform an ANOVA test via the response surface method package.

The null hypothesis of the tests is that the factors do not have an effect on the response variable of the traffic fatality rate.

#Loading response surface methods package
library(rsm)
## Warning: package 'rsm' was built under R version 3.1.2
#Set factors to binary operators for Response Surface method
data$mlda = as.integer(data$mlda)
data$beertax = as.integer(data$beertax)
data$unrate = as.integer(data$unrate)
data$vmiles = as.integer(data$vmiles)

#Generation of Linear Model using Response Surface Methods
model=rsm(mrall~SO(beertax, mlda, vmiles, unrate), data=data)
summary(model)
## 
## Call:
## rsm(formula = mrall ~ SO(beertax, mlda, vmiles, unrate), data = data)
## 
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.07134    1.45081    3.50  0.00054 ***
## beertax        -1.75611    0.29024   -6.05  4.0e-09 ***
## mlda           -1.18777    0.40100   -2.96  0.00328 ** 
## vmiles          0.27957    1.87623    0.15  0.88164    
## unrate         -1.30648    0.33183   -3.94  0.00010 ***
## beertax:mlda    0.11860    0.04278    2.77  0.00589 ** 
## beertax:vmiles -0.05276    0.05344   -0.99  0.32425    
## beertax:unrate -0.00331    0.04643   -0.07  0.94314    
## mlda:vmiles    -0.19996    0.05060   -3.95  9.5e-05 ***
## mlda:unrate     0.00797    0.03957    0.20  0.84060    
## vmiles:unrate   0.01431    0.09021    0.16  0.87408    
## beertax^2       0.44057    0.06154    7.16  5.6e-12 ***
## mlda^2          0.29341    0.09322    3.15  0.00180 ** 
## vmiles^2        0.14955    0.47472    0.32  0.75295    
## unrate^2        0.31579    0.08282    3.81  0.00016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.386,  Adjusted R-squared:  0.359 
## F-statistic: 14.4 on 14 and 321 DF,  p-value: <2e-16
## 
## Analysis of Variance Table
## 
## Response: mrall
##                                     Df Sum Sq Mean Sq F value  Pr(>F)
## FO(beertax, mlda, vmiles, unrate)    4   19.3    4.83   23.17 < 2e-16
## TWI(beertax, mlda, vmiles, unrate)   6    4.2    0.70    3.38   0.003
## PQ(beertax, mlda, vmiles, unrate)    4   18.4    4.61   22.11 3.9e-16
## Residuals                          321   66.9    0.21                
## Lack of fit                         19    5.1    0.27    1.32   0.170
## Pure error                         302   61.8    0.20                
## 
## Stationary point of response surface:
## beertax    mlda  vmiles  unrate 
##  1.7875  1.8009  0.4869  2.0443 
## 
## Eigenanalysis:
## $values
## [1] 0.47327 0.31629 0.31179 0.09796
## 
## $vectors
##             [,1]     [,2]    [,3]     [,4]
## beertax  0.89298 -0.08914  0.4410  0.01114
## mlda     0.40421  0.18867 -0.7687 -0.45835
## vmiles  -0.19780 -0.05714  0.4114 -0.88791
## unrate  -0.00816  0.97632  0.2129  0.03763

As we can see from the results of our model the tax on beer and unemployment rate are most significant and minimum legal drinking age is aslo significant. This is true because of our near zero p-values. It is also worth noting average miles driven had a very high p-value. Overall we reject the null hypothesis that the response variable mrall is independent of beertax, mlda and unrate.

Our stationary point of the response surface was: beertax mlda vmiles unrate 1.7874528 1.8009469 0.4868878 2.0442522

This should be an optimal point on the response surface.

Moving forward we will examine contour plots to examine two factor surfaces.

#Contour Plots for factor interaction surfaces

contour(model, ~beertax + mlda + vmiles + unrate, image = TRUE, at=summary(model$canonical$xs))

plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5plot of chunk unnamed-chunk-5

Diagnostics/Model Adequacy Checking

Next we graph Q-Q plots to check our data in our model for normality. If the data is not normal the results of the analysis may not be valid since normality within the data is an assumption of ANOVA tests.

We see slight deviation from normality in our Q-Q plot

#QQ plots for residuals of the model
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-6

Second we plot a fitted model against the residuals. We don’t see a very large degree of variation among the plot which indicates the data is not normal.

#Plot of Fitted vs Residuals of model
plot(fitted(model), residuals(model))

plot of chunk unnamed-chunk-7

Lastly we perform Shapiro-Wilke tests on our data to test for normality

#Shaprio Wilke tests on numeric variables
shapiro.test(Fatality$mrall)
## 
##  Shapiro-Wilk normality test
## 
## data:  Fatality$mrall
## W = 0.9667, p-value = 5.869e-07
shapiro.test(Fatality$mlda)
## 
##  Shapiro-Wilk normality test
## 
## data:  Fatality$mlda
## W = 0.639, p-value < 2.2e-16
shapiro.test(Fatality$vmiles)
## 
##  Shapiro-Wilk normality test
## 
## data:  Fatality$vmiles
## W = 0.7019, p-value < 2.2e-16
shapiro.test(Fatality$unrate)
## 
##  Shapiro-Wilk normality test
## 
## data:  Fatality$unrate
## W = 0.9701, p-value = 2.071e-06
shapiro.test(Fatality$beertax)
## 
##  Shapiro-Wilk normality test
## 
## data:  Fatality$beertax
## W = 0.7553, p-value < 2.2e-16

Overrall the results of our model adequacy testing lead us to believe our model is not adequate and does not explain the effect of the factors on the response variable of fatality rate.

4. References to the literature

No literature was used

5. Appendices

The R package in which this data was found can be located in the R library called Ecdat. The dataset is named “Fatality”

6. Contingencies

It is possible that the conclusions of our analysis are the results of chance. One concern is that the data is collected over several years. There are a large number of factors that fluctuate year to year that could effect the traffic fatality rate. In addition effects of factors like a mandatory jail sentence could be misleading since having a harsh versus a a light sentence could have different effects.

Different states also have many different types and amounts of roadways. It would be rational to hypothesize that a more rural state, with less large cities and areas with a large number of roads, despite different laws, drinking ages and income would have less traffic fatalities in general.

We can also perform the Kruskal-Wallis test which is a non-parametric test of two different groups of means. As we can see from our following tests our p-value is close to 0 for all factors and is therefore significant.

kruskal.test(mrall~mlda, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mrall by mlda
## Kruskal-Wallis chi-squared = 15.26, df = 2, p-value = 0.0004868
kruskal.test(mrall~beertax, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mrall by beertax
## Kruskal-Wallis chi-squared = 59.43, df = 2, p-value = 1.244e-13
kruskal.test(mrall~vmiles, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mrall by vmiles
## Kruskal-Wallis chi-squared = 43.56, df = 2, p-value = 3.483e-10
kruskal.test(mrall~unrate, data=data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  mrall by unrate
## Kruskal-Wallis chi-squared = 19.85, df = 2, p-value = 4.903e-05