Anthony D’Amato

RPI

11/30/2014

1. Setting

Crime Rates in the United States in 2012

The following experimental analysis looks at a data set containing crime rate data from all 50 of the United States in the year 2012. Below are the first 6 lines of this data set, and below that is a figure showing the structure of the data set.

CSV<-read.csv("C:\\Users\\Anthony\\Desktop\\School\\RPI Year 1\\DoE\\CrimeRMS.csv",header=TRUE)
head(CSV)
##   Population Violent Murder Robbery Inmates
## 1          3       2      3       3   650.5
## 2          1       3      2       2   405.2
## 3          3       2      3       3   582.8
## 4          2       2      3       2   494.2
## 5          2       2      2       3   351.2
## 6          3       2      2       2   392.2
str(CSV)
## 'data.frame':    50 obs. of  5 variables:
##  $ Population: int  3 1 3 2 2 3 2 1 3 3 ...
##  $ Violent   : int  2 3 2 2 2 2 1 3 3 2 ...
##  $ Murder    : int  3 2 3 3 2 2 2 3 2 2 ...
##  $ Robbery   : int  3 2 3 2 3 2 3 3 3 3 ...
##  $ Inmates   : num  650 405 583 494 351 ...
library(rsm)
## Warning: package 'rsm' was built under R version 3.1.2
Inmates<-as.integer(CSV$Inmates)
Population<-as.integer(CSV$Population)
Violent<-as.integer(CSV$Violent)
Murder<-as.integer(CSV$Murder)
Robbery<-as.integer(CSV$Robbery)

Factors and Levels

This data set consists of four factors: Population, Violent, Murder, and Robbery. The first factor is the population of the state, and the other three are related to crime rates of violent crimes, murder, and robbery. Each factor is split into three levels: 1, 2, and 3 which correspond to low, medium, and high values respectively. The data was manipulated this way to make it suitable for a 3^k design analysis.

The response variable for this experimental analysis is Inmates. This variable is a continuous variable and is a number that denotes the number of inmates per 100,000 citizens in the state.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?
In this experimental data analysis I will use a Response Surface Methods approach to determine if any of the four factors in the data set cause variations in the response variable.

The null hypothesis (H0) for all four factors is that each factor does not have an effect on the response variable “Criminals” that can be accredited to something other than randomization.

What is the rationale for this design?
I have chosen to use this experimental design to demonstrate proper usage of response surface methods in a 3^k design to analyze the effects of four factors on a response variable.

3. Statistical Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Below are boxplots showing any evident trends in the response variable with respect to each of the four factors.

par(mfrow=c(2,2))
boxplot(Inmates~Population,data=CSV,xlab="Population",ylab="Inmates")
boxplot(Inmates~Violent,data=CSV,xlab="Violent",ylab="Inmates")
boxplot(Inmates~Murder,data=CSV,xlab="Murder",ylab="Inmates")
boxplot(Inmates~Robbery,data=CSV,xlab="Robbery",ylab="Inmates")

plot of chunk unnamed-chunk-2

Each boxplot seems to show that as each factor increases so does the value of the response variable.

Testing

Response Surface Methods (RSM)

The response surface method below was used to analyze the changes in inmates as a response to changes in each of the four factors.

model<-rsm(Inmates~SO(Population,Violent,Murder,Robbery),data=CSV)
summary(model)
## 
## Call:
## rsm(formula = Inmates ~ SO(Population, Violent, Murder, Robbery), 
##     data = CSV)
## 
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)         -88.642    274.372   -0.32     0.75
## Population          299.647    231.380    1.30     0.20
## Violent             -28.209    169.228   -0.17     0.87
## Murder              307.019    194.754    1.58     0.12
## Robbery            -102.221    231.074   -0.44     0.66
## Population:Violent    0.387     63.143    0.01     1.00
## Population:Murder   -39.050     49.478   -0.79     0.44
## Population:Robbery   33.588     74.364    0.45     0.65
## Violent:Murder      -63.899     59.521   -1.07     0.29
## Violent:Robbery      80.791     80.278    1.01     0.32
## Murder:Robbery       92.561     83.371    1.11     0.27
## Population^2        -77.849     65.494   -1.19     0.24
## Violent^2            -7.701     62.429   -0.12     0.90
## Murder^2            -45.478     57.484   -0.79     0.43
## Robbery^2           -70.168     71.373   -0.98     0.33
## 
## Multiple R-squared:  0.445,  Adjusted R-squared:  0.223 
## F-statistic:    2 on 14 and 35 DF,  p-value: 0.048
## 
## Analysis of Variance Table
## 
## Response: Inmates
##                                           Df Sum Sq Mean Sq F value Pr(>F)
## FO(Population, Violent, Murder, Robbery)   4 388988   97247    5.73 0.0012
## TWI(Population, Violent, Murder, Robbery)  6  44862    7477    0.44 0.8467
## PQ(Population, Violent, Murder, Robbery)   4  42131   10533    0.62 0.6510
## Residuals                                 35 594308   16980               
## Lack of fit                               14 189628   13545    0.70 0.7484
## Pure error                                21 404680   19270               
## 
## Stationary point of response surface:
## Population    Violent     Murder    Robbery 
##      1.698      3.682      5.746      5.588 
## 
## Eigenanalysis:
## $values
## [1]   15.583   -9.587  -70.219 -136.973
## 
## $vectors
##               [,1]     [,2]    [,3]    [,4]
## Population -0.1147 -0.07194  0.9197  0.3687
## Violent    -0.9045  0.05799 -0.2460  0.3436
## Murder      0.2876  0.78212 -0.1193  0.5397
## Robbery    -0.2933  0.61623  0.2819 -0.6743

Below I create contour plots for all factor interactions.

par(mfrow=c(2,3))
contour(model,~Population+Violent+Murder+Robbery,image=TRUE,at=summary(model$canonical$xs))

plot of chunk unnamed-chunk-4

RSM Results

Looking at the above results, because all of the P values that are returned for each factor interaction are higher than the alpha level (0.05), the null hypothesis can not be rejected for any of the four factors in this data set. This means that the variation in the response variable can not be explained by anything other than randomization in this analysis.

The stationary points for each factor using response surface methods are: 1.698 (Population), 3.682 (Violent), 5.746 (Murder), 5.588 (Robbery).

Looking at the second order response surfaces in the contour plots above it appears that there is a maximum at the stationary points in the first and third plots. There also seem to be maximums at the top of the second plot and top right of the sixth plot. The fifth plot looks as if it may be the location of a saddle point. These contour plots may be somewhat misleading because they only look at the reponse variable with regards to two factors per plot. Thus, below I create three dimensional perspective plots which are more representative of the response surfaces of the data.

library(rgl)
## Warning: package 'rgl' was built under R version 3.1.2
par(mfrow=c(1,1))
persp(model,~Population+Violent,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-5

persp(model,~Population+Murder,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-5

persp(model,~Population+Robbery,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-5

persp(model,~Violent+Murder,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-5

persp(model,~Violent+Robbery,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-5

persp(model,~Murder+Robbery,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-5

Looking at the 3D perspective plots, there is evidence of a number of maximums on the Population/Violent plot, Population/Murder plot, Population/Robbery plot, and the Murder/Robbery plot. All plots seem to resemble a ridge, and the only plot which seems like it may be part of a saddle point is the Population/Violent plot.

Diagnostics/Model Adequacy Checking

Shapiro-Wilk Test

Below I perform a Shapiro-Wilk Test to determine if the data set is normally distributed.

shapiro.test(CSV$Inmates)
## 
##  Shapiro-Wilk normality test
## 
## data:  CSV$Inmates
## W = 0.9592, p-value = 0.08197

The null hypothesis for a Shapiro-Wilk test is that the data does not deviate from normality. Thus the resulting P-value which is greater than the alpha level of 0.05 indicates that we cannot reject the null hypothesis and thus the data is considered to follow a normal distribution.

Below I create two plots to further check the model adequacy of this experimental analysis. The first plot is a quantile-quantile (QQ) plot that is used to determine if the residuals of the data set follow a normal distribution. Since the residuals are nearly linear it seems that this is an adequate model.

After the QQ plot I created a Residuals vs. Fits plot which is used to identify the linearity of the residual values and to determine if there are any outlying values. Because the residual values seem to be scattered around zero for this model it can be concluded that the model used in this analysis is accurate for determining the effect of the four factors on the response variable of inmates.

par(mfrow=c(1,2))
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))

plot of chunk unnamed-chunk-7

4.Appendices

The data for this experimental analysis was taken from the website http://www.disastercenter.com/crime/uscrime.htm.