The following experimental analysis looks at a data set containing crime rate data from all 50 of the United States in the year 2012. Below are the first 6 lines of this data set, and below that is a figure showing the structure of the data set.
CSV<-read.csv("C:\\Users\\Anthony\\Desktop\\School\\RPI Year 1\\DoE\\CrimeRMS.csv",header=TRUE)
head(CSV)
## Population Violent Murder Robbery Inmates
## 1 3 2 3 3 650.5
## 2 1 3 2 2 405.2
## 3 3 2 3 3 582.8
## 4 2 2 3 2 494.2
## 5 2 2 2 3 351.2
## 6 3 2 2 2 392.2
str(CSV)
## 'data.frame': 50 obs. of 5 variables:
## $ Population: int 3 1 3 2 2 3 2 1 3 3 ...
## $ Violent : int 2 3 2 2 2 2 1 3 3 2 ...
## $ Murder : int 3 2 3 3 2 2 2 3 2 2 ...
## $ Robbery : int 3 2 3 2 3 2 3 3 3 3 ...
## $ Inmates : num 650 405 583 494 351 ...
library(rsm)
## Warning: package 'rsm' was built under R version 3.1.2
Inmates<-as.integer(CSV$Inmates)
Population<-as.integer(CSV$Population)
Violent<-as.integer(CSV$Violent)
Murder<-as.integer(CSV$Murder)
Robbery<-as.integer(CSV$Robbery)
This data set consists of four factors: Population, Violent, Murder, and Robbery. The first factor is the population of the state, and the other three are related to crime rates of violent crimes, murder, and robbery. Each factor is split into three levels: 1, 2, and 3 which correspond to low, medium, and high values respectively. The data was manipulated this way to make it suitable for a 3^k design analysis.
The response variable for this experimental analysis is Inmates. This variable is a continuous variable and is a number that denotes the number of inmates per 100,000 citizens in the state.
How will the experiment be organized and conducted to test the hypothesis?
In this experimental data analysis I will use a Response Surface Methods approach to determine if any of the four factors in the data set cause variations in the response variable.
The null hypothesis (H0) for all four factors is that each factor does not have an effect on the response variable “Criminals” that can be accredited to something other than randomization.
What is the rationale for this design?
I have chosen to use this experimental design to demonstrate proper usage of response surface methods in a 3^k design to analyze the effects of four factors on a response variable.
Below are boxplots showing any evident trends in the response variable with respect to each of the four factors.
par(mfrow=c(2,2))
boxplot(Inmates~Population,data=CSV,xlab="Population",ylab="Inmates")
boxplot(Inmates~Violent,data=CSV,xlab="Violent",ylab="Inmates")
boxplot(Inmates~Murder,data=CSV,xlab="Murder",ylab="Inmates")
boxplot(Inmates~Robbery,data=CSV,xlab="Robbery",ylab="Inmates")
Each boxplot seems to show that as each factor increases so does the value of the response variable.
The response surface method below was used to analyze the changes in inmates as a response to changes in each of the four factors.
model<-rsm(Inmates~SO(Population,Violent,Murder,Robbery),data=CSV)
summary(model)
##
## Call:
## rsm(formula = Inmates ~ SO(Population, Violent, Murder, Robbery),
## data = CSV)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -88.642 274.372 -0.32 0.75
## Population 299.647 231.380 1.30 0.20
## Violent -28.209 169.228 -0.17 0.87
## Murder 307.019 194.754 1.58 0.12
## Robbery -102.221 231.074 -0.44 0.66
## Population:Violent 0.387 63.143 0.01 1.00
## Population:Murder -39.050 49.478 -0.79 0.44
## Population:Robbery 33.588 74.364 0.45 0.65
## Violent:Murder -63.899 59.521 -1.07 0.29
## Violent:Robbery 80.791 80.278 1.01 0.32
## Murder:Robbery 92.561 83.371 1.11 0.27
## Population^2 -77.849 65.494 -1.19 0.24
## Violent^2 -7.701 62.429 -0.12 0.90
## Murder^2 -45.478 57.484 -0.79 0.43
## Robbery^2 -70.168 71.373 -0.98 0.33
##
## Multiple R-squared: 0.445, Adjusted R-squared: 0.223
## F-statistic: 2 on 14 and 35 DF, p-value: 0.048
##
## Analysis of Variance Table
##
## Response: Inmates
## Df Sum Sq Mean Sq F value Pr(>F)
## FO(Population, Violent, Murder, Robbery) 4 388988 97247 5.73 0.0012
## TWI(Population, Violent, Murder, Robbery) 6 44862 7477 0.44 0.8467
## PQ(Population, Violent, Murder, Robbery) 4 42131 10533 0.62 0.6510
## Residuals 35 594308 16980
## Lack of fit 14 189628 13545 0.70 0.7484
## Pure error 21 404680 19270
##
## Stationary point of response surface:
## Population Violent Murder Robbery
## 1.698 3.682 5.746 5.588
##
## Eigenanalysis:
## $values
## [1] 15.583 -9.587 -70.219 -136.973
##
## $vectors
## [,1] [,2] [,3] [,4]
## Population -0.1147 -0.07194 0.9197 0.3687
## Violent -0.9045 0.05799 -0.2460 0.3436
## Murder 0.2876 0.78212 -0.1193 0.5397
## Robbery -0.2933 0.61623 0.2819 -0.6743
Below I create contour plots for all factor interactions.
par(mfrow=c(2,3))
contour(model,~Population+Violent+Murder+Robbery,image=TRUE,at=summary(model$canonical$xs))
Looking at the above results, because all of the P values that are returned for each factor interaction are higher than the alpha level (0.05), the null hypothesis can not be rejected for any of the four factors in this data set. This means that the variation in the response variable can not be explained by anything other than randomization in this analysis.
The stationary points for each factor using response surface methods are: 1.698 (Population), 3.682 (Violent), 5.746 (Murder), 5.588 (Robbery).
Looking at the second order response surfaces in the contour plots above it appears that there is a maximum at the stationary points in the first and third plots. There also seem to be maximums at the top of the second plot and top right of the sixth plot. The fifth plot looks as if it may be the location of a saddle point. These contour plots may be somewhat misleading because they only look at the reponse variable with regards to two factors per plot. Thus, below I create three dimensional perspective plots which are more representative of the response surfaces of the data.
library(rgl)
## Warning: package 'rgl' was built under R version 3.1.2
par(mfrow=c(1,1))
persp(model,~Population+Violent,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model,~Population+Murder,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model,~Population+Robbery,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model,~Violent+Murder,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model,~Violent+Robbery,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model,~Murder+Robbery,image=TRUE,at=c(summary(model)$canonical$xs,block="B2"),contour="colors",zlab="Inmates",theta=30)
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
Looking at the 3D perspective plots, there is evidence of a number of maximums on the Population/Violent plot, Population/Murder plot, Population/Robbery plot, and the Murder/Robbery plot. All plots seem to resemble a ridge, and the only plot which seems like it may be part of a saddle point is the Population/Violent plot.
Below I perform a Shapiro-Wilk Test to determine if the data set is normally distributed.
shapiro.test(CSV$Inmates)
##
## Shapiro-Wilk normality test
##
## data: CSV$Inmates
## W = 0.9592, p-value = 0.08197
The null hypothesis for a Shapiro-Wilk test is that the data does not deviate from normality. Thus the resulting P-value which is greater than the alpha level of 0.05 indicates that we cannot reject the null hypothesis and thus the data is considered to follow a normal distribution.
Below I create two plots to further check the model adequacy of this experimental analysis. The first plot is a quantile-quantile (QQ) plot that is used to determine if the residuals of the data set follow a normal distribution. Since the residuals are nearly linear it seems that this is an adequate model.
After the QQ plot I created a Residuals vs. Fits plot which is used to identify the linearity of the residual values and to determine if there are any outlying values. Because the residual values seem to be scattered around zero for this model it can be concluded that the model used in this analysis is accurate for determining the effect of the four factors on the response variable of inmates.
par(mfrow=c(1,2))
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
The data for this experimental analysis was taken from the website http://www.disastercenter.com/crime/uscrime.htm.