The following experimental analysis looks at a data set containing the crime rates in the United States ranging from the 1960s through the 2000s. Below are the first 6 lines of this data set, and below that is a figure showing the structure of the data set.
Crime<-read.csv("C:\\Users\\Anthony\\Desktop\\School\\RPI Year 1\\DoE\\Crime5.csv",header=TRUE)
# To typecast variables for R analysis
Decade<-as.factor(Crime$Decade)
Total<-as.integer(Crime$Total)
head(Crime)
## Decade Total
## 1 1960s 3384200
## 2 1960s 3488000
## 3 1960s 3752200
## 4 1960s 4109500
## 5 1960s 4564600
## 6 1960s 4739400
str(Crime)
## 'data.frame': 50 obs. of 2 variables:
## $ Decade: Factor w/ 5 levels "1960s","1970s",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Total : int 3384200 3488000 3752200 4109500 4564600 4739400 5223500 5903400 6720200 7410900 ...
In this experiment I will perform an analysis of variance (ANOVA) to help determine the effects of the factor “Decade” on the resulting total crime rate in the united states.
This categorical variable “Decade” has five levels, each level is a decade from the 1960s to the 2000s inclusive.
The response variable is the total crime rate in the United States, and it is a continuous variable.
This data set contains recordings of Decade, Total Popuation, and Total Crime Rate, as well as the rates of specific types of crime such as violent, property related, murder, and rape.
How will the experiment be organized and conducted to test the hypothesis?
In this experimental data analysis I will perform an Analysis of Variance (ANOVA) to determine the effect of changes in decade on the response variable of total crime.
After this ANOVA is complete I will use resampling techniques to determine the actual distribution of the data set to perform another ANOVA. This second ANOVA will differ from the first because the first ANOVA will assume that the data follows a normal distribution.
What is the rationale for this design?
I have chosen to use this experimental design to demonstrate proper usage of resampling techinques to more accurately perform statistical inference via the analysis of variance.
Below is a boxplot representing a clear trend in the total crime rates in the United States from the 1960s to 2000s.
boxplot(Total~Decade,data=Crime,main="United States Crime",xlab="Decade",ylab="Total Crime Rate")
This boxplot shows that the crime rate in the United States increased each decade from the 1960s until the 1990s, and then began to decrease.
Below I perform an analysis of variance (ANOVA) to determine the effect of decade on the response variable of total crime rate.
model<-aov(Total~Decade,data=Crime)
anova(model)
## Analysis of Variance Table
##
## Response: Total
## Df Sum Sq Mean Sq F value Pr(>F)
## Decade 4 4.89e+14 1.22e+14 102 <2e-16 ***
## Residuals 45 5.37e+13 1.19e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis for this statistical analysis states that there is not a change in total crime rate with regards to decade that can be attributed to something other than randomization. In the results of the ANOVA above we see a very low p statistic value (2e-16) which allows us to reject this null hypothesis. Thus we accept the alternative hypothesis which states that there is a change in total crime rate that can be attributed to somethign other than randomization, in this case, the change in decade. The mean square of the residuals in this ANOVA is 1.14e12.
To expand on the results of the ANOVA tests I perform Tukey’s Honest Significant Differences Test (HSD) to determine which levels of the decade factor results in changes in the total crime rate that are statistically different from eachother.
TukeyHSD(model,which=c("Decade","Total"),conf.level=0.95)
## Warning: 'which' specified some non-factors which will be dropped
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Total ~ Decade, data = Crime)
##
## $Decade
## diff lwr upr p adj
## 1970s-1960s 5169570 3781166 6557974 0.0000
## 1980s-1960s 8182747 6794343 9571151 0.0000
## 1990s-1960s 8728625 7340220 10117029 0.0000
## 2000s-1960s 6571614 5183210 7960019 0.0000
## 1980s-1970s 3013177 1624773 4401581 0.0000
## 1990s-1970s 3559055 2170650 4947459 0.0000
## 2000s-1970s 1402044 13640 2790449 0.0467
## 1990s-1980s 545878 -842527 1934282 0.7967
## 2000s-1980s -1611133 -2999537 -222728 0.0156
## 2000s-1990s -2157010 -3545414 -768606 0.0006
Above is a Tukey’s HSD test which compares the mean total crime rate values which resulted from each change in decade. For individual tests that return a p-adj.value<0.05 there is a statistical difference between the total crime rates of those two factor levels that can be attributed to something other than randomization with a confidence level of 95%.
Here I perform a Monte Carlo simulation which is a resampling method performed using a known theoretical distribution. This simulation uses repeats of random sampling to determine the actual distribution of a data set.
Below I create arrays to sort the levels of the factor of interest.
meanstar=tapply(Total,Decade,mean)
#creates an array of the means of the levels
tapply(Total,Decade,mean)
## 1960s 1970s 1980s 1990s 2000s
## 4929590 10099160 13112337 13658215 11501204
#creates an array of the variances of the levels
tapply(Total,Decade,var)
## 1960s 1970s 1980s 1990s 2000s
## 1.899e+12 2.363e+12 5.901e+11 9.863e+11 1.300e+11
#creates an array of the lengths of the levels
tapply(Total,Decade,length)
## 1960s 1970s 1980s 1990s 2000s
## 10 10 10 10 10
Below is a summary of the ANOVA analysis, along with calculated means and parameters required for the Monte Carlo simulation.
# Summary of ANOVA results
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Decade 4 4.89e+14 1.22e+14 102 <2e-16 ***
## Residuals 45 5.37e+13 1.19e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Mean total crime rate in the data set
MeanTotal=mean(Total)
# Assigns the decades to one data set
simDecade=Crime$Decade
# Square root of the mean square of the ANOVA residuals
srMS=sqrt(1140000000000)
# Assign the number of random simulations to be run in the simulation (R)
R=1000
Below is the code used to perform the Monte Carlo simulation.
fstar=numeric(R)
for (i in 1:R) {
groupA = rnorm(10, mean=MeanTotal, sd=srMS)
groupB = rnorm(10, mean=MeanTotal, sd=srMS)
groupC = rnorm(10, mean=MeanTotal, sd=srMS)
groupD = rnorm(10, mean=MeanTotal, sd=srMS)
groupE = rnorm(10, mean=MeanTotal, sd=srMS)
simTotal=c(groupA,groupB,groupC,groupD,groupE)
simdata=data.frame(simDecade,simTotal)
fstar[i]=oneway.test(simTotal~simDecade,var.equal=T,data=simdata)$statistic}
# Compute empirical distribution of F under null hypothesis and display histogram using Monte Carlo Simulation
hist(fstar,ylim=c(0,0.75), prob=T, main="Histogram of Theoretical F Distribution Using Monte Carlo Simulation")
# Plot F distribution from .25 to 6 in .5 increments
x=seq(0,5.5,.25)
# Overlay the analytic F distribution as red dots (degrees of freedom known)
points(x,y=df(x,4,45),type="b",col="red")
# 4 = 5 levels - 1 mean; 45 = 50 values - 5 means
Below is the code used to perform a Bootstrap simulation.
meanstar = tapply(Total,Decade,mean)
grpAA = Crime$Total[Decade=="1960s"] - meanstar[1]
grpBB = Crime$Total[Decade=="1970s"] - meanstar[2]
grpCC = Crime$Total[Decade=="1980s"] - meanstar[3]
grpDD = Crime$Total[Decade=="1990s"] - meanstar[4]
grpEE = Crime$Total[Decade=="2000s"] - meanstar[5]
simDecade = Crime$Decade
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupAA = sample(grpAA, size=10, replace=T)
groupBB = sample(grpBB, size=10, replace=T)
groupCC = sample(grpCC, size=10, replace=T)
groupDD = sample(grpDD, size=10, replace=T)
groupEE = sample(grpEE, size=10, replace=T)
simTotal2 = c(groupAA,groupBB,groupCC,groupDD,groupEE)
simdata2 = data.frame(simTotal2,simDecade)
Fstar[i] = oneway.test(simTotal2~simDecade, var.equal=T, data=simdata2)$statistic}
# Compute empirical distribution of F under null hypothesis and display histogram using Bootstrap Simulation
hist(Fstar,ylim=c(0,0.75), prob=T,main="Histogram of Theoretical F Distribution Using Bootstrap Simulation")
x=seq(0,5,.25)
points(x,y=df(x,4,45),type="b",col="red")
# Recalculated alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(simTotal~simDecade, var.equal=T, data=simdata)$statistic)
## F
## 1.531
mean(Fstar>=realFstar)
## [1] 0.224
Below I perform a new ANOVA using the model that results from the two resampling methods.
model2=aov(simTotal2~simDecade,data=simdata2)
anova(model2)
## Analysis of Variance Table
##
## Response: simTotal2
## Df Sum Sq Mean Sq F value Pr(>F)
## simDecade 4 8.10e+11 2.02e+11 0.19 0.94
## Residuals 45 4.88e+13 1.08e+12
As seen above in this new ANOVA the resulting P value is >0.05 indicating that there is a high probability that the observable changes in total crime rate with regards to decade are likely due to randomization. Therefore in this instance, as a result of the bootstrapping method, we fail to reject the null hypothesis.
To check the adequacy of using the ANOVA as a means of analyzing this set of data I performed Quantile-Quantile (Q-Q) tests on the residual errors to determine if the residuals followed a normal distribution.
The nearly linear fit of the residuals in the QQ plot is an indication that the ANOVA model may be adequate for this analysis. A perfectly linear fit in these QQ plots would mean that the model that I used perfectly satisfies the assumptions of normality.
The second type of plot is a Residuals vs. Fits plot which is used to identify the linearity of the residual values and to determine if there are any outlying values. Because the residual values seem to be centered around zero for this model it can be concluded that the model used in this analysis is accurate for determining the effect of decade on the total crime rate in the United States.
# QQ Plot for residuals in analysis of fuel type effect on highway gas mileage
qqnorm(residuals(model))
qqline(residuals(model))
plot(fitted(model),residuals(model))
The data for this experimental analysis was taken from the website http://www.disastercenter.com/crime/uscrime.htm.