Anthony D’Amato

RPI

11/03/2014

1. Setting

Crime Rates in the United States 1960s-2000s

The following experimental analysis looks at a data set containing the crime rates in the United States ranging from the 1960s through the 2000s. Below are the first 6 lines of this data set, and below that is a figure showing the structure of the data set.

Crime<-read.csv("C:\\Users\\Anthony\\Desktop\\School\\RPI Year 1\\DoE\\Crime5.csv",header=TRUE)
# To typecast variables for R analysis
Decade<-as.factor(Crime$Decade)
Total<-as.integer(Crime$Total)
head(Crime)

##   Decade   Total
## 1  1960s 3384200
## 2  1960s 3488000
## 3  1960s 3752200
## 4  1960s 4109500
## 5  1960s 4564600
## 6  1960s 4739400

str(Crime)

## 'data.frame':    50 obs. of  2 variables:
##  $ Decade: Factor w/ 5 levels "1960s","1970s",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Total : int  3384200 3488000 3752200 4109500 4564600 4739400 5223500 5903400 6720200 7410900 ...

Factors and Levels

In this experiment I will perform an analysis of variance (ANOVA) to help determine the effects of the factor “Decade” on the resulting total crime rate in the united states.

This categorical variable “Decade” has five levels, each level is a decade from the 1960s to the 2000s inclusive.

The response variable is the total crime rate in the United States, and it is a continuous variable.

The Data: How is it organized and what does it look like?

This data set contains recordings of Decade, Total Popuation, and Total Crime Rate, as well as the rates of specific types of crime such as violent, property related, murder, and rape.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?
In this experimental data analysis I will perform an Analysis of Variance (ANOVA) to determine the effect of changes in decade on the response variable of total crime.

After this ANOVA is complete I will use resampling techniques to determine the actual distribution of the data set to perform another ANOVA. This second ANOVA will differ from the first because the first ANOVA will assume that the data follows a normal distribution.

What is the rationale for this design?
I have chosen to use this experimental design to demonstrate proper usage of resampling techinques to more accurately perform statistical inference via the analysis of variance.

3. Statistical Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Below is a boxplot representing a clear trend in the total crime rates in the United States from the 1960s to 2000s.

boxplot(Total~Decade,data=Crime,main="United States Crime",xlab="Decade",ylab="Total Crime Rate")

plot of chunk unnamed-chunk-2

This boxplot shows that the crime rate in the United States increased each decade from the 1960s until the 1990s, and then began to decrease.

Testing

Analysis of Variance (ANOVA)

Below I perform an analysis of variance (ANOVA) to determine the effect of decade on the response variable of total crime rate.

model<-aov(Total~Decade,data=Crime)
anova(model)

## Analysis of Variance Table
## 
## Response: Total
##           Df   Sum Sq  Mean Sq F value Pr(>F)    
## Decade     4 4.89e+14 1.22e+14     102 <2e-16 ***
## Residuals 45 5.37e+13 1.19e+12                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Results

The null hypothesis for this statistical analysis states that there is not a change in total crime rate with regards to decade that can be attributed to something other than randomization. In the results of the ANOVA above we see a very low p statistic value (2e-16) which allows us to reject this null hypothesis. Thus we accept the alternative hypothesis which states that there is a change in total crime rate that can be attributed to somethign other than randomization, in this case, the change in decade. The mean square of the residuals in this ANOVA is 1.14e12.

Tukey’s Honest Significant Differences

To expand on the results of the ANOVA tests I perform Tukey’s Honest Significant Differences Test (HSD) to determine which levels of the decade factor results in changes in the total crime rate that are statistically different from eachother.

TukeyHSD(model,which=c("Decade","Total"),conf.level=0.95)

## Warning: 'which' specified some non-factors which will be dropped

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Total ~ Decade, data = Crime)
## 
## $Decade
##                 diff      lwr      upr  p adj
## 1970s-1960s  5169570  3781166  6557974 0.0000
## 1980s-1960s  8182747  6794343  9571151 0.0000
## 1990s-1960s  8728625  7340220 10117029 0.0000
## 2000s-1960s  6571614  5183210  7960019 0.0000
## 1980s-1970s  3013177  1624773  4401581 0.0000
## 1990s-1970s  3559055  2170650  4947459 0.0000
## 2000s-1970s  1402044    13640  2790449 0.0467
## 1990s-1980s   545878  -842527  1934282 0.7967
## 2000s-1980s -1611133 -2999537  -222728 0.0156
## 2000s-1990s -2157010 -3545414  -768606 0.0006

Tukey’s HSD Results

Above is a Tukey’s HSD test which compares the mean total crime rate values which resulted from each change in decade. For individual tests that return a p-adj.value<0.05 there is a statistical difference between the total crime rates of those two factor levels that can be attributed to something other than randomization with a confidence level of 95%.

Resampling Methods

Monte Carlo Estimation of Power

Here I perform a Monte Carlo simulation which is a resampling method performed using a known theoretical distribution. This simulation uses repeats of random sampling to determine the actual distribution of a data set.

Below I create arrays to sort the levels of the factor of interest.

meanstar=tapply(Total,Decade,mean)
#creates an array of the means of the levels
tapply(Total,Decade,mean)

##    1960s    1970s    1980s    1990s    2000s 
##  4929590 10099160 13112337 13658215 11501204

#creates an array of the variances of the levels
tapply(Total,Decade,var)

##     1960s     1970s     1980s     1990s     2000s 
## 1.899e+12 2.363e+12 5.901e+11 9.863e+11 1.300e+11

#creates an array of the lengths of the levels
tapply(Total,Decade,length)

## 1960s 1970s 1980s 1990s 2000s 
##    10    10    10    10    10

Below is a summary of the ANOVA analysis, along with calculated means and parameters required for the Monte Carlo simulation.

# Summary of ANOVA results
summary(model)

##             Df   Sum Sq  Mean Sq F value Pr(>F)    
## Decade       4 4.89e+14 1.22e+14     102 <2e-16 ***
## Residuals   45 5.37e+13 1.19e+12                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Mean total crime rate in the data set
MeanTotal=mean(Total)
# Assigns the decades to one data set
simDecade=Crime$Decade
# Square root of the mean square of the ANOVA residuals
srMS=sqrt(1140000000000)
# Assign the number of random simulations to be run in the simulation (R)
R=1000

Below is the code used to perform the Monte Carlo simulation.

fstar=numeric(R)
for (i in 1:R) {
  groupA = rnorm(10, mean=MeanTotal, sd=srMS)
  groupB = rnorm(10, mean=MeanTotal, sd=srMS)
  groupC = rnorm(10, mean=MeanTotal, sd=srMS)
  groupD = rnorm(10, mean=MeanTotal, sd=srMS)
  groupE = rnorm(10, mean=MeanTotal, sd=srMS)
simTotal=c(groupA,groupB,groupC,groupD,groupE)
simdata=data.frame(simDecade,simTotal)
fstar[i]=oneway.test(simTotal~simDecade,var.equal=T,data=simdata)$statistic}

# Compute empirical distribution of F under null hypothesis and display histogram using Monte Carlo Simulation
hist(fstar,ylim=c(0,0.75), prob=T, main="Histogram of Theoretical F Distribution Using Monte Carlo Simulation")
# Plot F distribution from .25 to 6 in .5 increments
x=seq(0,5.5,.25)
# Overlay the analytic F distribution as red dots (degrees of freedom known)
points(x,y=df(x,4,45),type="b",col="red")

plot of chunk unnamed-chunk-7

# 4 = 5 levels - 1 mean; 45 = 50 values - 5 means

Below is the code used to perform a Bootstrap simulation.

meanstar = tapply(Total,Decade,mean)
grpAA = Crime$Total[Decade=="1960s"] - meanstar[1]
grpBB = Crime$Total[Decade=="1970s"] - meanstar[2]
grpCC = Crime$Total[Decade=="1980s"] - meanstar[3]
grpDD = Crime$Total[Decade=="1990s"] - meanstar[4]
grpEE = Crime$Total[Decade=="2000s"] - meanstar[5]

simDecade = Crime$Decade

R = 10000
Fstar = numeric(R)
for (i in 1:R) {
  groupAA = sample(grpAA, size=10, replace=T)
  groupBB = sample(grpBB, size=10, replace=T)
  groupCC = sample(grpCC, size=10, replace=T)
  groupDD = sample(grpDD, size=10, replace=T)
  groupEE = sample(grpEE, size=10, replace=T)
  simTotal2 = c(groupAA,groupBB,groupCC,groupDD,groupEE)
  simdata2 = data.frame(simTotal2,simDecade)
  Fstar[i] = oneway.test(simTotal2~simDecade, var.equal=T, data=simdata2)$statistic}

# Compute empirical distribution of F under null hypothesis and display histogram using Bootstrap Simulation
hist(Fstar,ylim=c(0,0.75), prob=T,main="Histogram of Theoretical F Distribution Using Bootstrap Simulation")
x=seq(0,5,.25)
points(x,y=df(x,4,45),type="b",col="red")

plot of chunk unnamed-chunk-8

# Recalculated alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(simTotal~simDecade, var.equal=T, data=simdata)$statistic)

##     F 
## 1.531

mean(Fstar>=realFstar)

## [1] 0.224

ANOVA Testing with Resampling

Below I perform a new ANOVA using the model that results from the two resampling methods.

model2=aov(simTotal2~simDecade,data=simdata2)
anova(model2)

## Analysis of Variance Table
## 
## Response: simTotal2
##           Df   Sum Sq  Mean Sq F value Pr(>F)
## simDecade  4 8.10e+11 2.02e+11    0.19   0.94
## Residuals 45 4.88e+13 1.08e+12

New ANOVA Results

As seen above in this new ANOVA the resulting P value is >0.05 indicating that there is a high probability that the observable changes in total crime rate with regards to decade are likely due to randomization. Therefore in this instance, as a result of the bootstrapping method, we fail to reject the null hypothesis.

Diagnostics/Model Adequacy Checking

To check the adequacy of using the ANOVA as a means of analyzing this set of data I performed Quantile-Quantile (Q-Q) tests on the residual errors to determine if the residuals followed a normal distribution.

The nearly linear fit of the residuals in the QQ plot is an indication that the ANOVA model may be adequate for this analysis. A perfectly linear fit in these QQ plots would mean that the model that I used perfectly satisfies the assumptions of normality.

The second type of plot is a Residuals vs. Fits plot which is used to identify the linearity of the residual values and to determine if there are any outlying values. Because the residual values seem to be centered around zero for this model it can be concluded that the model used in this analysis is accurate for determining the effect of decade on the total crime rate in the United States.

# QQ Plot for residuals in analysis of fuel type effect on highway gas mileage
qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-10

plot(fitted(model),residuals(model))

plot of chunk unnamed-chunk-11

4.Appendices

The data for this experimental analysis was taken from the website http://www.disastercenter.com/crime/uscrime.htm.