Recipe 6: Analysis of Covariance on Crime Dataset

Design of Experiments

Trevor Manzanares

Rensselaer Polytechnic Institute

10/30/14

Dataset under analysis includes data on United States Crime Rates 1960 - 2012. This experiment will seek to understand which factors have an effect on total crime rates for these years. Since all the variables are already factors, some of them will need to be typecasted as numbers for use in the ANCOVA model.

remove(list=ls())
crime = read.csv("C:/Users/Trevor/Documents/crime.csv", header=TRUE) 

#view first few lines
head(crime)

##   Year   Population      Total  Violent   Property Murder ForcibleRape
## 1 1960 179,323,175  3,384,200  288,460  3,095,700   9,000      17,190 
## 2 1961 182,992,000  3,488,000  289,390  3,198,600   8,000      17,220 
## 3 1962 185,771,000  3,752,200  301,510  3,450,700   8,000      17,550 
## 4 1963  188,483,000 4,109,500  316,970  3,792,500   8,000       17,650
## 5 1964 191,141,000  4,564,600  364,220  4,200,400   9,000      21,420 
## 6 1965 193,526,000  4,739,400  387,390  4,352,000   9,000      23,410 
##   Robbery AggravatedAssault   Burglary LarcenyTheft VehicleTheft
## 1 110,000          154,320    912,100    1,855,400      328,200 
## 2 110,000          156,760    949,600    1,913,000      336,000 
## 3 110,000          164,570    994,300    2,089,600      366,800 
## 4 120,000          174,210  1,086,400    2,297,800      408,300 
## 5 130,000          203,050  1,213,200    2,514,400      472,800 
## 6 140,000          215,330  1,282,500    2,572,600      496,900

#observe the structure of the data
str(crime)

## 'data.frame':    53 obs. of  12 variables:
##  $ Year             : num  1960 1961 1962 1963 1964 ...
##  $ Population       : Factor w/ 53 levels "179,323,175 ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Total            : Factor w/ 53 levels "10,189,902","10,253,400 ",..: 40 41 42 43 44 45 46 47 48 49 ...
##  $ Violent          : Factor w/ 53 levels "1,004,210 ","1,029,580 ",..: 39 40 41 42 43 44 45 46 47 48 ...
##  $ Property         : Factor w/ 53 levels "10,123,400 ",..: 31 32 33 34 35 36 37 38 39 40 ...
##  $ Murder           : Factor w/ 17 levels " 16,000  ","11,000",..: 17 16 16 16 17 17 2 3 4 5 ...
##  $ ForcibleRape     : Factor w/ 53 levels "102,220 ","102,560 ",..: 6 7 8 9 10 11 12 13 14 15 ...
##  $ Robbery          : Factor w/ 34 levels "110,000","120,000",..: 1 1 1 2 3 4 5 6 7 8 ...
##  $ AggravatedAssault: Factor w/ 53 levels "1,023,201 ","1,037,050 ",..: 9 10 11 12 13 14 15 16 17 18 ...
##  $ Burglary         : Factor w/ 53 levels "1,086,400 ","1,213,200 ",..: 51 52 53 1 2 3 4 5 6 7 ...
##  $ LarcenyTheft     : Factor w/ 53 levels "1,855,400 ","1,913,000 ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ VehicleTheft     : Factor w/ 53 levels "1,004,100 ","1,007,900 ",..: 32 33 34 35 36 37 38 39 43 45 ...

#53 observations of 12 variables
attach(crime)
#typecast variables for use in ANCOVA
crime$Total=as.integer(crime$Total)
crime$Population=as.integer(crime$Population)
crime$Violent=as.integer(crime$Violent)
crime$Robbery=as.factor(crime$Robbery)

#reobserve structure of data, apparent trends
str(crime)

## 'data.frame':    53 obs. of  12 variables:
##  $ Year             : num  1960 1961 1962 1963 1964 ...
##  $ Population       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Total            : int  40 41 42 43 44 45 46 47 48 49 ...
##  $ Violent          : int  39 40 41 42 43 44 45 46 47 48 ...
##  $ Property         : Factor w/ 53 levels "10,123,400 ",..: 31 32 33 34 35 36 37 38 39 40 ...
##  $ Murder           : Factor w/ 17 levels " 16,000  ","11,000",..: 17 16 16 16 17 17 2 3 4 5 ...
##  $ ForcibleRape     : Factor w/ 53 levels "102,220 ","102,560 ",..: 6 7 8 9 10 11 12 13 14 15 ...
##  $ Robbery          : Factor w/ 34 levels "110,000","120,000",..: 1 1 1 2 3 4 5 6 7 8 ...
##  $ AggravatedAssault: Factor w/ 53 levels "1,023,201 ","1,037,050 ",..: 9 10 11 12 13 14 15 16 17 18 ...
##  $ Burglary         : Factor w/ 53 levels "1,086,400 ","1,213,200 ",..: 51 52 53 1 2 3 4 5 6 7 ...
##  $ LarcenyTheft     : Factor w/ 53 levels "1,855,400 ","1,913,000 ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ VehicleTheft     : Factor w/ 53 levels "1,004,100 ","1,007,900 ",..: 32 33 34 35 36 37 38 39 43 45 ...

plot(crime)

plot of chunk unnamed-chunk-1 The data are tabluated into 12 columns, with detailed information on population, total crime, violent crime, property crime, murder, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and vehicle theft by year.The project will test the effects of the factor “Robbery” (34 levels), and the continuous variables “Population” and “# of Violent Crimes” on the response, “# of Total Crimes”. Summary statistics are as follows:

summary(crime)

##       Year        Population     Total       Violent          Property 
##  Min.   :1960   Min.   : 1   Min.   : 1   Min.   : 1   10,123,400 : 1  
##  1st Qu.:1973   1st Qu.:14   1st Qu.:14   1st Qu.:14   10,174,754 : 1  
##  Median :1986   Median :27   Median :27   Median :27   10,182,586 : 1  
##  Mean   :1986   Mean   :27   Mean   :27   Mean   :27   10,208,334 : 1  
##  3rd Qu.:1999   3rd Qu.:40   3rd Qu.:40   3rd Qu.:40   10,252,700 : 1  
##  Max.   :2012   Max.   :53   Max.   :53   Max.   :53   10,319,386 : 1  
##                                                        (Other)    :47  
##      Murder     ForcibleRape    Robbery    AggravatedAssault
##  16,000 : 8   102,220 : 1    410,000: 5   1,023,201 : 1     
##  18,000 : 5   102,560 : 1    110,000: 3   1,037,050 : 1     
##  19,000 : 5   106,010 : 1    350,000: 3   1,054,860 : 1     
##  20,000 : 5   106,590 : 1    420,000: 3   1,092,740 : 1     
##  14,000 : 4   109,060 : 1    450,000: 3   1,099,210 : 1     
##  21,000 : 4   17,190  : 1    540,000: 3   1,113,180 : 1     
##  (Other):22   (Other) :47    (Other):33   (Other)   :47     
##        Burglary      LarcenyTheft     VehicleTheft
##  1,086,400 : 1   1,855,400 : 1    1,004,100 : 1   
##  1,213,200 : 1   1,913,000 : 1    1,007,900 : 1   
##  1,282,500 : 1   2,089,600 : 1    1,009,600 : 1   
##  1,410,100 : 1   2,297,800 : 1    1,032,200 : 1   
##  1,632,100 : 1   2,514,400 : 1    1,062,400 : 1   
##  1,858,900 : 1   2,572,600 : 1    1,087,800 : 1   
##  (Other)   :47   (Other)   :47    (Other)   :47

Randomization is not applicable here due to the nature of the data. The data were collected by the Federal Bureau of Investigation from 1960 to 2012, containing the above listed variables. Each year has complete data.

The hypothesis is that population directly effects the number of total crimes committed, along with other factors such as # of violent crimes and robbery. Analysis of Covariance (ANCOVA) will be used to understand the effects of these variables on the total number of crimes for the years 1960-2012. The rationale for using ANCOVA is because there is a numerical outcome (Total number of crimes), an intuitive numeric predictor variable (Population), and two other variables (# Violent Crimes, Robbery). The former is numeric while the latter is a factor.

Given the lack of background information on the data, it is hard to say if there are replicates or repeated measures. If replication is defined as the same measurements of categories of crimes committed over 52 years, then there is replication. If there are repeat offenders and the same individual is incarcerated multiple times, that could count as a repeated measure.

Blocking will be utilized here to hold the factor “Robbery” constant since it does not seem to directly affect the response variable directly. Therefore, variation in the response will be controlled for more accurately and the overall model precision will increase.

Graphs showing trends in Total # of Crimes committed, violent crimes, and population with associated Robbery levels:

par(mfrow=c(1,1))
plot(Total~Violent,pch=as.character(Robbery),xlab="# Violent Crimes",ylab="Total Number of Crimes Committed")

plot of chunk unnamed-chunk-3

plot(Total~Population,pch=as.character(Robbery),xlab="Population",ylab="Total Number of Crimes Committed")

plot of chunk unnamed-chunk-3

Create ANCOVA model:

as.factor(crime$Robbery)

##  [1] 110,000 110,000 110,000 120,000 130,000 140,000 160,000 200,000
##  [9] 260,000 300,000 350,000 390,000 380,000 380,000 440,000 470,000
## [17] 430,000 410,000 430,000 480,000 560,000 590,000 550,000 510,000
## [25] 490,000 500,000 540,000 520,000 540,000 580,000 640,000 690,000
## [33] 670,000 660,000 620,000 580,000 540,000 500,000 450,000 410,000
## [41] 410,000 420,000 420,000 410,000 400,000 420,000 450,000 450,000
## [49] 440,000 410,000 370,000 350,000 350,000
## 34 Levels: 110,000 120,000 130,000 140,000 160,000 200,000 ... 690,000

model=lm(Total~Population*Violent+Robbery, data=crime)
anova(model)

## Analysis of Variance Table
## 
## Response: Total
##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## Population          1   5670    5670  140.94 2.4e-09 ***
## Violent             1   3547    3547   88.17 6.5e-08 ***
## Robbery            33   2410      73    1.82    0.10    
## Population:Violent  1    131     131    3.25    0.09 .  
## Residuals          16    644      40                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this particular model, we reject the H0 that population and number of violent crimes have no effect on total number of crimes committed from 1960-2012. Thus, variation in the total number of crimes committed can be attributed to variation in the population and number of violent crimes. However, we fail to reject the H0 that there is a significant interaction effect between population and number of violent crimes on the total number of crimes committed. While the output for Robbery appears to be significant, we ignore it because it is our blocking variable.

Estimate Parameters:

# Shapiro-Wilk test of normality.  Adequate if p < 0.
shapiro.test(crime$Total)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime$Total
## W = 0.9555, p-value = 0.04639

# We reject the H0 that the response variable is normally distributed.
qqnorm(crime$Total)
qqline(crime$Total)

plot of chunk unnamed-chunk-5

# Thus, we attempt to transform the data
shapiro.test(crime$Total^2) #still not normally distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  crime$Total^2
## W = 0.8979, p-value = 0.0002736

shapiro.test(sqrt(crime$Total)) #still not normally distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(crime$Total)
## W = 0.948, p-value = 0.02221

shapiro.test(log(crime$Total)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  log(crime$Total)
## W = 0.862, p-value = 2.058e-05

shapiro.test(log10(crime$Total)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  log10(crime$Total)
## W = 0.862, p-value = 2.058e-05

shapiro.test(1/crime$Total) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  1/crime$Total
## W = 0.4372, p-value = 5.857e-13

shapiro.test(1/sqrt(crime$Total)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  1/sqrt(crime$Total)
## W = 0.6621, p-value = 8.877e-10

# It appears that the data cannot be readily transformed into a normal distribution.

shapiro.test(crime$Population)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime$Population
## W = 0.9555, p-value = 0.04639

# We reject the H0 that the predictor variable "population" is normally distributed
# Thus, we attempt to transform the data
shapiro.test(crime$Population^2) #still not normally distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  crime$Population^2
## W = 0.8979, p-value = 0.0002736

shapiro.test(sqrt(crime$Population)) #still not normally distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(crime$Population)
## W = 0.948, p-value = 0.02221

shapiro.test(log(crime$Population)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  log(crime$Population)
## W = 0.862, p-value = 2.058e-05

shapiro.test(log10(crime$Population)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  log10(crime$Population)
## W = 0.862, p-value = 2.058e-05

shapiro.test(1/crime$Population) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  1/crime$Population
## W = 0.4372, p-value = 5.857e-13

shapiro.test(1/sqrt(crime$Population)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  1/sqrt(crime$Population)
## W = 0.6621, p-value = 8.877e-10

# It appears for this sampling subset the data cannot be readily transformed into a normal distribution.

shapiro.test(crime$Violent)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime$Violent
## W = 0.9555, p-value = 0.04639

# We reject the H0 that the predictor variable "Violent" is normally distributed
# Thus, we attempt to transform the data
shapiro.test(crime$Violent^2) #still not normally distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  crime$Violent^2
## W = 0.8979, p-value = 0.0002736

shapiro.test(sqrt(crime$Violent)) #still not normally distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(crime$Violent)
## W = 0.948, p-value = 0.02221

shapiro.test(log(crime$Violent)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  log(crime$Violent)
## W = 0.862, p-value = 2.058e-05

shapiro.test(log10(crime$Violent)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  log10(crime$Violent)
## W = 0.862, p-value = 2.058e-05

shapiro.test(1/crime$Violent) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  1/crime$Violent
## W = 0.4372, p-value = 5.857e-13

shapiro.test(1/sqrt(crime$Violent)) #still not normallly distributed

## 
##  Shapiro-Wilk normality test
## 
## data:  1/sqrt(crime$Violent)
## W = 0.6621, p-value = 8.877e-10

# It appears that the data cannot be readily transformed into a normal distribution.

One of the primary assumptions of the analysis of covariance is that the data are normally distributed for the vaiables of interest, and since they are (and cannot be readily transformed into such), the results from the analysis of variance can potentially be invalid.

Diagnostics and Model Adequacy Checking: Describe

qqnorm(residuals(model))
qqline(residuals(model))

plot of chunk unnamed-chunk-6

plot(fitted(model),residuals(model))

plot of chunk unnamed-chunk-6

#The model appears to have a decent fit to the data at the lower end since the points are evenly dispersed across the Cartesian plane of resiual error and the fitted model.However, toward the higher end the data are essentially a perfect fit. This means that the model does not do its job as well at this end.

References:

Source: FBI, Uniform Crime Reports

“http://www.disastercenter.com/crime/uscrime.htm”