By Srihari Mohan and Jack Sather

Analysis of Crime Rates by Region and Race

Introduction

With the recent news out of Ferguson, Missouri, in which a young black teenager was shot and killed by a white police officer, questions about race have been brought to the forefront of our community. In fact, as journalist John Blake says in a CNN article following the Grand Jury's decision in Ferguson on November 27th, "leaders [are] calling once again for a 'national conversation on race'" (Blake). These conversations have led to discussions on the relationship between races in a community (and in its police force) and the extent to which this relationship correlates with crime rates. An article from online journal livescience states, "based on assumptions that immigrants are more likely to commit crimes and settle in poor, disorganized communities, prevailing wisdom holds that the concentration of immigrants and an influx of foreigners drive cup crime rates" (livescience). With racial tensions in Ferguson and throughout the nation heightened by the shooting, we wanted to apply our understanding of R to analyze a real problem: exploring the relationship between the racial composition of U.S. communities and that of their police forces and seeing how crime rates within those communities were affected by this relationship. 

Our research project concerns analyzing the data from the University of California, Irvine's Machine Learning Repository on 'Crimes and Communities Unnormalized'(taken from the 1990 Census) and our initial hypothesis was that as the 'racial match' between a community's police force and its population decreased, the magnitude of crimes in that community would increase. We also predicted that the magnitude of crimes in a community would be highest when the percentage of minorities, like blacks and Hispanics, were highest. Although we won't try and generalize why one minority might be involved in crimes of greater magnitude than another (outside of differences in education and socioeconomic stratification), we feel that our hypotheses are sound because it makes sense to assume that as the 'racial match' between a police force and its community decreases, the more prevalent racial tensions are, and the more likely there will be crimes. Likewise, it makes sense to assume that greater minority populations might lead to higher crime rates because of increased tensions between the minorities and the rest of the population (whites). We decided to analyze this data by delineating communities within different regions of the United States from one another. For instance, we grouped all the data on communities in the South together, as we did with communities in the Northeast, West, and Midwest. In doing so, we wanted to see the extent to which these racial tensions and 'police-community racial matches' affected crime rates across different parts of the country. We felt that the magnitude of crimes because of a low 'racial match' between a community and its police force would be highest in the South (like Ferguson) than in the North and the West because of longer-standing racial conflicts here over time.

Custom CrimeScore Algorithm

It should be noted that with our experiment, we needed a way of quantifying the magnitude of crimes in a community. The UCI dataset gave us data on the numbers of different types of crimes (like murder, rape, burglary, etc.) committed annually within each community. We decided to create a custom CrimeScore algorithm that assigns each type of crime a score on our CrimeScale. For example, murder is scored as the highest crime and is assigned a key of 100 on the CrimeScale. Likewise, rape is scored at a key of 90. The CrimeScore is calculated by multiplying the number of each type of crime in a community by its key and by then dividing by the population of the community. The CrimeScore needs to account for the population of each community because in not doing so, the algorithm would be inaccurately biased toward communities with large populations (and thus more crimes) rather than their more directly comparable crime rates.

Getting and Cleaning Data

The UCI dataset was filled with missing data, noted by ? marks. In order to clean the data, we first replaced all of the ? marks with NAs. Since the explanatory variable of our analysis was RacialMatchCommPol (racial match between community and police), we removed every row in the data set in which the RacialMatchCommPol value for that observation was NA. Next, since the dataset was unnormalized and did not have any column headers, we picked the columns that we felt were needed for the CrimeScore algorithm (by mapping each to a key on the data description folder from UCI) and our analysis. We then gave names to these columns in a new data frame. To filter by region, we looked at the U.S. Census Bureau and created 4 vectors that stored each state in a vector for its region. We also needed to create bins to separate the percent by race of each race in a community. In other words, we segmented the black race percent of population by creating bins, using the cut() method and facet_grid(), in which communities with a black race percent of 0%-10% would fall in one bin while another with a black race percent of 20% or higher would fall in another. We created similar bins for each of the other racial communities, whites, Hispanics, and Asians.

R code for Cleaning Data and Calculating CrimeScore

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.2

library(grid)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.1.2

#reading in the data
setInternet2(use = TRUE)
crimedataraw <-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt",
               header = FALSE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "",
               stringsAsFactors = default.stringsAsFactors())
crimedatacleaned <-read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00211/CommViolPredUnnormalizedData.txt",
               header = FALSE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "",
               na.strings="?",strip.white=TRUE,stringsAsFactors = default.stringsAsFactors())

#x-axis RacialMatchCommPol - remove all NA rows
names(crimedatacleaned)[112] <- "RacialMatchCommPol"
crimedatacleaned <- crimedatacleaned[-which(is.na(crimedatacleaned$RacialMatchCommPol)),]

# Field 2 = state code
names(crimedatacleaned)[2] <- "statecode"

#Filter by region: input regional code to produce graphic for that region
userInput <-"West"

#Field 130 = murders, field 132 = rapes, Field 134 = robberies, Field 136 = assaults, Field 138 = burglaries
names(crimedatacleaned)[130] <- "murders"
names(crimedatacleaned)[132] <- "rapes"
names(crimedatacleaned)[134] <- "robberies"
names(crimedatacleaned)[136] <- "assaults"
names(crimedatacleaned)[138] <- "burglaries"
names(crimedatacleaned)[142] <- "autoTheft"
names(crimedatacleaned)[146] <- "violentcrimes"
names(crimedatacleaned)[112] <- "RacialMatchCommPol"
names(crimedatacleaned)[6] <- "population"
names(crimedatacleaned)[8] <- "racepctblack"
names(crimedatacleaned)[9] <- "racepctWhite"
names(crimedatacleaned)[10] <- "racepctAsian"
names(crimedatacleaned)[11] <- "racepctHisp"

# Region codes
crimedatacleaned$region <- NA
names(crimedatacleaned)[148] <- "region"
west <- c("AZ","CO","ID","NM","MT","UT","NV","WY","AK","CA","HI","OR","WA")
south <- c("DE","DC","FL","GA","MD","NC","SC","VA","WV","AL","KY","MS","TN","AR","LA","OK","TX")
midwest <- c("IN","IL","MI","OH","WI","IO","NE","KS","ND","MN","SD","MO")
northeast <- c("CT","ME","MA","NH","RI","VT","NJ","NY","PA")

crimedatacleaned[which(crimedatacleaned$statecode %in% west),148] <- "West"
crimedatacleaned[which(crimedatacleaned$statecode %in% south),148] <- "South"
crimedatacleaned[which(crimedatacleaned$statecode %in% midwest),148] <- "MidWest"
crimedatacleaned[which(crimedatacleaned$statecode %in% northeast),148] <- "NorthEast"

#CrimeScore algorithm
crimedatacleaned$crimeScore <- (((crimedatacleaned$murders * 100) + (crimedatacleaned$rapes * 90)+ (crimedatacleaned$assaults * 70)+
                                    (crimedatacleaned$robberies *60)+ (crimedatacleaned$robberies *50)+ (crimedatacleaned$violentcrimes *80)+
                                    (crimedatacleaned$burglaries * 40))/(crimedatacleaned$population))

#Creating bins for percent population of each race
crimedatacleaned$racepctblackbin<-cut(crimedatacleaned$racepctblack,breaks=c(0,10,15,20,100),
                                      labels=c("0-10 \n% of pop","10-15 \n% of pop","15-20 \n% of pop", "20-100\n% of pop"))
crimedatacleaned$racepctWhitebin<-cut(crimedatacleaned$racepctWhite,breaks=c(0,40,60,80,90,100),
                                      labels=c("0-40 \n% of pop","40-60 \n% of pop","60-80 \n% of pop", "80-90\n% of pop","90-100\n% of pop"))
crimedatacleaned$racepctHispbin<-cut(crimedatacleaned$racepctHisp,breaks=c(0,10,15,20,100),
                                     labels=c("0-10 \n% of pop","10-15 \n% of pop","15-20 \n% of pop", "20-100\n% of pop"))
crimedatacleaned$racepctAsianbin<-cut(crimedatacleaned$racepctAsian,breaks=c(0,2,5,10,100),
                                      labels=c("0-2 \n% of pop","2-5 \n% of pop","5-10 \n% of pop", "10-100\n% of pop"))

#subsetted dataframe for just region (userInput)
crimedatasubset <- crimedatacleaned[which(crimedatacleaned[148]==userInput),]

#create graphics plots
firstPart <- ggplot(crimedatasubset,aes(RacialMatchCommPol,crimeScore))+geom_point(,na.rm=TRUE)

finalPart <-  theme(axis.text.x=element_text(colour="slateblue4",size=12,face="bold"))+
  theme(axis.text.y=element_text(colour="slateblue4",size=12,face="bold"))+
  theme(axis.title.x=element_text(colour="slateblue4",size=16,face="bold"))+
  theme(axis.title.y=element_text(colour="slateblue4",size=16,face="bold"))+
  theme(plot.title=element_text(colour="slateblue4", face="bold", size=20))+
  theme(axis.text.x = element_text(angle=90,vjust=0.5, hjust=1,face="bold"))+
  theme(axis.ticks = element_line(colour = "slateblue4"))+
  theme(strip.text = element_text(size=12,face="bold"))+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  theme(strip.text = element_text(size=12,face="bold"))

p1<- firstPart + facet_grid(racepctblackbin ~ .,)+geom_jitter(na.rm=TRUE)+
  xlab("Racial match between the community \n and the police force")+ylab("CrimeScore")+
  ggtitle("Black community")+theme_bw()+xlim(0, 100)+finalPart

p2<-firstPart+facet_grid(racepctWhitebin ~ .)+geom_jitter(na.rm=TRUE)+
  xlab("Racial match between the community \n and the police force")+ylab("CrimeScore")+
  ggtitle("White community")+theme_bw() +xlim(0, 100)+finalPart

p3 <- firstPart+facet_grid(racepctHispbin ~ .)+geom_jitter(na.rm=TRUE)+
  xlab("Racial match between the community \n and the police force")+ylab("CrimeScore")+
  ggtitle("Hispanic community")+theme_bw() +xlim(0, 100)+finalPart

p4 <- firstPart +facet_grid(racepctAsianbin ~ .)+geom_jitter(na.rm=TRUE)+
  xlab("Racial match between the community\n and the police force")+ylab("CrimeScore")+
  ggtitle("Asian community")+theme_bw()+xlim(0, 100)+finalPart

CrimeScore and Racial Match Plots

grid.arrange(p1, p2, p3,p4, nrow=2,ncol=2, main = textGrob(paste("Crime Rate in ",userInput), vjust = 1, gp = gpar(fontface = "bold", cex = 1.5,col="slateblue4")))

plot of chunk unnamed-chunk-2

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-6

## Warning in min(x): no non-missing arguments to min; returning Inf

## Warning in max(x): no non-missing arguments to max; returning -Inf

## Warning in min(x): no non-missing arguments to min; returning Inf

## Warning in max(x): no non-missing arguments to max; returning -Inf

plot of chunk unnamed-chunk-8

Analysis

The results point to the general trend that CrimeScores are in fact higher in communities in which there is dissonance between the racial composition of the community and that of the police force. As the graphic shows, the highest CrimeScores of 7.5 to 11 tended to occur at 'racial match' values between only 50% and 60%. However, there certainly were exceptions, as in a Midwest community with a CrimeScore of nearly 8 when its police force had a 'racial match' of nearly 100% with its community. Still, the data points at the general trend, and our original hypothesis, that the better a police force's racial composition matches that of its community, the less crimes tend to be committed. 

Our second hypothesis concerned whether or not CrimeScores would be highest when the race percent of minorities in a community were highest. Surprisingly, there was variation among each racial community with respect to which race percents led to the highest CrimeScores. Blacks ended up having the highest average CrimeScores in their highest race percent bin (20% or greater) in every region. Asians, on the other hand, tended to have their highest average CrimeScores in their lowest race percent bin in every region but the West, suggesting that the more Asians in a community, the less they tended to commit crimes. In the Northeast, South, and Midwest, Hispanics tended to have their lower average CrimeScores as the race percent of Hispanics in the community increased, suggesting that in these regions, the more Hispanics there are in a community, the less they tend to commit crimes. However, in the West, Hispanics had their highest CrimeScores in their largest race percent bin of 20% or higher. Lastly, whites had their lowest average CrimeScores in communities in which whites are predominantly the majority, 80% to 90% or higher. However, it is really interesting to see how this trend deviates in the white population in the South, where they had significantly higher CrimeScores when they were only 40% to 60% of a community's population. 

This is so interesting because it ties so well with what happened in Ferguson. Today, blacks account for approximately 50% to 60% of Ferguson's population, where the majority of the 40% to 50% of the rest are whites. The CrimeScore data from our research suggests that in the South, whites tended to commit crimes in both greater magnitude and frequency when they were barely the majority or a slight minority in the community. Since this trend is evidenced with what happened in Ferguson, where roughly 40% to 50% of the community is white, our results really do beg the question as to whether or not increased white hostility in Southern communities, like Ferguson, with non-predominantly white populations could lead to tragedies like that of Michael Brown more so than in any other place and racial composition in the country.

Discussion

One problem that we had was that once we cleaned the dataset, we did not have enough data points for our initial hypothesis. This was because in the 'racial match' column, which was our explanatory variable, many of the column values were NA. We decided to remove any rows with a 'racial match' value equal to NA, but in doing so, eliminated many rows of the data set from our new data frame and only had about three data points per facet. Thus, we needed to expand our hypothesis. Instead of only looking at Missouri, we decided to break the country up into four regional groups, the Northeast, South, West, and Midwest, and then compare Crime Scores to the Racial Match within these new bins. By creating these bins, we were able to group together all of the community data in the South with one another, adding many more points to each faceted plot and making our analysis more nuanced. By expanding our hypothesis we were able to have the necessary data to analyze it.

One thing that we would change would be to have more recent data. The population in 1990 is vastly different from the current population and so we cannot be certain that this data is truly representative of the modern population. Also, it would have been interesting to be able to analyze communities individually (i.e. just look at the Ferguson community and spot data trends here). Unfortunately, as stated above, we didn't have enough data to make this analysis and had to pivot to looking at regional data as a whole. 

One idea we had was to add more to our CrimeScore algorithm. Right now, it's simple in that all it does is multiply the number of each type of crime by an assigned numeric key and then divide this result by the population of the community. It would be more interesting to make CrimeScore a more representative measure of the magnitude of crime by factoring in other variables into the algorithm, like socioeconomic differences within the population, education levels, age, etc. 

Another direction that would be interesting to research in the future is how these graphs change over time. The dataset we chose was taken from the 1990 Census. It would be interesting to see how these graphs changed in the 2000 and 2010 Censuses. There has been massive amounts of immigration in the United States since the 1990's and it would be fascinating to track how the change in population dynamics affects the crimes committed.