1 Introduction

Every major violent crime that makes it to the news brings new notions and beliefs that tend to alter the opinion of the public in terms of the safety of their neighborhood and the socio-economic factors that influence the normal functioning of society. For example, with the increase in the number of race-related crimes in our society, it has become imperative to understand whether the racial profile has anything to do with crime rates in a region. Other concerns besides race, include socio-economic status, religious beliefs and the percentage of the population that is composed of immigrants.

Understanding where violent crimes occur in terms of the socio-economic, environmental, and demographic characteristics of the reported regions can be crucial to deciphering why these crimes occur. These characteristics may very well help in predicting in advance where violent crimes are likely to occur through predictive models that can quantify the risk associated with a region. This would greatly help in city planning through the deployment of police forces effectively in order to quell the imminent danger. Determining the key drivers that contribute to the rise in violent crimes will provide invaluable inputs in terms of urban development and design of policing practices.

The ‘Communities and Crime Unnormalized Data Set’ made available by the University of California, Irvine’s Machine Learning Repository provides an excellent opportunity to test some of these pre-conceived notions prevalent in our society today with regard to race and crimes. It could also be helpful in building predictive models that can help better in city planning and crime reduction. Although the sources in this dataset date back to 1990 and 1995, the directional insights we get from it should still hold true.

A lot of prior research has been conducted using this dataset. In ‘Using machine learning algorithms to analyze crime data’ by Lawrence McClendon and Natarajan Meghanathan, a comparative study is performed between the violent crime patterns in this dataset and actual crime statistical data for the state of Mississippi provided by neighborhoodscout.com to determine how effective and accurate the machine learning algorithms used in data mining analysis can be at predicting violent crime patterns. Syahid Anuar, Ali Selamat, and Roselina Sallehuddin predicted ‘Crime Categories’ by applying their hybrid classification model, built by combining the Artificial Neural Network (ANN) and the Artificial Bee Colony (ABC) algorith, in the paper, ‘Hybrid Artificial Neural Network with Artificial Bee Colony Algorithm for Crime Classification’. Another paper, ‘Particle Swarm Optimization Feature Selection for Violent Crime Classification’, by the same authors proposed a crime classification model by combining Artificial Neural Network (ANN) model and Particle swarm optimization (PSO) model and used this dataset to evaluate the performance of their model.

With regard to resaerch on crime research in general, P J Brantingham and P L Brantingham focus on ghetto crime and how ethnic diversity impacts crime rates in certain areas. The paper,‘Environmental Criminology’ also focuses on the environmental and behavioral factors that criminals use to select a crime location. By narrowing the focus down to burglary, the paper assesses the social ecological factors that influence crime rates in the US. Chapter 13, ‘Toward a Theory of Race, Crime, and Urban Inequality’ from the book, ‘Race, Crime, and Justice’ focuses on the much controversial discussions and impact of race on crime rates in the US. It acknowledges the belief and prior research that blacks are not only more involved in violent crimes but also victims of it. It also further explores whether or not the causes of black crime are unique and there are other explanations for it. The author concludes that cultural relevance exists and can’t be denied but there is more depth and explanation to these crimes instead of just race.

With this motivation, Our primary focus is to analyze the number of Violent Crimes per 100,000 population in a community to determine the socio-economic factors besides race that impact crimes. The first part of the project mostly focuses on exploratory data analysis where we have explored the regions of the United States with the highest levels of violent and non-violent crimes. The second part of the project deals with building a regression model to determine the key factors that influence the violent crimes rates in US communities.

2 Data and Methods

2.1 Data

The Communities and Crime Unnormalized Data Set available at the UCI Machine Learning Repository is one of the most exquisite datasets in terms of the details available regarding the socio-economic profile of the communities in which crimes were committed in the United States, including information on race, age, employment, marital status, immigration data and home ownership. Combining information from the socio-economic data in the ’90 Census, law enforcement data in the 1990 Law Enforcement Management and Admin Stats survey, and crime data in the 1995 FBI UCR, the dataset contains socio-economic and crime-related information on 2,215 communities across the United States. A large dataset such as this, with 147 variables required thorough cleaning to arrive at the optimal number of variables required to conduct the best possible analysis. A full description of all the variables can be found here.

The dataset contains no duplicate records, but it contains a lot of missing values, represented by “?”. These missing values were first replaced by NAs. In order to achieve this, all the values for the predictive variables were first converted into characters and then converted into numeric values which forcefully converted the “?” values into NAs.

The first 5 non-predictive variables represent the name of the community, the state in which it is located, the county code, the community code, and the fold number for non-random 10 fold cross validation. A summary of the remaining variables is below:

library("ggplot2")
library("datasets")
library("dplyr")
library("missForest")
library("moments")
library("data.table")
library("Hmisc")
library("rms")
library("corrgram")
library("hydroGOF")
library("e1071")
library("rpart")
library('rminer')
library("randomForest")
library("fBasics")

#Reading the dataset
crime <- read.csv("crimedata.csv")

#Observing the dataset
glimpse(crime)
str(crime)
summary(crime)

#Renaming the first variable
crime <- rename(crime, communityname = Êcommunityname)
names(crime)
dim(crime)
#Dataset has 2215 observations and 147 variables

#Removing dupicate records
crime <- unique(crime)

#Transforming the class of all variables starting population to numeric
crime[,6:ncol(crime)] <- sapply(crime[,6:ncol(crime)], as.character)
crime[,6:ncol(crime)] <- sapply(crime[,6:ncol(crime)], as.numeric)
str(crime)
glimpse(crime)
summary(crime)
basicStats(crime[,6:ncol(crime)])[c("NAs", "Minimum", "Maximum", "1. Quartile", "3. Quartile", "Mean", "Stdev", "Median"),]

2.2 Primary Data Preparation

For reducing the size of the dataset in terms of the number of variables, the variables with a high proportion of NAs (greater than 50%) were removed as these were deemed practically useless in prediction or data exploration. Most of these variables came from the 1990 Law Enforcement Management and Admin Stats survey (Lemas). Removing these variables reduces the number of variables to 125. Based on observation, we saw that the variables householdsize and PersPerOccupHous represents the same information, mean people per household. PersPerOccupHous was removed from the dataset which reduced the number of variables to 124.

#Let's create a table that checks for NAs
na_table <- as.data.frame(!is.na(crime))
colMeans(na_table)

#Remove variables with more 50% NAs:

#LemasSwornFT
#LemasSwFTPerPop     
#LemasSwFTFieldOps  
#LemasSwFTFieldPerPop         
#LemasTotalReq
#LemasTotReqPerPop      
#PolicReqPerOffic           
#PolicPerPop    
#RacialMatchCommPol
#PctPolicWhite         
#PctPolicBlack          
#PctPolicHisp         
#PctPolicAsian
#PctPolicMinor   
#OfficAssgnDrugUnits     
#NumKindsDrugsSeiz      
#PolicAveOTWorked
#PolicCars
#PolicOperBudg   
#LemasPctPolicOnPatr   
#LemasGangUnitDeploy
#PolicBudgPerPop

#Dropping these variables
drop.cols <- c('LemasSwornFT','LemasSwFTPerPop','LemasSwFTFieldOps','LemasSwFTFieldPerPop',
               'LemasTotalReq','LemasTotReqPerPop','PolicReqPerOffic','PolicPerPop','RacialMatchCommPol',
               'PctPolicWhite','PctPolicBlack','PctPolicHisp','PctPolicAsian','PctPolicMinor',
               'OfficAssgnDrugUnits','NumKindsDrugsSeiz','PolicAveOTWorked','PolicCars',
               'PolicOperBudg','LemasPctPolicOnPatr','LemasGangUnitDeploy','PolicBudgPerPop')
crime <- crime %>% select(- drop.cols)
#Dataset now has 125 variables - 22 variables got removed

#householdsize and PersPerOccupHous represents the same info - remove one
crime <- crime %>% select(-PersPerOccupHous)

2.3 Visual Data Exploration

We start off the data exploration by plotting choropleth maps showing the degree of violent and non-violent crime rates in each of the states in the US. Choropleth maps provide an excellent way to visualize multiple aspects of regions. Here we show the violent/non-violent crime rates in different states with an option to hover over a state and view the name of the state and the total population. Violent/Non-Violent Crime rate was defined as the ratio of the total number of violent/non-violent crimes in a state to the total population of the state.

2.3.1 Choropleth map of Violent Crime Rates in US States

District of Columbia leads in terms of violent crime rates in the United States. This is probably due to Baltimore, one of the most crime-prone regions in the United States. Maryland has the second highest crime rate after DC, followed by New York, Louisiana, Georgia, Florida, South Carolina, and Minnesota. Maryland and DC, however, have never high crime rates in comparison to the other states, which have roughly almost the same crime rate. However, New York and Florida have very high populations in comparison to the others.

library(dplyr)
library(plotly)

#Violent Crime Rates by States

crime_EDA_1 <- crime %>% select(state, population, ViolentCrimesPerPop)
crime_EDA_1 <- crime_EDA_1[complete.cases(crime_EDA_1),]
crime_EDA_1 <- crime_EDA_1 %>% mutate(total_violent_crimes = ((ViolentCrimesPerPop/100000) * population))
crime_EDA_1 <- crime_EDA_1 %>% group_by(state) %>% summarise(tot_pop = sum(population), tot_vio = sum(total_violent_crimes)) %>%
  mutate(crime_rate = tot_vio/tot_pop)

crime_EDA_1$hover <- with(crime_EDA_1, paste(state, '<br>', "Population", tot_pop))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

p <- plot_geo(crime_EDA_1, locationmode = 'USA-states') %>%
  add_trace(
    z = ~crime_rate, text = ~hover, locations = ~state,
    color = ~crime_rate, colors = 'Reds'
  ) %>%
  colorbar(title = "Crime Rate") %>%
  layout(
    title = 'Violent Crime Rates in the United States',
    geo = g
  )

p

2.3.2 Choropleth map of Non-Violent Crime Rates in US States

District of Columbia again leads in terms of non-violent crime rates, followed by Florida, Maryland, Georgia, Louisiana, and surprisingly Washington. All these states have roughly the same non-violent crime rates. It is interesting to note that New York ranks on 23rd on the list of states with highest non-violent crime rates, but ranks 3rd on the list of states with highest violent crime rates. This is probably because non-violent crimes such as traffic accidents are lower in New York.

crime_EDA_2 <- crime %>% select(state, population, nonViolPerPop)
crime_EDA_2 <- crime_EDA_2[complete.cases(crime_EDA_2),]
crime_EDA_2 <- crime_EDA_2 %>% mutate(total_nonviolent_crimes = ((nonViolPerPop/100000) * population))
crime_EDA_2 <- crime_EDA_2 %>% group_by(state) %>% summarise(tot_pop = sum(population), tot_nonvio = sum(total_nonviolent_crimes)) %>%
  mutate(crime_rate = tot_nonvio/tot_pop)

crime_EDA_2$hover <- with(crime_EDA_2, paste(state, '<br>', "Population", tot_pop))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('blue')
)

p <- plot_geo(crime_EDA_2, locationmode = 'USA-states') %>%
  add_trace(
    z = ~crime_rate, text = ~hover, locations = ~state,
    color = ~crime_rate, colors = 'Oranges'
  ) %>%
  colorbar(title = "Crime Rate") %>%
  layout(
    title = 'Non-Violent Crime Rates in the United States',
    geo = g
  )

p

3 Choropleth map of percentage of black population in the US

More than 65% of the population of DC belongs to the African-American community. Other states with high percentage of African-Americans are Maryland, Georgia, Louisiana, and Mississippi. These are some of the states with the highest violent/non-violent crime rates. No wonder crimes bring about perceptions of race. We need to investigate this further.

#Percentage of African-American population by state

crime_EDA_3 <- crime %>% select(state, population, racepctblack)
crime_EDA_3 <- crime_EDA_3[complete.cases(crime_EDA_3),]
crime_EDA_3 <- crime_EDA_3 %>% mutate(total_black_pop = (racepctblack * population))
crime_EDA_3 <- crime_EDA_3 %>% group_by(state) %>% summarise(tot_pop = sum(population), tot_state_black_pop = sum(total_black_pop)) %>%
  mutate(blk_rate = tot_state_black_pop/tot_pop)

crime_EDA_3$hover <- with(crime_EDA_3, paste(state, '<br>', "Population", tot_pop))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('black')
)

p <- plot_geo(crime_EDA_3, locationmode = 'USA-states') %>%
  add_trace(
    z = ~blk_rate, text = ~hover, locations = ~state,
    color = ~blk_rate, colors = 'Blues'
  ) %>%
  colorbar(title = "% Black") %>%
  layout(
    title = 'Percentage of population that is African-American',
    geo = g
  )

p

3.0.1 Word Cloud of most dangerous communities

We conclude our initial Visual Data Exploration by taking a look at the communities with the highest violent crime rates. Interestingly, Baltimore only ranks 15th on the list of most dangerous communities in the US in terms of violent crime rates. The most dangerous cities when it comes to violent crime rates are Chester City in Pennsylvania, Atlanta, Newark, Alexandria in Louisiana, Miami, and Atlantic City. Chester City has an extremely high violent crime rate, while Atlanta, Miami, and Newark have very high populations in comparison to the others.

#Word Cloud of Communities with the highest crime rates

library("wordcloud")
library("RColorBrewer")

crime_EDA_3 <- crime %>% select(communityname,state, population, ViolentCrimesPerPop)
crime_EDA_3 <- crime_EDA_3 %>% mutate(Place = paste(communityname,state, sep = ", "))
crime_EDA_3 <- crime_EDA_3[complete.cases(crime_EDA_3),]
crime_EDA_3 <- crime_EDA_3 %>% mutate(total_violent_crimes = ((ViolentCrimesPerPop/100000) * population))
crime_EDA_3 <- crime_EDA_3 %>% group_by(Place) %>% summarise(tot_pop = sum(population), tot_vio = sum(total_violent_crimes)) %>%
  mutate(crime_rate = (tot_vio/tot_pop)*100000)
crime_EDA_3 <- arrange(crime_EDA_3, desc(crime_rate)) 
crime_EDA_3 <- head(crime_EDA_3, 15)

par(mfrow = c(1,1))
par(mar = rep(0,4))
set.seed(1)
wordcloud(words = crime_EDA_3$Place, freq = crime_EDA_3$crime_rate, scale=c(3,0.01),
          random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

3.1 Data Preparation and Exploration

3.1.1 Cleaning for model building

In order to build a model for predicting violent crimes per 100K population, we drop the location related non-predictive variables in the dataset. Now we are left with 119 observations. Next, we removed all observations that had missing values for the predictor variable, ViolentCrimesPerPop. This reduced the number of observations from 2,215 to 2,118. We also dropped the other predictor variables related to different types of crimes such as murders, rapes, robberies, assaults, burglaries, larcenies, auto-thefts, arsons, and non-violent crimes. This reduces the number of variables to 102.

#Let's remove State, County, Community, Community name, and fold from the dataset
crime_model <- crime %>% select(-communityname, -state, -countyCode, -communityCode, -fold)

#Let's remove observations that contain NAs for the predictor variable, ViolentCrimesPerPop
crime_ViolentModel <- crime_model[complete.cases(crime_model[,119]),]

#Let's also remove the other predictor variables
drop.OtherPredictors <- c('murders','murdPerPop','rapes','rapesPerPop','robberies','robbbPerPop',
                          'assaults','assaultPerPop','burglaries','burglPerPop','larcenies','larcPerPop',
                          'autoTheft','autoTheftPerPop','arsons','arsonsPerPop','nonViolPerPop')
crime_ViolentModel <- crime_ViolentModel %>% select(-drop.OtherPredictors)

3.1.2 Missing Value Treatment

For missing value treatment, we used the missForest package which is an implementation of the random forest algorithm. It’s a non-parametric imputation method applicable to various variable types, which builds a random forest model for each variable. Then it uses the model to predict missing values in the variable with the help of observed values. It yields OOB (out of bag) imputation error estimate, which helps estimate the error in imputation. Our implementation yielded an OOB NRMSE error of 1.84%, which is pretty small. We decided to go ahead with this imputation.

#Missing value treatment - using missForest package
crime_MissingValueTreatment <- missForest(crime_ViolentModel)

#Check imputation error
crime_MissingValueTreatment$OOBerror
#The NRMSE is the normalized root mean squared error - 1.81% is pretty small - we can go ahead with this
crime_Violent_Imp <- as.data.frame(crime_MissingValueTreatment$ximp)

3.1.3 Tests

3.1.3.1 Test for normality

Before building a predictive model, it is essential to test the variables for normality. For this, we chose the Shapiro-Wilk normality test, which tests the hypothesis the observations in a variable come from a normal distribution. Since the null hypothesis is that the distribution is statistically different from a normal distribution, a p-value of less than 0.05 indicates that the variable fails the normality test. All the variables left in the dataset, except PctWorkMomYoungKids, failed the normality test.

#Shapiro-Wilk Test for normality
lshap <- lapply(crime_Violent_Imp, shapiro.test)
sshap <- sapply(lshap, `[`, c("statistic","p.value"))
Shapiro_Wilk <- as.data.frame(t(sshap))
Shapiro_Wilk <- Shapiro_Wilk %>% select(Shapiro_Wilk_pValue = p.value)
Shapiro_Wilk <- setDT(Shapiro_Wilk, keep.rownames = TRUE)[]
par(mfrow=c(2,2))
hist(crime_Violent_Imp$PctWorkMomYoungKids, main = "Histogram")
plot(density(crime_Violent_Imp$PctWorkMomYoungKids), main = "Density Plot")
boxplot(crime_Violent_Imp$PctWorkMomYoungKids, main = "Boxplot")
qqnorm(crime_Violent_Imp$PctWorkMomYoungKids, main = "Normality Plot")
qqline(crime_Violent_Imp$PctWorkMomYoungKids)

3.1.3.2 Test for skewness

Next, the skewness associated with each of the remaining variables were calculated, which indicates the direction towards which the distribution of a variable is skewed.

#Calculating Skewness
lskew <- lapply(crime_Violent_Imp, skewness)
sskew <- sapply(lskew, `[`)
Skewness <- as.data.frame(sskew)
Skewness <- Skewness %>% select(Skewness = sskew)
Skewness <- setDT(Skewness, keep.rownames = TRUE)[]

PctHousOccup (Occupancy rate of houses) is most skewed towards the left - most houses in most communities are occupied (near 100%), with some communities having low occupancy rates.

par(mfrow=c(2,2))
hist(crime_Violent_Imp$PctHousOccup, main = "Histogram")
plot(density(crime_Violent_Imp$PctHousOccup), main = "Density Plot")
boxplot(crime_Violent_Imp$PctHousOccup, main = "Box Plot")
qqnorm(crime_Violent_Imp$PctHousOccup, main = "Normality Plot")
qqline(crime_Violent_Imp$PctHousOccup)

NumStreet (Number of homeless people counted on the streets) is the extremely skewed towards the right, so much so that it is difficult to visualize it. Most communities have zero to few homeless people on the streets. However, big and expensive cities like New York and San Jose usually have very high numbers of homeless people on the streets.

par(mfrow=c(2,2))
hist(crime_Violent_Imp$NumStreet, main = "Histogram")
plot(density(crime_Violent_Imp$NumStreet), main = "Density Plot")
boxplot(crime_Violent_Imp$NumStreet, main = "Box Plot")
qqnorm(crime_Violent_Imp$NumStreet, main = "Normality Plot")
qqline(crime_Violent_Imp$NumStreet)

3.1.3.3 Test for kurtosis

Kurtosis represents how pointed and high the middle of the curve is. Platykurtic distributions are the flattest and leptokurtic distributions are the most pointed and tallest. Mesokurtic distributions are closest to the standard normal distribution.

#Calculating Kurtosis
lkurt <- lapply(crime_Violent_Imp, kurtosis)
skurt <- sapply(lkurt, `[`)
Kurtosis <- as.data.frame(skurt)
Kurtosis <- Kurtosis %>% select(Kurtosis = skurt)
Kurtosis <- setDT(Kurtosis, keep.rownames = TRUE)[]

pctUrban (Percentage of people living in urban areas) follows the most platykurtic distribution, so much so that it has two distinct bell-shaped curves, which makes sense as areas develop either as uber-urban or non-urban regions.

par(mfrow=c(2,2))
hist(crime_Violent_Imp$pctUrban, main = "Histogram")
plot(density(crime_Violent_Imp$pctUrban), main = "Density Plot")
boxplot(crime_Violent_Imp$pctUrban, main = "Box Plot")
qqnorm(crime_Violent_Imp$pctUrban, main = "Normality Plot")
qqline(crime_Violent_Imp$pctUrban)

MedOwnCostPctIncNoMtg (median owners cost as a percentage of household income - for owners without a mortgage) represents the most mesokurtic distribution, which is kind of expected. However, it doesn’t pass the Shapiro-Wilk normality test.

par(mfrow=c(2,2))
hist(crime_Violent_Imp$MedOwnCostPctIncNoMtg, main = "Histogram")
plot(density(crime_Violent_Imp$MedOwnCostPctIncNoMtg), main = "Density Plot")
boxplot(crime_Violent_Imp$MedOwnCostPctIncNoMtg, main = "Box Plot")
qqnorm(crime_Violent_Imp$MedOwnCostPctIncNoMtg, main = "Normality Plot")
qqline(crime_Violent_Imp$MedOwnCostPctIncNoMtg)

Numstreet again follows the most leptokurtic distribution. However, for interpretation, we showcase the distribution of ViolentCrimesPerPop, which follows a leptokurtic distribution.

par(mfrow=c(2,2))
hist(crime_Violent_Imp$ViolentCrimesPerPop, main = "Histogram")
plot(density(crime_Violent_Imp$ViolentCrimesPerPop), main = "Density Plot")
boxplot(crime_Violent_Imp$ViolentCrimesPerPop, main = "Box Plot")
qqnorm(crime_Violent_Imp$ViolentCrimesPerPop, main = "Normality Plot")
qqline(crime_Violent_Imp$ViolentCrimesPerPop)

3.1.3.4 Summary of Normality, Skewness and Kurtosis

The below table shows the p-value from the Shapiro-Wilk test, the skewness, and the kurtosis for each of the variables

#Combining all tests into a table
tests_df <- merge(x = Shapiro_Wilk, y = Skewness, by = "rn", all = FALSE)
tests_df <- merge(x = tests_df, y = Kurtosis, by = "rn", all = FALSE)
tests_df <- tests_df %>% rename(Variable = rn)
tests_df

3.1.4 Variable Removal

We still have an extremely high number of variables in the dataset. To further remove variables, we measure the correlation of each independent variable against the dependent variable and remove the variables with low correlation to the independent variable. We removed variables with an absolute correlation of less than 0.25 to the independent variable. This vastly reduces the number of variables in the dataset to 49. We removed these variables as we felt that these variables would contribute very less to our final model. Besides, working with fewer variables help in better interpretation.

#Running a correlation matrix
res2 <- rcorr(as.matrix(crime_Violent_Imp))

flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
  )
}

#Finding variables with low correlation to dependent variable
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% filter(column == "ViolentCrimesPerPop") %>% arrange(cor)
remove <- correlation %>% filter(cor >= -0.25 & cor <= 0.25) %>% select(row)

low.corr.dep <- c('RentMedian','MedRent','HispPerCap','RentHighQ','PctSpeakEnglOnly','blackPerCap','whitePerCap',
                  'OwnOccLowQuart','OwnOccMedVal','PctWorkMom','OwnOccHiQuart','pctWFarmSelf','PctSameHouse85','AsianPerCap','MedYrHousBuilt',
                  'RentQrange','OtherPerCap','PersPerOwnOccHous','PctBornSameState','OwnOccQrange','pctWRetire','PctEmplProfServ','PctEmplManu',
                  'indianPerCap','PctWorkMomYoungKids','householdsize','PctSameState85','racePctAsian','agePct12t21','PctVacMore6Mos',
                  'agePct16t24','agePct65up','MedOwnCostPctInc','pctUrban','LandArea','PctSameCity85','MedOwnCostPctIncNoMtg','pctWSocSec',
                  'agePct12t29','NumStreet','PctImmigRecent','PersPerFam','NumImmig','PctImmigRec5','PctUsePubTrans','PctForeignBorn',
                  'NumInShelters','PctRecentImmig','PctImmigRec8','population','numbUrban','PctRecImmig5','PctRecImmig8')

crime_violent_dep <- crime_Violent_Imp %>% select(-low.corr.dep)

3.1.5 Multi-collinearity

For removing multi-collinearity, we selected pairs of independent variables with absolute correlations greater than 0.9. For deciding which variable to remove among the pairs of independent variables, we regressed the remaining independent variables against the dependent variable and found the variance inflation factor of each variable using the vif function from the rms package in R and the removed the independent variable from each pair of multi-collinear variables which had higher variance inflation factors. This reduced the number of variables in the dataset to 39. The logic for this was that the variables in a pair of multi-collinear variables contribute the same variance information to a model and hence only one of them (with the lower variance inflation factor) is required for modeling.

#Now let's check for multicollinearity
res2 <- rcorr(as.matrix(crime_violent_dep))
correlation <- flattenCorrMatrix(res2$r, res2$P)
correlation <- correlation %>% mutate(abs_cor = abs(cor)) %>% arrange(desc(abs_cor)) %>% filter(abs_cor > 0.9) %>% select(row, column)

#Variables with high correlation
# 1
# PctFam2Par
# PctKids2Par
# 2
# PctLargHouseFam
# PctLargHouseOccup
# 3
# FemalePctDiv
# TotalPctDiv
# 4
# NumUnderPov
# NumKidsBornNeverMar
# 5
# PctPersOwnOccup
# PctHousOwnOcc
# 6
# medIncome
# medFamInc
# 7
# MalePctDivorce
# TotalPctDiv
# 8 
# PctBSorMore
# PctOccupMgmtProf
# 9  
# PctFam2Par
# PctYoungKids2Par
# 10 
# medFamInc
# perCapInc
# 11 
# PctKids2Par
# PctYoungKids2Par
# 12 
# PctLess9thGrade
# PctNotHSGrad
# 13 
# MalePctDivorce
# FemalePctDiv
# 14 
# PctFam2Par
# PctTeen2Par
# 15 
# NumUnderPov
# HousVacant
# 16
# PctKids2Par
# PctTeen2Par

#Let's run a simple linear regression to see if any more variables can be removed based on VIF

LR_test <- lm(ViolentCrimesPerPop ~ ., crime_violent_dep)

vif(LR_test)

#Among multi-collinear variables remove the one with higher VIF

remove.multicollinear <- c('PctFam2Par','PctLargHouseOccup','TotalPctDiv','NumUnderPov','PctPersOwnOccup','medFamInc','PctBSorMore',
                           'PctKids2Par','PctNotHSGrad','FemalePctDiv')

crime_violent <- crime_violent_dep %>% select(-remove.multicollinear)

#We are left with 39 variables - we could perform transformations, but we intend to find the most imp variables and not build a true prediction model
#Hence we won't split the data into test and training datasets

3.1.6 Correlation plots

As a final step before modeling, we created correlation plots to assess in advance the independent variables that will be most important in predicting the violent crime rates.

corrgram(crime_violent[,c(1:13,39)], order=FALSE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Correlation between each of the variables")

corrgram(crime_violent[,c(14:26,39)], order=FALSE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Correlation between each of the variables")

corrgram(crime_violent[,27:39], order=FALSE, lower.panel=panel.shade,
         upper.panel=panel.pie, text.panel=panel.txt,
         main="Correlation between each of the variables")

Even at this early stage, using just the correlation plots, we can say that the following socio-economic factors in a community influence the level of violent crime rates:

. Percentage of the population that is African-American . Percentage of the population that is Caucasian . Percentage of households with investment/rent income . Percentage of households with public assistance income . Percentage of people under the poverty level . Percentage of people 16 and over, in the labor force, and unemployed . Percentage of males who are divorced . Percentage of kids, 4 and under, in two-parent households . Percentage of kids, age 12-17, in two-parent households . Percentage of kids born to never married . Percentage of vacant housing that is boarded up . Percentage of occupied housing units without a phone

4 Analysis

4.1 Introduction

Before building models, it is imperative to state that all the variables used for modeling did not pass the normality test. However, rather than building a model for actual prediction, the aim here is to study the most important socio-economic factors that influence crime rates in US communities. To this extent, the tedious process of transforming each and every independent variable based on applicable transformations wasn’t performed. Also, since the model isn’t being built for prediction purposes, we over-fit the prepared dataset to extract maximum possible information from it.

4.2 Linear Regression

Linear regression is a statistical method that allows us to summarize and study relationships between continuous (quantitative) variables. It is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

We regressed all the remaining 38 independent variables in the prepared dataset against the dependent variable. The multiple R-squared of the model was 0.6694 and the adjusted R-squared was 0.6633. Comparing the actual values against the predicted values of the dependent variable, the mean squared error (MSE) was determined to be 118165.5.

Simple_Linear_Regression <- lm(ViolentCrimesPerPop ~ ., crime_violent)
summary(Simple_Linear_Regression)

mse(Simple_Linear_Regression$fitted.values, crime_violent$ViolentCrimesPerPop)

#Multiple R-squared:  0.6721,   Adjusted R-squared:  0.6661 
#MSE: 117990.1
par(mfrow=c(2,2))
plot(Simple_Linear_Regression)

The following variables were deemed highly significant in the determination of violent crime rates by the model:

. Percentage of the population that is African-American . Median household income . Percentage of households with wage or salary income . Percentage of households with investment/rent income . Percentage of households with public assistance income . Percentage of people 25 and over with less than a 9th grade education . Percentage of people 16 and over, in the labor force, and unemployed . Percentage of people 16 and over who are employed in manufacturing . Percentage of males who are divorced . Percentage of kids born to never married . Percentage of population who have immigrated within the last 10 years . Percentage of persons in dense housing (more than 1 person per room) . Number of vacant households . Percentage of households owner occupied . Percentage of vacant housing that is boarded up . Rental housing in lower quartile rent . Percentage of officers assigned to drug units

4.3 Step-wise Linear Regression

Step-wise Linear Regression implements a bi-directional procedure to perform linear regression with variance elimination in both directions at every step by assessing how each variable contributes to the overall fit of the model and how it is impacted by it.

Upon building a step-wise regression model using the 38 independent variables in the prepared dataset against the dependent variable, only 21 variables remained. The multiple R-squared of the model was 0.6673 and the adjusted R-squared was 0.6639. Comparing the actual values against the predicted values of the dependent variable, the mean squared error (MSE) was determined to be 118882.2.

Stepwise_LR <- step(lm(ViolentCrimesPerPop ~ ., crime_violent), direction = "both", trace = FALSE)
summary(Stepwise_LR)

mse(Stepwise_LR$fitted.values, crime_violent$ViolentCrimesPerPop)

#Multiple R-squared:  0.6701,   Adjusted R-squared:  0.6666
#MSE: 118707
par(mfrow=c(2,2))
plot(Stepwise_LR)

The following variables were deemed highly significant in the determination of violent crime rates by the step-wise linear regression model: . Percentage of the population that is African-American . Median household income . Percentage of households with wage or salary income . Percentage of households with investment/rent income . Percentage of households with public assistance income . Percentage of people 25 and over with less than a 9th grade education . Percentage of people 16 and over, in the labor force, and unemployed . Percentage of people 16 and over who are employed in manufacturing . Percentage of males who are divorced . Percentage of kids born to never married . Percentage of persons in dense housing (more than 1 person per room) . Number of vacant households . Percentage of households owner occupied . Percentage of vacant housing that is boarded up . Rental housing in lower quartile rent . Percentage of officers assigned to drug units

4.4 Ridge Regreseeion

Ridge Regression introduces a tuning parameter, lambda, to the Linear Regression model to shrink the Regression coefficients. Since the Linear Regression Model that we built has a relatively good performance, Ridge Regression may be useful to reduce the values of the less important coefficients towards zero.

Using the glmnet package, we tested lambda values ranging from -2 to 3 in increments of 0.1 and regressed the dependent variable against all the independent variables to determine the value of lambda that minimizes the mean squared error. Using this lambda value, a ridge regression model was built. The multiple R-squared of the model was 0.6675859 and the mean squared error (MSE) was determined to be 118797.4.

library(glmnet)

y <- crime_violent$ViolentCrimesPerPop
x <- crime_violent %>% select(-ViolentCrimesPerPop) %>% data.matrix()
lambdas <- 10^seq(3, -2, by = -.1)

fit <- glmnet(x, y, alpha = 0, lambda = lambdas)
summary(fit)

cv_fit <- cv.glmnet(x, y, alpha = 0, lambda = lambdas)

opt_lambda <- cv_fit$lambda.min
opt_lambda

fit <- cv_fit$glmnet.fit
summary(fit)

y_predicted <- predict(fit, s = opt_lambda, newx = x)

# Sum of Squares Total and Error
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

# R squared
rsq <- 1 - sse / sst
rsq

mse(as.vector(y_predicted), crime_violent$ViolentCrimesPerPop)

#Multiple R-squared:  0.6707256
#MSE: 118473
par(mfrow=c(1,1))
plot(cv_fit)

par(mfrow=c(2,2))
plot(fit)
plot(fit$a0)
plot(fit$beta)
plot(fit$lambda)

4.5 Lasso Regression

Similar to Ridge Regression, the Lasso shrinks the coefficient estimates towards zero. However, the regularization penalty in Lasso has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter lambda is sufficiently large. Hence, much like best subset selection, the lasso performs variable selection.

Using the glmnet package, we tested lambda values ranging from -2 to 3 in increments of 0.1 and regressed the dependent variable against all the independent variables to determine the value of lambda that minimizes the mean squared error. Using this lambda value, a lasso regression model was built. The multiple R-squared of the model was 0.6692694 and the mean squared error (MSE) was determined to be 118195.8.

y <- crime_violent$ViolentCrimesPerPop
x <- crime_violent %>% select(-ViolentCrimesPerPop) %>% data.matrix()
lambdas <- 10^seq(3, -2, by = -.1)

fit <- glmnet(x, y, alpha = 1, lambda = lambdas)
summary(fit)

cv_fit <- cv.glmnet(x, y, alpha = 1, lambda = lambdas)

opt_lambda <- cv_fit$lambda.min
opt_lambda

fit <- cv_fit$glmnet.fit
summary(fit)

y_predicted <- predict(fit, s = opt_lambda, newx = x)

# Sum of Squares Total and Error
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

# R squared
rsq <- 1 - sse / sst
rsq

mse(as.vector(y_predicted), crime_violent$ViolentCrimesPerPop)

#Multiple R-squared:  0.6674572
#MSE: 119648.9
par(mfrow=c(1,1))
plot(cv_fit)

par(mfrow=c(2,2))
plot(fit)
plot(fit$a0)
plot(fit$beta)
plot(fit$lambda)

4.6 Support Vector Machine Regression

Support Vector Machine (SVM) can accommodate non-linear class boundaries. SVMs can be used with different kernel types, and for this dataset a radial kernel type was chosen to see whether a non-linear model could produce a better result.

First, we built an SVM model without hyper-parameter tuning and this model resulted in a multiple R-squared of 0.7528284 and the mean squared error (MSE) of 88333.65. This in itself is a really good model. However, we wanted to test if a hyper-parameter tuned model could give us better results. Hence, we used cross-validation to determine the best parameter values of gamma and cost for the model. The plot of the MSE error for the different parameter values shows the minimal error is obtained when the values of Cost and Gamma are 2 and 0.02631579 respectively. Using these values on a trained model, we obtained a multiple R-squared of 0.8059841 and the mean squared error (MSE) of 69336.99. This is the best model we have built so far.

y <- crime_violent$ViolentCrimesPerPop

model <- svm(ViolentCrimesPerPop ~ . ,crime_violent)
summary(model)

y_predicted <- predict(model, crime_violent)
par(mfrow=c(1,1))
plot(crime_violent$ViolentCrimesPerPop, y_predicted)

# Sum of Squares Total and Error
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

# R squared
rsq <- 1 - sse / sst
rsq

mse(as.vector(y_predicted), crime_violent$ViolentCrimesPerPop)

#Multiple R-squared:  0.7531458
#MSE: 88818.17

#Hyperparameter Optimization

# perform a grid search
tuneResult <- tune(svm, ViolentCrimesPerPop ~ .,  data = crime_violent,
                   ranges = list(epsilon = seq(0,1,0.25), cost = 2^(2:7))
)
print(tuneResult)
# Draw the tuning graph
plot(tuneResult)

tuneResult <- tune(svm, ViolentCrimesPerPop ~ .,  data = crime_violent,
                   ranges = list(epsilon = seq(0,0.6,0.2), cost = 2^(2:4))
) 

print(tuneResult)
plot(tuneResult)

tuneResult <- tune(svm, ViolentCrimesPerPop ~ .,  data = crime_violent,
                   ranges = list(epsilon = seq(0,0.2,0.04), cost = 2^(0:2))
) 

print(tuneResult)
plot(tuneResult)

tunedModel <- tuneResult$best.model
y_predicted <- predict(tunedModel, crime_violent) 

# Sum of Squares Total and Error
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

# R squared
rsq <- 1 - sse / sst
rsq

mse(as.vector(y_predicted), crime_violent$ViolentCrimesPerPop)

#Multiple R-squared:  0.8060834
#MSE: 69771.2

4.7 Decision Tree Regression

Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). The advantage of the Decision Tree is the interpretability of the results. However the predictive performance is, in most cases, not as good as other models.

The decision tree built is shown below with the importance of each of the variables. The multiple R-squared of the model was 0.6268348 and its MSE was 133361.

fit <- rpart(ViolentCrimesPerPop ~ ., method="anova", data = crime_violent)
printcp(fit)
par(mfrow=c(1,1))
plotcp(fit)

summary(fit)
rsq.rpart(fit)

par(mfrow=c(1,1))
plot(fit, uniform=TRUE, 
     main="Regression Tree for Violent Crimes")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

y_predicted <- predict(fit, crime_violent) 

# Sum of Squares Total and Error
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

# R squared
rsq <- 1 - sse / sst
rsq

mse(as.vector(y_predicted), crime_violent$ViolentCrimesPerPop)

importance <- as.data.frame(fit$variable.importance)
varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'fit$variable.importance'],2))
varImportance <- varImportance %>% arrange(desc(Importance))
varImportance <- head(varImportance, 10)
rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
ggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) +
  geom_bar(stat='identity') +
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  ggtitle("Variable Importance in Decision Tree Model") +
  coord_flip()

#Multiple R-squared:  0.6285281
#MSE: 133655.6

4.8 Random Forest Regression

Although the decision tree does an excellent job in terms of classification, it tends to over-fit the data, which reduces its predictive power when extrapolated to new samples. As a final approach, a random forest model, an ensemble model which takes the average result of one thousand decision trees built from subsets of the dataset, was employed. Despite being a black-box model, the random forest model does a better job at not over-fitting the dataset and hence has better predictive power. This model is also more appropriate for building a platform around and scaling for commercial purposes.

The random forest model that we built proved to be the best model with a multiple R-squared of 0.9457556 and its MSE was 19385.75. The most important variables based on the model are shown below.

RF <- randomForest(ViolentCrimesPerPop ~ ., crime_violent, importance = TRUE, proximity = TRUE)

y_predicted <- predict(RF, crime_violent) 

# Sum of Squares Total and Error
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

# R squared
rsq <- 1 - sse / sst
rsq

mse(as.vector(y_predicted), crime_violent$ViolentCrimesPerPop)

importance <- importance(RF)
varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,'IncNodePurity'],2))
varImportance <- varImportance %>% arrange(desc(Importance))
varImportance <- head(varImportance, 10)
rankImportance <- varImportance %>% mutate(Rank = paste0('#',dense_rank(desc(Importance))))
ggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) +
  geom_bar(stat='identity') +
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  ggtitle("Variable Importance in Random Forest Model") +
  coord_flip()

#Multiple R-squared:  0.9469094
#MSE: 19101.98

5 Discussion

5.1 Comparison of Models

The below table shows the R-squared and MSE values of each of the models built:

compare <- data.frame("Model" = c("Linear Regression","Stepwise Regression","Ridge Regression","Lasso Regression","Radial SVM","Tuned Radial SVM","Decision Tree","Random Forest"),
                "Multiple R-sq" = c(0.6694,0.6673,0.6676,0.6693,0.7528,0.8060,0.6268,0.9458),
                "Adjusted R-sq" = c(0.6633,0.6639,NA,NA,NA,NA,NA,NA),
                "MSE" = c(118166,118882,118797,118196,88334,69337,133361,19386))
compare

The linear regression model that we built is in itself a very good model with a multiple R2 of 66.94%. Step-wise regression does a great job of eliminating some of the variables that are not important in prediction, but at the cost of minimizing MSE. Both ridge and lasso regressions under-performed linear regression, with lasso regression out-performing ridge regressions. All the above-mentioned models have prediction powers in the same range. The real difference is seen with a Radial SVM model which had an R2 value of 75.28% and vastly reduced the MSE. The performance is further enhanced through hyper-parameter tuning which brings the R2 value to 80.6%. Decision tree is great for interpretation but significantly under-performs all other models. The final model, random forest model is the best in predicting violent crime rates with an R2 value of 94.58% and reduces the MSE of linear regression to a sixth.

5.2 Interpretation of Decision Tree Model

The decision tree model provides excellent interpretation in terms of pinpointing the socio-economic characteristics of communities that are most prone to violent crimes. Below are the classifications of communities sorted by those that are most prone to violent crimes to those that are least prone to violent crimes:

  1. Communities with the percentage of kids born to people who were never married greater than 14.04%
  2. Communities with the percentage of kids born to people who were never married between 7.245% and 14.04%, with the percentage of male divorcees greater than 12.57%, and with the percentage of people who don’t speak English well greater than 1.985%
  3. Communities with the percentage of kids born to people who were never married between 7.245% and 14.04%, with the percentage of male divorcees greater than 12.57%, and with the percentage of people who don’t speak English well less than 1.985%
  4. Communities with the percentage of kids born to people who were never married between 7.245% and 14.04%, and with the percentage of male divorcees less than 12.57%.
  5. Communities with the percentage of kids born to people who were never married between 4.55% and 7.245%.
  6. Communities with the percentage of kids born to people who were never married between 2.75% and 4.55%, and with the number of vacant houses exceeding 2172.
  7. Communities with the percentage of kids born to people who were never married between 2.75% and 4.55%, and with the number of vacant houses less than 2172.
  8. Communities with the percentage of kids born to people who were never married less than 2.75%, and with the percentage of male divorcees greater than 8.775%.
  9. Communities with the percentage of kids born to people who were never married less than 2.75%, and with the percentage of male divorcees less than 8.775%.

5.3 The Relationship between independent variables and dependent variables

Below is a summary of the most important factors that influence violent crime rates and their relationships. Violent crime rates in a community increase with:

  • Increase in the percentage/number of kids born to people who never married
  • Decrease in the percentage of Caucasian population
  • Increase in the percentage of African-American population
  • Decrease in the percent of kids of age 12-17 in two-parent households
  • Decrease in the percent of kids 4 and under in two-parent households
  • Increase in the percentage of male divorcees
  • Increase in the percentage of people sharing the same room
  • Decrease in the percentage of households with investment/rent income

5.4 Conclusion

News related to violent crimes shake the foundations of our society from time to time and make us question our beliefs regarding the prophesied cultural, economic, and racial harmony in our society. With the increase in the number of recent incidents related to violent crimes, it has become imperative to re-evaluate the cause of these crimes based on the socio-economic built of our society. Increased computational power has enabled us to look at a large number of factors at once to determine how the relationships among them influence crime rates in our society. A data-driven approach is paramount to plan for the future and build our communities on the principles of trust and harmony.

The Communities and Crime Unnormalized Dataset from the UCI Machine Learning repository provides an excellent opportunity to look at how a myriad of socio-economic factors influence violent crime rates in our society. Using novel machine learning approaches, predictive models were built with high accuracy which pinpointed several factors that influence these crimes. One theme that arises, when assessing the most important factors, is marriage. Successful marriages contribute to the healthy development of children, who in turn become responsible citizens. Areas with high percentage of kids born to people who were never married, living in dense areas, with little-to-no schooling tend to exhibit high violent crime rates. Although the models point to racial factors, we believe that this has more to do with the relatively poor economic conditions of African Americans in comparison to those of Caucasians.

Despite these contributions, the models developed here suffer from a serious case of over-fitting. Prioritizing generation of insights over building a predictive model, the entire dataset was utilized for both training and validation purposes, except in case of the random forest model which has been proposed for scaling purposes. The random forest model could be further optimized through hyper-parameter tuning to improve its predictive power. Post scaling, this model would need to be adjusted on a timely basis as more data becomes available. Further study in this area could explore other approaches such as Neural Networks to build models that do a better job at predicting violent crime rates.

In conclusion, this study further validated the dependence of violent crime rates on socio-economic factors, generated tailor-made insights for governments to act on by implementing public policy and community planning, and helped develop a random forest model with substantially high predictive power which could be scaled for policing purposes to assess the change in the socio-economic profile of a community over time, to educate the public, and to implement changes to police deployment.

6 References

  • Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  • U. S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a & 3a (Computer Files),
  • U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)
  • U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics (Computer File) U.S. * Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992)
  • U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (Computer File) (1995)
  • Lawrence McClendon and Natarajan Meghanathan. Using machine learning algorithms to analyze crime data. Machine Learning and Applications: An International Journal (MLAIJ) Vol.2, No.1, March 2015
  • Syahid Anuar, Ali Selamat, and Roselina Sallehuddin. Hybrid Artificial Neural Network with Artificial Bee Colony Algorithm for Crime Classification. Computational Intelligence in Information Systems pp 31-40
  • P J Brantingham and P L Brantingham. Environmental Criminology. National Criminal Justice Reference Service