To recommend where City of Chicago Police Department should deploy officers

Introduction


Brief Description

Chicago’s economy is booming, unemployment is low, but the wealth is not equally distributed. The situation has led to an enormous crime. The fact of the matter is Chicago’s crime rate is higher than the US average crime rate. Another problem is the lack of police officers. It is of the view that the total number of officers in Chicago is not enough to cover the entire landscape of Chicago. So, how to do we prevent crime? It is a classic analytics/ optimization problem where the resources are less and the area to cover is vast. Thus, we are seeing CCPD has started using analytics to tackle the problem of preventing the crime and not just being reactive in a constrained setting. Here, I have done my version of analytics by doing exploratory analysis and by building machine learning models to see how it can help prevent crime. Below are a few statistics that will help to visualize the enormity of the problem that is in the hands of CCPD:

  • 6.9 million crimes in approximately 20 years
  • Approximately 12000 Police Officers
  • Approximately 269 beats

Problem Statement

The City of Chicago Police Department (CCPD) needs to understand crime statistics in order to deploy officers on the street. The goal is to analyze the crime data and other external data and give recommendations on where CCPD should deploy their officers.


Introduction

“Crimes - 2001 to present” dataset from the Chicago Data Portal and the “Chicago Community Area (CCA) CDS” data have been considered for the crime analysis. The crime data has been considered only for the years 2016 and 2017, given the fact that the CCA CDS dataset containing the socio-economic indicators has been collected over the period 2013-17. My task is to predict the weekly crime count for each community area within Chicago for 54 weeks for the year 2017. The training data is from the year 2016 and the test data is from year 2017. The programming languaage/ tools used to analyse the data were R. The goal is make the prediction as accurate as possible by minimizing the root mean squared error.

\[y_{Community} = \sum_{i=1}^n crime count per week\]

Root mean squared error is defined as:

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2},\]

where \(\widehat{y}\) is the predicted crime count per week for a Community and \(y\) is actual crime count per week for a Community. It contains 2,69,096 records.

A word of caution: We need to be careful about racial biasedness as there are variables based on race. Race variable coming out to be significant doesn’t imply that a particular race is involved in any criminal activity.


Data Description & Preparation

Data Description

Chicago crime dataset - It reflects the reported incidents of crime that occurred in the City of Chicago. It contains 77 records

Few of the important Variables and their description

  • Id - Unique identfier for the record
  • Community Area - Indicates the community where the incident occurred. Chicago has 77 communities
  • Date - Date when the incident occured
  • Primary Type - Primary type of crime. There are 33 different types of crimes
  • Arrest - Indicates whether an arrest was made or not

Chicago Community Area dataset - It contains the socio-economic indicators for 77 Community areas within Chicago

Few of the important Variables and their description

  • VAC_HU - Vacant Housing Units
  • CT_SP_CHILD - Single Parent with Child
  • TOT_POP - Total Population
  • EMP, UNEMP etc. - Employment Status
  • WALK_BIKE, CARPOOL etc. - Mode of Travel to Work
  • JUST_DATE - Week number
  • FOR_BORN - Foreign born
  • A20_34, A35_49 etc. - Age cohorts
  • OPEN_SPACE_PER_1000 - Accessible Park Acreage per 1,000 Residents
  • INC_LT_25K - Household income less than $25,000
  • THREEOM_VEH - 3 or More Vehicles Available

Data Importing and Cleaning

After improting the data, it is necessary to visulaize the data to see if there are any missing values or any outliers.

Missing values

The columns containing missing values were left as it is as they were not used for the analysis.

Data Transformation

New variables like month, time of day, day of week, etc. were created to aid the analysis

Checking distribution of the predictors

There are a few skewed variables, and it usually helps to make the distribution of these variables normal either by log transformation or Box-Cox transformation. It especially helps when we are performing linear regression as skewed variables might create a problem. But, after taking log transformation, it was found that the results didn’t improve and in fact, the QQ plot showed that the residuals deviated further away from being normally distributed. Thus, the data was kept as it is without performing any transformation.

Checking distribution of the crime count (response variable)

The plot below depicts that the crime count variable is slightly skewed. To rectify it, log transformation was done before performing linear regression. But it was observed that the QQ plot worsened.

Exploratory Data Analysis

Chicago Crime Count

The plot below depicts the crime count for each community area within Chicago. We can see that community areas filled with darker shade of blue have crime count. The community areas where the crime count is high are 23, 24, 25, 28, 29, 8, 32, 66, 67, 68 and so on..


Community Area Clustering

We have performed k-means clustering to see which community areas can be grouped together. The variables that have been considered for clustering are aggregated at community area level. The variables are arrest count, domestic abuse count, counts for various types of crime activities and different socio-economic indicators. The ideal number of clusters has been selected based on “WSS” and “Silhouette” method.

The silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a useful clustering. The average silhouette method computes the average silhouette of observations for different values of k.

Displaying the number of community areas within each cluster and the mean values of various variables for each of the three clusters.

three_cluster Community_Area_Count
1 48
2 15
3 14

Summary of the three clusters is as follows:

  • Cluster1 - Low Arrest count, Low Domestic Abuse count, Low Vacant Housing Units, Low Single Parent with Child, Low Income Less than 25K, Low Percent Aged 25 and above without Diploma
  • Cluster2 - Medium Arrest Count, Medium Domestic Abuse count, High Vacant Housing Units, Medium Single Parent with Child, High Income Less than 25K, Medium Percent Aged 25 and above without Diploma
  • Cluster3 - High Arrest Count, High Domestic Abuse count, High Vacant Housing Units, High Single Parent with Child, High Income Less than 25K, High Percent Aged 25 and above without Diploma

Arrest Percentage by Community Area

The plot below shows the arrest percentage for the community areas within Chicago. Not all 77 Community areas have been shown. Only top 20 community areas who have the highest crime count are shown. It is assumed that the low arrest percentage indicate that the suspect might have ran away from the crime scene. Thus, the community areas that have low arrest percentage compared to the Chicago arrest percentage needs to be focussed considering these areas have high crime activities.

We can see from the plot that the community areas highlighted in “Red” have low arrest percentage. The community areas with low arrest percent are 24, 22, 28, 6, 8, 66, 32 and 43.


Crime Statistics by Hour


Crime Statistics by Day

Machine Learning

Checking Correlation

The plot below shows the correlation between the crime count and the independet variables. We can see that most of the variables are correlated with the crime count making it logical to perform linear regression.

Linear Regression

Linear Regression is the simplest supervised learning approach. As the predictors are correlated with the crime count it is good to start with the simplest approach. As we saw earlier, the variables were skewed, so I perfomed log transformation of the variables that were skewed and fitted linear regression model. The output was worse than the fitting the model without performing any log or box cox transformation.

After running the linear regression, it was discovered that every variable came out to be significant except NATIVE & FOREIGN BORN. The Ajusted R2 was found to be 93.44% Root Mean Squared Error on the test data was found to be 27.07%

## [1] 27.07018

Influential Variables

It is important to check if there are any observations that are influencing linear regression. Even if the observations are outliers, it doesn’t mean that they are influential. It means it will not make much difference to exclude from the model. Looking at the top 10 values, we can see that they are not influencing the lineare regression. Thus, we don’t need to exclude the observations.

.cooksd crime_count COMMUNITY_AREA_NAME
0.0477839 356 Loop
0.0273691 144 Austin
0.0133216 217 Lake View
0.0083669 407 Austin
0.0080259 237 Austin
0.0070439 235 North Lawndale
0.0070240 189 Lake View
0.0063131 245 Austin
0.0057752 55 West Town
0.0053697 391 Austin

Heteroskedasticity & Pattern Detection

Next, it is important to visualize the model to assess if we are fitting the model correctly or not. We will plot residuals vs. fitted value to check if there is any non-linearity or if there is heteroskedasticity. We can see from the plot below, that there is so problem with our model.

QQ Plot

Next, we observed whether the residulas are normal or not. We see that the plot is not exactly normal. But, after performing log transformation, the results were worse than before and so we will stick with the response variable as it is.

Multicollinearity

Finally, we need to check whether is any multicollinearity or not. We will calculate variance influence factor for each of the independent variable. Even though the accuracy might be good, if there is any multicollinearity it will mask the variables that are correlated with the response variable. The variables will become statistically insignificant because of collinearity. If VIF exceeds 10, we say that there is muticollinearity.

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"
x
TOT_POP 28073360711
UND19 459530524
A20_34 1267319203
A35_49 356836454
A50_64 194547973
A65_74 34479670
A75_84 9502277
OV85 1600431
MED_AGE 5
WHITE 8584864437
HISP 6173287414
BLACK 6419196905
ASIAN 389174728
OTHER 17848964
POP_HH 6256
UNEMP_PCT 14
TOT_COMM 5123209368
DROVE_AL 968986739
CARPOOL 36657478
TRANSIT 977282020
WALK_BIKE 211490497
COMM_OTHER 7519096
NO_VEH 560216
ONE_VEH 890280
TWO_VEH 147086
THREEOM_VEH 16217
PERCENT_AGED_25_WITHOUT_HIGH_SCHOOL_DIPLOMA 24
INC_LT_25K 182
MEDINC 35
TOT_HH 6352977351
OWN_OCC_HU 1309566770
RENT_OCC_HU 3316000055
VAC_HU 202221990
HU_TOT 6358088664
BR_0_1 458986040
BR_2 219680781
BR_3 64066619
BR_4 7052947
BR_5 1855081
AVG_VMT 10
OPEN_SPACE_PER_1000 5
CT_1PHH 1660962417
CT_2PHH 742525024
CT_3PHH 81703791
CT_4MPHH 218848902
CT_FAM_HH 1340887582
CT_SP_WCHILD 224
CT_NONFAM_HH 3639874835
NATIVE 21959
FOR_BORN 3515
just_date 1
Lasso Regression

As we saw above, when there are a lot of independent variables there is a high posiibilty of collinearity. Thus, the variables that are correlated with the response variable get masked. But, our goal is not just to improve the accuracy but also the variables that are directly related to the response variable. Also, the model will overfit the data as the number of variables increase. Lasso controls regression coefficient and in turn variance by penalising the independent variables. Thus, Lasso will help to reduce collinearity and also will make the coefficients zero which are not significant.

The penalty to apply has been determined using 10 fold cross validation. Here, lambda is the tuning parameter.

Variable_Name Coef_Value
just_date_00 -36.66
OPEN_SPACE_PER_1000 -1.40
UNEMP_PCT -0.20
PERCENT_AGED_25_WITHOUT_HIGH_SCHOOL_DIPLOMA 0.36

The root mean squared error comes out to be 28.54%

## [1] 0.2854015
Gradient Boosting

Gradient boosting is a tree based method that usually has the prediction accuracy that surpasses most of the algorithms. It requires no preprocessing of the data. Boosting can be said as collection of decison trees where each tree is build sequentially.

Gradient Boosting requires a lot of parameters to be tuned. They are as follows:

  • Total Number of Trees
  • Depth of Trees: Number of splits in each tree
  • Learning rate: It is similar to how big a step needs to be taken
  • Subsampling: Percentage of data to be used

As there are a lot of paramters to be tuned, we will perform a grid search which iterates over combination of hyperparamters.

Based on minimum RMSE, the optimal values of the paramters were found to be: Number of trees: 319 Shrinkage (Learning rate): 0.1 Depth: 3

Variable Importance

The plot below depicts the variable that were most influential in reducing the mean squared error averaged acroos the trees.The variables with largest average decrease in MSE are considered most important.

Partial Dependence Plot (PDP)

PDP help us to understand how the response variable changes with the independent variable. PDP plots the average change in predicted crime count with respect to VAC_HU, etc. while holding other variables constant.

The root mean squared error by fitting Bossting model comes out to be 22.74%

## [1] 0.227425

Model Summary

##   MachineLearning_method   RMSE
## 1      Linear Regression 27.07%
## 2       Ridge Regression 28.54%
## 3      Gradient Boosting 22.74%
Recommendation

The goal was to deploy the officers based on the crime data and socio-economic data. Following are the recommendations:

  • For particular community areas, the crime count was found to be high but the arrest rate was lower than the avergae Chicago arrest rate. For these areas the deployment of police officers needs to be increased. Following are the community areas: 43,32, 66, 8, 6, 28, 22 and 24
  • It was given that community areas 8, 43, 25 and 68 have the high number of vaccant hosuing units. Thus, more than average deployment of police offcers is required in the weeks where we have predicted the crime count is going to be highest. Also, based on our exploratory data analysis the we should have more than avergae deployment of police officers between 12 PM and 12 AM
  • 25,43, 19 and 30 are the community areas where there are highest number of single parent with child. Thus, police officers need to be deployed in theses areas during the weeks when the crime count is going to high within these community areas
  • Based on clustering, we found out that there are community areas where there is high number of domestic abuse, high vaccant hosuing units, high single parent with child and so on. These community areas need to monitored more than average during the weeks when the crime count is going to be high
  • Special task forces that can handle extremely violent crimes should be deployed on the specific days of week when the violent crimes are high in number based on the EDA performed. As the crime count was not predicted on daily basis we can look at the weekly predicted numbers to deploy the special task forces
  • Building new parks will help to engage people socially and might help in the reduction of crime as it is found from our analysis that community that have more number of parks have less crime