Chicago Crime Prediction

To recommend where City of Chicago Police Department should deploy officers

Introduction

Brief Description

Chicago’s economy is booming, unemployment is low, but the wealth is not equally distributed. The situation has led to an enormous crime. The fact of the matter is Chicago’s crime rate is higher than the US average crime rate. Another problem is the lack of police officers. It is of the view that the total number of officers in Chicago is not enough to cover the entire landscape of Chicago. So, how to do we prevent crime? It is a classic analytics/ optimization problem where the resources are less and the area to cover is vast. Thus, we are seeing CCPD has started using analytics to tackle the problem of preventing the crime and not just being reactive in a constrained setting. Here, I have done my version of analytics by doing exploratory analysis and by building machine learning models to see how it can help prevent crime. Below are a few statistics that will help to visualize the enormity of the problem that is in the hands of CCPD:

6.9 million crimes in approximately 20 years
Approximately 12000 Police Officers
Approximately 269 beats

Problem Statement

The City of Chicago Police Department (CCPD) needs to understand crime statistics in order to deploy officers on the street. The goal is to analyze the crime data and other external data and give recommendations on where CCPD should deploy their officers.

Introduction

“Crimes - 2001 to present” dataset from the Chicago Data Portal and the “Chicago Community Area (CCA) CDS” data have been considered for the crime analysis. The crime data has been considered only for the years 2016 and 2017, given the fact that the CCA CDS dataset containing the socio-economic indicators has been collected over the period 2013-17. My task is to predict the weekly crime count for each community area within Chicago for 54 weeks for the year 2017. The training data is from the year 2016 and the test data is from year 2017. The programming languaage/ tools used to analyse the data were R. The goal is make the prediction as accurate as possible by minimizing the root mean squared error.

\[y_{Community} = \sum_{i=1}^n crime count per week\]

Root mean squared error is defined as:

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2},\]

where $\widehat{y}$ is the predicted crime count per week for a Community and $y$ is actual crime count per week for a Community. It contains 2,69,096 records.

A word of caution: We need to be careful about racial biasedness as there are variables based on race. Race variable coming out to be significant doesn’t imply that a particular race is involved in any criminal activity.

Data Description & Preparation

Data Description

Chicago crime dataset - It reflects the reported incidents of crime that occurred in the City of Chicago. It contains 77 records

Few of the important Variables and their description

Id - Unique identfier for the record
Community Area - Indicates the community where the incident occurred. Chicago has 77 communities
Date - Date when the incident occured
Primary Type - Primary type of crime. There are 33 different types of crimes
Arrest - Indicates whether an arrest was made or not

Chicago Community Area dataset - It contains the socio-economic indicators for 77 Community areas within Chicago

Few of the important Variables and their description

VAC_HU - Vacant Housing Units
CT_SP_CHILD - Single Parent with Child
TOT_POP - Total Population
EMP, UNEMP etc. - Employment Status
WALK_BIKE, CARPOOL etc. - Mode of Travel to Work
JUST_DATE - Week number
FOR_BORN - Foreign born
A20_34, A35_49 etc. - Age cohorts
OPEN_SPACE_PER_1000 - Accessible Park Acreage per 1,000 Residents
INC_LT_25K - Household income less than $25,000
THREEOM_VEH - 3 or More Vehicles Available

Data Importing and Cleaning

After improting the data, it is necessary to visulaize the data to see if there are any missing values or any outliers.

Missing values

The columns containing missing values were left as it is as they were not used for the analysis.

Data Transformation

New variables like month, time of day, day of week, etc. were created to aid the analysis

Checking distribution of the predictors

There are a few skewed variables, and it usually helps to make the distribution of these variables normal either by log transformation or Box-Cox transformation. It especially helps when we are performing linear regression as skewed variables might create a problem. But, after taking log transformation, it was found that the results didn’t improve and in fact, the QQ plot showed that the residuals deviated further away from being normally distributed. Thus, the data was kept as it is without performing any transformation.

Checking distribution of the crime count (response variable)

The plot below depicts that the crime count variable is slightly skewed. To rectify it, log transformation was done before performing linear regression. But it was observed that the QQ plot worsened.

Exploratory Data Analysis

Chicago Crime Count

The plot below depicts the crime count for each community area within Chicago. We can see that community areas filled with darker shade of blue have crime count. The community areas where the crime count is high are 23, 24, 25, 28, 29, 8, 32, 66, 67, 68 and so on..

Community Area Clustering

We have performed k-means clustering to see which community areas can be grouped together. The variables that have been considered for clustering are aggregated at community area level. The variables are arrest count, domestic abuse count, counts for various types of crime activities and different socio-economic indicators. The ideal number of clusters has been selected based on “WSS” and “Silhouette” method.

The silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a useful clustering. The average silhouette method computes the average silhouette of observations for different values of k.

Displaying the number of community areas within each cluster and the mean values of various variables for each of the three clusters.

three_cluster	Community_Area_Count
1	48
2	15
3	14

Summary of the three clusters is as follows:

Cluster1 - Low Arrest count, Low Domestic Abuse count, Low Vacant Housing Units, Low Single Parent with Child, Low Income Less than 25K, Low Percent Aged 25 and above without Diploma
Cluster2 - Medium Arrest Count, Medium Domestic Abuse count, High Vacant Housing Units, Medium Single Parent with Child, High Income Less than 25K, Medium Percent Aged 25 and above without Diploma
Cluster3 - High Arrest Count, High Domestic Abuse count, High Vacant Housing Units, High Single Parent with Child, High Income Less than 25K, High Percent Aged 25 and above without Diploma

Arrest Percentage by Community Area

The plot below shows the arrest percentage for the community areas within Chicago. Not all 77 Community areas have been shown. Only top 20 community areas who have the highest crime count are shown. It is assumed that the low arrest percentage indicate that the suspect might have ran away from the crime scene. Thus, the community areas that have low arrest percentage compared to the Chicago arrest percentage needs to be focussed considering these areas have high crime activities.

We can see from the plot that the community areas highlighted in “Red” have low arrest percentage. The community areas with low arrest percent are 24, 22, 28, 6, 8, 66, 32 and 43.

Crime Statistics by Hour

Crime Statistics by Day

Machine Learning

Checking Correlation

The plot below shows the correlation between the crime count and the independet variables. We can see that most of the variables are correlated with the crime count making it logical to perform linear regression.

Linear Regression

Linear Regression is the simplest supervised learning approach. As the predictors are correlated with the crime count it is good to start with the simplest approach. As we saw earlier, the variables were skewed, so I perfomed log transformation of the variables that were skewed and fitted linear regression model. The output was worse than the fitting the model without performing any log or box cox transformation.

After running the linear regression, it was discovered that every variable came out to be significant except NATIVE & FOREIGN BORN. The Ajusted R2 was found to be 93.44% Root Mean Squared Error on the test data was found to be 27.07%

## [1] 27.07018

Influential Variables

It is important to check if there are any observations that are influencing linear regression. Even if the observations are outliers, it doesn’t mean that they are influential. It means it will not make much difference to exclude from the model. Looking at the top 10 values, we can see that they are not influencing the lineare regression. Thus, we don’t need to exclude the observations.

.cooksd	crime_count	COMMUNITY_AREA_NAME
0.0477839	356	Loop
0.0273691	144	Austin
0.0133216	217	Lake View
0.0083669	407	Austin
0.0080259	237	Austin
0.0070439	235	North Lawndale
0.0070240	189	Lake View
0.0063131	245	Austin
0.0057752	55	West Town
0.0053697	391	Austin

Heteroskedasticity & Pattern Detection

Next, it is important to visualize the model to assess if we are fitting the model correctly or not. We will plot residuals vs. fitted value to check if there is any non-linearity or if there is heteroskedasticity. We can see from the plot below, that there is so problem with our model.

QQ Plot

Next, we observed whether the residulas are normal or not. We see that the plot is not exactly normal. But, after performing log transformation, the results were worse than before and so we will stick with the response variable as it is.

Multicollinearity

Finally, we need to check whether is any multicollinearity or not. We will calculate variance influence factor for each of the independent variable. Even though the accuracy might be good, if there is any multicollinearity it will mask the variables that are correlated with the response variable. The variables will become statistically insignificant because of collinearity. If VIF exceeds 10, we say that there is muticollinearity.

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

	x
TOT_POP	28073360711
UND19	459530524
A20_34	1267319203
A35_49	356836454
A50_64	194547973
A65_74	34479670
A75_84	9502277
OV85	1600431
MED_AGE	5
WHITE	8584864437
HISP	6173287414
BLACK	6419196905
ASIAN	389174728
OTHER	17848964
POP_HH	6256
UNEMP_PCT	14
TOT_COMM	5123209368
DROVE_AL	968986739
CARPOOL	36657478
TRANSIT	977282020
WALK_BIKE	211490497
COMM_OTHER	7519096
NO_VEH	560216
ONE_VEH	890280
TWO_VEH	147086
THREEOM_VEH	16217
PERCENT_AGED_25_WITHOUT_HIGH_SCHOOL_DIPLOMA	24
INC_LT_25K	182
MEDINC	35
TOT_HH	6352977351
OWN_OCC_HU	1309566770
RENT_OCC_HU	3316000055
VAC_HU	202221990
HU_TOT	6358088664
BR_0_1	458986040
BR_2	219680781
BR_3	64066619
BR_4	7052947
BR_5	1855081
AVG_VMT	10
OPEN_SPACE_PER_1000	5
CT_1PHH	1660962417
CT_2PHH	742525024
CT_3PHH	81703791
CT_4MPHH	218848902
CT_FAM_HH	1340887582
CT_SP_WCHILD	224
CT_NONFAM_HH	3639874835
NATIVE	21959
FOR_BORN	3515
just_date	1

Lasso Regression

As we saw above, when there are a lot of independent variables there is a high posiibilty of collinearity. Thus, the variables that are correlated with the response variable get masked. But, our goal is not just to improve the accuracy but also the variables that are directly related to the response variable. Also, the model will overfit the data as the number of variables increase. Lasso controls regression coefficient and in turn variance by penalising the independent variables. Thus, Lasso will help to reduce collinearity and also will make the coefficients zero which are not significant.

The penalty to apply has been determined using 10 fold cross validation. Here, lambda is the tuning parameter.

Variable_Name	Coef_Value
just_date_00	-36.66
OPEN_SPACE_PER_1000	-1.40
UNEMP_PCT	-0.20
PERCENT_AGED_25_WITHOUT_HIGH_SCHOOL_DIPLOMA	0.36

The root mean squared error comes out to be 28.54%

## [1] 0.2854015

Gradient Boosting

Gradient boosting is a tree based method that usually has the prediction accuracy that surpasses most of the algorithms. It requires no preprocessing of the data. Boosting can be said as collection of decison trees where each tree is build sequentially.

Gradient Boosting requires a lot of parameters to be tuned. They are as follows:

Total Number of Trees
Depth of Trees: Number of splits in each tree
Learning rate: It is similar to how big a step needs to be taken
Subsampling: Percentage of data to be used

As there are a lot of paramters to be tuned, we will perform a grid search which iterates over combination of hyperparamters.

Based on minimum RMSE, the optimal values of the paramters were found to be: Number of trees: 319 Shrinkage (Learning rate): 0.1 Depth: 3

Variable Importance

The plot below depicts the variable that were most influential in reducing the mean squared error averaged acroos the trees.The variables with largest average decrease in MSE are considered most important.

Partial Dependence Plot (PDP)

PDP help us to understand how the response variable changes with the independent variable. PDP plots the average change in predicted crime count with respect to VAC_HU, etc. while holding other variables constant.

The root mean squared error by fitting Bossting model comes out to be 22.74%

## [1] 0.227425

Model Summary

##   MachineLearning_method   RMSE
## 1      Linear Regression 27.07%
## 2       Ridge Regression 28.54%
## 3      Gradient Boosting 22.74%

Recommendation

The goal was to deploy the officers based on the crime data and socio-economic data. Following are the recommendations:

For particular community areas, the crime count was found to be high but the arrest rate was lower than the avergae Chicago arrest rate. For these areas the deployment of police officers needs to be increased. Following are the community areas: 43,32, 66, 8, 6, 28, 22 and 24
It was given that community areas 8, 43, 25 and 68 have the high number of vaccant hosuing units. Thus, more than average deployment of police offcers is required in the weeks where we have predicted the crime count is going to be highest. Also, based on our exploratory data analysis the we should have more than avergae deployment of police officers between 12 PM and 12 AM
25,43, 19 and 30 are the community areas where there are highest number of single parent with child. Thus, police officers need to be deployed in theses areas during the weeks when the crime count is going to high within these community areas
Based on clustering, we found out that there are community areas where there is high number of domestic abuse, high vaccant hosuing units, high single parent with child and so on. These community areas need to monitored more than average during the weeks when the crime count is going to be high
Special task forces that can handle extremely violent crimes should be deployed on the specific days of week when the violent crimes are high in number based on the EDA performed. As the crime count was not predicted on daily basis we can look at the weekly predicted numbers to deploy the special task forces
Building new parks will help to engage people socially and might help in the reduction of crime as it is found from our analysis that community that have more number of parks have less crime

Chicago Crime Prediction

Ashish Gyanchandani

To recommend where City of Chicago Police Department should deploy officers

Introduction

Brief Description

Problem Statement

Introduction

Data Description & Preparation

Data Description

Data Importing and Cleaning

Missing values

Data Transformation

Checking distribution of the predictors

Checking distribution of the crime count (response variable)

Exploratory Data Analysis

Chicago Crime Count

Community Area Clustering

Arrest Percentage by Community Area

Crime Statistics by Hour

Crime Statistics by Day

Machine Learning

Checking Correlation

Linear Regression

Lasso Regression

Gradient Boosting

Recommendation