Chicago’s economy is booming, unemployment is low, but the wealth is not equally distributed. The situation has led to an enormous crime. The fact of the matter is Chicago’s crime rate is higher than the US average crime rate. Another problem is the lack of police officers. It is of the view that the total number of officers in Chicago is not enough to cover the entire landscape of Chicago. So, how to do we prevent crime? It is a classic analytics/ optimization problem where the resources are less and the area to cover is vast. Thus, we are seeing CCPD has started using analytics to tackle the problem of preventing the crime and not just being reactive in a constrained setting. Here, I have done my version of analytics by doing exploratory analysis and by building machine learning models to see how it can help prevent crime. Below are a few statistics that will help to visualize the enormity of the problem that is in the hands of CCPD:
The City of Chicago Police Department (CCPD) needs to understand crime statistics in order to deploy officers on the street. The goal is to analyze the crime data and other external data and give recommendations on where CCPD should deploy their officers.
“Crimes - 2001 to present” dataset from the Chicago Data Portal and the “Chicago Community Area (CCA) CDS” data have been considered for the crime analysis. The crime data has been considered only for the years 2016 and 2017, given the fact that the CCA CDS dataset containing the socio-economic indicators has been collected over the period 2013-17. My task is to predict the weekly crime count for each community area within Chicago for 54 weeks for the year 2017. The training data is from the year 2016 and the test data is from year 2017. The programming languaage/ tools used to analyse the data were R. The goal is make the prediction as accurate as possible by minimizing the root mean squared error.
\[y_{Community} = \sum_{i=1}^n crime count per week\]
Root mean squared error is defined as:
\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2},\]
where \(\widehat{y}\) is the predicted crime count per week for a Community and \(y\) is actual crime count per week for a Community. It contains 2,69,096 records.
A word of caution: We need to be careful about racial biasedness as there are variables based on race. Race variable coming out to be significant doesn’t imply that a particular race is involved in any criminal activity.
Chicago crime dataset - It reflects the reported incidents of crime that occurred in the City of Chicago. It contains 77 records
Few of the important Variables and their description
Id - Unique identfier for the recordCommunity Area - Indicates the community where the incident occurred. Chicago has 77 communitiesDate - Date when the incident occuredPrimary Type - Primary type of crime. There are 33 different types of crimesArrest - Indicates whether an arrest was made or notChicago Community Area dataset - It contains the socio-economic indicators for 77 Community areas within Chicago
Few of the important Variables and their description
VAC_HU - Vacant Housing UnitsCT_SP_CHILD - Single Parent with ChildTOT_POP - Total PopulationEMP, UNEMP etc. - Employment StatusWALK_BIKE, CARPOOL etc. - Mode of Travel to WorkJUST_DATE - Week numberFOR_BORN - Foreign bornA20_34, A35_49 etc. - Age cohortsOPEN_SPACE_PER_1000 - Accessible Park Acreage per 1,000 ResidentsINC_LT_25K - Household income less than $25,000THREEOM_VEH - 3 or More Vehicles AvailableAfter improting the data, it is necessary to visulaize the data to see if there are any missing values or any outliers.
The columns containing missing values were left as it is as they were not used for the analysis.
New variables like month, time of day, day of week, etc. were created to aid the analysis
There are a few skewed variables, and it usually helps to make the distribution of these variables normal either by log transformation or Box-Cox transformation. It especially helps when we are performing linear regression as skewed variables might create a problem. But, after taking log transformation, it was found that the results didn’t improve and in fact, the QQ plot showed that the residuals deviated further away from being normally distributed. Thus, the data was kept as it is without performing any transformation.
The plot below depicts that the crime count variable is slightly skewed. To rectify it, log transformation was done before performing linear regression. But it was observed that the QQ plot worsened.
The plot below depicts the crime count for each community area within Chicago. We can see that community areas filled with darker shade of blue have crime count. The community areas where the crime count is high are 23, 24, 25, 28, 29, 8, 32, 66, 67, 68 and so on..
We have performed k-means clustering to see which community areas can be grouped together. The variables that have been considered for clustering are aggregated at community area level. The variables are arrest count, domestic abuse count, counts for various types of crime activities and different socio-economic indicators. The ideal number of clusters has been selected based on “WSS” and “Silhouette” method.
The silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a useful clustering. The average silhouette method computes the average silhouette of observations for different values of k.
Displaying the number of community areas within each cluster and the mean values of various variables for each of the three clusters.
| three_cluster | Community_Area_Count |
|---|---|
| 1 | 48 |
| 2 | 15 |
| 3 | 14 |
Summary of the three clusters is as follows:
Cluster1 - Low Arrest count, Low Domestic Abuse count, Low Vacant Housing Units, Low Single Parent with Child, Low Income Less than 25K, Low Percent Aged 25 and above without DiplomaCluster2 - Medium Arrest Count, Medium Domestic Abuse count, High Vacant Housing Units, Medium Single Parent with Child, High Income Less than 25K, Medium Percent Aged 25 and above without DiplomaCluster3 - High Arrest Count, High Domestic Abuse count, High Vacant Housing Units, High Single Parent with Child, High Income Less than 25K, High Percent Aged 25 and above without DiplomaThe plot below shows the arrest percentage for the community areas within Chicago. Not all 77 Community areas have been shown. Only top 20 community areas who have the highest crime count are shown. It is assumed that the low arrest percentage indicate that the suspect might have ran away from the crime scene. Thus, the community areas that have low arrest percentage compared to the Chicago arrest percentage needs to be focussed considering these areas have high crime activities.
We can see from the plot that the community areas highlighted in “Red” have low arrest percentage. The community areas with low arrest percent are 24, 22, 28, 6, 8, 66, 32 and 43.
The plot below shows the correlation between the crime count and the independet variables. We can see that most of the variables are correlated with the crime count making it logical to perform linear regression.
Linear Regression is the simplest supervised learning approach. As the predictors are correlated with the crime count it is good to start with the simplest approach. As we saw earlier, the variables were skewed, so I perfomed log transformation of the variables that were skewed and fitted linear regression model. The output was worse than the fitting the model without performing any log or box cox transformation.
After running the linear regression, it was discovered that every variable came out to be significant except NATIVE & FOREIGN BORN. The Ajusted R2 was found to be 93.44% Root Mean Squared Error on the test data was found to be 27.07%
## [1] 27.07018
Influential Variables
It is important to check if there are any observations that are influencing linear regression. Even if the observations are outliers, it doesn’t mean that they are influential. It means it will not make much difference to exclude from the model. Looking at the top 10 values, we can see that they are not influencing the lineare regression. Thus, we don’t need to exclude the observations.
| .cooksd | crime_count | COMMUNITY_AREA_NAME |
|---|---|---|
| 0.0477839 | 356 | Loop |
| 0.0273691 | 144 | Austin |
| 0.0133216 | 217 | Lake View |
| 0.0083669 | 407 | Austin |
| 0.0080259 | 237 | Austin |
| 0.0070439 | 235 | North Lawndale |
| 0.0070240 | 189 | Lake View |
| 0.0063131 | 245 | Austin |
| 0.0057752 | 55 | West Town |
| 0.0053697 | 391 | Austin |
Heteroskedasticity & Pattern Detection
Next, it is important to visualize the model to assess if we are fitting the model correctly or not. We will plot residuals vs. fitted value to check if there is any non-linearity or if there is heteroskedasticity. We can see from the plot below, that there is so problem with our model.
QQ Plot
Next, we observed whether the residulas are normal or not. We see that the plot is not exactly normal. But, after performing log transformation, the results were worse than before and so we will stick with the response variable as it is.
Multicollinearity
Finally, we need to check whether is any multicollinearity or not. We will calculate variance influence factor for each of the independent variable. Even though the accuracy might be good, if there is any multicollinearity it will mask the variables that are correlated with the response variable. The variables will become statistically insignificant because of collinearity. If VIF exceeds 10, we say that there is muticollinearity.
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
| x | |
|---|---|
| TOT_POP | 28073360711 |
| UND19 | 459530524 |
| A20_34 | 1267319203 |
| A35_49 | 356836454 |
| A50_64 | 194547973 |
| A65_74 | 34479670 |
| A75_84 | 9502277 |
| OV85 | 1600431 |
| MED_AGE | 5 |
| WHITE | 8584864437 |
| HISP | 6173287414 |
| BLACK | 6419196905 |
| ASIAN | 389174728 |
| OTHER | 17848964 |
| POP_HH | 6256 |
| UNEMP_PCT | 14 |
| TOT_COMM | 5123209368 |
| DROVE_AL | 968986739 |
| CARPOOL | 36657478 |
| TRANSIT | 977282020 |
| WALK_BIKE | 211490497 |
| COMM_OTHER | 7519096 |
| NO_VEH | 560216 |
| ONE_VEH | 890280 |
| TWO_VEH | 147086 |
| THREEOM_VEH | 16217 |
| PERCENT_AGED_25_WITHOUT_HIGH_SCHOOL_DIPLOMA | 24 |
| INC_LT_25K | 182 |
| MEDINC | 35 |
| TOT_HH | 6352977351 |
| OWN_OCC_HU | 1309566770 |
| RENT_OCC_HU | 3316000055 |
| VAC_HU | 202221990 |
| HU_TOT | 6358088664 |
| BR_0_1 | 458986040 |
| BR_2 | 219680781 |
| BR_3 | 64066619 |
| BR_4 | 7052947 |
| BR_5 | 1855081 |
| AVG_VMT | 10 |
| OPEN_SPACE_PER_1000 | 5 |
| CT_1PHH | 1660962417 |
| CT_2PHH | 742525024 |
| CT_3PHH | 81703791 |
| CT_4MPHH | 218848902 |
| CT_FAM_HH | 1340887582 |
| CT_SP_WCHILD | 224 |
| CT_NONFAM_HH | 3639874835 |
| NATIVE | 21959 |
| FOR_BORN | 3515 |
| just_date | 1 |
As we saw above, when there are a lot of independent variables there is a high posiibilty of collinearity. Thus, the variables that are correlated with the response variable get masked. But, our goal is not just to improve the accuracy but also the variables that are directly related to the response variable. Also, the model will overfit the data as the number of variables increase. Lasso controls regression coefficient and in turn variance by penalising the independent variables. Thus, Lasso will help to reduce collinearity and also will make the coefficients zero which are not significant.
The penalty to apply has been determined using 10 fold cross validation. Here, lambda is the tuning parameter.
| Variable_Name | Coef_Value |
|---|---|
| just_date_00 | -36.66 |
| OPEN_SPACE_PER_1000 | -1.40 |
| UNEMP_PCT | -0.20 |
| PERCENT_AGED_25_WITHOUT_HIGH_SCHOOL_DIPLOMA | 0.36 |
The root mean squared error comes out to be 28.54%
## [1] 0.2854015
Gradient boosting is a tree based method that usually has the prediction accuracy that surpasses most of the algorithms. It requires no preprocessing of the data. Boosting can be said as collection of decison trees where each tree is build sequentially.
Gradient Boosting requires a lot of parameters to be tuned. They are as follows:
As there are a lot of paramters to be tuned, we will perform a grid search which iterates over combination of hyperparamters.
Based on minimum RMSE, the optimal values of the paramters were found to be: Number of trees: 319 Shrinkage (Learning rate): 0.1 Depth: 3
Variable Importance
The plot below depicts the variable that were most influential in reducing the mean squared error averaged acroos the trees.The variables with largest average decrease in MSE are considered most important.
Partial Dependence Plot (PDP)
PDP help us to understand how the response variable changes with the independent variable. PDP plots the average change in predicted crime count with respect to VAC_HU, etc. while holding other variables constant.
The root mean squared error by fitting Bossting model comes out to be 22.74%
## [1] 0.227425
Model Summary
## MachineLearning_method RMSE
## 1 Linear Regression 27.07%
## 2 Ridge Regression 28.54%
## 3 Gradient Boosting 22.74%
The goal was to deploy the officers based on the crime data and socio-economic data. Following are the recommendations: