The Baltimore Crime data was retrieved from Data.gov and is updated every Monday. Find the link to the original dataset here. The rows represent one offense with information such as the location, the type of offense, the neighborhood where the offense took place, and many more. The outputted dataset below depicts the final dataset used for this analysis.
Added Variables include:
Weather Data: precipitation, snow, temperature average, snow, snow depth
Socioeconomic Data: unemployment rate for every month from 2015-2019
Note: The tabs to the right include relatively in-depth explanation of analyses run. For summaries of the models, see Results page.
Objectives
Our primary goal is to find the best way to convey Baltimore’s crime data in a way that is helpful to the Baltimore Police Department. Our secondary goals were:
Motivations
Predicting and analyzing crime is a field that has been around for a long time, and for good reason. Finding the motivations and situations that lead to a crime occuring could lead to a safer, more secure world. The “routine activity approach” is a theory which emphasizes the circumstances in which a crime occurs, rather than the traits of the perpetrator, to analyze crime trends (Cohen & Felson, 1979).
There is evidence that property crimes are driven by pleasant weather, which is consistent with this approach (Hipp & Curran, 2004). Property crimes are nonviolent crimes which involve the theft or destruction of someone else’s property.
There is also empirical evidence that crime rates increase in hotter years, and that crime is more prevalent in hotter parts of the year. Furthermore, the effect of temperature is stronger on violent crimes than it is on nonviolent crimes (Anderson, 1987). Conversely, there is a strong positive effect of unemployment on property crimes, while the evidence is much weaker for violent crimes (Raphael & Ebmer, 2001).
For this reason crimes in the dataset were divided into two broad categories: violent and nonviolent. By predicting which category a single crime falls into, this provides insight into what factors are the most influential on that type of crime occurring.
Predicting the amount of daily crime can be useful for many reasons, the most important of which is staffing. By taking into account the weather, current unemployment rate, or last week’s numbers, the Baltimore Police Department might be able to use their officer’s time more effectively by staffing only the officers they need. It may also reveal which factors are the most important.
By creating interesting data visualizations it is easier to convey complicated trends and patterns in a straightforward fashion. The apprehensibility of the graphics is especially true considering the primary audience for these findings, the BPD.
Sources:
Building Model 1:
Building Model 2:
See the exact metrics of both confusion matrix outputs below.
Model Comparison
| Metrics | Model 1 | Model 2 |
|---|---|---|
| AUC | 0.58 | 0.59 |
| Accuracy | 70% | 65% |
| Specificity Rate | 99.93% | 79% |
| Type I Error | 0.07% | 20% |
| Sensitivity Rate | 0.1% | 31% |
| Type II Error | 99.9% | 69% |
Initial Model
After Bagging
After Random Forest
After Boosting
Model Comparison
The best model, since is has the lowest MSE, is boosting.
| Model | Test MSE | % Variance Explained |
|---|---|---|
| Initial | 22.93837 | |
| Bagging | 18.83332 | 28.98% |
| Random Forest | 17.33683 | 33.77% |
| Boosting | 16.98436 |
Model Building
\[\text{ Total Crime} = 1506.4027 + (-0.7054)Year + (1.3848)Month + (0.0863)Day \\+ (0.5819)Temperature + ( -6.8273)Precipitation + (1.0690)Unemployment\]
Model Building
Model Comparison
Both models had prediction intervals that contained all of the real values for January of 2020. All of the accuracy metrics select model 2, as well as the AIC and AICc. Only the BIC selects model 1, indicating that model 2 is superior.
| Metrics | Model 1 | Model 2 |
|---|---|---|
| AIC | 8.449309 | 8.448679 |
| AICc | 8.449316 | 8.448692 |
| BIC | 8.461382 | 8.463771 |
| MAE | 12.21906 | 12.22173 |
| RSME | 14.5193 | 14.53334 |
| ME | 0.4678299 | -0.7078554 |
| MPE | -2.50532 | -2.73881 |
| MAPE | 12.08573 | 12.11812 |
Our analysis focused on two types of predictive modeling: predicting the category of the crime, and predicting the total number of crimes. We summarize the results of these methods here.
For more detail on the methodology of these methods, see the tabs on the Analysis Page.
Predicting Violent vs. Nonviolent Crime
Crimes in the dataset were divided into two broad categories: violent and nonviolent. By predicting which category a single crime falls into, this provides insight into what factors are the most influential on that type of crime occurring. To do this we used logistic regression.
Logistic Regression
After checking that logistic regression’s assumptions were met, we conducted two versions of logistic regression. The second version was created to improve upon our first by balancing the specificity and sensitivity rates, as well as make our model more interpretable. Our second version significantly improved upon our first model and we were able to determine that the most significant predictors used to predict if a violent crime will occur are the month, district, precipitation, the average temperature, the unemployment rate, and the part of day. Lastly, so that our models could potentially be used for predicting violent crime occurrences with more accuracy and providing Baltimore PD with a better understanding of how to allocate their resources, we want to try to improve upon these models by finding more datasets with predictors that could potentially have a greater influence on whether a crime will occur.
Predicting Daily Crime Count
Predicting the amount of daily crime can be useful for many reasons, the most important of which is staffing. By taking into account the weather, current unemployment rate, or last week’s numbers, the Baltimore Police Department might be able to use their officer’s time more effectively by staffing only the officers they need. It may also reveal which factors are the most important. To do this we used linear regression, tree diagrams, and time series predictive modeling.
Decision Tree
For the decision tree, we first built an initial tree. This tree is visually easy to read but is very vulnerable to changes in the dataset. Every time we make a new train dataset, the tree looks different. In order to overcome this, we borrow the idea of bootstrapping (random forest and bagging). For each method, 2000 trees were created, and the average is taken. Each method differs in how variables are chosen for each split - bagging uses all of them, and random forest only uses some of them. In our case, random forest with using only two predictors for each split made the model perform significantly better. The last method we used was Boosting. Boosting is similar to bagging/random forest, but the trees are built sequentially with information provided by previous trees. Boosting gave us the best model performance. After the analysis, we could confirm that the temperature was the most significant variable in determining the total number of crimes for each day. The next important variables were month and unemployment rate. Then precipitation and year followed.
Time Series
Time series models use prior values about an event to forecast future ones, often taking into account repetitive cycles in the data referred to as seasonality. In order to perform time series analysis, the overall trend and seasonality must be removed. This was done by subtracting the prior day’s value from the current value, which is called taking the first difference. To remove the seasonality, in this case a weekly trend, the value from a week ago is subtracted. Then this data was used to successfully forecast values into the month of January. This tells us that the weekly cycle of crime is important – that is, we can successfully use past values to predict future ones with no further information.
Linear Regression
See tab to the right.
Key Variables
Interpretation
Future Improvements
Variable Importance Ranked
Interpretation
Future Improvements
If we were to have more time, we would try to search for more datasets that include more variables regarding human behavior that could have a correlation with crime, such as:
Key Variables
Interpretation
Future Improvements
If we were to have more time, we would try to search for more datasets that include more variables regarding human behavior that could have a correlation with crime, such as:
Key Trends
Interpretation
Future Improvements
Description
From the analysis, the Northeast district had the most number of violent crimes. The visualization to the left depicts violent crime count for the neighborhoods in the northeast district with the size of the circle representing the neighborhood with the most violent crime count.
Take Away
The top neighborhoods with the highest violent crime count are shown below in the table.
| Neighborhood | Violent.Crime.Count |
|---|---|
| Frankford | 1639 |
| Belair-Edison | 1600 |
| Coldstream Homestead | 1044 |
How to Use
To view the neighborhood and respective violent crime count on the map, click on the circle.
Zoom in and out of the map to click on the smaller circles.
Drag along the map at a desired zoom level to view the violent crime counts of the surrounding neiborhoods.
What Can Be Done in the Future?
Further analyses could be performed to look into why these neighborhoods have the highest number of violent crimes for the northeast region of Baltimore.
Description
The visualizations to the left depict density maps that represent violent and nonviolent crime in all of Baltimore for 2019 along with labels of the top 5 most dangerous neighborhoods of Baltimore in 2020. See link for the article where the neighborhood data was extracted here.
From the start, one main goal of this analysis was to be able to predict crime with the most relevant information available. Viewing just crime levels in 2019 and comparing them with the top 5 dangerous neighborhoods in 2020 can give insight into how well the data can possibly predict crime as well as view trends in crimes in just the span of one year.
Take Away
From the graphs, one can see that the volume of violent crimes exceeded the volume of nonviolent in 2019. Additionally, the most crime, either violent or nonviolent, occured near the central district of Baltimore. This conclusion is slightly different from the result of the analysis that stated the northeast district had the most violent crimes for the years 2015-2019. One reason for the difference could be a result of the natural shift of crime trends from year to year.
Three of the five top dangerous neighborhoods of 2020 lie within the lighter colored areas which represent the higher crime counts. Dundalk and the Fairfield Area, the top two most dangerous neighborhoods, do not overlap with any of the data in this dataframe, which is surprising. Keep in mind, the location of these two neighborhoods could cover a wider area than represented on the map.
What Can Be Done in the Future?
Further analyses can be performed to output all the years 2015-2019 to see how crime trends have changed.