We live in a country where many issues, pertaining to violence, have become a political conversation. Police brutality, school shootings, hate crimes, and gang violence are just a few of those major issues. The United States may not fall within the top ten list of countries with the highest crime rate, but the fact that our crime rate is at 46.73 means there is still much progress to be done. If I only steal from a store 3 times a week, when I used to do it 5 times a week, does that really make the issue better?
The data we analyzed investigates a city with one of the highest crime rates in the United States, Boston, Massachusetts. Our goal was to apply analytics to this information to predict the likeliness and specifics about future crimes. Luckily, the Boston Police Department implemented a new crime incident report system that includes information regarding the offense committed, the date and time of the crime, and the location of the crime. The information provided for each crime allowed us to specify questions regarding which crimes occur the most, where crimes occur the most, and how often certain types of crimes occur. Our data exploration allowed us to center on two questions, regarding which type of offense occurs most often and the locations where crimes are committed the most, that can help further predictive analytics.
Our data was collected by the Boston Police Department and includes 319,073 incident reports that Boston police officers responded to. The dataset was provided by Analyze Data, which is Boston’s data hub for statistics related to the citizens of Boston. The incident reports date from June 14, 2015 to September 03, 2018. From our dataset, the relevant variables include the offense code group, district, year, month, day of week, hour, UCR part, street, latitude, and longitude. The table below demonstrates the variables used in our dataset:
## # A tibble: 6 x 10
## OFFENSE_CODE_GR… DISTRICT YEAR MONTH DAY_OF_WEEK HOUR UCR_PART STREET Lat
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 Larceny D14 2018 9 Sunday 13 Part One LINCO… 42.4
## 2 Vandalism C11 2018 8 Tuesday 0 Part Two HECLA… 42.3
## 3 Towed D4 2018 9 Monday 19 Part Th… CAZEN… 42.3
## 4 Investigate Pro… D4 2018 9 Monday 21 Part Th… NEWCO… 42.3
## 5 Investigate Pro… B3 2018 9 Monday 21 Part Th… DELHI… 42.3
## 6 Motor Vehicle A… C11 2018 9 Monday 21 Part Th… TALBO… 42.3
## # … with 1 more variable: Long <dbl>
Our data originally included 17 variables, but we deemed 10 of them most substantial to our research (as shown in the above table). The first variable we included was the offense code group, which determines the type of crime reported. We used the offense code group to determine the frequency of different types of crimes. Additionally, we included the district variable, which describes the legislative districts within the city. This information allowed us to determine where crimes were committed the most. Variables that correspond with the district and the information regarding the locations of crimes include the street on which the crime was committed and the latitude and longitude of the incident reported. Also, we utilized the year the crime was committed, the month the crime was committed in, the day of the week the crime was committed on, and the hour at which the crime was committed to determine patterns in when crimes were committed and if there was a correlation between the type of crime and the date and time. Lastly, the UCR part, which determines the seriousness of crimes, was implemented in our data to determine where UCR crimes appear the most and which specific UCR crime occurs the most.
The variable most applicable to the two primary questions was the offense code group. The offense code group can be used to determine when and where certain types of crimes are committed. This information proved relevant to us because if we could determine a trend, we could use analytics to determine future crimes and brainstorm focal questions on how these frequencies can be decreased. We understood that the offense code group could vary by year, so we created the following graph to determine which code groups experienced an increase:
## # A tibble: 12 x 2
## OFFENSE_CODE_GROUP n
## <chr> <int>
## 1 Motor Vehicle Accident Response 37132
## 2 Larceny 25935
## 3 Medical Assistance 23540
## 4 Investigate Person 18750
## 5 Other 18075
## 6 Drug Violation 16548
## 7 Simple Assault 15826
## 8 Vandalism 15415
## 9 Verbal Disputes 13099
## 10 Towed 11287
## 11 Investigate Property 11124
## 12 Larceny From Motor Vehicle 10847
For the first question, “Is there a particular type of offense that happens more often during a certain part of the week or year? How does this affect the overall fluctuation in crime over that particular year?”, we created a visual that plotted the offense code group against the month the crime was committed in and the day of the week the crime was committed on. The variables used are offense code group, month, and day of week. Our dataset offered 67 different types of code groups, so we narrowed down the top 12 offense code groups. The top 12 offense code groups accounted for 217,578 of the 319,073 incident reports, or 68.2%. We filtered out the code groups that did not exceed 10,000 incidents. We thought these 12 offense code groups were an efficient representation of the crimes committed because they made up most of the data while the other 55 offense code groups only made up approximately 32.8% of the data. Below is the table that narrows down the code groups, the table of the counts of each crime for each month, the table of the counts of crimes per each day of the week, the plot of offense code groups versus month, and the plot of offense code groups versus day of week:
## # A tibble: 12 x 2
## OFFENSE_CODE_GROUP n
## <chr> <int>
## 1 Motor Vehicle Accident Response 37132
## 2 Larceny 25935
## 3 Medical Assistance 23540
## 4 Investigate Person 18750
## 5 Other 18075
## 6 Drug Violation 16548
## 7 Simple Assault 15826
## 8 Vandalism 15415
## 9 Verbal Disputes 13099
## 10 Towed 11287
## 11 Investigate Property 11124
## 12 Larceny From Motor Vehicle 10847
## # A tibble: 144 x 3
## # Groups: MONTH [12]
## MONTH OFFENSE_CODE_GROUP counts
## <dbl> <chr> <int>
## 1 12 Drug Violation 965
## 2 12 Investigate Person 1575
## 3 12 Investigate Property 786
## 4 12 Larceny 2079
## 5 12 Larceny From Motor Vehicle 818
## 6 12 Medical Assistance 1711
## 7 12 Motor Vehicle Accident Response 2973
## 8 12 Other 1172
## 9 12 Simple Assault 1176
## 10 12 Towed 843
## # … with 134 more rows
## # A tibble: 7 x 2
## DAY_OF_WEEK counts
## <chr> <int>
## 1 Friday 48495
## 2 Wednesday 46729
## 3 Thursday 46656
## 4 Tuesday 46383
## 5 Monday 45679
## 6 Saturday 44818
## 7 Sunday 40313
From the above data, we concluded that relatively every crime follows the same trend of reaching its highest frequency during the summer months. Despite the overall similarity among all of the crimes, there is some variation. Crimes such as Vandalism, Towed, and Investigate Person reach their highest frequency during late July and early August while crimes like Larceny, Investigate Property, and Drug Violation occur the most during late June early July. Another interesting finding we discovered was in late February, early March. Medical assistance experiences a clear increase along with Drug Violation. Based on the percent bar graph formulated for “DAY_OF_WEEK”, some clear observations are: Drug Violations occur the least on Sunday and Saturday which represent the two days that show the most variation compared every other day of the week. The weekdays experience a relatively stable trend in crime occurrence. In terms of the fluctuation of crime over an entire year, depending how much individual crimes increase over the summer months can heavily influence how much the total crime rate will be for the overall year.
I was able to draw conclusions based on the data provided by the visuals above, but I wanted to perform a hypothesis test to support my idea that there exists a significant relationship between offense code group and month and offense code group and day of the week. The significance tests used to establish that month and day of week can predict the offense code of the crime committed is performed below:
##
## Call:
## lm(formula = OFFENSE_CODE ~ MONTH, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2235.6 -1304.2 610.2 885.5 1541.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2351.8212 4.7274 497.492 < 2e-16 ***
## MONTH -5.1854 0.6409 -8.091 5.95e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1185 on 319071 degrees of freedom
## Multiple R-squared: 0.0002051, Adjusted R-squared: 0.000202
## F-statistic: 65.46 on 1 and 319071 DF, p-value: 5.952e-16
##
## Call:
## lm(formula = OFFENSE_CODE ~ DAY_OF_WEEK, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2212.2 -1314.9 593.6 884.8 1521.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2323.1543 5.3824 431.621 <2e-16 ***
## DAY_OF_WEEKMonday -4.8885 7.7283 -0.633 0.5270
## DAY_OF_WEEKSaturday -0.9517 7.7664 -0.123 0.9025
## DAY_OF_WEEKSunday -13.7463 7.9888 -1.721 0.0853 .
## DAY_OF_WEEKThursday -3.8417 7.6865 -0.500 0.6172
## DAY_OF_WEEKTuesday -7.2310 7.6980 -0.939 0.3476
## DAY_OF_WEEKWednesday -9.7247 7.6834 -1.266 0.2056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1185 on 319066 degrees of freedom
## Multiple R-squared: 1.394e-05, Adjusted R-squared: -4.862e-06
## F-statistic: 0.7415 on 6 and 319066 DF, p-value: 0.6162
I performed a t-test and used the p-values assess the effectiveness of my model. The p-value of the first model, which uses MONTH as the predictor variable and OFFENSE_CODE as the response variable, is not equal to 0. This allows me to reject the null hypothesis that there is no significant relationship between the month the crime was committed in and the offense committed. The p-value also indicates that the model is effective. This corresponds with the original conclusion because we noticed trends of crimes occuring during different parts of the year. The t-test for the second model, which uses DAY_OF_WEEK as the predictor variable and OFFENSE_CODE as the response variable, shows insignificant p-values. This does not allow me to reject the null hypothesis that there is no significant relationship between the day of the week the crime was committed and the offense committed. The p-values of the coefficients demonstrate that most of the coefficients were not completely significant. I interpreted this to mean that one or two days show a significant relationship with the offense code committed, but there is not enough support to determine a significant relationship. This makes sense with our conclusion because we did see that crimes remained relatively constant throughout the week.
For the second question, “What district is associated with the greatest number of crimes? Does this district tend to fall under the same general location?”, we created a visual that plotted the offense code group against the district the crime was committed in. The variables used are offense code group, district, latitude, and longitude. Although there were 12 different districts, some of the incident reports did not record a district and instead report “NA”. We had to design the code to exclude the observations do not have a district recorded, which would definitely impact our data. However, only 1,764 out of the 319,073 incident reports, or 0.55%, did not have a district listed. We thought this data was representative because the incidents listed with districts make up approximately 99.45% of the data. Below is the table of the counts of crime for each district, the table of the counts of crimes per each day of the week and two visuals that demonstrate the frequency of crime in each district:
## # A tibble: 13 x 2
## DISTRICT Crimes
## <chr> <int>
## 1 B2 49945
## 2 C11 42530
## 3 D4 41915
## 4 A1 35717
## 5 B3 35442
## 6 C6 23460
## 7 D14 20127
## 8 E13 17536
## 9 E18 17348
## 10 A7 13544
## 11 E5 13239
## 12 A15 6505
## 13 <NA> 1765
From the above data, we concluded that District B2 in central Boston is associated with the highest number of crimes, followed by C11 on the east side and D4 in northern central Boston. Districts A1 and A15 appears to have the lowest number of crimes out of the twelve districts. The crimes appear to be fairly evenly spread throughout these districts, with a slightly higher concentration in the center of the districts than the outer areas.
Similarly to the first question, I was able to draw conclusions based on the data provided by the visuals above, but I wanted to perform a hypothesis test to support my idea that there exists a significant relationship between frequency of crimes and the districts in Boston. The significance tests used to establish that the city distrct can predict how much crime is committed is performed below:
##
## Call:
## lm(formula = OFFENSE_CODE ~ DISTRICT, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2365.9 -1289.6 553.6 933.1 1782.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2136.308 6.229 342.97 <2e-16 ***
## DISTRICTA15 233.832 15.869 14.73 <2e-16 ***
## DISTRICTA7 227.695 11.879 19.17 <2e-16 ***
## DISTRICTB2 233.537 8.158 28.63 <2e-16 ***
## DISTRICTB3 316.136 8.826 35.82 <2e-16 ***
## DISTRICTC11 255.279 8.449 30.21 <2e-16 ***
## DISTRICTC6 167.178 9.893 16.90 <2e-16 ***
## DISTRICTD14 240.051 10.376 23.14 <2e-16 ***
## DISTRICTD4 -87.946 8.477 -10.37 <2e-16 ***
## DISTRICTE13 183.892 10.855 16.94 <2e-16 ***
## DISTRICTE18 333.221 10.894 30.59 <2e-16 ***
## DISTRICTE5 340.582 11.978 28.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1177 on 317296 degrees of freedom
## (1765 observations deleted due to missingness)
## Multiple R-squared: 0.01364, Adjusted R-squared: 0.01361
## F-statistic: 399 on 11 and 317296 DF, p-value: < 2.2e-16
I, once again, performed another t-test and used the p-values to assess the effectiveness of my model. The p-value of the model which uses DISTRICT as the predictor variable and OFFENSE_CODE as the response variable, is not equal to 0. This allows me to reject the null hypothesis that there is no significant relationship between the district the crime was committed in and the offense committed. The p-value also indicates that the model is effective. The p-value for each coefficient is also significant, which each value being less than 2e-16, which allows me to draw the conclusion that district is an effective predictor for offenses committed.
When approaching the first question about if there is a certain part of the week or the year when a certain crime is committed, we discovered that each of the top 12 crimes experienced an increase in count during the summertime (June through August). We noticed that certain crimes (vandalism, towing, and investigate person) reached a peak at the end of July/ beginning of August and other crimes (larceny, investigate property, and drug violation) reached a peak at the end of June/ beginning of July. This information also demonstrated that the frequency in crimes during the summer significantly impacts the total year’s crime rate. Additionally, we found that crimes remained pretty constant throughout the week. There were no peaks or troughs in the data for days of week versus offense codes. When approaching the second question about if certain districts experience more crimes and if the districts with more crime fall under the same location, we found that central Boston (B2) is correlated with the highest number of crimes. Districts C11 and D4, which fall on the east side and northern central parts of Boston respectively, have the second highest frequency of crimes. Districts A1 and A15, northern to northeast Boston, show the lowest number of crimes out of all the districts. The crimes in Boston seem to be pretty easily spread out, but it does appear that more of the crimes occur in the central region.
This data is highly relevant to the state of our country today. With different types of crime becoming rather frequent, there is more discussion on policies and laws we can implement to decrease the crime rate. Everyday, we hear news about a terrorist attack, a person of color being wrongfully killed, a public setting being attacked with gun violence, or someone wrongfully being targeted because of their personal beliefs. Crime is everywhere. The conversation is growing about what we can do to prevent the frequency of crimes and this dataset demonstrates that we can draw conclusions about when and where certain types of crime are committed and thus, create a predictive model that demonstrates when and where we need to pay more attention. The state government, the city government, financial committees, law enforcement, and any form of power that can influence budgets, police resources, and employees can use this information to determine which areas need more attention, safety, and security. Of course there is no way to ensure all crime is eradicated or when and where exactly crimes are committed, but there is an established pattern and if we can offer security at the areas more frequented by crimes or more security in the months we experience a peak in crime count, we could attempt to get ahead of future incidents and decrease the crime rate. This data also begs further questions about what more can be done to prevent these crimes, like providing shelter, education, and jobs. It would be incredibly interesting to see if implementing these opportunities in areas with high crimes can impact the crime rate.
Although most of the data demonstrated interesting trends to draw conclusions from, there is definitely more that could be done to draw more secure conclusions. If we had information regarding the demographic of those involved in the crimes reported, the poverty level of the locations where crimes were committed, the incident reports of other cities including cities with low crime rates, we could draw statistics that lead to interpretations that go deeper than the surface level conclusions we had drawn. I would like to have used an ANOVA test to determine the differences in statistics between different areas and further researched the underlying causes behind these differences, like poverty level, the effectiveness of the local government, etc. Additionally, I would have liked to test the interactions between predictor variables to see if a more predictive analysis could have been created - for example, testing a multiple regression on if location and time impacted the offense committed (lm(OFFENSE_CODE ~ MONTH + DISTRICT)). As always, there is definitely room for improvement in this research but given the data we did interpret, I do think this research provides relevancy and support for higher powers to make a change. Collecting statistics is one thing, but actually taking the effort to apply these statistics to real world situations is another thing.