INTRODUCTION

We live in a country where many issues, pertaining to violence, have become a political conversation. Police brutality, school shootings, hate crimes, and gang violence are just a few of those major issues. The United States may not fall within the top ten list of countries with the highest crime rate, but the fact that our crime rate is at 46.73 means there is still much progress to be done. If I only steal from a store 3 times a week, when I used to do it 5 times a week, does that really make the issue better?

The data we analyzed investigates a city with one of the highest crime rates in the United States, Boston, Massachusetts. Our goal was to apply analytics to this information to predict the likeliness and specifics about future crimes. Luckily, the Boston Police Department implemented a new crime incident report system that includes information regarding the offense committed, the date and time of the crime, and the location of the crime. The information provided for each crime allowed us to specify questions regarding which crimes occur the most, where crimes occur the most, and how often certain types of crimes occur. Our data exploration allowed us to center on two questions, regarding which type of offense occurs most often and the locations where crimes are committed the most, that can help further predictive analytics.

  1. Is there a particular type of offense that happens more often during a certain part of the week or year? How does this affect the overall fluctuation in crime over that particular year?
  2. What district is associated with the greatest number of crimes? Does this district tend to fall under the same general location?

DATA

Our data was collected by the Boston Police Department and includes 319,073 incident reports that Boston police officers responded to. The dataset was provided by Analyze Data, which is Boston’s data hub for statistics related to the citizens of Boston. The incident reports date from June 14, 2015 to September 03, 2018. From our dataset, the relevant variables include the offense code group, district, year, month, day of week, hour, UCR part, street, latitude, and longitude. The table below demonstrates the variables used in our dataset:

## # A tibble: 6 x 10
##   OFFENSE_CODE_GR… DISTRICT  YEAR MONTH DAY_OF_WEEK  HOUR UCR_PART STREET   Lat
##   <chr>            <chr>    <dbl> <dbl> <chr>       <dbl> <chr>    <chr>  <dbl>
## 1 Larceny          D14       2018     9 Sunday         13 Part One LINCO…  42.4
## 2 Vandalism        C11       2018     8 Tuesday         0 Part Two HECLA…  42.3
## 3 Towed            D4        2018     9 Monday         19 Part Th… CAZEN…  42.3
## 4 Investigate Pro… D4        2018     9 Monday         21 Part Th… NEWCO…  42.3
## 5 Investigate Pro… B3        2018     9 Monday         21 Part Th… DELHI…  42.3
## 6 Motor Vehicle A… C11       2018     9 Monday         21 Part Th… TALBO…  42.3
## # … with 1 more variable: Long <dbl>

Our data originally included 17 variables, but we deemed 10 of them most substantial to our research (as shown in the above table). The first variable we included was the offense code group, which determines the type of crime reported. We used the offense code group to determine the frequency of different types of crimes. Additionally, we included the district variable, which describes the legislative districts within the city. This information allowed us to determine where crimes were committed the most. Variables that correspond with the district and the information regarding the locations of crimes include the street on which the crime was committed and the latitude and longitude of the incident reported. Also, we utilized the year the crime was committed, the month the crime was committed in, the day of the week the crime was committed on, and the hour at which the crime was committed to determine patterns in when crimes were committed and if there was a correlation between the type of crime and the date and time. Lastly, the UCR part, which determines the seriousness of crimes, was implemented in our data to determine where UCR crimes appear the most and which specific UCR crime occurs the most.

The variable most applicable to the two primary questions was the offense code group. The offense code group can be used to determine when and where certain types of crimes are committed. This information proved relevant to us because if we could determine a trend, we could use analytics to determine future crimes and brainstorm focal questions on how these frequencies can be decreased. We understood that the offense code group could vary by year, so we created the following graph to determine which code groups experienced an increase:

## # A tibble: 12 x 2
##    OFFENSE_CODE_GROUP                  n
##    <chr>                           <int>
##  1 Motor Vehicle Accident Response 37132
##  2 Larceny                         25935
##  3 Medical Assistance              23540
##  4 Investigate Person              18750
##  5 Other                           18075
##  6 Drug Violation                  16548
##  7 Simple Assault                  15826
##  8 Vandalism                       15415
##  9 Verbal Disputes                 13099
## 10 Towed                           11287
## 11 Investigate Property            11124
## 12 Larceny From Motor Vehicle      10847

RESULTS

For the first question, “Is there a particular type of offense that happens more often during a certain part of the week or year? How does this affect the overall fluctuation in crime over that particular year?”, we created a visual that plotted the offense code group against the month the crime was committed in and the day of the week the crime was committed on. The variables used are offense code group, month, and day of week. Our dataset offered 67 different types of code groups, so we narrowed down the top 12 offense code groups. The top 12 offense code groups accounted for 217,578 of the 319,073 incident reports, or 68.2%. We filtered out the code groups that did not exceed 10,000 incidents. We thought these 12 offense code groups were an efficient representation of the crimes committed because they made up most of the data while the other 55 offense code groups only made up approximately 32.8% of the data. Below is the table that narrows down the code groups, the table of the counts of each crime for each month, the table of the counts of crimes per each day of the week, the plot of offense code groups versus month, and the plot of offense code groups versus day of week:

## # A tibble: 12 x 2
##    OFFENSE_CODE_GROUP                  n
##    <chr>                           <int>
##  1 Motor Vehicle Accident Response 37132
##  2 Larceny                         25935
##  3 Medical Assistance              23540
##  4 Investigate Person              18750
##  5 Other                           18075
##  6 Drug Violation                  16548
##  7 Simple Assault                  15826
##  8 Vandalism                       15415
##  9 Verbal Disputes                 13099
## 10 Towed                           11287
## 11 Investigate Property            11124
## 12 Larceny From Motor Vehicle      10847
## # A tibble: 144 x 3
## # Groups:   MONTH [12]
##    MONTH OFFENSE_CODE_GROUP              counts
##    <dbl> <chr>                            <int>
##  1    12 Drug Violation                     965
##  2    12 Investigate Person                1575
##  3    12 Investigate Property               786
##  4    12 Larceny                           2079
##  5    12 Larceny From Motor Vehicle         818
##  6    12 Medical Assistance                1711
##  7    12 Motor Vehicle Accident Response   2973
##  8    12 Other                             1172
##  9    12 Simple Assault                    1176
## 10    12 Towed                              843
## # … with 134 more rows
## # A tibble: 7 x 2
##   DAY_OF_WEEK counts
##   <chr>        <int>
## 1 Friday       48495
## 2 Wednesday    46729
## 3 Thursday     46656
## 4 Tuesday      46383
## 5 Monday       45679
## 6 Saturday     44818
## 7 Sunday       40313

From the above data, we concluded that relatively every crime follows the same trend of reaching its highest frequency during the summer months. Despite the overall similarity among all of the crimes, there is some variation. Crimes such as Vandalism, Towed, and Investigate Person reach their highest frequency during late July and early August while crimes like Larceny, Investigate Property, and Drug Violation occur the most during late June early July. Another interesting finding we discovered was in late February, early March. Medical assistance experiences a clear increase along with Drug Violation. Based on the percent bar graph formulated for “DAY_OF_WEEK”, some clear observations are: Drug Violations occur the least on Sunday and Saturday which represent the two days that show the most variation compared every other day of the week. The weekdays experience a relatively stable trend in crime occurrence. In terms of the fluctuation of crime over an entire year, depending how much individual crimes increase over the summer months can heavily influence how much the total crime rate will be for the overall year.

I was able to draw conclusions based on the data provided by the visuals above, but I wanted to perform a hypothesis test to support my idea that there exists a significant relationship between offense code group and month and offense code group and day of the week. The significance tests used to establish that month and day of week can predict the offense code of the crime committed is performed below:

## 
## Call:
## lm(formula = OFFENSE_CODE ~ MONTH, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2235.6 -1304.2   610.2   885.5  1541.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2351.8212     4.7274 497.492  < 2e-16 ***
## MONTH         -5.1854     0.6409  -8.091 5.95e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1185 on 319071 degrees of freedom
## Multiple R-squared:  0.0002051,  Adjusted R-squared:  0.000202 
## F-statistic: 65.46 on 1 and 319071 DF,  p-value: 5.952e-16
## 
## Call:
## lm(formula = OFFENSE_CODE ~ DAY_OF_WEEK, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2212.2 -1314.9   593.6   884.8  1521.6 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2323.1543     5.3824 431.621   <2e-16 ***
## DAY_OF_WEEKMonday      -4.8885     7.7283  -0.633   0.5270    
## DAY_OF_WEEKSaturday    -0.9517     7.7664  -0.123   0.9025    
## DAY_OF_WEEKSunday     -13.7463     7.9888  -1.721   0.0853 .  
## DAY_OF_WEEKThursday    -3.8417     7.6865  -0.500   0.6172    
## DAY_OF_WEEKTuesday     -7.2310     7.6980  -0.939   0.3476    
## DAY_OF_WEEKWednesday   -9.7247     7.6834  -1.266   0.2056    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1185 on 319066 degrees of freedom
## Multiple R-squared:  1.394e-05,  Adjusted R-squared:  -4.862e-06 
## F-statistic: 0.7415 on 6 and 319066 DF,  p-value: 0.6162

I performed a t-test and used the p-values assess the effectiveness of my model. The p-value of the first model, which uses MONTH as the predictor variable and OFFENSE_CODE as the response variable, is not equal to 0. This allows me to reject the null hypothesis that there is no significant relationship between the month the crime was committed in and the offense committed. The p-value also indicates that the model is effective. This corresponds with the original conclusion because we noticed trends of crimes occuring during different parts of the year. The t-test for the second model, which uses DAY_OF_WEEK as the predictor variable and OFFENSE_CODE as the response variable, shows insignificant p-values. This does not allow me to reject the null hypothesis that there is no significant relationship between the day of the week the crime was committed and the offense committed. The p-values of the coefficients demonstrate that most of the coefficients were not completely significant. I interpreted this to mean that one or two days show a significant relationship with the offense code committed, but there is not enough support to determine a significant relationship. This makes sense with our conclusion because we did see that crimes remained relatively constant throughout the week.

For the second question, “What district is associated with the greatest number of crimes? Does this district tend to fall under the same general location?”, we created a visual that plotted the offense code group against the district the crime was committed in. The variables used are offense code group, district, latitude, and longitude. Although there were 12 different districts, some of the incident reports did not record a district and instead report “NA”. We had to design the code to exclude the observations do not have a district recorded, which would definitely impact our data. However, only 1,764 out of the 319,073 incident reports, or 0.55%, did not have a district listed. We thought this data was representative because the incidents listed with districts make up approximately 99.45% of the data. Below is the table of the counts of crime for each district, the table of the counts of crimes per each day of the week and two visuals that demonstrate the frequency of crime in each district:

## # A tibble: 13 x 2
##    DISTRICT Crimes
##    <chr>     <int>
##  1 B2        49945
##  2 C11       42530
##  3 D4        41915
##  4 A1        35717
##  5 B3        35442
##  6 C6        23460
##  7 D14       20127
##  8 E13       17536
##  9 E18       17348
## 10 A7        13544
## 11 E5        13239
## 12 A15        6505
## 13 <NA>       1765

From the above data, we concluded that District B2 in central Boston is associated with the highest number of crimes, followed by C11 on the east side and D4 in northern central Boston. Districts A1 and A15 appears to have the lowest number of crimes out of the twelve districts. The crimes appear to be fairly evenly spread throughout these districts, with a slightly higher concentration in the center of the districts than the outer areas.

Similarly to the first question, I was able to draw conclusions based on the data provided by the visuals above, but I wanted to perform a hypothesis test to support my idea that there exists a significant relationship between frequency of crimes and the districts in Boston. The significance tests used to establish that the city distrct can predict how much crime is committed is performed below:

## 
## Call:
## lm(formula = OFFENSE_CODE ~ DISTRICT, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2365.9 -1289.6   553.6   933.1  1782.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2136.308      6.229  342.97   <2e-16 ***
## DISTRICTA15  233.832     15.869   14.73   <2e-16 ***
## DISTRICTA7   227.695     11.879   19.17   <2e-16 ***
## DISTRICTB2   233.537      8.158   28.63   <2e-16 ***
## DISTRICTB3   316.136      8.826   35.82   <2e-16 ***
## DISTRICTC11  255.279      8.449   30.21   <2e-16 ***
## DISTRICTC6   167.178      9.893   16.90   <2e-16 ***
## DISTRICTD14  240.051     10.376   23.14   <2e-16 ***
## DISTRICTD4   -87.946      8.477  -10.37   <2e-16 ***
## DISTRICTE13  183.892     10.855   16.94   <2e-16 ***
## DISTRICTE18  333.221     10.894   30.59   <2e-16 ***
## DISTRICTE5   340.582     11.978   28.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1177 on 317296 degrees of freedom
##   (1765 observations deleted due to missingness)
## Multiple R-squared:  0.01364,    Adjusted R-squared:  0.01361 
## F-statistic:   399 on 11 and 317296 DF,  p-value: < 2.2e-16

I, once again, performed another t-test and used the p-values to assess the effectiveness of my model. The p-value of the model which uses DISTRICT as the predictor variable and OFFENSE_CODE as the response variable, is not equal to 0. This allows me to reject the null hypothesis that there is no significant relationship between the district the crime was committed in and the offense committed. The p-value also indicates that the model is effective. The p-value for each coefficient is also significant, which each value being less than 2e-16, which allows me to draw the conclusion that district is an effective predictor for offenses committed.

CONCLUSION

When approaching the first question about if there is a certain part of the week or the year when a certain crime is committed, we discovered that each of the top 12 crimes experienced an increase in count during the summertime (June through August). We noticed that certain crimes (vandalism, towing, and investigate person) reached a peak at the end of July/ beginning of August and other crimes (larceny, investigate property, and drug violation) reached a peak at the end of June/ beginning of July. This information also demonstrated that the frequency in crimes during the summer significantly impacts the total year’s crime rate. Additionally, we found that crimes remained pretty constant throughout the week. There were no peaks or troughs in the data for days of week versus offense codes. When approaching the second question about if certain districts experience more crimes and if the districts with more crime fall under the same location, we found that central Boston (B2) is correlated with the highest number of crimes. Districts C11 and D4, which fall on the east side and northern central parts of Boston respectively, have the second highest frequency of crimes. Districts A1 and A15, northern to northeast Boston, show the lowest number of crimes out of all the districts. The crimes in Boston seem to be pretty easily spread out, but it does appear that more of the crimes occur in the central region.

This data is highly relevant to the state of our country today. With different types of crime becoming rather frequent, there is more discussion on policies and laws we can implement to decrease the crime rate. Everyday, we hear news about a terrorist attack, a person of color being wrongfully killed, a public setting being attacked with gun violence, or someone wrongfully being targeted because of their personal beliefs. Crime is everywhere. The conversation is growing about what we can do to prevent the frequency of crimes and this dataset demonstrates that we can draw conclusions about when and where certain types of crime are committed and thus, create a predictive model that demonstrates when and where we need to pay more attention. The state government, the city government, financial committees, law enforcement, and any form of power that can influence budgets, police resources, and employees can use this information to determine which areas need more attention, safety, and security. Of course there is no way to ensure all crime is eradicated or when and where exactly crimes are committed, but there is an established pattern and if we can offer security at the areas more frequented by crimes or more security in the months we experience a peak in crime count, we could attempt to get ahead of future incidents and decrease the crime rate. This data also begs further questions about what more can be done to prevent these crimes, like providing shelter, education, and jobs. It would be incredibly interesting to see if implementing these opportunities in areas with high crimes can impact the crime rate.

Although most of the data demonstrated interesting trends to draw conclusions from, there is definitely more that could be done to draw more secure conclusions. If we had information regarding the demographic of those involved in the crimes reported, the poverty level of the locations where crimes were committed, the incident reports of other cities including cities with low crime rates, we could draw statistics that lead to interpretations that go deeper than the surface level conclusions we had drawn. I would like to have used an ANOVA test to determine the differences in statistics between different areas and further researched the underlying causes behind these differences, like poverty level, the effectiveness of the local government, etc. Additionally, I would have liked to test the interactions between predictor variables to see if a more predictive analysis could have been created - for example, testing a multiple regression on if location and time impacted the offense committed (lm(OFFENSE_CODE ~ MONTH + DISTRICT)). As always, there is definitely room for improvement in this research but given the data we did interpret, I do think this research provides relevancy and support for higher powers to make a change. Collecting statistics is one thing, but actually taking the effort to apply these statistics to real world situations is another thing.