Background

About Analyze Boston

Analyze Boston is the City of Boston’s open data hub to find facts, figures, and maps related to our lives within the city. We are working to make this the default technology platform to support the publication of the City’s public information, in the form of data, and to make this information easy to find, access, and use by a broad audience. This platform is managed by the Citywide Analytics Team.

Dataset

Crime incident reports are provided by Boston Police Department (BPD) to document the initial details surrounding an incident to which BPD officers respond. This is a dataset containing records from the new crime incident report system, which includes a reduced set of fields focused on capturing the type of incident as well as when and where it occurred.

Records begin in June 14, 2015 and continue to September 3, 2018.

The Dataset published on Analyze Boston and Kaggle

Study Case: “Crime in Boston 2015-2018”

we are data analysts at Analyze Boston whose job is to analyze and dissect information from data more deeply. We want to assist police officers so they can increase security in certain areas of Boston. We got Crime in Boston 2015-2018 data and want to utilize it to determine the distribution of criminal cases in Boston and what types of crimes often occur in Boston.

Read Data

Make sure our data placed in the same folder our R project data.

# Read Dataset
crime <- read.csv("data_input/crime.csv")
head(crime, 10)
##    INCIDENT_NUMBER OFFENSE_CODE              OFFENSE_CODE_GROUP
## 1       I182070945          619                         Larceny
## 2       I182070943         1402                       Vandalism
## 3       I182070941         3410                           Towed
## 4       I182070940         3114            Investigate Property
## 5       I182070938         3114            Investigate Property
## 6       I182070936         3820 Motor Vehicle Accident Response
## 7       I182070933          724                      Auto Theft
## 8       I182070932         3301                 Verbal Disputes
## 9       I182070931          301                         Robbery
## 10      I182070929         3301                 Verbal Disputes
##                           OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING
## 1                          LARCENY ALL OTHERS      D14            808         
## 2                                   VANDALISM      C11            347         
## 3                         TOWED MOTOR VEHICLE       D4            151         
## 4                        INVESTIGATE PROPERTY       D4            272         
## 5                        INVESTIGATE PROPERTY       B3            421         
## 6  M/V ACCIDENT INVOLVING PEDESTRIAN - INJURY      C11            398         
## 7                                  AUTO THEFT       B2            330         
## 8                              VERBAL DISPUTE       B2            584         
## 9                            ROBBERY - STREET       C6            177         
## 10                             VERBAL DISPUTE      C11            364         
##       OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK HOUR   UCR_PART            STREET
## 1  2018-09-02 13:00:00 2018     9      Sunday   13   Part One        LINCOLN ST
## 2  2018-08-21 00:00:00 2018     8     Tuesday    0   Part Two          HECLA ST
## 3  2018-09-03 19:27:00 2018     9      Monday   19 Part Three       CAZENOVE ST
## 4  2018-09-03 21:16:00 2018     9      Monday   21 Part Three        NEWCOMB ST
## 5  2018-09-03 21:05:00 2018     9      Monday   21 Part Three          DELHI ST
## 6  2018-09-03 21:09:00 2018     9      Monday   21 Part Three        TALBOT AVE
## 7  2018-09-03 21:25:00 2018     9      Monday   21   Part One       NORMANDY ST
## 8  2018-09-03 20:39:37 2018     9      Monday   20 Part Three           LAWN ST
## 9  2018-09-03 20:48:00 2018     9      Monday   20   Part One MASSACHUSETTS AVE
## 10 2018-09-03 20:38:00 2018     9      Monday   20 Part Three         LESLIE ST
##         Lat      Long                    Location
## 1  42.35779 -71.13937 (42.35779134, -71.13937053)
## 2  42.30682 -71.06030 (42.30682138, -71.06030035)
## 3  42.34659 -71.07243 (42.34658879, -71.07242943)
## 4  42.33418 -71.07866 (42.33418175, -71.07866441)
## 5  42.27537 -71.09036 (42.27536542, -71.09036101)
## 6  42.29020 -71.07159 (42.29019621, -71.07159012)
## 7  42.30607 -71.08273 (42.30607218, -71.08273260)
## 8  42.32702 -71.10555 (42.32701648, -71.10555088)
## 9  42.33152 -71.07085 (42.33152148, -71.07085307)
## 10 42.29515 -71.05861 (42.29514664, -71.05860832)

Checking Dataset

# Inspect
str(crime)
## 'data.frame':    319073 obs. of  17 variables:
##  $ INCIDENT_NUMBER    : chr  "I182070945" "I182070943" "I182070941" "I182070940" ...
##  $ OFFENSE_CODE       : int  619 1402 3410 3114 3114 3820 724 3301 301 3301 ...
##  $ OFFENSE_CODE_GROUP : chr  "Larceny" "Vandalism" "Towed" "Investigate Property" ...
##  $ OFFENSE_DESCRIPTION: chr  "LARCENY ALL OTHERS" "VANDALISM" "TOWED MOTOR VEHICLE" "INVESTIGATE PROPERTY" ...
##  $ DISTRICT           : chr  "D14" "C11" "D4" "D4" ...
##  $ REPORTING_AREA     : int  808 347 151 272 421 398 330 584 177 364 ...
##  $ SHOOTING           : chr  "" "" "" "" ...
##  $ OCCURRED_ON_DATE   : chr  "2018-09-02 13:00:00" "2018-08-21 00:00:00" "2018-09-03 19:27:00" "2018-09-03 21:16:00" ...
##  $ YEAR               : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ MONTH              : int  9 8 9 9 9 9 9 9 9 9 ...
##  $ DAY_OF_WEEK        : chr  "Sunday" "Tuesday" "Monday" "Monday" ...
##  $ HOUR               : int  13 0 19 21 21 21 21 20 20 20 ...
##  $ UCR_PART           : chr  "Part One" "Part Two" "Part Three" "Part Three" ...
##  $ STREET             : chr  "LINCOLN ST" "HECLA ST" "CAZENOVE ST" "NEWCOMB ST" ...
##  $ Lat                : num  42.4 42.3 42.3 42.3 42.3 ...
##  $ Long               : num  -71.1 -71.1 -71.1 -71.1 -71.1 ...
##  $ Location           : chr  "(42.35779134, -71.13937053)" "(42.30682138, -71.06030035)" "(42.34658879, -71.07242943)" "(42.33418175, -71.07866441)" ...

There is some datatype that not appropriate

Delete Column that is not use,

  1. SHOOTING
  2. REPORTING_AREA
  3. Lat
  4. Long

The datatype that we should change,

  1. OFFENSE_CODE_GROUP -> as.factor
  2. OFFENSE_DESCRIPTION -> as.factor
  3. DISTRICT -> as.factor
  4. OCCURRED_ON_DATE -> datetime
  5. MONTH -> name of month
  6. UCR_PART -> as.factor
  7. STREET -> as.factor

Data Wrangling

Import Packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Delete Column & Change Datatypes

crime_clean <- crime %>% 
  select(-c("SHOOTING", "REPORTING_AREA", "Lat", "Long")) %>% 
  mutate(OFFENSE_CODE_GROUP = as.factor(OFFENSE_CODE_GROUP),
         OFFENSE_DESCRIPTION = as.factor(OFFENSE_DESCRIPTION),
         DISTRICT = as.factor(DISTRICT),
         OCCURRED_ON_DATE = ymd_hms(OCCURRED_ON_DATE),
         UCR_PART = as.factor(UCR_PART),
         STREET = as.factor(STREET))

crime_clean$MONTH <- sapply(as.character(crime_clean$MONTH), switch,
         "1" = "January",
         "2" = "February", 
         "3" = "March", 
         "4" = "April", 
         "5" = "May",
         "6" = "June", 
         "7" = "July", 
         "8" = "August", 
         "9" = "September",
         "10" = "October", 
         "11" = "November",
         "12" = "December")


crime_clean <- crime_clean[!(crime_clean$STREET == ""),]
crime_clean <- crime_clean[!(crime_clean$DISTRICT == ""),]
crime_clean$MONTH <- as.factor(crime_clean$MONTH)
crime_clean$DAY_OF_WEEK <- as.factor(crime_clean$DAY_OF_WEEK)
head(crime_clean)
##   INCIDENT_NUMBER OFFENSE_CODE              OFFENSE_CODE_GROUP
## 1      I182070945          619                         Larceny
## 2      I182070943         1402                       Vandalism
## 3      I182070941         3410                           Towed
## 4      I182070940         3114            Investigate Property
## 5      I182070938         3114            Investigate Property
## 6      I182070936         3820 Motor Vehicle Accident Response
##                          OFFENSE_DESCRIPTION DISTRICT    OCCURRED_ON_DATE YEAR
## 1                         LARCENY ALL OTHERS      D14 2018-09-02 13:00:00 2018
## 2                                  VANDALISM      C11 2018-08-21 00:00:00 2018
## 3                        TOWED MOTOR VEHICLE       D4 2018-09-03 19:27:00 2018
## 4                       INVESTIGATE PROPERTY       D4 2018-09-03 21:16:00 2018
## 5                       INVESTIGATE PROPERTY       B3 2018-09-03 21:05:00 2018
## 6 M/V ACCIDENT INVOLVING PEDESTRIAN - INJURY      C11 2018-09-03 21:09:00 2018
##       MONTH DAY_OF_WEEK HOUR   UCR_PART      STREET                    Location
## 1 September      Sunday   13   Part One  LINCOLN ST (42.35779134, -71.13937053)
## 2    August     Tuesday    0   Part Two    HECLA ST (42.30682138, -71.06030035)
## 3 September      Monday   19 Part Three CAZENOVE ST (42.34658879, -71.07242943)
## 4 September      Monday   21 Part Three  NEWCOMB ST (42.33418175, -71.07866441)
## 5 September      Monday   21 Part Three    DELHI ST (42.27536542, -71.09036101)
## 6 September      Monday   21 Part Three  TALBOT AVE (42.29019621, -71.07159012)

Each of column already changed into desired data type

Checking Missing Value

anyNA(crime_clean)
## [1] FALSE
colSums(is.na(crime_clean))
##     INCIDENT_NUMBER        OFFENSE_CODE  OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION 
##                   0                   0                   0                   0 
##            DISTRICT    OCCURRED_ON_DATE                YEAR               MONTH 
##                   0                   0                   0                   0 
##         DAY_OF_WEEK                HOUR            UCR_PART              STREET 
##                   0                   0                   0                   0 
##            Location 
##                   0

Awesome! we haven’t Missing Values

Now, The Crime in Boston data is ready to be processed and analyzed

Data Explanation

We can use summary() function to know the data

summary(crime_clean)
##  INCIDENT_NUMBER     OFFENSE_CODE                        OFFENSE_CODE_GROUP
##  Length:307484      Min.   : 111   Motor Vehicle Accident Response: 33684  
##  Class :character   1st Qu.:1001   Larceny                        : 25578  
##  Mode  :character   Median :2907   Medical Assistance             : 23001  
##                     Mean   :2306   Investigate Person             : 18377  
##                     3rd Qu.:3201   Other                          : 17515  
##                     Max.   :3831   Drug Violation                 : 15420  
##                                    (Other)                        :173909  
##                             OFFENSE_DESCRIPTION    DISTRICT    
##  INVESTIGATE PERSON                   : 18381   B2     :48132  
##  SICK/INJURED/MEDICAL - PERSON        : 18344   C11    :41458  
##  M/V - LEAVING SCENE - PROPERTY DAMAGE: 15291   D4     :40228  
##  VANDALISM                            : 14864   B3     :34687  
##  ASSAULT SIMPLE - BATTERY             : 14352   A1     :34179  
##  VERBAL DISPUTE                       : 13023   C6     :22514  
##  (Other)                              :213229   (Other):86286  
##  OCCURRED_ON_DATE                   YEAR            MONTH       
##  Min.   :2015-06-15 00:00:00   Min.   :2015   August   : 33557  
##  1st Qu.:2016-04-11 07:30:00   1st Qu.:2016   July     : 33441  
##  Median :2017-02-05 01:43:30   Median :2017   June     : 29638  
##  Mean   :2017-01-28 03:11:54   Mean   :2017   September: 25445  
##  3rd Qu.:2017-11-10 11:19:45   3rd Qu.:2017   May      : 25364  
##  Max.   :2018-09-03 21:25:00   Max.   :2018   October  : 24648  
##                                               (Other)  :135391  
##     DAY_OF_WEEK         HOUR             UCR_PART     
##  Friday   :46712   Min.   : 0.00             :    90  
##  Monday   :43966   1st Qu.: 9.00   Other     :  1188  
##  Saturday :43179   Median :14.00   Part One  : 60132  
##  Sunday   :38979   Mean   :13.12   Part Three:152143  
##  Thursday :44940   3rd Qu.:18.00   Part Two  : 93931  
##  Tuesday  :44626   Max.   :23.00                      
##  Wednesday:45082                                      
##                STREET         Location        
##  WASHINGTON ST    : 14192   Length:307484     
##  BLUE HILL AVE    :  7794   Class :character  
##  BOYLSTON ST      :  7219   Mode  :character  
##  DORCHESTER AVE   :  5143                     
##  TREMONT ST       :  4796                     
##  MASSACHUSETTS AVE:  4707                     
##  (Other)          :263633

INSIGHT

  1. The range of existing data starts from June 2015 to September 2018
  2. Most crimes occur in August
  3. Crime occurs the most in WASHINGTON ST
  4. Crime occurs the most on Friday

Exploratory Data

The most Occur Crime Category

We need to subset the data for the Crime Group

crime_category <- as.data.frame(sort(table(crime_clean$OFFENSE_CODE_GROUP), decreasing = T))
names(crime_category)[1] <- paste("Category")
names(crime_category)[2] <- paste("Frequency")

head(crime_category, 10)
##                           Category Frequency
## 1  Motor Vehicle Accident Response     33684
## 2                          Larceny     25578
## 3               Medical Assistance     23001
## 4               Investigate Person     18377
## 5                            Other     17515
## 6                   Drug Violation     15420
## 7                   Simple Assault     15363
## 8                        Vandalism     15118
## 9                  Verbal Disputes     13023
## 10                           Towed     10966

Plotting The Data

ggplot(head(crime_category, 10), aes(x = reorder(Category, Frequency), y = Frequency))+
  geom_col(fill = "purple") +
  coord_flip()+
  labs(x = "",
       y = "Frequency",
       title = "The most Occur Crime Category") +
  theme_minimal()

The Most Street of Occur Crime

We need to subset the data for the street of Occur Crime

crime_street <- as.data.frame(sort(table(crime_clean$STREET), decreasing = T))
names(crime_street)[1] <- paste("Street")
names(crime_street)[2] <- paste("Frequency")

head(crime_street, 10)
##               Street Frequency
## 1      WASHINGTON ST     14192
## 2      BLUE HILL AVE      7794
## 3        BOYLSTON ST      7219
## 4     DORCHESTER AVE      5143
## 5         TREMONT ST      4796
## 6  MASSACHUSETTS AVE      4707
## 7       HARRISON AVE      4608
## 8          CENTRE ST      4379
## 9   COMMONWEALTH AVE      4134
## 10     HYDE PARK AVE      3470

Plotting The Data

ggplotly(ggplot(head(crime_street, 10), aes(x = reorder(Street, Frequency), y = Frequency))+
  geom_col(fill = "Orange") +
  coord_flip()+
  labs(x = "",
       y = "Frequency",
       title = "The most Street of Occur Crime") +
  theme_minimal())

The Most Hour of Occur Crime

We need to subset the data for the hour of Occur Crime

crime_hour <- as.data.frame(table(crime_clean$HOUR))
names(crime_hour)[1] <- paste("Hour")
names(crime_hour)[2] <- paste("Frequency")

crime_hour
##    Hour Frequency
## 1     0     14560
## 2     1      8770
## 3     2      7261
## 4     3      4392
## 5     4      3286
## 6     5      3177
## 7     6      4861
## 8     7      8542
## 9     8     12593
## 10    9     14311
## 11   10     15864
## 12   11     15935
## 13   12     18116
## 14   13     16324
## 15   14     16581
## 16   15     15926
## 17   16     19156
## 18   17     19855
## 19   18     19451
## 20   19     16897
## 21   20     15330
## 22   21     13624
## 23   22     12446
## 24   23     10226

Plotting The Data

ggplotly(ggplot(crime_hour, aes(x = reorder(Hour, Frequency), y = Frequency))+
  geom_col(fill = "red") +
  coord_flip()+
  labs(x = "Hour",
       y = "Frequency",
       title = "The Most Hour of Occur Crime") +
  theme_minimal())

Crime Hour Frequency for Each Day

We need to subset the data Occur Crime

crime_day <- as.data.frame(table(crime_clean$HOUR,
                                 crime_clean$DAY_OF_WEEK))
names(crime_day)[1] <- paste("Hour")
names(crime_day)[2] <- paste("Day")
names(crime_day)[3] <- paste("Frequency")

head(crime_day, 10)
##    Hour    Day Frequency
## 1     0 Friday      2086
## 2     1 Friday      1208
## 3     2 Friday       908
## 4     3 Friday       512
## 5     4 Friday       433
## 6     5 Friday       467
## 7     6 Friday       739
## 8     7 Friday      1346
## 9     8 Friday      1981
## 10    9 Friday      2234

Plotting The Data

ggplotly(ggplot(data = crime_day, mapping = aes(x = Frequency, y = reorder(Hour, Frequency))) +
  geom_col(mapping = aes(fill = Day)) + # default position
  labs(x = "Frequency",
       y = "Hour",
       fill = "",
       title = "Crime Hour with Highest Occur",
       subtitle = "Colored per Day of Occur Crime") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "top"))

Conclusion

From the analysis and plots that have been shown previously, it can be concluded that,

  1. Crime in Boston often occurs between 4pm and 6pm.
  2. Crime in Boston often happens on Fridays.
  3. Crime in Boston is common on Washington St, Blue Hill Ave, and Boylston St.
  4. The categories of crimes that often occur in Boston are ‘Motor Vehicle Accident Response’, ‘Larceny’, and ‘Medical Assistance’ crimes.
  5. Crime in Boston often happens in August.

Reference

  1. Analyze Boston