About the Data

This data is provided by Boston Police Department (BPD), which is about the crime incidents that happened within June 14, 2015 until September 3, 2018.

Compared to the Chicago crime data, this data size is much smaller (55.3 MB, as opposed to the almost 1 GB data), while still maintaining a vast amount of informations (see the section of 1. Read the Data below).

1. Read the Data

The Data

crime <- read.csv("crime.csv")
crime

  INCIDENT_NUMBER OFFENSE_CODE   OFFENSE_CODE_GROUP  OFFENSE_DESCRIPTION
1      I182070945          619              Larceny   LARCENY ALL OTHERS
2      I182070943         1402            Vandalism            VANDALISM
3      I182070941         3410                Towed  TOWED MOTOR VEHICLE
4      I182070940         3114 Investigate Property INVESTIGATE PROPERTY
  DISTRICT REPORTING_AREA SHOOTING    OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK
1      D14            808          2018-09-02 13:00:00 2018     9      Sunday
2      C11            347          2018-08-21 00:00:00 2018     8     Tuesday
3       D4            151          2018-09-03 19:27:00 2018     9      Monday
4       D4            272          2018-09-03 21:16:00 2018     9      Monday
  HOUR   UCR_PART      STREET      Lat      Long                    Location
1   13   Part One  LINCOLN ST 42.35779 -71.13937 (42.35779134, -71.13937053)
2    0   Part Two    HECLA ST 42.30682 -71.06030 (42.30682138, -71.06030035)
3   19 Part Three CAZENOVE ST 42.34659 -71.07243 (42.34658879, -71.07242943)
4   21 Part Three  NEWCOMB ST 42.33418 -71.07866 (42.33418175, -71.07866441)
 [ reached 'max' / getOption("max.print") -- omitted 319069 rows ]

The column names are mostly quite explanatory, but here are the explanations either way (quoted directly from BPD website): - INCIDENT_NUMBER : Internal BPD report number - OFFENSE_CODE : Numerical code of offense description
- OFFENSE_CODE_GROUP : Internal categorization of [offense_description]
- OFFENSE_DESCRIPTION : Primary descriptor of incident
- DISTRICT : What district the crime was reported in
- REPORTING_AREA : RA number associated with the where the crime was reported from.
- SHOOTING : Indicated a shooting took place.
- OCCURRED_ON_DATE: Earliest date and time the incident could have taken place
- UCR_PART : Universal Crime Reporting Part number (1,2, 3)
- STREET : Street name the incident took place

offense <- read.csv("offense_codes.csv")
offense

   CODE                                              NAME
1   612                  LARCENY PURSE SNATCH - NO FORCE 
2   613                               LARCENY SHOPLIFTING
3   615           LARCENY THEFT OF MV PARTS & ACCESSORIES
4  1731                                            INCEST
5  3111                         LICENSE PREMISE VIOLATION
6  2646                       LIQUOR - DRINKING IN PUBLIC
7  2204                              LIQUOR LAW VIOLATION
8  3810     M/V ACCIDENT - INVOLVING \xcaBICYCLE - INJURY
9  3801                              M/V ACCIDENT - OTHER
10 3807                 M/V ACCIDENT - OTHER CITY VEHICLE
11 3803                    M/V ACCIDENT - PERSONAL INJURY
12 3805                     M/V ACCIDENT - POLICE VEHICLE
13 3802                M/V ACCIDENT - PROPERTY \xcaDAMAGE
14 3205                                 M/V PLATES - LOST
15  123           MANSLAUGHTER - NON-VEHICLE - NEGLIGENCE
16  121               MANSLAUGHTER - VEHICLE - NEGLIGENCE
17 3501                                    MISSING PERSON
18 3502                          MISSING PERSON - LOCATED
19 3503           MISSING PERSON - NOT REPORTED - LOCATED
20  111               MURDER, NON-NEGLIGIENT MANSLAUGHTER
21 3303                          NOISY PARTY/RADIO-ARREST
22 2623                   OBSCENE MATERIALS - PORNOGRAPHY
23 2628                               OBSCENE PHONE CALLS
24 1711                             OPEN & GROSS LEWDNESS
25 2007            VIOL. OF RESTRAINING ORDER W NO ARREST
26 2102               OPERATING UNDER THE INFLUENCE DRUGS
27 2660                                     OTHER OFFENSE
28 2900               VAL - VIOLATION OF AUTO LAW - OTHER
29 1109                                      FRAUD - WIRE
30 2616                   POSSESSION OF BURGLARIOUS TOOLS
31 2606                        PRISONER ATTEMPT TO RESCUE
32 2636              PRISONER ESCAPE / ESCAPE & RECAPTURE
33 3029              PRISONER - SUICIDE / SUICIDE ATTEMPT
34 3106                      PROPERTY - ACCIDENTAL DAMAGE
35 1300 STOLEN PROPERTY - BUYING / RECEIVING / POSSESSING
36 3207                                  PROPERTY - FOUND
37 3201                                   PROPERTY - LOST
 [ reached 'max' / getOption("max.print") -- omitted 539 rows ]

Library

library(dplyr) # for data wrangling
library(lubridate) # to deal with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error

Data Aggregation

Since the aim for this data is mainly for time series exploration, let’s aggregate the data needed:

# Aggregating a crime: Drug Violation
crime_agg <- crime %>% 
  group_by(OFFENSE_CODE_GROUP, OCCURRED_ON_DATE) %>% 
  summarise(number_of_crime = sum(n())) %>% 
  ungroup() %>% 
  filter(OFFENSE_CODE_GROUP == "Drug Violation") %>% 
  select(-OFFENSE_CODE_GROUP)

crime_agg

# A tibble: 12,353 x 2
   OCCURRED_ON_DATE    number_of_crime
   <fct>                         <int>
 1 2015-06-15 00:01:00               2
 2 2015-06-15 07:30:00               1
 3 2015-06-15 11:09:00               1
 4 2015-06-15 15:00:00               1
 5 2015-06-15 15:18:00               1
 6 2015-06-15 18:00:00               1
 7 2015-06-15 18:46:00               1
 8 2015-06-15 19:00:00               1
 9 2015-06-15 19:48:00               1
10 2015-06-15 20:23:00               1
# … with 12,343 more rows

Data Preprocessing

# Convert `OCCURED_ON_DATE` into datetime using lubridate
crime_agg$OCCURRED_ON_DATE <- as_datetime(crime_agg$OCCURRED_ON_DATE)

crime_agg

# A tibble: 12,353 x 2
   OCCURRED_ON_DATE    number_of_crime
   <dttm>                        <int>
 1 2015-06-15 00:01:00               2
 2 2015-06-15 07:30:00               1
 3 2015-06-15 11:09:00               1
 4 2015-06-15 15:00:00               1
 5 2015-06-15 15:18:00               1
 6 2015-06-15 18:00:00               1
 7 2015-06-15 18:46:00               1
 8 2015-06-15 19:00:00               1
 9 2015-06-15 19:48:00               1
10 2015-06-15 20:23:00               1
# … with 12,343 more rows

# Extracting just the date from previous aggregation
daily_crime <- crime_agg %>% 
  mutate(year = year(OCCURRED_ON_DATE),
         month = month(OCCURRED_ON_DATE),
         day = day(OCCURRED_ON_DATE),
         date = paste0(year,"-",month,"-",day),
         date = ymd(date)) %>% 
  select(-OCCURRED_ON_DATE, -year, -month, -day) %>% 
  group_by(date) %>% 
  summarise(number_of_crime = sum(number_of_crime))

daily_crime

# A tibble: 1,170 x 2
   date       number_of_crime
   <date>               <int>
 1 2015-06-15              11
 2 2015-06-16              24
 3 2015-06-17               8
 4 2015-06-18               8
 5 2015-06-19              17
 6 2015-06-20              18
 7 2015-06-22              20
 8 2015-06-23              20
 9 2015-06-24               9
10 2015-06-25              13
# … with 1,160 more rows

# re-check the date range
range(daily_crime$date)

[1] "2015-06-15" "2018-09-03"

library(ggplot2)
daily_crime %>% 
  ggplot(aes(x=date, y=number_of_crime))+
  geom_line()

This data is also ideal for time series analysis, because 1. No N/A 2. This data is appropriately arranged by date (sequential) 3. No missing time value

2. E D A

Time Series (TS) Object

# time series object
drug_ts <- ts(data=daily_crime$number_of_crime,
   start=c(2015, 6, 15),
   end=c(2018, 9, 3),
   frequency = 30)

drug_ts

Time Series:
Start = c(2015, 6) 
End = c(2018, 9) 
Frequency = 30 
 [1] 11 24  8  8 17 18 20 20  9 13 12 15 10 15 18 17 19 24 15  9 16 12 17 18 17
[26] 18  3 10 26 28 11 26 16  5 28 25 29 21 19 11  6 19 19 26 24 11 11  3 24 17
[51] 23 13 19 14  9 21 18  9 29 25 10  7 15 22 19 28 20 11  6 16 28 19 15  8  6
 [ reached getOption("max.print") -- omitted 19 entries ]

Trend 2015-2018

# Trend of drug-related crime using library(TSstudio)
library(TSstudio)
drug_ts %>% ts_plot()

Note: 1. There is no apparent ascending nor declining trend each year (it stays quite constant)
2. The number of drug-related crime can be significantly varied each month, although the range stays around under 10 to 30 each month

To take a closer look at the graph during a 1-year span:

# Trend during 2017-2018
window(drug_ts, start=2017, end=2018) %>% autoplot()

Decompose TS Object

The general objectives of decomposing TS object will help us understand: > 1. Data is the pattern of the original data 2. Seasonal (S) is the frequency of our data pattern
3. Trend (T) is the generic/global movement of mean
4. Remainder / Error (E) is the value that is not captured by Trend or Seasonal

# decompose time series
drug_dc <- decompose(drug_ts, type = "additive")
autoplot(drug_dc)

Note: - I have found that if I did not specifically set the start and end date in the date_ts object, the Time shown in decomposition graph will not show the proper year format.
- From the graph above, I concluded that this data is “Additive Seasonality with No Trend”

3. Forecasting Model & Evaluation

Exploration of forecast model

Simple Exponential Smoothing

Because our data has no trend nor seasonality (seen from the previous graph). SES calculates the Error, Trend, and Seasonal (ETS) functions and its parameters are y = time series object, model = consists of 3 alphabet below: - A = Additive, M = Multiplicative, N = None, Z = Auto
in our data example, we will use model = “ANN” because the data is additive, with no trend & seasonality. and the last parameter is alpha (optional).

I will try to calculate ETS twice, without & with alpha value. ### SES {.tabset}

library(forecast)

# without alpha
drug_ets <- ets(y= drug_ts, model = "ANN") # also called "ETS, as in error, trend, seasonal "
drug_ets

ETS(A,N,N) 

Call:
 ets(y = drug_ts, model = "ANN") 

  Smoothing parameters:
    alpha = 1e-04 

  Initial states:
    l = 16.7546 

  sigma:  7.286

     AIC     AICc      BIC 
804.4070 804.6736 812.0369

# with alpha
drug_ets2 <- ets(y= drug_ts, model = "ANN", alpha = 0.5)
drug_ets2

ETS(A,N,N) 

Call:
 ets(y = drug_ts, model = "ANN", alpha = 0.5) 

  Smoothing parameters:
    alpha = 0.5 

  Initial states:
    l = 14.0831 

  sigma:  8.3151

     AIC     AICc      BIC 
827.2468 827.3787 832.3334

# ETS visualization
drug_ts %>% autoplot()+
  autolayer(drug_ets$fitted, series = "Alpha = 0.0001")+
  autolayer(drug_ets2$fitted, series = "Alpha = 0.5")

SES Model Evaluation

sqrt(drug_ets$mse)

[1] 7.208049

sqrt(drug_ets2$mse)

[1] 8.226208

The RMSE value is lower on the first SES model (the one without alpha parameter). However, I will explore another model, which is ARIMA.

Auto-ARIMA

Train & Test Data

Before we move further into Auto-ARIMA and ARIMA model, it’s best if we set the train and test data first for later reference. - Train data : around 70% of the data in the drug_ts
- Test data : the rest of the time series data

drug_train <- head(drug_ts, (0.7*length(drug_ts)))
drug_test <- tail(drug_ts, length(drug_ts)-length(drug_train))

Stationary Check

# mencoba dengan seasonal period (frequency)
adf.test(x = drug_train)


    Augmented Dickey-Fuller Test

data:  drug_train
Dickey-Fuller = -6.0836, Lag order = 3, p-value = 0.01
alternative hypothesis: stationary

p-value < alpha (0.05). Therefore, we don’t have to do differencing.

# using Auto-Arima
drug_auto <- auto.arima(y = drug_train, seasonal = FALSE)
summary(drug_auto)

Series: drug_train 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
         mean
      16.4769
s.e.   0.8185

sigma^2 estimated as 44.22:  log likelihood=-214.88
AIC=433.75   AICc=433.95   BIC=438.1

Training set error measures:
                        ME     RMSE      MAE       MPE     MAPE      MASE
Training set -1.858249e-15 6.598619 5.455621 -28.85181 51.66002 0.8125393
                   ACF1
Training set 0.09081785

# Creating forecast model with Auto-ARIMA
f_auto <- forecast(object = drug_auto, h=30)

f_auto %>% 
  autoplot()+
  autolayer(drug_test)

4. Assumption Check

Normal Distribution

hist(f_auto$residuals)

Autocol Check

Box.test(drug_auto$residuals, type = "Ljung-Box")


    Box-Ljung test

data:  drug_auto$residuals
X-squared = 0.56124, df = 1, p-value = 0.4538

Time Series Analysis using Boston’s Drug Abuse Record