Time Series Analysis using Boston’s Drug Abuse Record
About the Data
This data is provided by Boston Police Department (BPD), which is about the crime incidents that happened within June 14, 2015 until September 3, 2018.
Compared to the Chicago crime data, this data size is much smaller (55.3 MB, as opposed to the almost 1 GB data), while still maintaining a vast amount of informations (see the section of 1. Read the Data below).
1. Read the Data
The Data
INCIDENT_NUMBER OFFENSE_CODE OFFENSE_CODE_GROUP OFFENSE_DESCRIPTION
1 I182070945 619 Larceny LARCENY ALL OTHERS
2 I182070943 1402 Vandalism VANDALISM
3 I182070941 3410 Towed TOWED MOTOR VEHICLE
4 I182070940 3114 Investigate Property INVESTIGATE PROPERTY
DISTRICT REPORTING_AREA SHOOTING OCCURRED_ON_DATE YEAR MONTH DAY_OF_WEEK
1 D14 808 2018-09-02 13:00:00 2018 9 Sunday
2 C11 347 2018-08-21 00:00:00 2018 8 Tuesday
3 D4 151 2018-09-03 19:27:00 2018 9 Monday
4 D4 272 2018-09-03 21:16:00 2018 9 Monday
HOUR UCR_PART STREET Lat Long Location
1 13 Part One LINCOLN ST 42.35779 -71.13937 (42.35779134, -71.13937053)
2 0 Part Two HECLA ST 42.30682 -71.06030 (42.30682138, -71.06030035)
3 19 Part Three CAZENOVE ST 42.34659 -71.07243 (42.34658879, -71.07242943)
4 21 Part Three NEWCOMB ST 42.33418 -71.07866 (42.33418175, -71.07866441)
[ reached 'max' / getOption("max.print") -- omitted 319069 rows ]
The column names are mostly quite explanatory, but here are the explanations either way (quoted directly from BPD website): - INCIDENT_NUMBER : Internal BPD report number - OFFENSE_CODE : Numerical code of offense description
- OFFENSE_CODE_GROUP : Internal categorization of [offense_description]
- OFFENSE_DESCRIPTION : Primary descriptor of incident
- DISTRICT : What district the crime was reported in
- REPORTING_AREA : RA number associated with the where the crime was reported from.
- SHOOTING : Indicated a shooting took place.
- OCCURRED_ON_DATE: Earliest date and time the incident could have taken place
- UCR_PART : Universal Crime Reporting Part number (1,2, 3)
- STREET : Street name the incident took place
CODE NAME
1 612 LARCENY PURSE SNATCH - NO FORCE
2 613 LARCENY SHOPLIFTING
3 615 LARCENY THEFT OF MV PARTS & ACCESSORIES
4 1731 INCEST
5 3111 LICENSE PREMISE VIOLATION
6 2646 LIQUOR - DRINKING IN PUBLIC
7 2204 LIQUOR LAW VIOLATION
8 3810 M/V ACCIDENT - INVOLVING \xcaBICYCLE - INJURY
9 3801 M/V ACCIDENT - OTHER
10 3807 M/V ACCIDENT - OTHER CITY VEHICLE
11 3803 M/V ACCIDENT - PERSONAL INJURY
12 3805 M/V ACCIDENT - POLICE VEHICLE
13 3802 M/V ACCIDENT - PROPERTY \xcaDAMAGE
14 3205 M/V PLATES - LOST
15 123 MANSLAUGHTER - NON-VEHICLE - NEGLIGENCE
16 121 MANSLAUGHTER - VEHICLE - NEGLIGENCE
17 3501 MISSING PERSON
18 3502 MISSING PERSON - LOCATED
19 3503 MISSING PERSON - NOT REPORTED - LOCATED
20 111 MURDER, NON-NEGLIGIENT MANSLAUGHTER
21 3303 NOISY PARTY/RADIO-ARREST
22 2623 OBSCENE MATERIALS - PORNOGRAPHY
23 2628 OBSCENE PHONE CALLS
24 1711 OPEN & GROSS LEWDNESS
25 2007 VIOL. OF RESTRAINING ORDER W NO ARREST
26 2102 OPERATING UNDER THE INFLUENCE DRUGS
27 2660 OTHER OFFENSE
28 2900 VAL - VIOLATION OF AUTO LAW - OTHER
29 1109 FRAUD - WIRE
30 2616 POSSESSION OF BURGLARIOUS TOOLS
31 2606 PRISONER ATTEMPT TO RESCUE
32 2636 PRISONER ESCAPE / ESCAPE & RECAPTURE
33 3029 PRISONER - SUICIDE / SUICIDE ATTEMPT
34 3106 PROPERTY - ACCIDENTAL DAMAGE
35 1300 STOLEN PROPERTY - BUYING / RECEIVING / POSSESSING
36 3207 PROPERTY - FOUND
37 3201 PROPERTY - LOST
[ reached 'max' / getOption("max.print") -- omitted 539 rows ]
Library
Data Aggregation
Since the aim for this data is mainly for time series exploration, let’s aggregate the data needed:
# Aggregating a crime: Drug Violation
crime_agg <- crime %>%
group_by(OFFENSE_CODE_GROUP, OCCURRED_ON_DATE) %>%
summarise(number_of_crime = sum(n())) %>%
ungroup() %>%
filter(OFFENSE_CODE_GROUP == "Drug Violation") %>%
select(-OFFENSE_CODE_GROUP)
crime_agg# A tibble: 12,353 x 2
OCCURRED_ON_DATE number_of_crime
<fct> <int>
1 2015-06-15 00:01:00 2
2 2015-06-15 07:30:00 1
3 2015-06-15 11:09:00 1
4 2015-06-15 15:00:00 1
5 2015-06-15 15:18:00 1
6 2015-06-15 18:00:00 1
7 2015-06-15 18:46:00 1
8 2015-06-15 19:00:00 1
9 2015-06-15 19:48:00 1
10 2015-06-15 20:23:00 1
# … with 12,343 more rows
Data Preprocessing
# Convert `OCCURED_ON_DATE` into datetime using lubridate
crime_agg$OCCURRED_ON_DATE <- as_datetime(crime_agg$OCCURRED_ON_DATE)
crime_agg# A tibble: 12,353 x 2
OCCURRED_ON_DATE number_of_crime
<dttm> <int>
1 2015-06-15 00:01:00 2
2 2015-06-15 07:30:00 1
3 2015-06-15 11:09:00 1
4 2015-06-15 15:00:00 1
5 2015-06-15 15:18:00 1
6 2015-06-15 18:00:00 1
7 2015-06-15 18:46:00 1
8 2015-06-15 19:00:00 1
9 2015-06-15 19:48:00 1
10 2015-06-15 20:23:00 1
# … with 12,343 more rows
# Extracting just the date from previous aggregation
daily_crime <- crime_agg %>%
mutate(year = year(OCCURRED_ON_DATE),
month = month(OCCURRED_ON_DATE),
day = day(OCCURRED_ON_DATE),
date = paste0(year,"-",month,"-",day),
date = ymd(date)) %>%
select(-OCCURRED_ON_DATE, -year, -month, -day) %>%
group_by(date) %>%
summarise(number_of_crime = sum(number_of_crime))
daily_crime# A tibble: 1,170 x 2
date number_of_crime
<date> <int>
1 2015-06-15 11
2 2015-06-16 24
3 2015-06-17 8
4 2015-06-18 8
5 2015-06-19 17
6 2015-06-20 18
7 2015-06-22 20
8 2015-06-23 20
9 2015-06-24 9
10 2015-06-25 13
# … with 1,160 more rows
[1] "2015-06-15" "2018-09-03"
This data is also ideal for time series analysis, because 1. No N/A 2. This data is appropriately arranged by date (sequential) 3. No missing time value
2. E D A
Time Series (TS) Object
# time series object
drug_ts <- ts(data=daily_crime$number_of_crime,
start=c(2015, 6, 15),
end=c(2018, 9, 3),
frequency = 30)
drug_tsTime Series:
Start = c(2015, 6)
End = c(2018, 9)
Frequency = 30
[1] 11 24 8 8 17 18 20 20 9 13 12 15 10 15 18 17 19 24 15 9 16 12 17 18 17
[26] 18 3 10 26 28 11 26 16 5 28 25 29 21 19 11 6 19 19 26 24 11 11 3 24 17
[51] 23 13 19 14 9 21 18 9 29 25 10 7 15 22 19 28 20 11 6 16 28 19 15 8 6
[ reached getOption("max.print") -- omitted 19 entries ]
Trend 2015-2018
Note: 1. There is no apparent ascending nor declining trend each year (it stays quite constant)2. The number of drug-related crime can be significantly varied each month, although the range stays around under 10 to 30 each month
To take a closer look at the graph during a 1-year span:
Decompose TS Object
The general objectives of decomposing TS object will help us understand: > 1. Data is the pattern of the original data 2. Seasonal (S) is the frequency of our data pattern
3. Trend (T) is the generic/global movement of mean
4. Remainder / Error (E) is the value that is not captured by Trend or Seasonal
Note: - I have found that if I did not specifically set the start and end date in the date_ts object, the Time shown in decomposition graph will not show the proper year format.
- From the graph above, I concluded that this data is “Additive Seasonality with No Trend”
3. Forecasting Model & Evaluation
Exploration of forecast model
Simple Exponential Smoothing
Because our data has no trend nor seasonality (seen from the previous graph). SES calculates the Error, Trend, and Seasonal (ETS) functions and its parameters are y = time series object, model = consists of 3 alphabet below: - A = Additive, M = Multiplicative, N = None, Z = Auto
in our data example, we will use model = “ANN” because the data is additive, with no trend & seasonality. and the last parameter is alpha (optional).
I will try to calculate ETS twice, without & with alpha value. ### SES {.tabset}
library(forecast)
# without alpha
drug_ets <- ets(y= drug_ts, model = "ANN") # also called "ETS, as in error, trend, seasonal "
drug_etsETS(A,N,N)
Call:
ets(y = drug_ts, model = "ANN")
Smoothing parameters:
alpha = 1e-04
Initial states:
l = 16.7546
sigma: 7.286
AIC AICc BIC
804.4070 804.6736 812.0369
ETS(A,N,N)
Call:
ets(y = drug_ts, model = "ANN", alpha = 0.5)
Smoothing parameters:
alpha = 0.5
Initial states:
l = 14.0831
sigma: 8.3151
AIC AICc BIC
827.2468 827.3787 832.3334
# ETS visualization
drug_ts %>% autoplot()+
autolayer(drug_ets$fitted, series = "Alpha = 0.0001")+
autolayer(drug_ets2$fitted, series = "Alpha = 0.5")SES Model Evaluation
[1] 7.208049
[1] 8.226208
The RMSE value is lower on the first SES model (the one without alpha parameter). However, I will explore another model, which is ARIMA.
Auto-ARIMA
Train & Test Data
Before we move further into Auto-ARIMA and ARIMA model, it’s best if we set the train and test data first for later reference. - Train data : around 70% of the data in the drug_ts
- Test data : the rest of the time series data
Stationary Check
Augmented Dickey-Fuller Test
data: drug_train
Dickey-Fuller = -6.0836, Lag order = 3, p-value = 0.01
alternative hypothesis: stationary
p-value < alpha (0.05). Therefore, we don’t have to do
differencing.
Series: drug_train
ARIMA(0,0,0) with non-zero mean
Coefficients:
mean
16.4769
s.e. 0.8185
sigma^2 estimated as 44.22: log likelihood=-214.88
AIC=433.75 AICc=433.95 BIC=438.1
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -1.858249e-15 6.598619 5.455621 -28.85181 51.66002 0.8125393
ACF1
Training set 0.09081785
4. Assumption Check
Autocol Check
Box-Ljung test
data: drug_auto$residuals
X-squared = 0.56124, df = 1, p-value = 0.4538