1 Intro

We will be using this data to analyze the behaviour of criminal especially who operates on stealing vehicles, is it decreasing over the year? does it show repeating pattern every month, based on the analysis we can focus the investigation on what actually happen on certain time frame rather than investigating through all of our data.

2 Time Series

Time series is method of analyzing data on which the values are affected by the time, after we analyze and figuring out the data, we can predict, in this case called ‘forecasting’ future values.

3 Setup and Data Importing

options(scipen = 9999)
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)
library(lubridate)
library(padr)

data = read.csv('Crimes_-_2001_to_Present.csv')
head(data)

Our data is collected by the minute, hence why we have over 335000 observation.

Parsing Date Times, Assigning to date column

data$Date <- mdy_hms(data$Date)

extracting the date without the time.

data <- data %>% mutate(Date = date(Date))

aggregating by month. also removing all from 2020 and after, because corona, might affect the pattern. Group it by month

data$Date <- floor_date(data$Date, unit = 'month')
GTA <- data %>% group_by(Date) %>% summarise(number_of_gta = n()) %>%  arrange() %>% filter(Date < '2020-01-01')
head(GTA)

Filling gap between timeframes, if any

GTA <- GTA %>% pad()

Check missing value

colSums(is.na(GTA))

#>          Date number_of_gta 
#>             0             0

Great, We have no NA, we can move on to Time Series Analysis.

4 Time Series Analysis

unlike previous Machine Learning models, we do not work with DataFrame, instead, we work with TS object, to do that we can use ts function.

GTA_ts <- ts(GTA$number_of_gta, frequency = 12)

GTA_ts %>% autoplot()

The plot above we can say that the automobile theft in chicago is downward trending for the last 20 years, although its plateauing since 2015.

to be more clear about our trend, and to identify the seasonality of our data, we could use decompose function.

GTA_ts %>% decompose() %>% autoplot()

1. Trend are decreasing globally, although it stops at 2015 and increasing then decreasing again. 2. Time series shows that there are repeating pattern.

5 Build Model

Before we build the model, look at the decompose plot again, does the data have a trend? yes, its downward trend, does the data have seasonality? yes it does, does the trend and seasonality exponentially increasing/decreasing? not really, based on this we can say that our data is Additive Time Series.

Time series data MUST be arranged by their time, so unlike other model, we cannot cut and subset the data randomly, instead we cut certain amount of our time series on the end, here we cut 48 observations, equal to 4 years.

GTA_train <- GTA_ts %>% head(-48)
GTA_test <- GTA_ts %>% tail(48)

5.1 Exponential Smoothing

Because we identify our time series has a trend and seasonality, we put ‘ZZZ’ to the model, that mean we let the model decide if it is Additive or Multiplicative, if there are no trend and/or seasonality put ‘N’ to the model, ‘M’ or ‘A’ if you certain if the trend/seasonality is Multiplicative or Additive.

GTA_ets1 <- ets(GTA_train, model = 'ZZZ')
GTA_ets1

#> ETS(M,A,M) 
#> 
#> Call:
#>  ets(y = GTA_train, model = "ZZZ") 
#> 
#>   Smoothing parameters:
#>     alpha = 0.8394 
#>     beta  = 0.0001 
#>     gamma = 0.0001 
#> 
#>   Initial states:
#>     l = 2332.415 
#>     b = -11.1716 
#>     s = 1.0256 1.0096 1.071 1.0243 1.0654 1.0753
#>            1.028 0.9956 0.9323 0.9711 0.819 0.9828
#> 
#>   sigma:  0.0742
#> 
#>      AIC     AICc      BIC 
#> 2647.715 2651.492 2701.995

5.2 Triple Exponential Smoothing

if our data did not show trend or seasonality, set the beta and gamma to false.

GTA_hw <- HoltWinters(GTA_train)
GTA_hw

#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#> 
#> Call:
#> HoltWinters(x = GTA_train)
#> 
#> Smoothing parameters:
#>  alpha: 0.6014963
#>  beta : 0
#>  gamma: 0.8045079
#> 
#> Coefficients:
#>           [,1]
#> a    783.02650
#> b    -17.22188
#> s1   128.19683
#> s2  -103.17464
#> s3    15.13408
#> s4   -78.51063
#> s5    38.73962
#> s6   127.25011
#> s7   116.97321
#> s8   118.56920
#> s9    46.03325
#> s10   79.37650
#> s11   88.93086
#> s12  130.33719

5.3 ARIMA Model

Stationarity

GTA_train %>% adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -3.8031, Lag order = 5, p-value = 0.02037
#> alternative hypothesis: stationary

p-value < alpha so we do not need to differencing our data, d = 0

GTA_train %>% tsdisplay()

we have 2 significant spikes, p = 2 Because the ACF plot shows there are significant autocorellation on every lag, we cant be sure on how many lag we should pass as our q values, for that we can let model determine it by it self, then, we tune it.

GTA_arima <- auto.arima(GTA_train)
GTA_arima

#> Series: GTA_train 
#> ARIMA(1,1,0)(1,0,0)[12] with drift 
#> 
#> Coefficients:
#>           ar1    sar1    drift
#>       -0.1299  0.5888  -4.4184
#> s.e.   0.0763  0.0656  19.0426
#> 
#> sigma^2 estimated as 16957:  log likelihood=-1126.63
#> AIC=2261.26   AICc=2261.49   BIC=2274.01

GTA_train %>% adf.test(k = 12)

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -2.8999, Lag order = 12, p-value = 0.2001
#> alternative hypothesis: stationary

we have significant autocor on lag 12, do a differencing

GTA_train %>% diff(lag=12) %>% diff(lag=1) %>% adf.test()

#> 
#>  Augmented Dickey-Fuller Test
#> 
#> data:  .
#> Dickey-Fuller = -5.9058, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary

D = 1 d = 1

Determining P, p, Q, q

GTA_train %>% 
  diff(lag=12) %>% 
  diff() %>% 
  tsdisplay(lag.max = 24)

P = 1 p = 1, 2, 3

Q = 1 q = 1

GTAm_arima <- Arima(GTA_train, order = c(2,1,0), seasonal = c(1,1,1))
GTAm_arima

#> Series: GTA_train 
#> ARIMA(2,1,0)(1,1,1)[12] 
#> 
#> Coefficients:
#>           ar1      ar2    sar1     sma1
#>       -0.0799  -0.2121  0.1562  -0.8220
#> s.e.   0.0756   0.0757  0.1052   0.0855
#> 
#> sigma^2 estimated as 14018:  log likelihood=-1037.49
#> AIC=2084.98   AICc=2085.36   BIC=2100.57

as you can see just by tuning the model just a little bit, it shows by how much is our model improving, from 2261 AIC to 2084 AIC, lets compare all our model to see the clear winner

6 Forecasting & Evaluation

ets_forecast <- forecast(GTA_ets1, h = 48)
hw_forecast <- forecast(GTA_hw, h = 48)
arima_forecast <- forecast(GTA_arima, h = 48)
arima2_forecast <- forecast(GTAm_arima, h = 48)

accuracy(ets_forecast, GTA_test)

#>                      ME     RMSE       MAE         MPE      MAPE      MASE
#> Training set   3.475892 109.7196  83.00188 -0.01219992  5.492201 0.3914523
#> Test set     256.606272 282.6839 259.71412 29.83642031 30.238535 1.2248603
#>                    ACF1 Theil's U
#> Training set 0.05826485        NA
#> Test set     0.68357877  2.957616

accuracy(hw_forecast, GTA_test)

#>                     ME     RMSE      MAE        MPE      MAPE     MASE
#> Training set  13.14234 138.4374 105.8934  0.7736224  7.409186 0.499413
#> Test set     447.08831 481.6953 447.0883 53.4072553 53.407255 2.108552
#>                   ACF1 Theil's U
#> Training set 0.2536124        NA
#> Test set     0.8092334  5.225746

accuracy(arima_forecast, GTA_test)

#>                      ME     RMSE       MAE         MPE      MAPE      MASE
#> Training set  -0.427566 128.7636  96.52157  -0.3232384  6.578428 0.4552138
#> Test set     -94.903225 151.9128 126.21812 -13.1280100 16.066541 0.5952682
#>                     ACF1 Theil's U
#> Training set -0.02265706        NA
#> Test set      0.66001549  1.705504

accuracy(arima2_forecast, GTA_test)

#>                       ME     RMSE       MAE        MPE      MAPE      MASE
#> Training set  -0.8532242 112.6682  84.97258 -0.1920874  5.908203 0.4007465
#> Test set     166.0064844 185.2794 167.22471 19.0620748 19.215736 0.7886629
#>                      ACF1 Theil's U
#> Training set -0.006885288        NA
#> Test set      0.333022690  1.856731

our tuned ARIMA model is not only the model with the smallest error, it is also the least overfitting.

lets compare both of our ARIMA forecast to our test data.

arima_forecast %>% autoplot() + autolayer(GTA_test)

arima2_forecast %>% autoplot() + autolayer(GTA_test)

our tuned model prediction is a lot more closer to the actual value, and the prediction also more confident.

7 Conclusion

Even though just by doing auto arima we already have a good model compared to Exponential Model, rather than treating it as a final model, instead, use it as a baseline model, then do further analysis. we can actually produce a better model.

Grand Theft Auto

Fadhlan Rahim

February 25, 2021