We will be using this data to analyze the behaviour of criminal especially who operates on stealing vehicles, is it decreasing over the year? does it show repeating pattern every month, based on the analysis we can focus the investigation on what actually happen on certain time frame rather than investigating through all of our data.
Time series is method of analyzing data on which the values are affected by the time, after we analyze and figuring out the data, we can predict, in this case called ‘forecasting’ future values.
options(scipen = 9999)
library(dplyr)
library(forecast)
library(plotly)
library(TTR)
library(MLmetrics)
library(tseries)
library(lubridate)
library(padr)data = read.csv('Crimes_-_2001_to_Present.csv')
head(data)Our data is collected by the minute, hence why we have over 335000 observation.
Parsing Date Times, Assigning to date column
data$Date <- mdy_hms(data$Date)extracting the date without the time.
data <- data %>% mutate(Date = date(Date))aggregating by month. also removing all from 2020 and after, because corona, might affect the pattern. Group it by month
data$Date <- floor_date(data$Date, unit = 'month')
GTA <- data %>% group_by(Date) %>% summarise(number_of_gta = n()) %>% arrange() %>% filter(Date < '2020-01-01')
head(GTA)Filling gap between timeframes, if any
GTA <- GTA %>% pad()Check missing value
colSums(is.na(GTA))#> Date number_of_gta
#> 0 0
Great, We have no NA, we can move on to Time Series Analysis.
unlike previous Machine Learning models, we do not work with DataFrame, instead, we work with TS object, to do that we can use ts function.
GTA_ts <- ts(GTA$number_of_gta, frequency = 12)GTA_ts %>% autoplot() The plot above we can say that the automobile theft in chicago is downward trending for the last 20 years, although its plateauing since 2015.
to be more clear about our trend, and to identify the seasonality of our data, we could use decompose function.
GTA_ts %>% decompose() %>% autoplot() 1. Trend are decreasing globally, although it stops at 2015 and increasing then decreasing again. 2. Time series shows that there are repeating pattern.
Before we build the model, look at the decompose plot again, does the data have a trend? yes, its downward trend, does the data have seasonality? yes it does, does the trend and seasonality exponentially increasing/decreasing? not really, based on this we can say that our data is Additive Time Series.
Time series data MUST be arranged by their time, so unlike other model, we cannot cut and subset the data randomly, instead we cut certain amount of our time series on the end, here we cut 48 observations, equal to 4 years.
GTA_train <- GTA_ts %>% head(-48)
GTA_test <- GTA_ts %>% tail(48)Because we identify our time series has a trend and seasonality, we put ‘ZZZ’ to the model, that mean we let the model decide if it is Additive or Multiplicative, if there are no trend and/or seasonality put ‘N’ to the model, ‘M’ or ‘A’ if you certain if the trend/seasonality is Multiplicative or Additive.
GTA_ets1 <- ets(GTA_train, model = 'ZZZ')
GTA_ets1#> ETS(M,A,M)
#>
#> Call:
#> ets(y = GTA_train, model = "ZZZ")
#>
#> Smoothing parameters:
#> alpha = 0.8394
#> beta = 0.0001
#> gamma = 0.0001
#>
#> Initial states:
#> l = 2332.415
#> b = -11.1716
#> s = 1.0256 1.0096 1.071 1.0243 1.0654 1.0753
#> 1.028 0.9956 0.9323 0.9711 0.819 0.9828
#>
#> sigma: 0.0742
#>
#> AIC AICc BIC
#> 2647.715 2651.492 2701.995
if our data did not show trend or seasonality, set the beta and gamma to false.
GTA_hw <- HoltWinters(GTA_train)
GTA_hw#> Holt-Winters exponential smoothing with trend and additive seasonal component.
#>
#> Call:
#> HoltWinters(x = GTA_train)
#>
#> Smoothing parameters:
#> alpha: 0.6014963
#> beta : 0
#> gamma: 0.8045079
#>
#> Coefficients:
#> [,1]
#> a 783.02650
#> b -17.22188
#> s1 128.19683
#> s2 -103.17464
#> s3 15.13408
#> s4 -78.51063
#> s5 38.73962
#> s6 127.25011
#> s7 116.97321
#> s8 118.56920
#> s9 46.03325
#> s10 79.37650
#> s11 88.93086
#> s12 130.33719
Stationarity
GTA_train %>% adf.test()#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -3.8031, Lag order = 5, p-value = 0.02037
#> alternative hypothesis: stationary
p-value < alpha so we do not need to differencing our data, d = 0
GTA_train %>% tsdisplay() we have 2 significant spikes, p = 2 Because the ACF plot shows there are significant autocorellation on every lag, we cant be sure on how many lag we should pass as our q values, for that we can let model determine it by it self, then, we tune it.
GTA_arima <- auto.arima(GTA_train)
GTA_arima#> Series: GTA_train
#> ARIMA(1,1,0)(1,0,0)[12] with drift
#>
#> Coefficients:
#> ar1 sar1 drift
#> -0.1299 0.5888 -4.4184
#> s.e. 0.0763 0.0656 19.0426
#>
#> sigma^2 estimated as 16957: log likelihood=-1126.63
#> AIC=2261.26 AICc=2261.49 BIC=2274.01
GTA_train %>% adf.test(k = 12)#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -2.8999, Lag order = 12, p-value = 0.2001
#> alternative hypothesis: stationary
we have significant autocor on lag 12, do a differencing
GTA_train %>% diff(lag=12) %>% diff(lag=1) %>% adf.test()#>
#> Augmented Dickey-Fuller Test
#>
#> data: .
#> Dickey-Fuller = -5.9058, Lag order = 5, p-value = 0.01
#> alternative hypothesis: stationary
D = 1 d = 1
Determining P, p, Q, q
GTA_train %>%
diff(lag=12) %>%
diff() %>%
tsdisplay(lag.max = 24) P = 1 p = 1, 2, 3
Q = 1 q = 1
GTAm_arima <- Arima(GTA_train, order = c(2,1,0), seasonal = c(1,1,1))
GTAm_arima#> Series: GTA_train
#> ARIMA(2,1,0)(1,1,1)[12]
#>
#> Coefficients:
#> ar1 ar2 sar1 sma1
#> -0.0799 -0.2121 0.1562 -0.8220
#> s.e. 0.0756 0.0757 0.1052 0.0855
#>
#> sigma^2 estimated as 14018: log likelihood=-1037.49
#> AIC=2084.98 AICc=2085.36 BIC=2100.57
as you can see just by tuning the model just a little bit, it shows by how much is our model improving, from 2261 AIC to 2084 AIC, lets compare all our model to see the clear winner
ets_forecast <- forecast(GTA_ets1, h = 48)
hw_forecast <- forecast(GTA_hw, h = 48)
arima_forecast <- forecast(GTA_arima, h = 48)
arima2_forecast <- forecast(GTAm_arima, h = 48)accuracy(ets_forecast, GTA_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 3.475892 109.7196 83.00188 -0.01219992 5.492201 0.3914523
#> Test set 256.606272 282.6839 259.71412 29.83642031 30.238535 1.2248603
#> ACF1 Theil's U
#> Training set 0.05826485 NA
#> Test set 0.68357877 2.957616
accuracy(hw_forecast, GTA_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set 13.14234 138.4374 105.8934 0.7736224 7.409186 0.499413
#> Test set 447.08831 481.6953 447.0883 53.4072553 53.407255 2.108552
#> ACF1 Theil's U
#> Training set 0.2536124 NA
#> Test set 0.8092334 5.225746
accuracy(arima_forecast, GTA_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set -0.427566 128.7636 96.52157 -0.3232384 6.578428 0.4552138
#> Test set -94.903225 151.9128 126.21812 -13.1280100 16.066541 0.5952682
#> ACF1 Theil's U
#> Training set -0.02265706 NA
#> Test set 0.66001549 1.705504
accuracy(arima2_forecast, GTA_test)#> ME RMSE MAE MPE MAPE MASE
#> Training set -0.8532242 112.6682 84.97258 -0.1920874 5.908203 0.4007465
#> Test set 166.0064844 185.2794 167.22471 19.0620748 19.215736 0.7886629
#> ACF1 Theil's U
#> Training set -0.006885288 NA
#> Test set 0.333022690 1.856731
our tuned ARIMA model is not only the model with the smallest error, it is also the least overfitting.
lets compare both of our ARIMA forecast to our test data.
arima_forecast %>% autoplot() + autolayer(GTA_test)arima2_forecast %>% autoplot() + autolayer(GTA_test)our tuned model prediction is a lot more closer to the actual value, and the prediction also more confident.
Even though just by doing auto arima we already have a good model compared to Exponential Model, rather than treating it as a final model, instead, use it as a baseline model, then do further analysis. we can actually produce a better model.