Time Series Analysis describes a set of research problems where our observations are collected at regular time intervals and where we can assume correlations among successive observations. The main idea is to forecast the future based on previous data. The process itself is called forecasting. Time Series is an observation for single data.

Before we work with the time series, we load the library first:

library(data.table) #fread
library(ggplot2) 
library(plotly)
library(lubridate)
library(tidyquant)  #theme_tq
library(forecast)   #autoplot
library(MLmetrics)  #model evaluation
library(TTR)        #SMA
library(tseries)    #adf.test

1 Chicago Crime Rate

This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.

Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at PSITAdministration@ChicagoPolice.org.

Resource link: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

1.1 Data Preparation

Use fread from data.table packages, because the data size is big.

crime <- fread("data_input/crime.csv")
rmarkdown::paged_table(crime)
glimpse(crime)
## Observations: 6,889,715
## Variables: 30
## $ ID                           <int> 11712689, 11712663, 11712683, 117...
## $ `Case Number`                <chr> "JC294129", "JC294099", "JC294090...
## $ Date                         <chr> "06/05/2019 11:58:00 PM", "06/05/...
## $ Block                        <chr> "039XX W OHIO ST", "001XX W ONTAR...
## $ IUCR                         <chr> "502R", "1220", "0460", "051A", "...
## $ `Primary Type`               <chr> "OTHER OFFENSE", "DECEPTIVE PRACT...
## $ Description                  <chr> "VEHICLE TITLE/REG OFFENSE", "THE...
## $ `Location Description`       <chr> "STREET", "GAS STATION", "CTA PLA...
## $ Arrest                       <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, ...
## $ Domestic                     <lgl> FALSE, FALSE, FALSE, FALSE, TRUE,...
## $ Beat                         <int> 1122, 1832, 623, 835, 334, 1922, ...
## $ District                     <int> 11, 18, 6, 8, 3, 19, 3, 22, 1, 6,...
## $ Ward                         <int> 37, 42, 6, 18, 7, 44, 20, 34, 42,...
## $ `Community Area`             <int> 23, 8, 44, 66, 43, 6, 68, 73, 28,...
## $ `FBI Code`                   <chr> "26", "11", "08B", "04A", "08B", ...
## $ `X Coordinate`               <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Y Coordinate`               <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Year                         <int> 2019, 2019, 2019, 2019, 2019, 201...
## $ `Updated On`                 <chr> "06/12/2019 04:13:52 PM", "06/12/...
## $ Latitude                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Longitude                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Location                     <chr> "", "", "", "", "", "", "", "", "...
## $ `Historical Wards 2003-2015` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Zip Codes`                  <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Community Areas`            <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Census Tracts`              <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Wards                        <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Boundaries - ZIP Codes`     <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Police Districts`           <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Police Beats`               <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
unique(crime$`Primary Type`)
##  [1] "OTHER OFFENSE"                    
##  [2] "DECEPTIVE PRACTICE"               
##  [3] "BATTERY"                          
##  [4] "ASSAULT"                          
##  [5] "CRIMINAL DAMAGE"                  
##  [6] "OFFENSE INVOLVING CHILDREN"       
##  [7] "THEFT"                            
##  [8] "PUBLIC PEACE VIOLATION"           
##  [9] "INTERFERENCE WITH PUBLIC OFFICER" 
## [10] "ROBBERY"                          
## [11] "NARCOTICS"                        
## [12] "BURGLARY"                         
## [13] "MOTOR VEHICLE THEFT"              
## [14] "CRIMINAL TRESPASS"                
## [15] "CRIM SEXUAL ASSAULT"              
## [16] "SEX OFFENSE"                      
## [17] "GAMBLING"                         
## [18] "HOMICIDE"                         
## [19] "WEAPONS VIOLATION"                
## [20] "CONCEALED CARRY LICENSE VIOLATION"
## [21] "PROSTITUTION"                     
## [22] "LIQUOR LAW VIOLATION"             
## [23] "ARSON"                            
## [24] "OBSCENITY"                        
## [25] "STALKING"                         
## [26] "KIDNAPPING"                       
## [27] "OTHER NARCOTIC VIOLATION"         
## [28] "INTIMIDATION"                     
## [29] "PUBLIC INDECENCY"                 
## [30] "HUMAN TRAFFICKING"                
## [31] "NON-CRIMINAL"                     
## [32] "NON-CRIMINAL (SUBJECT SPECIFIED)" 
## [33] "NON - CRIMINAL"                   
## [34] "RITUALISM"                        
## [35] "DOMESTIC VIOLENCE"

1.2 Proccess

We’re going to forecast the PUBLIC PEACE VIOLATION, so we have to cut the data that we’re not using to reduce the size of the data. After that we have to manipulate the data.

Filter the crime type that we want to focused at.

crime_peace <- crime %>% 
  filter(`Primary Type` == "PUBLIC PEACE VIOLATION") %>% 
  mutate(Date = mdy_hms(Date), # convert date time
         Date = date(Date), # subset only date component.
         Month = month(Date)) %>% # create new column Month
  select(`Primary Type`, Month, Year) %>% 
  group_by(Year, Month) %>%
  summarise(total = n()) %>% 
  ungroup() %>% 
  mutate(Date = parse_date_time(paste(Year,Month), orders = "Ym")) %>% 
  select(Date, total)

rmarkdown::paged_table(crime_peace)

Assume that the collection of data in June 2019 has not finished yet, so we cut the data period before June 2019.

crime_peace <- crime_peace %>% 
  filter(Date < "2019-06-01") %>% 
  select(Date, total)

2 Time Series

Before we create Time Series data, we should know the range of date:

range(crime_peace$Date)
## [1] "2001-01-01 UTC" "2019-05-01 UTC"

2.1 Create

  • Create ts object with an appropriate starts and frequency.
crime_ts <- ts(data=crime_peace$total, 
             start=2001, 
             frequency=12) #we have Monthly data and would like to get Year seasonality
crime_ts
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2001 161 173 267 229 239 228 213 215 374 313 174 164
## 2002 186 177 208 183 227 266 246 210 210 243 158 143
## 2003 165 186 228 185 245 226 184 209 232 219 186 165
## 2004 191 199 228 179 240 236 225 209 201 253 181 153
## 2005 173 168 234 246 263 265 287 253 241 240 202 158
## 2006 197 179 272 227 315 291 276 237 280 302 264 228
## 2007 215 208 303 296 353 310 294 244 333 303 240 216
## 2008 198 199 261 280 296 246 286 299 271 284 212 181
## 2009 197 211 288 246 313 275 277 266 315 327 245 187
## 2010 289 230 358 319 342 333 282 309 328 322 251 175
## 2011 197 194 272 270 326 318 296 260 295 258 208 201
## 2012 189 207 300 226 317 258 268 261 285 277 216 203
## 2013 209 210 265 295 366 326 291 249 281 287 186 170
## 2014 169 164 230 238 302 283 300 279 288 298 198 154
## 2015 180 137 249 228 222 241 219 204 217 211 169 145
## 2016 106 111 141 142 164 165 154 124 127 161 116  96
## 2017 123  99 126 129 153 126 148 145 131 129 105  84
## 2018  89  78 119 115 124 137 116 121 112 112 104 144
## 2019  88 109 139 154 154

2.2 Plot

  • Plot ts object
p <- autoplot(crime_ts) + theme_tq()
ggplotly(p)

Based on the plot, we can guess that the data is additive. Later, We would like to read the Seasonality of the data, so we put the sign into the plot.

crime_peace2 <- crime_peace %>% 
  mutate(Date = as.Date(Date),
         Month = months(Date))

p <- crime_peace2 %>% 
  ggplot(aes(x = Date, total)) +
  geom_line() +
  geom_point(data = crime_peace2 %>% filter(Month == "January" | Month == "May"),
             aes(x = Date, y = total, color = Month)) +
  scale_color_manual(values = c("red","dodgerblue4")) +
  theme_tq()

plotly::ggplotly(p)

As per visualization above, We can see that there is Trend and Seasonality (based on signs that we’ve created), but We’ll try to proof it by using decompose function.

2.3 Decompose

  • Decompose ts object
crime_dc <- decompose(crime_ts)

str(crime_dc)
## List of 6
##  $ x       : Time-Series [1:221] from 2001 to 2019: 161 173 267 229 239 228 213 215 374 313 ...
##  $ seasonal: Time-Series [1:221] from 2001 to 2019: -44.2 -50.5 16.5 1.3 46.9 ...
##  $ trend   : Time-Series [1:221] from 2001 to 2019: NA NA NA NA NA ...
##  $ random  : Time-Series [1:221] from 2001 to 2019: NA NA NA NA NA ...
##  $ figure  : num [1:12] -44.2 -50.5 16.5 1.3 46.9 ...
##  $ type    : chr "additive"
##  - attr(*, "class")= chr "decomposed.ts"
autoplot(crime_dc) + theme_tq()

Same with our guess before, the result of decompose function also : type : chr “additive”

2.4 Analysis

  • Analysis of our decompose plot
  1. Is crime generally rising in Chicago in the past decade (last 10 years)?

No, if we see the Trend above, the crime is decreasing from the past decade

crime_decade <- crime_ts %>% 
  tail(12*10)
autoplot(crime_decade)

or We can show using window function:

p <- autoplot(window(crime_ts, start=c(2009,6), end=c(2019,5)),
              xlab = "Year",
              ylab = "Crime Rate",
              main = "Public Peace Violation") 
      
ggplotly(p)
  1. Is there a seasonal component to the crime rate?

Yes, there is Seasonal component for Public Peace Violation.

[additional: manual decompose]

#1. Trend
trend <- ma(crime_ts, order = 12, centre = T)

#2. Seasonal
detrend <- crime_ts / trend

m <- t(matrix(data = detrend, nrow = 12))
## Warning in matrix(data = detrend, nrow = 12): data length [221] is not a
## sub-multiple or multiple of the number of rows [12]
seasonal <- colMeans(m, na.rm = T)
seasonal <- seasonal/mean(seasonal)
seasonal <- ts(rep(seasonal,12), 
                   start = start(crime_ts),
                   frequency = 12)
p <- autoplot(seasonal) + 
  theme_tq() +
  labs(y = "",
       title = "Seasonality Component")
ggplotly(p)

Extra content you can use for explore the component time series https://www.clarusft.com/exploring-seasonality-in-a-time-series-with-rs-ggplot2/

3 Forecasting Model

3.1 Forecast SMA

crime_tsm <- SMA(crime_ts, n=5)
crime_tsm
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov
## 2001    NA    NA    NA    NA 213.8 227.2 235.2 224.8 253.8 268.6 257.8
## 2002 242.2 202.8 181.8 183.6 196.2 212.2 226.0 226.4 231.8 235.0 213.4
## 2003 183.8 179.0 176.0 181.4 201.8 214.0 213.6 209.8 219.2 214.0 206.0
## 2004 198.6 192.0 193.8 192.4 207.4 216.4 221.6 217.8 222.2 224.8 213.8
## 2005 192.2 185.6 181.8 194.8 216.8 235.2 259.0 262.8 261.8 257.2 244.6
## 2006 207.6 195.2 201.6 206.6 238.0 256.8 276.2 269.2 279.8 277.2 271.8
## 2007 257.8 243.4 243.6 250.0 275.0 294.0 311.2 299.4 306.8 296.8 282.8
## 2008 258.0 231.2 222.8 230.8 246.8 256.4 273.8 281.4 279.6 277.2 270.4
## 2009 229.0 217.0 217.8 224.6 251.0 266.6 279.8 275.4 289.2 292.0 286.0
## 2010 272.6 255.6 261.8 276.6 307.6 316.4 326.8 317.0 318.8 314.8 298.4
## 2011 254.6 227.8 217.8 221.6 251.8 276.0 296.4 294.0 299.0 285.4 263.4
## 2012 230.2 212.6 221.0 224.6 247.8 261.6 273.8 266.0 277.8 269.8 261.4
## 2013 238.0 223.0 220.6 236.4 269.0 292.4 308.6 305.4 302.6 286.8 258.8
## 2014 218.6 195.2 183.8 194.2 220.6 243.4 270.6 280.4 290.4 289.6 272.6
## 2015 223.6 193.4 183.6 189.6 203.2 215.4 231.8 222.8 220.6 218.4 204.0
## 2016 169.6 148.4 134.4 129.0 132.8 144.6 153.2 149.8 146.8 146.2 136.4
## 2017 124.6 119.0 112.0 114.6 126.0 126.6 136.4 140.2 140.6 135.8 131.6
## 2018 107.6  97.0  95.0  97.0 105.0 114.6 122.2 122.6 122.0 119.6 113.0
## 2019 112.0 111.4 116.8 126.8 128.8                                    
##        Dec
## 2001 248.0
## 2002 192.8
## 2003 202.2
## 2004 199.4
## 2005 218.8
## 2006 262.2
## 2007 267.2
## 2008 249.4
## 2009 268.0
## 2010 277.0
## 2011 244.4
## 2012 248.4
## 2013 234.6
## 2014 243.4
## 2015 189.2
## 2016 124.8
## 2017 118.8
## 2018 118.6
## 2019
autoplot(crime_tsm)

Periode 2001 - 2019.

Try to forecast May 2001: because the ordo is 5, We will consider the previous 5 period.

mean(crime_ts[1:5])
## [1] 213.8

We will forecast 2 period in the future (June & July 2019)

future1 <- mean(tail(crime_ts, 5)) #get the last 5 observation/data : May 2019
future2 <- mean(tail(crime_ts, 4), future1) #get the last 4 data and add with the result of forecast 1
future3 <- mean(tail(crime_ts, 3), future1, future2) #Forecast July 2019
future1; future2; future3
## [1] 128.8
## [1] 146.5
## [1] 154

forecast for August 2019:

mean(tail(crime_ts,2), future1, future2, future3)
## [1] 154

Before, we forecast the future manually, now we will use R function: forecast

plot(crime_peace$total,type="l") #plot data aktual
lines(c(crime_tsm, future1, future2, future3), col="dodgerblue4", lty=2)

Based on the plot above, the Prediction seems has a big gap with the Actual data, so, we’ll try other Forecast method.

3.2 Exponential Smoothing

ets function used to create Time Series data and we’d like to give R the authority to set all the Smoothing parameters, so, we use model = “ZZZ”

crime_ets <- ets(crime_ts, model= "ZZZ")
crime_ets
## ETS(M,Ad,M) 
## 
## Call:
##  ets(y = crime_ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.3903 
##     beta  = 0.0126 
##     gamma = 1e-04 
##     phi   = 0.9651 
## 
##   Initial states:
##     l = 233.2339 
##     b = -0.7026 
##     s = 0.7714 0.8528 1.1212 1.1331 1.0303 1.0949
##            1.1418 1.1889 1.0103 1.0767 0.7763 0.8022
## 
##   sigma:  0.1224
## 
##      AIC     AICc      BIC 
## 2646.639 2650.025 2707.806

From the visualization, we know that the data has Trend & Seasonality , but let us show the SES method first:

3.2.1 Simple Exponential Smoothing

crime_ses <- HoltWinters(crime_ts, beta = F, gamma=F) # Simple Exponential Smoothing
crime_ses
## Holt-Winters exponential smoothing without trend and without seasonal component.
## 
## Call:
## HoltWinters(x = crime_ts, beta = F, gamma = F)
## 
## Smoothing parameters:
##  alpha: 0.75233
##  beta : FALSE
##  gamma: FALSE
## 
## Coefficients:
##       [,1]
## a 152.5886
MLmetrics::MAE(crime_ses$fitted[,1], crime_ts)
## [1] 31.19513
crime_ses$alpha
## [1] 0.75233

Untuk melakukan forecasting di R, dapat menggunakan 2 opsi fungsi di antara-nya:
1. predict, argumennya: model, n.ahead (berapa periode ke depan)
2. forecast, argumennya: model, h (horizon, berapa periode ke depan)

Di bawah akan di lakukan forecast menggunakan model co2_ses (Simple Exponential Smoothing), kalau di perhatikan semua hasil sama / SES biasa di sebut metode Flat.

crime_sesfut <- predict(crime_ses, n.ahead = 4) #forecast 4 periode ke depan
crime_sesfut
##           Jun      Jul      Aug      Sep
## 2019 152.5886 152.5886 152.5886 152.5886
p <- autoplot(crime_ses$x, series = "Actual") + 
  autolayer(crime_ses$fitted[,1], series = "Fitted") + 
  autolayer(crime_sesfut, series = "Forecast") +
  theme_tq() +
  labs(title = "Simple Exponential Smoothing Forecasting")
ggplotly(p)

3.2.2 Holt Exponential Smoothing

Holt = trend, no seasonality

crime_holt <- HoltWinters(crime_ts, gamma=F) # Double Exponential Smoothing / Holt Exponential Smoothing
crime_holt
## Holt-Winters exponential smoothing with trend and without seasonal component.
## 
## Call:
## HoltWinters(x = crime_ts, gamma = F)
## 
## Smoothing parameters:
##  alpha: 0.7799254
##  beta : 0.01828357
##  gamma: FALSE
## 
## Coefficients:
##          [,1]
## a 152.8024890
## b  -0.2744654
MLmetrics::MAE(crime_holt$fitted[,1], crime_ts)
## [1] 32.17858
p <- autoplot(crime_holt$x, series = "Actual") + autolayer(crime_holt$fitted[,1], series = "Fitted") + theme_tq()
ggplotly(p)

The result is better than before

crime_holtfut <- predict(crime_holt, n.ahead = 8)
crime_holtfut
##           Jan Feb Mar Apr May      Jun      Jul      Aug      Sep      Oct
## 2019                          152.5280 152.2536 151.9791 151.7046 151.4302
## 2020 150.6068                                                             
##           Nov      Dec
## 2019 151.1557 150.8812
## 2020
p <- autoplot(crime_holt$x, series = "Actual") + 
  autolayer(crime_holt$fitted[,1], series = "Fitted") + 
  autolayer(crime_holtfut, series = "Forecast") +
  theme_tq() +
  labs(title = "Holt Exponential Smoothing Forecasting")
ggplotly(p)

3.2.3 Holt - Winter Exponential Smoothing

Holt-Winter Exponential Smoothing = trend, seasonality

crime_hw <- HoltWinters(crime_ts)
crime_hw
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = crime_ts)
## 
## Smoothing parameters:
##  alpha: 0.2486149
##  beta : 0.009154466
##  gamma: 0.5870536
## 
## Coefficients:
##            [,1]
## a   121.9660154
## b    -0.9487718
## s1   35.8516568
## s2   32.9112357
## s3   31.9270089
## s4   28.9830934
## s5   37.2742543
## s6   14.4913596
## s7   16.1094443
## s8  -15.6919185
## s9  -15.8591546
## s10  16.8330936
## s11  21.9711542
## s12  31.4387903
crime_hwf <- forecast(crime_hw, h=24) #24 periode ke depan = 2 tahun
autoplot(crime_hwf) + theme_tq()

3.3 Model Evaluation

After we created few Time Series Models, we would like to compare which the correct one is. The evaluation using 2 ways: Error & Assumption Checking

3.3.1 Error

#Mean Absolute Percentage Error
MAPE(crime_ses$fitted[,1], crime_ses$x)*100 
## [1] 14.71812
MAPE(crime_holt$fitted[,1], crime_holt$x)*100
## [1] 15.22716
MAPE(crime_hw$fitted[,1], crime_hw$x)*100   
## [1] 12.15574

Our preliminary analysis based on visualization is correct, the data is contains of Trend & Seasonality, because the MAPE result shows the Error Percentage for Holt-Winters has smallest error: 12.16%

3.3.2 Assumption Checking

There is two Assumption Checking method in Time Series :
- Normality Residual
Hypotesis check:

H0: Residual has normal distribution (v) the one we want to achieve
H1: Residual doesnt have normal distribution

Conclusion: Reject H0 if p-value < 0,05

  • No Autocorrelation Residual
  1. ACF PACF plot & Ljung-Box Test (Autocorrelation testing)

We want the error has no correlation with previous data.

The main idea is about Correlation concept, but for Time Series data ; the correlation between residual data lag -> t and t-1. Hence is called autocorrelation, because we inspect the error correlation between themselves.

Hypotesis check:
H0: Residual has No autocorrelation (v) the one we want to achieve
H1: Residual has autocorrelation

Conclusion: Reject H0 if p-value < 0,05

3.3.2.1 Normality Distribution

There is 2 ways to inspect the normality distribution: - graphic - scientific test

resids <- crime_hwf$residuals[!is.na(crime_hwf$residuals)]

plot(resids, pch=19, cex=0.5, ylim=c(-1, 1))
abline(h=0, col="dodgerblue4")

hist(resids, breaks=15, xlim=c(-1, 1))
abline(v=0, lwd=2, col="green")

Check using shapiro test.

shapiro.test(resids)
## 
##  Shapiro-Wilk normality test
## 
## data:  resids
## W = 0.97364, p-value = 0.0005848

Because p-value (0.0005848) < 0.05 so the decision is rejecting H0. Hence, the residual data doesnt have normal distribution.

3.3.2.2 No Autocorrelation Residual

Now, we check the residual of Time Series data: Crime with Holt-Winter model

acf(resids, lag.max=20)

Using Box.test(residual data, lag = how many previous data considered)

Box.test(resids, lag=20, type="Ljung-Box") #lag=20 :compare to 20 previous data
## 
##  Box-Ljung test
## 
## data:  resids
## X-squared = 65.413, df = 20, p-value = 1.003e-06

because p-value (1.003e-06) < 0.05 so we decide to reject H0. There is residual correlation.

3.3.3 Temporary Conclusion

The model don’t pass the assumption checking, we continue using next Method ARIMA & SARIMA. Hopefully we can get the correct model which also has minimum error.

4 ARIMA

Arima is a combination of 2 models, Autoregressive (AR) and Moving Averages (MA). The prerequisite of using this model is the stasionary data. If the data is not stasioner, we need to do diff or log.
In order to check if the data is stationer or not, we can use adf.test function. And we can automatically create ARIMA model using auto.arima() function. Additional, if we want to manually set the ordo(p) and ordo(q), we can inspect ACF and PACF plot.

To remind the structure and type of our data :

glimpse(crime_peace)
## Observations: 221
## Variables: 2
## $ Date  <dttm> 2001-01-01, 2001-02-01, 2001-03-01, 2001-04-01, 2001-05...
## $ total <int> 161, 173, 267, 229, 239, 228, 213, 215, 374, 313, 174, 1...

Show the plot again : comparing original and data after differencing (assume our data hasn’t stationer yet)

autoplot(crime_ts) + theme_tq() #data asli

autoplot(diff(crime_ts)) + theme_tq() #data tidak stasioner di diff biar stasioner 

  1. Check if it is not stationer:

adf.test(data time series, alternative = “stationary”).

Hypotesis check:

H0: Data is not stationer H1: Data already stationer

Conclusion: reject H0 if p-value < 0.05

adf.test(crime_ts, alternative="s")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  crime_ts
## Dickey-Fuller = -3.0012, Lag order = 6, p-value = 0.1557
## alternative hypothesis: stationary

p-value = 0.1557 > 0.05 is small, means the H0 is rejected. The time series is proven to be stationary by differencing.

adf.test(diff(crime_ts), alternative="s")
## Warning in adf.test(diff(crime_ts), alternative = "s"): p-value smaller
## than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff(crime_ts)
## Dickey-Fuller = -8.6432, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

Work cross validation, build a model from training dataset.

  • Cross Validation. Split your data train & Test [Optional]
#filter data train period before 2015
train <- crime_peace %>% filter(Date < "2015-01-01") %>%
  select(total) %>% 
  ts(frequency = 12, start = 2001)

#filter data test period after 2015
test <- crime_peace %>% filter(Date >= "2015-01-01") %>%
  select(total) %>% 
  ts(frequency = 12, start = 2015)

autoplot(train) + geom_line(data=test, col = "tomato3")

[Additional] split train & test using window function.

train <- window(crime_ts, start = c(2001,1), end = c(2014,12))
test <- window(crime_ts, start = c(2015,1))

p <- autoplot(train) + geom_line(data=test, col = "tomato3") + theme_tq()
ggplotly(p)
auto <- auto.arima(train, seasonal = F)
auto
## Series: train 
## ARIMA(3,1,2) 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     ma2
##       1.3577  -0.4156  -0.2169  -1.8442  0.9170
## s.e.  0.1024   0.1379   0.0858   0.0852  0.0825
## 
## sigma^2 estimated as 1575:  log likelihood=-850.69
## AIC=1713.38   AICc=1713.91   BIC=1732.09

The result of auto ARIMA is ARIMA(3,1,2)

  • Model Selection

exponential smoothing vs arima / sarima? choose your best model from the smallest error.

tsdisplay(diff(train), lag.max = 20)

Analysis: 1. Plot ACF = cut off, PCF = cut off -> model ARIMA 2. looking for p , Focused at plot P(ertama)ACF, significant at lag number 1,3 3. looking for q , Focused at plot A(khir)CF, significant at lag number 1,3

We would like to create few benchmark models:

benchmark1 <- Arima(train, order = c(1,1,1))
benchmark2 <- Arima(train, order = c(3,1,1))
benchmark3 <- Arima(train, order = c(1,1,3))
benchmark4 <- Arima(train, order = c(3,1,3))

Compare which model is correct (looking for the smallest AIC value)

auto$aic
## [1] 1713.384
benchmark1$aic
## [1] 1726.886
benchmark2$aic
## [1] 1723.104
benchmark3$aic
## [1] 1725.018
benchmark4$aic
## [1] 1707.687

AIC benchmark4 ARIMA(3,1,3) has smallest AIC : 1707.687

Check the residual whether has the correlation or not

Box.test(auto$residuals)
## 
##  Box-Pierce test
## 
## data:  auto$residuals
## X-squared = 0.00043345, df = 1, p-value = 0.9834

H0: No Autocorrelation
H1: Autocorrelation

Then the conclusion no autocorrelation because p-value(0.9834) > 0.05

Box.test(benchmark1$residuals)
## 
##  Box-Pierce test
## 
## data:  benchmark1$residuals
## X-squared = 0.039214, df = 1, p-value = 0.843
Box.test(benchmark2$residuals)
## 
##  Box-Pierce test
## 
## data:  benchmark2$residuals
## X-squared = 0.030694, df = 1, p-value = 0.8609
Box.test(benchmark3$residuals)
## 
##  Box-Pierce test
## 
## data:  benchmark3$residuals
## X-squared = 0.005273, df = 1, p-value = 0.9421
Box.test(benchmark4$residuals)
## 
##  Box-Pierce test
## 
## data:  benchmark4$residuals
## X-squared = 0.90538, df = 1, p-value = 0.3413

There is no autocorrelation from all benchmark models.

  1. Which time series method seems to capture the variation in your time series better? Explain your choice of algorithm and its key assumptions?

benchmark 4

Forecast the next period

  1. Calculating error on test set
f<- forecast(auto, h=5) #h=5 because we would like to Forecast 5 period in the future (in data test)
f
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2015       155.6567 104.7999 206.5136  77.87784 233.4356
## Feb 2015       167.1424 109.9706 224.3143  79.70571 254.5792
## Mar 2015       191.5905 131.6429 251.5382  99.90850 283.2726
## Apr 2015       219.6517 159.3737 279.9297 127.46449 311.8390
## May 2015       245.0995 184.8148 305.3841 152.90207 337.2969
#compare the Forecast with benchmark4 model
fbenchmark4<- forecast(benchmark4, h=5) 
fbenchmark4
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2015       170.6725 121.1466 220.1985  94.92913 246.4159
## Feb 2015       172.3697 112.8297 231.9097  81.31108 263.4283
## Mar 2015       209.2824 147.1993 271.3656 114.33440 304.2305
## Apr 2015       218.3751 156.0978 280.6525 123.13015 313.6201
## May 2015       251.3525 189.0495 313.6554 156.06835 346.6366
accuracy(f,test)
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.9116585 38.96875 30.06960 -1.827776 12.74564 1.0072703
## Test set     7.3718151 32.86423 28.66859  1.566922 14.52971 0.9603392
##                      ACF1 Theil's U
## Training set  0.001606262        NA
## Test set     -0.486430138 0.5415479
accuracy(fbenchmark4,test)
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set  0.8688769 37.77806 29.14111 -1.635460 12.27746 0.9761678
## Test set     -1.2104535 27.81980 24.67841 -2.736988 12.87866 0.8266765
##                     ACF1 Theil's U
## Training set -0.07341099        NA
## Test set     -0.41930019 0.4381011

MAPE result from benchmark4 model is the smallest.

Plot the forecast result

autoplot(f$x) + autolayer(f$fitted) + autolayer(f$mean) + 
  autolayer(fbenchmark4$x) + autolayer(fbenchmark4$fitted) + autolayer(fbenchmark4$mean) +
  geom_line(data=test, col="tomato3", aes(x=x, y=y))

autoplot(fbenchmark4$x) + autolayer(fbenchmark4$fitted) + autolayer(fbenchmark4$mean) + 
  geom_line(data=test, col="tomato3", aes(x=x, y=y))

4.1 Seasonal ARIMA

Same as treating seasonality in exponential smoothing, it is trying to do auto regression and moving average to its last season. It is denote as: ARIMA(p,d,q)(P,D,Q)f.

Since the data already a ts object, we’ll go ahead with the modelling steps.

head(crime_ts)
##      Jan Feb Mar Apr May Jun
## 2001 161 173 267 229 239 228
tail(crime_ts)
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2018                                             144
## 2019  88 109 139 154 154

It might be better to model this using seasonal ARIMA.

  1. Check if the time series stationary
adf.test(crime_ts, alternative = "s")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  crime_ts
## Dickey-Fuller = -3.0012, Lag order = 6, p-value = 0.1557
## alternative hypothesis: stationary

Already stationary in variance, let’s check the acf plot:

  1. Split train test dataset. I will split test data for last 5 years (2005 - 2010)
train
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2001 161 173 267 229 239 228 213 215 374 313 174 164
## 2002 186 177 208 183 227 266 246 210 210 243 158 143
## 2003 165 186 228 185 245 226 184 209 232 219 186 165
## 2004 191 199 228 179 240 236 225 209 201 253 181 153
## 2005 173 168 234 246 263 265 287 253 241 240 202 158
## 2006 197 179 272 227 315 291 276 237 280 302 264 228
## 2007 215 208 303 296 353 310 294 244 333 303 240 216
## 2008 198 199 261 280 296 246 286 299 271 284 212 181
## 2009 197 211 288 246 313 275 277 266 315 327 245 187
## 2010 289 230 358 319 342 333 282 309 328 322 251 175
## 2011 197 194 272 270 326 318 296 260 295 258 208 201
## 2012 189 207 300 226 317 258 268 261 285 277 216 203
## 2013 209 210 265 295 366 326 291 249 281 287 186 170
## 2014 169 164 230 238 302 283 300 279 288 298 198 154
test
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2015 180 137 249 228 222 241 219 204 217 211 169 145
## 2016 106 111 141 142 164 165 154 124 127 161 116  96
## 2017 123  99 126 129 153 126 148 145 131 129 105  84
## 2018  89  78 119 115 124 137 116 121 112 112 104 144
## 2019  88 109 139 154 154
p <- autoplot(train) + geom_line(data=test, col = "tomato3") + theme_tq()
ggplotly(p)
auto.s <- auto.arima(train, seasonal=T)
auto.s
## Series: train 
## ARIMA(1,0,1)(1,1,2)[12] 
## 
## Coefficients:
##          ar1      ma1     sar1     sma1     sma2
##       0.9027  -0.5106  -0.3624  -0.4799  -0.3172
## s.e.  0.0686   0.1577   2.7890   2.7868   2.3587
## 
## sigma^2 estimated as 752.2:  log likelihood=-743.12
## AIC=1498.25   AICc=1498.81   BIC=1516.55
tsdisplay(train)

  • look at season = freq = 12, compare the visualization of ACF & PACF, both are cut off = ARIMA.
  • the ACF & PACF are significant at 1,2 & 1
  • if we use SARIMA, set into 0 first to get the seasonality

ARIMA(1,0,1)(1,1,2)[12]

auto.s$aic
## [1] 1498.246
  1. Estimate P,Q for seasonality

To estimate P and Q use pacf() acf() plot on observation

acf(train)

pacf(train)

See how the lag 1 is high. Means, the lag 1 can be recorded in seasonal ARIMA:

  • Try P = 1
  • Try Q = 1
benchmark1 <- Arima(train, order = c(0,0,0), seasonal = c(1,0,1))
tsdisplay(benchmark1$residuals)

acf = tails off, pacf = cut off –> AR , makanya cari tau ordo P (fokus ke PACF, signifikan di 1)

4.2 Model Evaluation

benchmark1 <- Arima(train, order = c(1,0,0), seasonal = c(1,0,1))
benchmark1$aic
## [1] 1622.232
  • Try P = 2
  • Try Q = 1
benchmark2 <- Arima(train, order = c(0,0,0), seasonal = c(2,0,1))
tsdisplay(benchmark2$residuals)

ACF = tails off, PACF = cut off –> AR , so we are looking for P (focused to PACF, significant at 1)

benchmark2 <- Arima(train, order = c(1,0,0), seasonal = c(2,0,1))
auto.s$aic
## [1] 1498.246
benchmark1$aic
## [1] 1622.232
benchmark2$aic
## [1] 1623.592

dari 3 model: 1. auto – ARIMA(1,0,1)(1,1,2)[12] 2. benchmark1 – manual model 3. benchmark2 – manual model in this case, the smallest AIC comes from auto.s function

  1. Calculate error on test set
f <- forecast(auto.s, h = 12)
accuracy(f,test)
##                      ME     RMSE      MAE         MPE      MAPE      MASE
## Training set   2.404621 26.00101 19.08067   0.2415126  7.976162 0.6391635
## Test set     -41.379088 47.01459 41.37909 -21.0236631 21.023663 1.3861150
##                      ACF1 Theil's U
## Training set  0.082853216        NA
## Test set     -0.001066382  0.846168
autoplot(f) + geom_line(data=test, col="tomato3", aes(x=x, y=y))

Box.test(auto.s$residuals)
## 
##  Box-Pierce test
## 
## data:  auto.s$residuals
## X-squared = 1.1533, df = 1, p-value = 0.2829

p-value = 0.2829 > 0.05, There is no autocorrelation

4.3 Conclusion

  1. Crime Rate Trend for “Public Peace Violation” is decreasing
  2. But the data has seasonality, so some times the crime will increase in specific months
  3. Based on Model Evaluation methods, Seasonal ARIMA model is the best to show the Time Series Forecast for this data.