Time Series Analysis describes a set of research problems where our observations are collected at regular time intervals and where we can assume correlations among successive observations. The main idea is to forecast the future based on previous data. The process itself is called forecasting. Time Series is an observation for single data.
Before we work with the time series, we load the library first:
library(data.table) #fread
library(ggplot2)
library(plotly)
library(lubridate)
library(tidyquant) #theme_tq
library(forecast) #autoplot
library(MLmetrics) #model evaluation
library(TTR) #SMA
library(tseries) #adf.testThis dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days.
Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at PSITAdministration@ChicagoPolice.org.
Resource link: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
Use fread from data.table packages, because the data size is big.
crime <- fread("data_input/crime.csv")rmarkdown::paged_table(crime)glimpse(crime)## Observations: 6,889,715
## Variables: 30
## $ ID <int> 11712689, 11712663, 11712683, 117...
## $ `Case Number` <chr> "JC294129", "JC294099", "JC294090...
## $ Date <chr> "06/05/2019 11:58:00 PM", "06/05/...
## $ Block <chr> "039XX W OHIO ST", "001XX W ONTAR...
## $ IUCR <chr> "502R", "1220", "0460", "051A", "...
## $ `Primary Type` <chr> "OTHER OFFENSE", "DECEPTIVE PRACT...
## $ Description <chr> "VEHICLE TITLE/REG OFFENSE", "THE...
## $ `Location Description` <chr> "STREET", "GAS STATION", "CTA PLA...
## $ Arrest <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, ...
## $ Domestic <lgl> FALSE, FALSE, FALSE, FALSE, TRUE,...
## $ Beat <int> 1122, 1832, 623, 835, 334, 1922, ...
## $ District <int> 11, 18, 6, 8, 3, 19, 3, 22, 1, 6,...
## $ Ward <int> 37, 42, 6, 18, 7, 44, 20, 34, 42,...
## $ `Community Area` <int> 23, 8, 44, 66, 43, 6, 68, 73, 28,...
## $ `FBI Code` <chr> "26", "11", "08B", "04A", "08B", ...
## $ `X Coordinate` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Y Coordinate` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Year <int> 2019, 2019, 2019, 2019, 2019, 201...
## $ `Updated On` <chr> "06/12/2019 04:13:52 PM", "06/12/...
## $ Latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Location <chr> "", "", "", "", "", "", "", "", "...
## $ `Historical Wards 2003-2015` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Zip Codes` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Community Areas` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Census Tracts` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Wards <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Boundaries - ZIP Codes` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Police Districts` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ `Police Beats` <int> NA, NA, NA, NA, NA, NA, NA, NA, N...
unique(crime$`Primary Type`)## [1] "OTHER OFFENSE"
## [2] "DECEPTIVE PRACTICE"
## [3] "BATTERY"
## [4] "ASSAULT"
## [5] "CRIMINAL DAMAGE"
## [6] "OFFENSE INVOLVING CHILDREN"
## [7] "THEFT"
## [8] "PUBLIC PEACE VIOLATION"
## [9] "INTERFERENCE WITH PUBLIC OFFICER"
## [10] "ROBBERY"
## [11] "NARCOTICS"
## [12] "BURGLARY"
## [13] "MOTOR VEHICLE THEFT"
## [14] "CRIMINAL TRESPASS"
## [15] "CRIM SEXUAL ASSAULT"
## [16] "SEX OFFENSE"
## [17] "GAMBLING"
## [18] "HOMICIDE"
## [19] "WEAPONS VIOLATION"
## [20] "CONCEALED CARRY LICENSE VIOLATION"
## [21] "PROSTITUTION"
## [22] "LIQUOR LAW VIOLATION"
## [23] "ARSON"
## [24] "OBSCENITY"
## [25] "STALKING"
## [26] "KIDNAPPING"
## [27] "OTHER NARCOTIC VIOLATION"
## [28] "INTIMIDATION"
## [29] "PUBLIC INDECENCY"
## [30] "HUMAN TRAFFICKING"
## [31] "NON-CRIMINAL"
## [32] "NON-CRIMINAL (SUBJECT SPECIFIED)"
## [33] "NON - CRIMINAL"
## [34] "RITUALISM"
## [35] "DOMESTIC VIOLENCE"
We’re going to forecast the PUBLIC PEACE VIOLATION, so we have to cut the data that we’re not using to reduce the size of the data. After that we have to manipulate the data.
Filter the crime type that we want to focused at.
crime_peace <- crime %>%
filter(`Primary Type` == "PUBLIC PEACE VIOLATION") %>%
mutate(Date = mdy_hms(Date), # convert date time
Date = date(Date), # subset only date component.
Month = month(Date)) %>% # create new column Month
select(`Primary Type`, Month, Year) %>%
group_by(Year, Month) %>%
summarise(total = n()) %>%
ungroup() %>%
mutate(Date = parse_date_time(paste(Year,Month), orders = "Ym")) %>%
select(Date, total)
rmarkdown::paged_table(crime_peace)Assume that the collection of data in June 2019 has not finished yet, so we cut the data period before June 2019.
crime_peace <- crime_peace %>%
filter(Date < "2019-06-01") %>%
select(Date, total)Before we create Time Series data, we should know the range of date:
range(crime_peace$Date)## [1] "2001-01-01 UTC" "2019-05-01 UTC"
crime_ts <- ts(data=crime_peace$total,
start=2001,
frequency=12) #we have Monthly data and would like to get Year seasonality
crime_ts## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2001 161 173 267 229 239 228 213 215 374 313 174 164
## 2002 186 177 208 183 227 266 246 210 210 243 158 143
## 2003 165 186 228 185 245 226 184 209 232 219 186 165
## 2004 191 199 228 179 240 236 225 209 201 253 181 153
## 2005 173 168 234 246 263 265 287 253 241 240 202 158
## 2006 197 179 272 227 315 291 276 237 280 302 264 228
## 2007 215 208 303 296 353 310 294 244 333 303 240 216
## 2008 198 199 261 280 296 246 286 299 271 284 212 181
## 2009 197 211 288 246 313 275 277 266 315 327 245 187
## 2010 289 230 358 319 342 333 282 309 328 322 251 175
## 2011 197 194 272 270 326 318 296 260 295 258 208 201
## 2012 189 207 300 226 317 258 268 261 285 277 216 203
## 2013 209 210 265 295 366 326 291 249 281 287 186 170
## 2014 169 164 230 238 302 283 300 279 288 298 198 154
## 2015 180 137 249 228 222 241 219 204 217 211 169 145
## 2016 106 111 141 142 164 165 154 124 127 161 116 96
## 2017 123 99 126 129 153 126 148 145 131 129 105 84
## 2018 89 78 119 115 124 137 116 121 112 112 104 144
## 2019 88 109 139 154 154
p <- autoplot(crime_ts) + theme_tq()
ggplotly(p)Based on the plot, we can guess that the data is additive. Later, We would like to read the Seasonality of the data, so we put the sign into the plot.
crime_peace2 <- crime_peace %>%
mutate(Date = as.Date(Date),
Month = months(Date))
p <- crime_peace2 %>%
ggplot(aes(x = Date, total)) +
geom_line() +
geom_point(data = crime_peace2 %>% filter(Month == "January" | Month == "May"),
aes(x = Date, y = total, color = Month)) +
scale_color_manual(values = c("red","dodgerblue4")) +
theme_tq()
plotly::ggplotly(p)As per visualization above, We can see that there is Trend and Seasonality (based on signs that we’ve created), but We’ll try to proof it by using decompose function.
crime_dc <- decompose(crime_ts)
str(crime_dc)## List of 6
## $ x : Time-Series [1:221] from 2001 to 2019: 161 173 267 229 239 228 213 215 374 313 ...
## $ seasonal: Time-Series [1:221] from 2001 to 2019: -44.2 -50.5 16.5 1.3 46.9 ...
## $ trend : Time-Series [1:221] from 2001 to 2019: NA NA NA NA NA ...
## $ random : Time-Series [1:221] from 2001 to 2019: NA NA NA NA NA ...
## $ figure : num [1:12] -44.2 -50.5 16.5 1.3 46.9 ...
## $ type : chr "additive"
## - attr(*, "class")= chr "decomposed.ts"
autoplot(crime_dc) + theme_tq()Same with our guess before, the result of decompose function also : type : chr “additive”
decompose plot
- Is crime generally rising in Chicago in the past decade (last 10 years)?
No, if we see the Trend above, the crime is decreasing from the past decade
crime_decade <- crime_ts %>%
tail(12*10)
autoplot(crime_decade)or We can show using window function:
p <- autoplot(window(crime_ts, start=c(2009,6), end=c(2019,5)),
xlab = "Year",
ylab = "Crime Rate",
main = "Public Peace Violation")
ggplotly(p)
- Is there a seasonal component to the crime rate?
Yes, there is Seasonal component for Public Peace Violation.
[additional: manual decompose]
#1. Trend
trend <- ma(crime_ts, order = 12, centre = T)
#2. Seasonal
detrend <- crime_ts / trend
m <- t(matrix(data = detrend, nrow = 12))## Warning in matrix(data = detrend, nrow = 12): data length [221] is not a
## sub-multiple or multiple of the number of rows [12]
seasonal <- colMeans(m, na.rm = T)
seasonal <- seasonal/mean(seasonal)
seasonal <- ts(rep(seasonal,12),
start = start(crime_ts),
frequency = 12)
p <- autoplot(seasonal) +
theme_tq() +
labs(y = "",
title = "Seasonality Component")
ggplotly(p)Extra content you can use for explore the component time series https://www.clarusft.com/exploring-seasonality-in-a-time-series-with-rs-ggplot2/
crime_tsm <- SMA(crime_ts, n=5)
crime_tsm## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
## 2001 NA NA NA NA 213.8 227.2 235.2 224.8 253.8 268.6 257.8
## 2002 242.2 202.8 181.8 183.6 196.2 212.2 226.0 226.4 231.8 235.0 213.4
## 2003 183.8 179.0 176.0 181.4 201.8 214.0 213.6 209.8 219.2 214.0 206.0
## 2004 198.6 192.0 193.8 192.4 207.4 216.4 221.6 217.8 222.2 224.8 213.8
## 2005 192.2 185.6 181.8 194.8 216.8 235.2 259.0 262.8 261.8 257.2 244.6
## 2006 207.6 195.2 201.6 206.6 238.0 256.8 276.2 269.2 279.8 277.2 271.8
## 2007 257.8 243.4 243.6 250.0 275.0 294.0 311.2 299.4 306.8 296.8 282.8
## 2008 258.0 231.2 222.8 230.8 246.8 256.4 273.8 281.4 279.6 277.2 270.4
## 2009 229.0 217.0 217.8 224.6 251.0 266.6 279.8 275.4 289.2 292.0 286.0
## 2010 272.6 255.6 261.8 276.6 307.6 316.4 326.8 317.0 318.8 314.8 298.4
## 2011 254.6 227.8 217.8 221.6 251.8 276.0 296.4 294.0 299.0 285.4 263.4
## 2012 230.2 212.6 221.0 224.6 247.8 261.6 273.8 266.0 277.8 269.8 261.4
## 2013 238.0 223.0 220.6 236.4 269.0 292.4 308.6 305.4 302.6 286.8 258.8
## 2014 218.6 195.2 183.8 194.2 220.6 243.4 270.6 280.4 290.4 289.6 272.6
## 2015 223.6 193.4 183.6 189.6 203.2 215.4 231.8 222.8 220.6 218.4 204.0
## 2016 169.6 148.4 134.4 129.0 132.8 144.6 153.2 149.8 146.8 146.2 136.4
## 2017 124.6 119.0 112.0 114.6 126.0 126.6 136.4 140.2 140.6 135.8 131.6
## 2018 107.6 97.0 95.0 97.0 105.0 114.6 122.2 122.6 122.0 119.6 113.0
## 2019 112.0 111.4 116.8 126.8 128.8
## Dec
## 2001 248.0
## 2002 192.8
## 2003 202.2
## 2004 199.4
## 2005 218.8
## 2006 262.2
## 2007 267.2
## 2008 249.4
## 2009 268.0
## 2010 277.0
## 2011 244.4
## 2012 248.4
## 2013 234.6
## 2014 243.4
## 2015 189.2
## 2016 124.8
## 2017 118.8
## 2018 118.6
## 2019
autoplot(crime_tsm)Periode 2001 - 2019.
Try to forecast May 2001: because the ordo is 5, We will consider the previous 5 period.
mean(crime_ts[1:5])## [1] 213.8
We will forecast 2 period in the future (June & July 2019)
future1 <- mean(tail(crime_ts, 5)) #get the last 5 observation/data : May 2019
future2 <- mean(tail(crime_ts, 4), future1) #get the last 4 data and add with the result of forecast 1
future3 <- mean(tail(crime_ts, 3), future1, future2) #Forecast July 2019
future1; future2; future3## [1] 128.8
## [1] 146.5
## [1] 154
forecast for August 2019:
mean(tail(crime_ts,2), future1, future2, future3)## [1] 154
Before, we forecast the future manually, now we will use R function: forecast
plot(crime_peace$total,type="l") #plot data aktual
lines(c(crime_tsm, future1, future2, future3), col="dodgerblue4", lty=2)Based on the plot above, the Prediction seems has a big gap with the Actual data, so, we’ll try other Forecast method.
ets function used to create Time Series data and we’d like to give R the authority to set all the Smoothing parameters, so, we use model = “ZZZ”
crime_ets <- ets(crime_ts, model= "ZZZ")
crime_ets## ETS(M,Ad,M)
##
## Call:
## ets(y = crime_ts, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.3903
## beta = 0.0126
## gamma = 1e-04
## phi = 0.9651
##
## Initial states:
## l = 233.2339
## b = -0.7026
## s = 0.7714 0.8528 1.1212 1.1331 1.0303 1.0949
## 1.1418 1.1889 1.0103 1.0767 0.7763 0.8022
##
## sigma: 0.1224
##
## AIC AICc BIC
## 2646.639 2650.025 2707.806
From the visualization, we know that the data has Trend & Seasonality , but let us show the SES method first:
crime_ses <- HoltWinters(crime_ts, beta = F, gamma=F) # Simple Exponential Smoothing
crime_ses## Holt-Winters exponential smoothing without trend and without seasonal component.
##
## Call:
## HoltWinters(x = crime_ts, beta = F, gamma = F)
##
## Smoothing parameters:
## alpha: 0.75233
## beta : FALSE
## gamma: FALSE
##
## Coefficients:
## [,1]
## a 152.5886
MLmetrics::MAE(crime_ses$fitted[,1], crime_ts)## [1] 31.19513
crime_ses$alpha## [1] 0.75233
Untuk melakukan forecasting di R, dapat menggunakan 2 opsi fungsi di antara-nya:
1. predict, argumennya: model, n.ahead (berapa periode ke depan)
2. forecast, argumennya: model, h (horizon, berapa periode ke depan)
Di bawah akan di lakukan forecast menggunakan model co2_ses (Simple Exponential Smoothing), kalau di perhatikan semua hasil sama / SES biasa di sebut metode Flat.
crime_sesfut <- predict(crime_ses, n.ahead = 4) #forecast 4 periode ke depan
crime_sesfut## Jun Jul Aug Sep
## 2019 152.5886 152.5886 152.5886 152.5886
p <- autoplot(crime_ses$x, series = "Actual") +
autolayer(crime_ses$fitted[,1], series = "Fitted") +
autolayer(crime_sesfut, series = "Forecast") +
theme_tq() +
labs(title = "Simple Exponential Smoothing Forecasting")
ggplotly(p)Holt = trend, no seasonality
crime_holt <- HoltWinters(crime_ts, gamma=F) # Double Exponential Smoothing / Holt Exponential Smoothing
crime_holt## Holt-Winters exponential smoothing with trend and without seasonal component.
##
## Call:
## HoltWinters(x = crime_ts, gamma = F)
##
## Smoothing parameters:
## alpha: 0.7799254
## beta : 0.01828357
## gamma: FALSE
##
## Coefficients:
## [,1]
## a 152.8024890
## b -0.2744654
MLmetrics::MAE(crime_holt$fitted[,1], crime_ts)## [1] 32.17858
p <- autoplot(crime_holt$x, series = "Actual") + autolayer(crime_holt$fitted[,1], series = "Fitted") + theme_tq()
ggplotly(p)The result is better than before
crime_holtfut <- predict(crime_holt, n.ahead = 8)
crime_holtfut## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2019 152.5280 152.2536 151.9791 151.7046 151.4302
## 2020 150.6068
## Nov Dec
## 2019 151.1557 150.8812
## 2020
p <- autoplot(crime_holt$x, series = "Actual") +
autolayer(crime_holt$fitted[,1], series = "Fitted") +
autolayer(crime_holtfut, series = "Forecast") +
theme_tq() +
labs(title = "Holt Exponential Smoothing Forecasting")
ggplotly(p)Holt-Winter Exponential Smoothing = trend, seasonality
crime_hw <- HoltWinters(crime_ts)
crime_hw## Holt-Winters exponential smoothing with trend and additive seasonal component.
##
## Call:
## HoltWinters(x = crime_ts)
##
## Smoothing parameters:
## alpha: 0.2486149
## beta : 0.009154466
## gamma: 0.5870536
##
## Coefficients:
## [,1]
## a 121.9660154
## b -0.9487718
## s1 35.8516568
## s2 32.9112357
## s3 31.9270089
## s4 28.9830934
## s5 37.2742543
## s6 14.4913596
## s7 16.1094443
## s8 -15.6919185
## s9 -15.8591546
## s10 16.8330936
## s11 21.9711542
## s12 31.4387903
crime_hwf <- forecast(crime_hw, h=24) #24 periode ke depan = 2 tahun
autoplot(crime_hwf) + theme_tq()After we created few Time Series Models, we would like to compare which the correct one is. The evaluation using 2 ways: Error & Assumption Checking
#Mean Absolute Percentage Error
MAPE(crime_ses$fitted[,1], crime_ses$x)*100 ## [1] 14.71812
MAPE(crime_holt$fitted[,1], crime_holt$x)*100## [1] 15.22716
MAPE(crime_hw$fitted[,1], crime_hw$x)*100 ## [1] 12.15574
Our preliminary analysis based on visualization is correct, the data is contains of Trend & Seasonality, because the MAPE result shows the Error Percentage for Holt-Winters has smallest error: 12.16%
There is two Assumption Checking method in Time Series :
- Normality Residual
Hypotesis check:
H0: Residual has normal distribution (v) the one we want to achieve
H1: Residual doesnt have normal distribution
Conclusion: Reject H0 if p-value < 0,05
We want the error has no correlation with previous data.
The main idea is about Correlation concept, but for Time Series data ; the correlation between residual data lag -> t and t-1. Hence is called autocorrelation, because we inspect the error correlation between themselves.
Hypotesis check:
H0: Residual has No autocorrelation (v) the one we want to achieve
H1: Residual has autocorrelation
Conclusion: Reject H0 if p-value < 0,05
There is 2 ways to inspect the normality distribution: - graphic - scientific test
resids <- crime_hwf$residuals[!is.na(crime_hwf$residuals)]
plot(resids, pch=19, cex=0.5, ylim=c(-1, 1))
abline(h=0, col="dodgerblue4")hist(resids, breaks=15, xlim=c(-1, 1))
abline(v=0, lwd=2, col="green")Check using shapiro test.
shapiro.test(resids)##
## Shapiro-Wilk normality test
##
## data: resids
## W = 0.97364, p-value = 0.0005848
Because p-value (0.0005848) < 0.05 so the decision is rejecting H0. Hence, the residual data doesnt have normal distribution.
Now, we check the residual of Time Series data: Crime with Holt-Winter model
acf(resids, lag.max=20)Using Box.test(residual data, lag = how many previous data considered)
Box.test(resids, lag=20, type="Ljung-Box") #lag=20 :compare to 20 previous data##
## Box-Ljung test
##
## data: resids
## X-squared = 65.413, df = 20, p-value = 1.003e-06
because p-value (1.003e-06) < 0.05 so we decide to reject H0. There is residual correlation.
The model don’t pass the assumption checking, we continue using next Method ARIMA & SARIMA. Hopefully we can get the correct model which also has minimum error.
Arima is a combination of 2 models, Autoregressive (AR) and Moving Averages (MA). The prerequisite of using this model is the stasionary data. If the data is not stasioner, we need to do diff or log.
In order to check if the data is stationer or not, we can use adf.test function. And we can automatically create ARIMA model using auto.arima() function. Additional, if we want to manually set the ordo(p) and ordo(q), we can inspect ACF and PACF plot.
To remind the structure and type of our data :
glimpse(crime_peace)## Observations: 221
## Variables: 2
## $ Date <dttm> 2001-01-01, 2001-02-01, 2001-03-01, 2001-04-01, 2001-05...
## $ total <int> 161, 173, 267, 229, 239, 228, 213, 215, 374, 313, 174, 1...
Show the plot again : comparing original and data after differencing (assume our data hasn’t stationer yet)
autoplot(crime_ts) + theme_tq() #data asliautoplot(diff(crime_ts)) + theme_tq() #data tidak stasioner di diff biar stasioner adf.test(data time series, alternative = “stationary”).
Hypotesis check:
H0: Data is not stationer H1: Data already stationer
Conclusion: reject H0 if p-value < 0.05
adf.test(crime_ts, alternative="s")##
## Augmented Dickey-Fuller Test
##
## data: crime_ts
## Dickey-Fuller = -3.0012, Lag order = 6, p-value = 0.1557
## alternative hypothesis: stationary
p-value = 0.1557 > 0.05 is small, means the H0 is rejected. The time series is proven to be stationary by differencing.
adf.test(diff(crime_ts), alternative="s")## Warning in adf.test(diff(crime_ts), alternative = "s"): p-value smaller
## than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(crime_ts)
## Dickey-Fuller = -8.6432, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
Work cross validation, build a model from training dataset.
#filter data train period before 2015
train <- crime_peace %>% filter(Date < "2015-01-01") %>%
select(total) %>%
ts(frequency = 12, start = 2001)
#filter data test period after 2015
test <- crime_peace %>% filter(Date >= "2015-01-01") %>%
select(total) %>%
ts(frequency = 12, start = 2015)
autoplot(train) + geom_line(data=test, col = "tomato3")[Additional] split train & test using window function.
train <- window(crime_ts, start = c(2001,1), end = c(2014,12))
test <- window(crime_ts, start = c(2015,1))
p <- autoplot(train) + geom_line(data=test, col = "tomato3") + theme_tq()
ggplotly(p)auto <- auto.arima(train, seasonal = F)
auto## Series: train
## ARIMA(3,1,2)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2
## 1.3577 -0.4156 -0.2169 -1.8442 0.9170
## s.e. 0.1024 0.1379 0.0858 0.0852 0.0825
##
## sigma^2 estimated as 1575: log likelihood=-850.69
## AIC=1713.38 AICc=1713.91 BIC=1732.09
The result of auto ARIMA is ARIMA(3,1,2)
exponential smoothing vs arima / sarima? choose your best model from the smallest error.
tsdisplay(diff(train), lag.max = 20)Analysis: 1. Plot ACF = cut off, PCF = cut off -> model ARIMA 2. looking for p , Focused at plot P(ertama)ACF, significant at lag number 1,3 3. looking for q , Focused at plot A(khir)CF, significant at lag number 1,3
We would like to create few benchmark models:
benchmark1 <- Arima(train, order = c(1,1,1))
benchmark2 <- Arima(train, order = c(3,1,1))
benchmark3 <- Arima(train, order = c(1,1,3))
benchmark4 <- Arima(train, order = c(3,1,3))Compare which model is correct (looking for the smallest AIC value)
auto$aic## [1] 1713.384
benchmark1$aic## [1] 1726.886
benchmark2$aic## [1] 1723.104
benchmark3$aic## [1] 1725.018
benchmark4$aic## [1] 1707.687
AIC benchmark4 ARIMA(3,1,3) has smallest AIC : 1707.687
Check the residual whether has the correlation or not
Box.test(auto$residuals)##
## Box-Pierce test
##
## data: auto$residuals
## X-squared = 0.00043345, df = 1, p-value = 0.9834
H0: No Autocorrelation
H1: Autocorrelation
Then the conclusion no autocorrelation because p-value(0.9834) > 0.05
Box.test(benchmark1$residuals)##
## Box-Pierce test
##
## data: benchmark1$residuals
## X-squared = 0.039214, df = 1, p-value = 0.843
Box.test(benchmark2$residuals)##
## Box-Pierce test
##
## data: benchmark2$residuals
## X-squared = 0.030694, df = 1, p-value = 0.8609
Box.test(benchmark3$residuals)##
## Box-Pierce test
##
## data: benchmark3$residuals
## X-squared = 0.005273, df = 1, p-value = 0.9421
Box.test(benchmark4$residuals)##
## Box-Pierce test
##
## data: benchmark4$residuals
## X-squared = 0.90538, df = 1, p-value = 0.3413
There is no autocorrelation from all benchmark models.
- Which time series method seems to capture the variation in your time series better? Explain your choice of algorithm and its key assumptions?
benchmark 4
Forecast the next period
f<- forecast(auto, h=5) #h=5 because we would like to Forecast 5 period in the future (in data test)
f## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2015 155.6567 104.7999 206.5136 77.87784 233.4356
## Feb 2015 167.1424 109.9706 224.3143 79.70571 254.5792
## Mar 2015 191.5905 131.6429 251.5382 99.90850 283.2726
## Apr 2015 219.6517 159.3737 279.9297 127.46449 311.8390
## May 2015 245.0995 184.8148 305.3841 152.90207 337.2969
#compare the Forecast with benchmark4 model
fbenchmark4<- forecast(benchmark4, h=5)
fbenchmark4## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2015 170.6725 121.1466 220.1985 94.92913 246.4159
## Feb 2015 172.3697 112.8297 231.9097 81.31108 263.4283
## Mar 2015 209.2824 147.1993 271.3656 114.33440 304.2305
## Apr 2015 218.3751 156.0978 280.6525 123.13015 313.6201
## May 2015 251.3525 189.0495 313.6554 156.06835 346.6366
accuracy(f,test)## ME RMSE MAE MPE MAPE MASE
## Training set 0.9116585 38.96875 30.06960 -1.827776 12.74564 1.0072703
## Test set 7.3718151 32.86423 28.66859 1.566922 14.52971 0.9603392
## ACF1 Theil's U
## Training set 0.001606262 NA
## Test set -0.486430138 0.5415479
accuracy(fbenchmark4,test)## ME RMSE MAE MPE MAPE MASE
## Training set 0.8688769 37.77806 29.14111 -1.635460 12.27746 0.9761678
## Test set -1.2104535 27.81980 24.67841 -2.736988 12.87866 0.8266765
## ACF1 Theil's U
## Training set -0.07341099 NA
## Test set -0.41930019 0.4381011
MAPE result from benchmark4 model is the smallest.
Plot the forecast result
autoplot(f$x) + autolayer(f$fitted) + autolayer(f$mean) +
autolayer(fbenchmark4$x) + autolayer(fbenchmark4$fitted) + autolayer(fbenchmark4$mean) +
geom_line(data=test, col="tomato3", aes(x=x, y=y))autoplot(fbenchmark4$x) + autolayer(fbenchmark4$fitted) + autolayer(fbenchmark4$mean) +
geom_line(data=test, col="tomato3", aes(x=x, y=y))Same as treating seasonality in exponential smoothing, it is trying to do auto regression and moving average to its last season. It is denote as: ARIMA(p,d,q)(P,D,Q)f.
Since the data already a ts object, we’ll go ahead with the modelling steps.
head(crime_ts)## Jan Feb Mar Apr May Jun
## 2001 161 173 267 229 239 228
tail(crime_ts)## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2018 144
## 2019 88 109 139 154 154
It might be better to model this using seasonal ARIMA.
adf.test(crime_ts, alternative = "s")##
## Augmented Dickey-Fuller Test
##
## data: crime_ts
## Dickey-Fuller = -3.0012, Lag order = 6, p-value = 0.1557
## alternative hypothesis: stationary
Already stationary in variance, let’s check the acf plot:
train## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2001 161 173 267 229 239 228 213 215 374 313 174 164
## 2002 186 177 208 183 227 266 246 210 210 243 158 143
## 2003 165 186 228 185 245 226 184 209 232 219 186 165
## 2004 191 199 228 179 240 236 225 209 201 253 181 153
## 2005 173 168 234 246 263 265 287 253 241 240 202 158
## 2006 197 179 272 227 315 291 276 237 280 302 264 228
## 2007 215 208 303 296 353 310 294 244 333 303 240 216
## 2008 198 199 261 280 296 246 286 299 271 284 212 181
## 2009 197 211 288 246 313 275 277 266 315 327 245 187
## 2010 289 230 358 319 342 333 282 309 328 322 251 175
## 2011 197 194 272 270 326 318 296 260 295 258 208 201
## 2012 189 207 300 226 317 258 268 261 285 277 216 203
## 2013 209 210 265 295 366 326 291 249 281 287 186 170
## 2014 169 164 230 238 302 283 300 279 288 298 198 154
test## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2015 180 137 249 228 222 241 219 204 217 211 169 145
## 2016 106 111 141 142 164 165 154 124 127 161 116 96
## 2017 123 99 126 129 153 126 148 145 131 129 105 84
## 2018 89 78 119 115 124 137 116 121 112 112 104 144
## 2019 88 109 139 154 154
p <- autoplot(train) + geom_line(data=test, col = "tomato3") + theme_tq()
ggplotly(p)auto.s <- auto.arima(train, seasonal=T)
auto.s## Series: train
## ARIMA(1,0,1)(1,1,2)[12]
##
## Coefficients:
## ar1 ma1 sar1 sma1 sma2
## 0.9027 -0.5106 -0.3624 -0.4799 -0.3172
## s.e. 0.0686 0.1577 2.7890 2.7868 2.3587
##
## sigma^2 estimated as 752.2: log likelihood=-743.12
## AIC=1498.25 AICc=1498.81 BIC=1516.55
tsdisplay(train)0 first to get the seasonalityARIMA(1,0,1)(1,1,2)[12]
auto.s$aic## [1] 1498.246
To estimate P and Q use pacf() acf() plot on observation
acf(train)pacf(train)See how the lag 1 is high. Means, the lag 1 can be recorded in seasonal ARIMA:
benchmark1 <- Arima(train, order = c(0,0,0), seasonal = c(1,0,1))tsdisplay(benchmark1$residuals) acf = tails off, pacf = cut off –> AR , makanya cari tau ordo P (fokus ke PACF, signifikan di 1)
benchmark1 <- Arima(train, order = c(1,0,0), seasonal = c(1,0,1))benchmark1$aic## [1] 1622.232
benchmark2 <- Arima(train, order = c(0,0,0), seasonal = c(2,0,1))tsdisplay(benchmark2$residuals)ACF = tails off, PACF = cut off –> AR , so we are looking for P (focused to PACF, significant at 1)
benchmark2 <- Arima(train, order = c(1,0,0), seasonal = c(2,0,1))auto.s$aic## [1] 1498.246
benchmark1$aic## [1] 1622.232
benchmark2$aic## [1] 1623.592
dari 3 model: 1. auto – ARIMA(1,0,1)(1,1,2)[12] 2. benchmark1 – manual model 3. benchmark2 – manual model in this case, the smallest AIC comes from auto.s function
f <- forecast(auto.s, h = 12)
accuracy(f,test)## ME RMSE MAE MPE MAPE MASE
## Training set 2.404621 26.00101 19.08067 0.2415126 7.976162 0.6391635
## Test set -41.379088 47.01459 41.37909 -21.0236631 21.023663 1.3861150
## ACF1 Theil's U
## Training set 0.082853216 NA
## Test set -0.001066382 0.846168
autoplot(f) + geom_line(data=test, col="tomato3", aes(x=x, y=y))Box.test(auto.s$residuals)##
## Box-Pierce test
##
## data: auto.s$residuals
## X-squared = 1.1533, df = 1, p-value = 0.2829
p-value = 0.2829 > 0.05, There is no autocorrelation