DATA 624 PROJECT1
Introduction
This report is intended for colleagues from a variety of backgrounds and contains both technical and non-technical explanations of the work conducted. The objective of this project was to perform the appropriate analysis in order to forecast two variables (of five provided) each from six different time series sets. We were provided a spreadsheet that contains 1622 periods of every variable in every set and were expected to forecast 140 periods. The sets are labeled S01, S02, S03, S04, S05 and S06 and each contains variables labeled V01, V02, V03, V05, and V07. Different variables are required to be forecast depending on the set, specified below:
- S01 – Forecast Var01, Var02
- S02 – Forecast Var02, Var03
- S03 – Forecast Var05, Var07
- S04 – Forecast Var01, Var02
- S05 – Forecast Var02, Var03
- S06 – Forecast Var05, Var07
For each category, we perform the following:
- Exploratory Data Analysis
- Visualize the data
- Identify trend, seasonality, stationarity, outliers, missing data, variance, etc. Interpret ACF and PACF to inform modeling approach.
- Data Preparation
- Prepare data for modeling, address outliers, missing values, stationarity, etc.
- Split data into training and testing sets.
- Data Modeling
- Prepare models. For this project, we base model accuracy using MAPE (Mean Absolute Percentage Error), selecting the model with the lowest score.
- Data Export. Using the best model, we predict the next 140 periods and write to disk.
Data Import
We read in the data and create separate data frames containing only the first 1622 observations.
SeriesInd | category | Var01 | Var02 | Var03 | Var05 | Var07 |
---|---|---|---|---|---|---|
40669 | S03 | 30.64286 | 123432400 | 30.34000 | 30.49000 | 30.57286 |
40669 | S02 | 10.28000 | 60855800 | 10.05000 | 10.17000 | 10.28000 |
40669 | S01 | 26.61000 | 10369300 | 25.89000 | 26.20000 | 26.01000 |
40669 | S06 | 27.48000 | 39335700 | 26.82000 | 27.02000 | 27.32000 |
40669 | S05 | 69.26000 | 27809100 | 68.19000 | 68.72000 | 69.15000 |
40669 | S04 | 17.20000 | 16587400 | 16.88000 | 16.94000 | 17.10000 |
40670 | S03 | 30.79857 | 150476200 | 30.46428 | 30.65714 | 30.62571 |
40670 | S02 | 11.24000 | 215620200 | 10.40000 | 10.45000 | 10.96000 |
40670 | S01 | 26.30000 | 10943800 | 25.70000 | 25.95000 | 25.86000 |
40670 | S06 | 28.24000 | 55416000 | 27.24000 | 27.27000 | 28.07000 |
SeriesInd | category | Var01 | Var02 | Var03 | Var05 | Var07 |
---|---|---|---|---|---|---|
40669 | S01 | 26.61 | 10369300 | 25.89 | 26.20 | 26.01 |
40670 | S01 | 26.30 | 10943800 | 25.70 | 25.95 | 25.86 |
40671 | S01 | 26.03 | 8933800 | 25.56 | 25.90 | 25.67 |
40672 | S01 | 25.84 | 10775400 | 25.36 | 25.60 | 25.75 |
40673 | S01 | 26.34 | 12875600 | 25.52 | 25.60 | 26.34 |
S01
par(mfrow = c(1,2))
|>
s1 select(c(Var01,Var02)) |>
ts() |>
autoplot(facets = TRUE, main = "S01") +
ylab('')
|>
s1 ggplot(aes(x = Var01, y = Var02)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Var01
Exploratory Data Analysis
<- ts(s1$Var01)
v1 ggtsdisplay(v1)
## Warning: Removed 2 rows containing missing values (`geom_point()`).
- The data is *non-stationary**
- There are two missing values
- The ACF is highly significant across many lags
- The PACF cuts off at lag 1
- There are some outliers though they are difficult to observe
- There is an upward trend
- Variance appears to be stable
- There is some cyclicity in the data. There does not appear to be any seasonality
- The data is non-linear
We can see the lags are highly correlated using a lagplot,
gglagplot(v1)
We can visualize the missing values to see where they appear within
the data using the imputeTS
package, which we can later use
to impute the missing values using a moving average.
library(imputeTS)
|>
v1 ggplot_na_distribution()
Data Preparation
To deal with the outliers, we’ll use the tsclean
function from the forecast
package, which uses median
absolute deviation (MAD) to identify outliers and replace them with
imputed values based on neighboring observations.
For the missing values, we’ll use the na_ma()
function
in the imputeTS
package, which replaces missing values with
the weighted moving average of neighboring observations within a
specific window.
library(forecast)
<- v1 |>
v1 tsclean(lambda = "auto") |>
na_ma() # default window = 4
We take a first difference of the data to see if we can achieve stationarity,
<- diff(v1)
v1.diff ggtsdisplay(v1.diff)
There are still some outliers and significant values, but the ACF and PACF plots are informative: the ACF cuts off at lag 2, indicating a MA(2) component. The PACF also cuts off at lag 2, indication an AR(2) component.
Now we split the data into testing and training sets to prepare for modeling, setting aside 20% of the data for testing.
library(zeallot) # provides tuple assigment
<- function(x, split) {
train_test_split <- round(length(x) * split) # split point index
split.index <- window(x, end = split.index)
train <- window(x, start = split.index + 1)
test = length(test)
horizon
return(list(train, test, horizon))
}
c(train, test, horizon) %<-% train_test_split(v1, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s1.v1 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s1.v1 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s1.v1 10.01436 10.49858
Box.test(residuals(holt.fit))$p.value
## [1] 0.02235277
Box.test(residuals(ses.fit))$p.value
## [1] 0.005648436
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data")
ETS Modeling
set.seed(123)
<- ets(train, model = "MMN")
ets.fit ets.fit
## ETS(M,M,N)
##
## Call:
## ets(y = train, model = "MMN")
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 1e-04
##
## Initial states:
## l = 27.0027
## b = 1.0007
##
## sigma: 0.0127
##
## AIC AICc BIC
## 7172.067 7172.114 7197.910
A multiplicative error, multiplicative trend, no seasonality gives us the best model, but there are still problems as there is significance in the ACF: the model is not incorporating all the information in the data, which we can see in the residuals,
checkresiduals(ets.fit)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,M,N)
## Q* = 17.51, df = 10, p-value = 0.06382
##
## Model df: 0. Total lags used: 10
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s1.v1 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s1.v1 10.01436 10.49858 23.20971
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data")
ARIMA Modeling
An ARIMA(2,1,2) model with log transformation (lambda=0
)
achieves the best MAPE.
set.seed(123)
<- train |>
arima.fc Arima(order=c(2,1,2), lambda=0) |>
forecast(h=horizon)
checkresiduals(arima.fc)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 7.7323, df = 6, p-value = 0.2584
##
## Model df: 4. Total lags used: 10
set.seed(123)
<- c(s1.v1 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
## ses holt ets arima
## s1.v1 10.01436 10.49858 23.20971 9.994805
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc) s1v1.results
Var02
Exploratory Data Analysis
<- ts(s1$Var02)
v2 ggtsdisplay(v2)
- This data has a lot of outliers
- There is a faint downward trend over time
- The variance in the data appears to decrease over time
- There is a strong autocorrelation among many lags
- There is no missing data
- The data is non-stationary
- There doesn’t appear to be any seasonality
gglagplot(v2)
Although the ACF plot shows a highly significant autocorrelation, the lags show an inverse correlation between the time series and its lagged values.
Data Preparation
Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation
<- v2 |>
v2 tsclean(lambda = "auto") |>
log()
ggtsdisplay(diff(diff(v2)))
A second-difference, log tranformation brings the data into stationarity.
We see the ACF cuts off at Lag 1, suggesting an MA(1) component, while the PACF trails off, further suggesting the MA(1) component.
gglagplot(diff(diff(log(v2))))
We can see the inverse correlations in the lags are no longer present.
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v2, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s1.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s1.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s1.v2 2.168109 2.168723
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(A,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.4238
##
## Initial states:
## l = 16.1856
##
## sigma: 0.311
##
## AIC AICc BIC
## 6276.973 6276.991 6292.478
checkresiduals(ets.fit)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 69.91, df = 10, p-value = 4.614e-11
##
## Model df: 0. Total lags used: 10
The p-value is far too small to be useful, but we’ll make the predictions and store the results for continuity,
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s1.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s1.v2 2.168109 2.168723 2.168141
ARIMA modeling
set.seed(123)
|>
train auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
##
## Model df: 5. Total lags used: 10
The p-value of the residuals is still too low to be considered a good model, but it’s the best we’ve found so far,
set.seed(123)
<- train |>
arima.fc Arima(order=c(1,1,4), lambda=0) |>
forecast(h=horizon)
checkresiduals(arima.fc)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
##
## Model df: 5. Total lags used: 10
set.seed(123)
<- c(s1.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s1.v2
## 2.118945
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s1v2.results
cat("\tMAPE results by model - S01\n\n")
## MAPE results by model - S01
<- rbind(s1v1.results,s1v2.results)
s1.results s1.results
## ses holt ets arima
## s1.v1 10.014358 10.498584 23.209708 9.994805
## s1.v2 2.168109 2.168723 2.168141 2.118945
S02
par(mfrow = c(1,2))
|>
s2 select(c(Var02,Var03)) |>
ts() |>
autoplot(facets = TRUE, main = "S02") +
ylab('')
|>
s2 ggplot(aes(x = Var02, y = Var03)) +
geom_point()
## Warning: Removed 4 rows containing missing values (`geom_point()`).
Var02
Exploratory Data Analysis
<- ts(s2$Var02)
v2 ggtsdisplay(v2)
- The data is non-stationary
- There is a very faint downward trend over time.
- There is high variance in the data
- There does not appear to be any seasonality in the data
- The ACF plot indicates the lags are highly significant and trail off
over time, indicating an AR component. The PACF model cuts off after 8,
indicating an MA component.
- There are many outliers in the data
- There are no missing values in the data
gglagplot(v2)
The lags show an inverse correlation.
Data Preparation
Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation
<- v2 |>
v2 tsclean(lambda = "auto")
ggtsdisplay(diff(log(v2)))
The data now appears stationary. The ACF cuts off at 2 indicating an MA(2) component, the PACF cuts off at 9, indicating an AR(9) component.
We can see the inverse correlations in the lags are no longer present.
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v2, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s2.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s2.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s2.v2 27.55139 27.56873
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 5.881639e-08
## ses_pval 7.954414e-08
The p-values are very low. These models do not fit the data well.
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- diff(log(train))
train.diff <- ets(train.diff)
ets.fit ets.fit
## ETS(A,N,N)
##
## Call:
## ets(y = train.diff)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = -5e-04
##
## sigma: 0.3676
##
## AIC AICc BIC
## 6704.535 6704.553 6720.038
Because we differenced and logged the forecasts, we need to back transfrom the forecasts to match the test data.
library(stats)
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- exp(cumsum(ets.fc$mean))
ets.fc <- c(s2.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s2.v2 27.55139 27.56873 100
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals
## Q* = 2988, df = 10, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 10
We can see that this model also does not fit the data very well,
ARIMA modeling
set.seed(123)
|>
train auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,3)
## Q* = 6.2503, df = 5, p-value = 0.2826
##
## Model df: 5. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima(lambda=0) |>
forecast(h=horizon)
<- c(s2.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s2.v2
## 28.31755
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
This model fits the data pretty well.
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s2v2.results s2v2.results
## ses holt ets arima
## s2.v2 27.55139 27.56873 100 28.31755
Var03
Exploratory Data Analysis
<- ts(s2$Var03)
v3 ggtsdisplay(v3)
## Warning: Removed 4 rows containing missing values (`geom_point()`).
- The data is non-stationary
- There is no noticeable trend over time
- There are four missing values
- There is one extreme outlier
- There does not appear to be any seasonality in the data
- The ACF plot indicates the lags are highly significant and trail off over time, indicating an AR component. The PACF model cuts off at lag 1, indicating an AR(1) component.
gglagplot(v3)
The lags show a positive linear correlation over shorter periods of time and inverse correlation over longer periods.
Data Preparation
Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation
<- v3 |>
v3 tsclean(lambda = "auto") |>
na_ma()
ggtsdisplay(v3)
Let’s take a look at a difference of the data to see if it achieves stationarity,
ggtsdisplay(diff(v3))
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v3, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s2.v3 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s2.v3 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s2.v3 17.17073 17.28684
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 1.352437e-03
## ses_pval 5.046795e-05
The p-values are very low. These models do not fit the data well.
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(A,Ad,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.0266
## phi = 0.8214
##
## Initial states:
## l = 9.4723
## b = 0.4578
##
## sigma: 0.2565
##
## AIC AICc BIC
## 5779.563 5779.628 5810.574
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s2.v3 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s2.v3 17.17073 17.28684 17.28755
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,Ad,N)
## Q* = 29.746, df = 10, p-value = 0.0009426
##
## Model df: 0. Total lags used: 10
We can see from the low p-value that this model also does not fit the data very well,
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0)
## Q* = 11.722, df = 8, p-value = 0.1641
##
## Model df: 2. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima(lambda=0) |>
forecast(h=horizon)
set.seed(123)
<- c(s2.v3 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s2.v3
## 17.29381
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
This model also does not fit the data well.
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s2v3.results cat("\tMAPE results by model - S02\n\n")
## MAPE results by model - S02
<- rbind(s2v2.results,s2v3.results)
s2.results s2.results
## ses holt ets arima
## s2.v2 27.55139 27.56873 100.00000 28.31755
## s2.v3 17.17073 17.28684 17.28755 17.29381
S03
par(mfrow = c(1,2))
|>
s3 select(c(Var05,Var07)) |>
ts() |>
autoplot(facets = TRUE, main = "S03") +
ylab('')
|>
s3 ggplot(aes(x = Var05, y = Var07)) +
geom_point()
## Warning: Removed 4 rows containing missing values (`geom_point()`).
We can see that the data are very similiar and have a strong positive correlation.
Var05
Exploratory Data Analysis
<- ts(s3$Var05)
v5 ggtsdisplay(v5)
## Warning: Removed 4 rows containing missing values (`geom_point()`).
- The data is non-stationary.
- There is an upward trend over time.
- There is some cyclicity, but no discernible seasonality.
- The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
- There are some missing values.
- There are some outliers present.
gglagplot(v5)
The lagplot indicates a positive linear relationship across lags.
Data Preparation
Dealing with outliers and missing values,
<- v5 |>
v5 tsclean(lambda = "auto") |>
na_ma()
ggtsdisplay(v5)
Let’s difference the data and see if it achieves stationarity,
checkresiduals(diff(v5))
##
## Ljung-Box test
##
## data: Residuals
## Q* = 61.078, df = 10, p-value = 2.265e-09
##
## Model df: 0. Total lags used: 10
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v5, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s3.v5 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s3.v5 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s3.v5 15.45108 19.26787
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 0.8696317
## ses_pval 0.8115583
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(M,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9277
##
## Initial states:
## l = 30.4884
##
## sigma: 0.0179
##
## AIC AICc BIC
## 9655.076 9655.095 9670.582
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s3.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s3.v5 15.45108 19.26787 15.45439
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 17.805, df = 10, p-value = 0.05834
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima() |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2) with drift
## Q* = 10.565, df = 7, p-value = 0.1587
##
## Model df: 3. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima(lambda=0) |>
forecast(h=horizon)
set.seed(123)
<- c(s3.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s3.v5
## 40.19997
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s3v5.results s3v5.results
## ses holt ets arima
## s3.v5 15.45108 19.26787 15.45439 40.19997
Var07
Exploratory Data Analysis
<- ts(s3$Var07)
v7 ggtsdisplay(v7)
## Warning: Removed 4 rows containing missing values (`geom_point()`).
- The data is non-stationary.
- There is an upward trend over time.
- There is some cyclicity, but no discernible seasonality.
- The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
- There are some missing values.
- There are some outliers present.
gglagplot(v7)
The lagplot indicates a positive linear relationship across lags.
Data Preparation
Dealing with outliers and missing values,
<- v7 |>
v7 tsclean(lambda = "auto") |>
na_ma()
ggtsdisplay(v7)
We’ve been checking differences manually, but we can also call the
ndiffs
function to determine the appropriate number of
first differences,
ndiffs(v7)
## [1] 1
One difference is required to make the v7
data
stationary.
Now we set aside our training and testing sets.
c(train, test, horizon) %<-% train_test_split(v7, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s3.v7 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s3.v7 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s3.v7 15.32257 18.70147
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 0.5457403
## ses_pval 0.3956926
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(M,A,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 1e-04
##
## Initial states:
## l = 30.7662
## b = 0.0751
##
## sigma: 0.0168
##
## AIC AICc BIC
## 9495.793 9495.839 9521.635
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s3.v7 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s3.v7 15.32257 18.70147 26.68161
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,A,N)
## Q* = 18.471, df = 10, p-value = 0.04752
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0) with drift
## Q* = 18.498, df = 10, p-value = 0.04712
##
## Model df: 0. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima(lambda=0) |>
forecast(h=horizon)
set.seed(123)
<- c(s3.v7 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s3.v7
## 40.03802
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s3v7.results cat("\tMAPE results by model - S03\n\n")
## MAPE results by model - S03
<- rbind(s3v5.results,s3v7.results)
s3.results s3.results
## ses holt ets arima
## s3.v5 15.45108 19.26787 15.45439 40.19997
## s3.v7 15.32257 18.70147 26.68161 40.03802
S04
par(mfrow = c(1,2))
|>
s4 select(c(Var01,Var02)) |>
ts() |>
autoplot(facets = TRUE, main = "S04") +
ylab('')
|>
s4 ggplot(aes(x = Var01, y = Var02)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Var01
Exploratory Data Analysis
<- ts(s4$Var01)
v1 ggtsdisplay(v1)
## Warning: Removed 2 rows containing missing values (`geom_point()`).
- The data is non-stationary.
- There is an upward trend over time.
- There is some cyclicity, but no discernible seasonality.
- The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 1, indicating an AR(1) component.
- There are two missing values.
- It’s not clear from the line plot if any outliers are present. But
we can run
tsclean()
to check for any.
gglagplot(v1)
- There is a strong positive correlation from lags 1-9 that decreases slightly as the lags get larger and larger.
Data Preparation
Dealing with any outliers and missing values,
<- v1 |>
v1 tsclean()
ggtsdisplay(v1)
Let’s difference the data and see if it achieves stationarity,
checkresiduals(diff(v1))
##
## Ljung-Box test
##
## data: Residuals
## Q* = 8.5322, df = 10, p-value = 0.577
##
## Model df: 0. Total lags used: 10
We can see the variance increases over time, and the ACF and PACF
plots indicate the data is not quite stationary yet. Let’s run
ndiffs()
to check the KPSS,
ndiffs(v1, test = "kpss")
## [1] 1
It appears one first differencing is necessary to bring the data to stationarity. We can set aside our training and testing data,
c(train, test, horizon) %<-% train_test_split(v1, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s4.v1 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s4.v1 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s4.v1 23.01952 23.02523
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 0.14618564
## ses_pval 0.07028179
The Holt test is more effective, although neither capture the nuances of the data,
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(M,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l = 17.1971
##
## sigma: 0.0175
##
## AIC AICc BIC
## 6808.987 6809.006 6824.493
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s4.v1 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s4.v1 23.01952 23.02523 23.01952
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 28.511, df = 10, p-value = 0.001495
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2)
## Q* = 11.977, df = 6, p-value = 0.06248
##
## Model df: 4. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima(lambda = 0) |>
forecast(h=horizon)
set.seed(123)
<- c(s4.v1 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s4.v1
## 23.23907
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s4v1.results cat("\tMAPE results by model - S04\n\n")
## MAPE results by model - S04
s4v1.results
## ses holt ets arima
## s4.v1 23.01952 23.02523 23.01952 23.23907
Var02
Exploratory Data Analysis
<- ts(s4$Var02)
v2 ggtsdisplay(v2)
- The data is non-stationary.
- There does appear to be some seasonality.
- There are many outliers.
- The ACF plot exhibits some seasonal behavior. The PACF cuts off at lag 4.
- There are no missing values.
gglagplot(v2)
- The data is inversely correlated at each lag.
Data Preparation
Dealing with any outliers and missing values,
<- v2 |>
v2 tsclean()
ggtsdisplay(v2)
In this case, setting lambda = NULL
does a better job
removing the outliers.
Let’s difference the data and see if it achieves stationarity,
checkresiduals(diff(v1))
##
## Ljung-Box test
##
## data: Residuals
## Q* = 8.5322, df = 10, p-value = 0.577
##
## Model df: 0. Total lags used: 10
We can see the variance increases over time, and the ACF and PACF
plots indicate the data is stationary. Let’s run ndiffs()
to check the KPSS,
ndiffs(v2, test = "kpss")
## [1] 1
It appears one first differencing is necessary to bring the data to stationarity. We can set aside our training and testing data,
c(train, test, horizon) %<-% train_test_split(v2, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s4.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s4.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s4.v2 28.7717 28.78593
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 9.362804e-06
## ses_pval 9.514048e-06
The Holt test is more effective, although neither capture the nuances of the data,
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(M,A,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.4765
## beta = 1e-04
##
## Initial states:
## l = 17382793.0973
## b = 362620.2033
##
## sigma: 0.3722
##
## AIC AICc BIC
## 50401.69 50401.73 50427.53
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s4.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s4.v2 28.7717 28.78593 329.4155
We can see that the ETS is way off
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,A,N)
## Q* = 59.165, df = 10, p-value = 5.212e-09
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima() |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,1)
## Q* = 7.614, df = 7, p-value = 0.3679
##
## Model df: 3. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima() |>
forecast(h=horizon)
set.seed(123)
<- c(s4.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s4.v2
## 37.04331
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s4v2.results cat("\tMAPE results by model - S04\n\n")
## MAPE results by model - S04
<- rbind(s4v1.results,s4v2.results)
s4.results s4.results
## ses holt ets arima
## s4.v1 23.01952 23.02523 23.01952 23.23907
## s4.v2 28.77170 28.78593 329.41550 37.04331
S05
Var02
Exploratory Data Analysis
<- ts(s5$Var02)
v2 ggtsdisplay(v2)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
- The data is non-stationary, although it appears to exhibit periods of stationarity
- There is no noticeable trend over time.
- There appears to be some seasonality.
- The ACF plot indicates some seasonality. The PACF plot cuts off at lag 3, indicating an AR() component.
- There is one missing value.
- Many outliers are present.
gglagplot(v2)
Again we see an inverse correlation within each lag.
Data Preparation
Dealing with any outliers and missing values,
<- v2 |>
v2 tsclean()
ggtsdisplay(v2)
Now we see the ACF trails off over time, supporting the AR() component indication prior to cleaning the outliers.
We split the data and prepare for modeling,
c(train, test, horizon) %<-% train_test_split(v2, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s5.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s5.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s5.v2 27.10623 27.10393
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 0.001833601
## ses_pval 0.001647381
The Holt test is more effective, although neither capture the nuances of the data,
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(M,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.3462
##
## Initial states:
## l = 28022117.2545
##
## sigma: 0.244
##
## AIC AICc BIC
## 48704.38 48704.40 48719.89
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s5.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s5.v2 27.10623 27.10393 27.10565
There is no real difference between these three methods in terms of performance. They all have p-values too low to indicate the model is taking into account all the information.
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 52.59, df = 10, p-value = 8.872e-08
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima(lambda = 0) |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)
## Q* = 10.592, df = 7, p-value = 0.1574
##
## Model df: 3. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima(lambda = 0) |>
forecast(h=horizon)
set.seed(123)
<- c(s5.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s5.v2
## 30.18151
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s5v2.results cat("\tMAPE results by model - S05\n\n")
## MAPE results by model - S05
s5v2.results
## ses holt ets arima
## s5.v2 27.10623 27.10393 27.10565 30.18151
Var03
Exploratory Data Analysis
<- ts(s5$Var03)
v3 ggtsdisplay(v3)
## Warning: Removed 5 rows containing missing values (`geom_point()`).
- The data is non-stationary
- There is no noticeable trend over time.
- There appears to be some seasonality.
- The ACF tails off over time, indicating an AR(). The PACF plot cuts off at lag 1, indicating an AR(1) component.
- There are some missing values and maybe one outlier.
gglagplot(v3)
We see a positive correlation among different lags, strongest at lag 1 and decreasing over time.
Data Preparation
Dealing with any outliers and missing values,
<- v3 |>
v3 tsclean(lambda = "auto")
ggtsdisplay(v3)
Missing values have now been imputed and we can observe that the one noticeable outlier has been reduced in magnitude.
We split the data and prepare for modeling,
c(train, test, horizon) %<-% train_test_split(v3, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s5.v3 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s5.v3 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s5.v3 8.07383 7.938347
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 0.0015865109
## ses_pval 0.0002142398
The Holt test is more effective, although neither capture the nuances of the data,
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(A,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l = 68.1953
##
## sigma: 0.878
##
## AIC AICc BIC
## 8970.909 8970.928 8986.415
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s5.v3 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s5.v3 8.07383 7.938347 8.07383
There is no real difference between these three methods in terms of performance. They all have p-values too low to indicate the model is taking into account all the information.
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 24.738, df = 10, p-value = 0.005865
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima() |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 10.244, df = 9, p-value = 0.3311
##
## Model df: 1. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima() |>
forecast(h=horizon)
set.seed(123)
<- c(s5.v3 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s5.v3
## 7.968795
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s5v3.results cat("\tMAPE results by model - S05\n\n")
## MAPE results by model - S05
<- rbind(s5v2.results,s5v3.results)
s5.results s5.results
## ses holt ets arima
## s5.v2 27.10623 27.103930 27.10565 30.181510
## s5.v3 8.07383 7.938347 8.07383 7.968795
S06
Var05
Exploratory Data Analysis
<- ts(s6$Var05)
v5 ggtsdisplay(v5)
## Warning: Removed 5 rows containing missing values (`geom_point()`).
- The data appears non-stationary, but it’s dramatically affected by outliers
- There are some missing values.
- There is an upward trend over time.
The outliers are making it difficult to visually interpret the data. Let’s deal with the outlier(s) and reassess
<- v5 |>
v5 tsclean()
ggtsdisplay(v5)
Now we’re able to get a much better look at the data. There appears to be some seasonality, a clear upward trend over time. The variance appears constant. We see that the lags are significant over time in the ACF and cut off after lag 2 in the PACF.
gglagplot(v5)
We see a strong positive correlation at each lag.
Data Preparation
We split the data and prepare for modeling,
c(train, test, horizon) %<-% train_test_split(v5, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s6.v5 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s6.v5 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s6.v5 5.871887 5.872024
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 0.9511334
## ses_pval 0.9044653
We see that both the Holt method and Simple Exponential Smoothing seem to be capturing the data fairly well.
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(A,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.907
##
## Initial states:
## l = 27.0512
##
## sigma: 0.5092
##
## AIC AICc BIC
## 7556.894 7556.913 7572.400
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s6.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s6.v5 5.871887 5.872024 5.871887
There is no real difference between these three methods in terms of performance. The Error, Trend, Seasonal (ETS) model performs the same as the SES model.
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 23.182, df = 10, p-value = 0.01009
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima() |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2) with drift
## Q* = 11.107, df = 6, p-value = 0.08513
##
## Model df: 4. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima() |>
forecast(h=horizon)
set.seed(123)
<- c(s6.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s6.v5
## 11.01148
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s6v5.results cat("\tMAPE results by model - S06\n\n")
## MAPE results by model - S06
s6v5.results
## ses holt ets arima
## s6.v5 5.871887 5.872024 5.871887 11.01148
Var07
Exploratory Data Analysis
<- ts(s6$Var07)
v7 ggtsdisplay(v7)
## Warning: Removed 5 rows containing missing values (`geom_point()`).
Much like the previous dataset, one extreme outlier is rendering visualization impossible. Let’s deal with that outlier first and then interpret the data,
<- v7 |>
v7 tsclean()
ggtsdisplay(v7)
- The data is non-stationary
- There is an upward trend over time.
- There appears to be some seasonality.
- The ACF stays large over time. The PACF plot cuts off at lag 1, indicating an AR(1) component.
- Variance appears to be constant, so there’s no need to Box Cox or log transform the data.
gglagplot(v7)
We see a positive correlation among different lags, strongest at lag 1 and decreasing over time.
Data Preparation
Since we’ve already dealt with the outliers we can move on to splitting our training and testing sets.
We split the data and prepare for modeling,
c(train, test, horizon) %<-% train_test_split(v7, 0.8)
Data Modeling
Simple Exponential Smoothing and Holt’s Method
set.seed(123)
<- ses(train, h = horizon)
ses.fit <- c(s6.v7 = accuracy(ses.fit, test)['Test set', 'MAPE'])
ses.acc <- holt(train, damped=TRUE, h = horizon)
holt.fit <- c(s6.v7 = accuracy(holt.fit, test)['Test set', 'MAPE'])
holt.acc cbind(ses = ses.acc, holt = holt.acc)
## ses holt
## s6.v7 6.317979 6.330951
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
ses_pval = Box.test(residuals(ses.fit))$p.value)
## [,1]
## holt_pval 0.6373357
## ses_pval 0.9123292
Simple exponential smoothing seems to fit the data well.
autoplot(train) +
autolayer(ses.fit, series = "ses") +
autolayer(holt.fit, alpha = 0.4, series = "holt") +
autolayer(test, series = "test data") +
ylab("")
ETS Modeling
set.seed(123)
<- ets(train)
ets.fit ets.fit
## ETS(A,N,N)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.8951
##
## Initial states:
## l = 27.4005
##
## sigma: 0.5195
##
## AIC AICc BIC
## 7608.745 7608.764 7624.251
set.seed(123)
<- forecast(ets.fit, h = horizon)
ets.fc <- c(s6.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])
ets.acc
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
## ses holt ets
## s6.v7 6.317979 6.330951 6.317979
Again, ETS and SES appear to fit the data equally.
checkresiduals(ets.fc)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 14.131, df = 10, p-value = 0.1671
##
## Model df: 0. Total lags used: 10
autoplot(train) +
autolayer(ets.fc, series = "ets") +
autolayer(test, series = "test data") +
ylab("")
ARIMA modeling
set.seed(123)
|>
train auto.arima() |>
forecast(h = horizon) |>
checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,2) with drift
## Q* = 6.9393, df = 6, p-value = 0.3265
##
## Model df: 4. Total lags used: 10
set.seed(123)
<- train |>
arima.fc auto.arima() |>
forecast(h=horizon)
set.seed(123)
<- c(s6.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc arima.acc
## s6.v5
## 11.94021
autoplot(train) +
autolayer(arima.fc, series = "arima") +
autolayer(test, series = "test data") +
ylab("")
The arima model does fit the data very well here.
Results
<- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s6v7.results cat("\tMAPE results by model - S05\n\n")
## MAPE results by model - S05
<- rbind(s6v5.results,s6v7.results)
s6.results s6.results
## ses holt ets arima
## s6.v5 5.871887 5.872024 5.871887 11.01148
## s6.v7 6.317979 6.330951 6.317979 11.94021
Final Results
<- rbind(s1.results,s2.results,
final.results
s3.results,s4.results,
s5.results,s6.results)
final.results
## ses holt ets arima
## s1.v1 10.014358 10.498584 23.209708 9.994805
## s1.v2 2.168109 2.168723 2.168141 2.118945
## s2.v2 27.551391 27.568733 99.999997 28.317545
## s2.v3 17.170725 17.286836 17.287545 17.293806
## s3.v5 15.451085 19.267867 15.454392 40.199973
## s3.v7 15.322569 18.701468 26.681607 40.038021
## s4.v1 23.019520 23.025230 23.019520 23.239073
## s4.v2 28.771703 28.785932 329.415504 37.043313
## s5.v2 27.106234 27.103930 27.105652 30.181510
## s5.v3 8.073830 7.938347 8.073830 7.968795
## s6.v5 5.871887 5.872024 5.871887 11.011477
## s6.v7 6.317979 6.330951 6.317979 11.940211