DATA 624 PROJECT1

Introduction

This report is intended for colleagues from a variety of backgrounds and contains both technical and non-technical explanations of the work conducted. The objective of this project was to perform the appropriate analysis in order to forecast two variables (of five provided) each from six different time series sets. We were provided a spreadsheet that contains 1622 periods of every variable in every set and were expected to forecast 140 periods. The sets are labeled S01, S02, S03, S04, S05 and S06 and each contains variables labeled V01, V02, V03, V05, and V07. Different variables are required to be forecast depending on the set, specified below:

  • S01 – Forecast Var01, Var02
  • S02 – Forecast Var02, Var03
  • S03 – Forecast Var05, Var07
  • S04 – Forecast Var01, Var02
  • S05 – Forecast Var02, Var03
  • S06 – Forecast Var05, Var07

For each category, we perform the following:

  1. Exploratory Data Analysis
  • Visualize the data
  • Identify trend, seasonality, stationarity, outliers, missing data, variance, etc. Interpret ACF and PACF to inform modeling approach.
  1. Data Preparation
  • Prepare data for modeling, address outliers, missing values, stationarity, etc.
  • Split data into training and testing sets.
  1. Data Modeling
  • Prepare models. For this project, we base model accuracy using MAPE (Mean Absolute Percentage Error), selecting the model with the lowest score.
  • Data Export. Using the best model, we predict the next 140 periods and write to disk.

Data Import

We read in the data and create separate data frames containing only the first 1622 observations.

Series Provided
SeriesInd category Var01 Var02 Var03 Var05 Var07
40669 S03 30.64286 123432400 30.34000 30.49000 30.57286
40669 S02 10.28000 60855800 10.05000 10.17000 10.28000
40669 S01 26.61000 10369300 25.89000 26.20000 26.01000
40669 S06 27.48000 39335700 26.82000 27.02000 27.32000
40669 S05 69.26000 27809100 68.19000 68.72000 69.15000
40669 S04 17.20000 16587400 16.88000 16.94000 17.10000
40670 S03 30.79857 150476200 30.46428 30.65714 30.62571
40670 S02 11.24000 215620200 10.40000 10.45000 10.96000
40670 S01 26.30000 10943800 25.70000 25.95000 25.86000
40670 S06 28.24000 55416000 27.24000 27.27000 28.07000
s01
SeriesInd category Var01 Var02 Var03 Var05 Var07
40669 S01 26.61 10369300 25.89 26.20 26.01
40670 S01 26.30 10943800 25.70 25.95 25.86
40671 S01 26.03 8933800 25.56 25.90 25.67
40672 S01 25.84 10775400 25.36 25.60 25.75
40673 S01 26.34 12875600 25.52 25.60 26.34

S01

par(mfrow = c(1,2))

s1 |>
  select(c(Var01,Var02)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S01") +
  ylab('')

s1 |>
  ggplot(aes(x = Var01, y = Var02)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Var01

Exploratory Data Analysis

v1 <- ts(s1$Var01)
ggtsdisplay(v1)
## Warning: Removed 2 rows containing missing values (`geom_point()`).

  • The data is *non-stationary**
  • There are two missing values
  • The ACF is highly significant across many lags
  • The PACF cuts off at lag 1
  • There are some outliers though they are difficult to observe
  • There is an upward trend
  • Variance appears to be stable
  • There is some cyclicity in the data. There does not appear to be any seasonality
  • The data is non-linear

We can see the lags are highly correlated using a lagplot,

gglagplot(v1)

We can visualize the missing values to see where they appear within the data using the imputeTS package, which we can later use to impute the missing values using a moving average.

library(imputeTS)
v1 |>
  ggplot_na_distribution()

Data Preparation

To deal with the outliers, we’ll use the tsclean function from the forecast package, which uses median absolute deviation (MAD) to identify outliers and replace them with imputed values based on neighboring observations.

For the missing values, we’ll use the na_ma() function in the imputeTS package, which replaces missing values with the weighted moving average of neighboring observations within a specific window.

library(forecast)
v1 <- v1 |>
  tsclean(lambda = "auto") |>
  na_ma() # default window = 4

We take a first difference of the data to see if we can achieve stationarity,

v1.diff <- diff(v1)
ggtsdisplay(v1.diff)

There are still some outliers and significant values, but the ACF and PACF plots are informative: the ACF cuts off at lag 2, indicating a MA(2) component. The PACF also cuts off at lag 2, indication an AR(2) component.

Now we split the data into testing and training sets to prepare for modeling, setting aside 20% of the data for testing.

library(zeallot) # provides tuple assigment

train_test_split <- function(x, split) {
  split.index <- round(length(x) * split) # split point index
  train <- window(x, end = split.index)
  test <- window(x, start = split.index + 1)
  horizon = length(test)
  
  return(list(train, test, horizon))
}

c(train, test, horizon) %<-% train_test_split(v1, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s1.v1 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s1.v1 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s1.v1 10.01436 10.49858
Box.test(residuals(holt.fit))$p.value
## [1] 0.02235277
Box.test(residuals(ses.fit))$p.value
## [1] 0.005648436
autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data")

ETS Modeling

set.seed(123)
ets.fit <- ets(train, model = "MMN")
ets.fit
## ETS(M,M,N) 
## 
## Call:
##  ets(y = train, model = "MMN") 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 27.0027 
##     b = 1.0007 
## 
##   sigma:  0.0127
## 
##      AIC     AICc      BIC 
## 7172.067 7172.114 7197.910

A multiplicative error, multiplicative trend, no seasonality gives us the best model, but there are still problems as there is significance in the ACF: the model is not incorporating all the information in the data, which we can see in the residuals,

checkresiduals(ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,M,N)
## Q* = 17.51, df = 10, p-value = 0.06382
## 
## Model df: 0.   Total lags used: 10
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s1.v1 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s1.v1 10.01436 10.49858 23.20971
autoplot(train) +
  autolayer(ets.fc, series = "ets") +
  autolayer(test, series = "test data")

ARIMA Modeling

An ARIMA(2,1,2) model with log transformation (lambda=0) achieves the best MAPE.

set.seed(123)
arima.fc <- train |>
  Arima(order=c(2,1,2), lambda=0) |>
  forecast(h=horizon)

checkresiduals(arima.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 7.7323, df = 6, p-value = 0.2584
## 
## Model df: 4.   Total lags used: 10
set.seed(123)
arima.acc <- c(s1.v1 = accuracy(arima.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
##            ses     holt      ets    arima
## s1.v1 10.01436 10.49858 23.20971 9.994805

Results

s1v1.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)

Var02

Exploratory Data Analysis

v2 <- ts(s1$Var02)
ggtsdisplay(v2)

  • This data has a lot of outliers
  • There is a faint downward trend over time
  • The variance in the data appears to decrease over time
  • There is a strong autocorrelation among many lags
  • There is no missing data
  • The data is non-stationary
  • There doesn’t appear to be any seasonality
gglagplot(v2)

Although the ACF plot shows a highly significant autocorrelation, the lags show an inverse correlation between the time series and its lagged values.

Data Preparation

Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation

v2 <- v2 |>
  tsclean(lambda = "auto") |>
  log()

ggtsdisplay(diff(diff(v2)))

A second-difference, log tranformation brings the data into stationarity.

We see the ACF cuts off at Lag 1, suggesting an MA(1) component, while the PACF trails off, further suggesting the MA(1) component.

gglagplot(diff(diff(log(v2))))

We can see the inverse correlations in the lags are no longer present.

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s1.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s1.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s1.v2 2.168109 2.168723
autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.4238 
## 
##   Initial states:
##     l = 16.1856 
## 
##   sigma:  0.311
## 
##      AIC     AICc      BIC 
## 6276.973 6276.991 6292.478
checkresiduals(ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 69.91, df = 10, p-value = 4.614e-11
## 
## Model df: 0.   Total lags used: 10

The p-value is far too small to be useful, but we’ll make the predictions and store the results for continuity,

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s1.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s1.v2 2.168109 2.168723 2.168141

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
## 
## Model df: 5.   Total lags used: 10

The p-value of the residuals is still too low to be considered a good model, but it’s the best we’ve found so far,

set.seed(123)
arima.fc <- train |>
  Arima(order=c(1,1,4), lambda=0) |>
  forecast(h=horizon)

checkresiduals(arima.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
## 
## Model df: 5.   Total lags used: 10
set.seed(123)
arima.acc <- c(s1.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s1.v2 
## 2.118945
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data")

Results

s1v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)

cat("\tMAPE results by model - S01\n\n")
##  MAPE results by model - S01
s1.results <- rbind(s1v1.results,s1v2.results)
s1.results
##             ses      holt       ets    arima
## s1.v1 10.014358 10.498584 23.209708 9.994805
## s1.v2  2.168109  2.168723  2.168141 2.118945

S02

par(mfrow = c(1,2))

s2 |>
  select(c(Var02,Var03)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S02") +
  ylab('')

s2 |>
  ggplot(aes(x = Var02, y = Var03)) +
  geom_point()
## Warning: Removed 4 rows containing missing values (`geom_point()`).

Var02

Exploratory Data Analysis

v2 <- ts(s2$Var02)
ggtsdisplay(v2)

  • The data is non-stationary
  • There is a very faint downward trend over time.
  • There is high variance in the data
  • There does not appear to be any seasonality in the data
  • The ACF plot indicates the lags are highly significant and trail off over time, indicating an AR component. The PACF model cuts off after 8, indicating an MA component.
  • There are many outliers in the data
  • There are no missing values in the data
gglagplot(v2)

The lags show an inverse correlation.

Data Preparation

Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation

v2 <- v2 |>
  tsclean(lambda = "auto")

ggtsdisplay(diff(log(v2)))

The data now appears stationary. The ACF cuts off at 2 indicating an MA(2) component, the PACF cuts off at 9, indicating an AR(9) component.

We can see the inverse correlations in the lags are no longer present.

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s2.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s2.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s2.v2 27.55139 27.56873
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                   [,1]
## holt_pval 5.881639e-08
## ses_pval  7.954414e-08

The p-values are very low. These models do not fit the data well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") + 
  ylab("")

ETS Modeling

set.seed(123)
train.diff <- diff(log(train))
ets.fit <- ets(train.diff)
ets.fit
## ETS(A,N,N) 
## 
## Call:
##  ets(y = train.diff) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = -5e-04 
## 
##   sigma:  0.3676
## 
##      AIC     AICc      BIC 
## 6704.535 6704.553 6720.038

Because we differenced and logged the forecasts, we need to back transfrom the forecasts to match the test data.

library(stats)
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.fc <- exp(cumsum(ets.fc$mean))
ets.acc <- c(s2.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt ets
## s2.v2 27.55139 27.56873 100
checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 2988, df = 10, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 10

We can see that this model also does not fit the data very well,

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,3)
## Q* = 6.2503, df = 5, p-value = 0.2826
## 
## Model df: 5.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)

arima.acc <- c(s2.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s2.v2 
## 28.31755
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") + 
  ylab("")

This model fits the data pretty well.

Results

s2v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s2v2.results
##            ses     holt ets    arima
## s2.v2 27.55139 27.56873 100 28.31755

Var03

Exploratory Data Analysis

v3 <- ts(s2$Var03)
ggtsdisplay(v3)
## Warning: Removed 4 rows containing missing values (`geom_point()`).

  • The data is non-stationary
  • There is no noticeable trend over time
  • There are four missing values
  • There is one extreme outlier
  • There does not appear to be any seasonality in the data
  • The ACF plot indicates the lags are highly significant and trail off over time, indicating an AR component. The PACF model cuts off at lag 1, indicating an AR(1) component.
gglagplot(v3)

The lags show a positive linear correlation over shorter periods of time and inverse correlation over longer periods.

Data Preparation

Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation

v3 <- v3 |>
  tsclean(lambda = "auto") |>
  na_ma()

ggtsdisplay(v3)

Let’s take a look at a difference of the data to see if it achieves stationarity,

ggtsdisplay(diff(v3))

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v3, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s2.v3 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s2.v3 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s2.v3 17.17073 17.28684
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                   [,1]
## holt_pval 1.352437e-03
## ses_pval  5.046795e-05

The p-values are very low. These models do not fit the data well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") + 
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(A,Ad,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.0266 
##     phi   = 0.8214 
## 
##   Initial states:
##     l = 9.4723 
##     b = 0.4578 
## 
##   sigma:  0.2565
## 
##      AIC     AICc      BIC 
## 5779.563 5779.628 5810.574
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s2.v3 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s2.v3 17.17073 17.28684 17.28755
checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,N)
## Q* = 29.746, df = 10, p-value = 0.0009426
## 
## Model df: 0.   Total lags used: 10

We can see from the low p-value that this model also does not fit the data very well,

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0)
## Q* = 11.722, df = 8, p-value = 0.1641
## 
## Model df: 2.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s2.v3 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s2.v3 
## 17.29381
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

This model also does not fit the data well.

Results

s2v3.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S02\n\n")
##  MAPE results by model - S02
s2.results <- rbind(s2v2.results,s2v3.results)
s2.results
##            ses     holt       ets    arima
## s2.v2 27.55139 27.56873 100.00000 28.31755
## s2.v3 17.17073 17.28684  17.28755 17.29381

S03

par(mfrow = c(1,2))

s3 |>
  select(c(Var05,Var07)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S03") +
  ylab('')

s3 |>
  ggplot(aes(x = Var05, y = Var07)) +
  geom_point()
## Warning: Removed 4 rows containing missing values (`geom_point()`).

We can see that the data are very similiar and have a strong positive correlation.

Var05

Exploratory Data Analysis

v5 <- ts(s3$Var05)
ggtsdisplay(v5)
## Warning: Removed 4 rows containing missing values (`geom_point()`).

  • The data is non-stationary.
  • There is an upward trend over time.
  • There is some cyclicity, but no discernible seasonality.
  • The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
  • There are some missing values.
  • There are some outliers present.
gglagplot(v5)

The lagplot indicates a positive linear relationship across lags.

Data Preparation

Dealing with outliers and missing values,

v5 <- v5 |>
  tsclean(lambda = "auto") |>
  na_ma()

ggtsdisplay(v5)

Let’s difference the data and see if it achieves stationarity,

checkresiduals(diff(v5))

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 61.078, df = 10, p-value = 2.265e-09
## 
## Model df: 0.   Total lags used: 10

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v5, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s3.v5 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s3.v5 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s3.v5 15.45108 19.26787
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                [,1]
## holt_pval 0.8696317
## ses_pval  0.8115583
autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(M,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9277 
## 
##   Initial states:
##     l = 30.4884 
## 
##   sigma:  0.0179
## 
##      AIC     AICc      BIC 
## 9655.076 9655.095 9670.582
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s3.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s3.v5 15.45108 19.26787 15.45439
checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 17.805, df = 10, p-value = 0.05834
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2) with drift
## Q* = 10.565, df = 7, p-value = 0.1587
## 
## Model df: 3.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s3.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s3.v5 
## 40.19997
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s3v5.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s3v5.results
##            ses     holt      ets    arima
## s3.v5 15.45108 19.26787 15.45439 40.19997

Var07

Exploratory Data Analysis

v7 <- ts(s3$Var07)
ggtsdisplay(v7)
## Warning: Removed 4 rows containing missing values (`geom_point()`).

  • The data is non-stationary.
  • There is an upward trend over time.
  • There is some cyclicity, but no discernible seasonality.
  • The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
  • There are some missing values.
  • There are some outliers present.
gglagplot(v7)

The lagplot indicates a positive linear relationship across lags.

Data Preparation

Dealing with outliers and missing values,

v7 <- v7 |>
  tsclean(lambda = "auto") |>
  na_ma()

ggtsdisplay(v7)

We’ve been checking differences manually, but we can also call the ndiffs function to determine the appropriate number of first differences,

ndiffs(v7)
## [1] 1

One difference is required to make the v7 data stationary.

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v7, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s3.v7 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s3.v7 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s3.v7 15.32257 18.70147
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                [,1]
## holt_pval 0.5457403
## ses_pval  0.3956926
autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(M,A,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 30.7662 
##     b = 0.0751 
## 
##   sigma:  0.0168
## 
##      AIC     AICc      BIC 
## 9495.793 9495.839 9521.635
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s3.v7 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s3.v7 15.32257 18.70147 26.68161
checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,N)
## Q* = 18.471, df = 10, p-value = 0.04752
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0) with drift
## Q* = 18.498, df = 10, p-value = 0.04712
## 
## Model df: 0.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s3.v7 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s3.v7 
## 40.03802
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s3v7.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S03\n\n")
##  MAPE results by model - S03
s3.results <- rbind(s3v5.results,s3v7.results)
s3.results
##            ses     holt      ets    arima
## s3.v5 15.45108 19.26787 15.45439 40.19997
## s3.v7 15.32257 18.70147 26.68161 40.03802

S04

par(mfrow = c(1,2))

s4 |>
  select(c(Var01,Var02)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S04") +
  ylab('')

s4 |>
  ggplot(aes(x = Var01, y = Var02)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Var01

Exploratory Data Analysis

v1 <- ts(s4$Var01)
ggtsdisplay(v1)
## Warning: Removed 2 rows containing missing values (`geom_point()`).

  • The data is non-stationary.
  • There is an upward trend over time.
  • There is some cyclicity, but no discernible seasonality.
  • The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 1, indicating an AR(1) component.
  • There are two missing values.
  • It’s not clear from the line plot if any outliers are present. But we can run tsclean() to check for any.
gglagplot(v1)

  • There is a strong positive correlation from lags 1-9 that decreases slightly as the lags get larger and larger.

Data Preparation

Dealing with any outliers and missing values,

v1 <- v1 |>
  tsclean()

ggtsdisplay(v1)

Let’s difference the data and see if it achieves stationarity,

checkresiduals(diff(v1))

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 8.5322, df = 10, p-value = 0.577
## 
## Model df: 0.   Total lags used: 10

We can see the variance increases over time, and the ACF and PACF plots indicate the data is not quite stationary yet. Let’s run ndiffs() to check the KPSS,

ndiffs(v1, test = "kpss")
## [1] 1

It appears one first differencing is necessary to bring the data to stationarity. We can set aside our training and testing data,

c(train, test, horizon) %<-% train_test_split(v1, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s4.v1 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s4.v1 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s4.v1 23.01952 23.02523
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                 [,1]
## holt_pval 0.14618564
## ses_pval  0.07028179

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(M,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 17.1971 
## 
##   sigma:  0.0175
## 
##      AIC     AICc      BIC 
## 6808.987 6809.006 6824.493
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s4.v1 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s4.v1 23.01952 23.02523 23.01952
checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 28.511, df = 10, p-value = 0.001495
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 11.977, df = 6, p-value = 0.06248
## 
## Model df: 4.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima(lambda = 0) |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s4.v1 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s4.v1 
## 23.23907
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s4v1.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S04\n\n")
##  MAPE results by model - S04
s4v1.results
##            ses     holt      ets    arima
## s4.v1 23.01952 23.02523 23.01952 23.23907

Var02

Exploratory Data Analysis

v2 <- ts(s4$Var02)
ggtsdisplay(v2)

  • The data is non-stationary.
  • There does appear to be some seasonality.
  • There are many outliers.
  • The ACF plot exhibits some seasonal behavior. The PACF cuts off at lag 4.
  • There are no missing values.
gglagplot(v2)

  • The data is inversely correlated at each lag.

Data Preparation

Dealing with any outliers and missing values,

v2 <- v2 |>
  tsclean() 

ggtsdisplay(v2)

In this case, setting lambda = NULL does a better job removing the outliers.

Let’s difference the data and see if it achieves stationarity,

checkresiduals(diff(v1))

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 8.5322, df = 10, p-value = 0.577
## 
## Model df: 0.   Total lags used: 10

We can see the variance increases over time, and the ACF and PACF plots indicate the data is stationary. Let’s run ndiffs() to check the KPSS,

ndiffs(v2, test = "kpss")
## [1] 1

It appears one first differencing is necessary to bring the data to stationarity. We can set aside our training and testing data,

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s4.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s4.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##           ses     holt
## s4.v2 28.7717 28.78593
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                   [,1]
## holt_pval 9.362804e-06
## ses_pval  9.514048e-06

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(M,A,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.4765 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 17382793.0973 
##     b = 362620.2033 
## 
##   sigma:  0.3722
## 
##      AIC     AICc      BIC 
## 50401.69 50401.73 50427.53
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s4.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##           ses     holt      ets
## s4.v2 28.7717 28.78593 329.4155

We can see that the ETS is way off

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,N)
## Q* = 59.165, df = 10, p-value = 5.212e-09
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)
## Q* = 7.614, df = 7, p-value = 0.3679
## 
## Model df: 3.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s4.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s4.v2 
## 37.04331
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s4v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S04\n\n")
##  MAPE results by model - S04
s4.results <- rbind(s4v1.results,s4v2.results)
s4.results
##            ses     holt       ets    arima
## s4.v1 23.01952 23.02523  23.01952 23.23907
## s4.v2 28.77170 28.78593 329.41550 37.04331

S05

Var02

Exploratory Data Analysis

v2 <- ts(s5$Var02)
ggtsdisplay(v2)
## Warning: Removed 1 rows containing missing values (`geom_point()`).

  • The data is non-stationary, although it appears to exhibit periods of stationarity
  • There is no noticeable trend over time.
  • There appears to be some seasonality.
  • The ACF plot indicates some seasonality. The PACF plot cuts off at lag 3, indicating an AR() component.
  • There is one missing value.
  • Many outliers are present.
gglagplot(v2)

Again we see an inverse correlation within each lag.

Data Preparation

Dealing with any outliers and missing values,

v2 <- v2 |>
  tsclean() 

ggtsdisplay(v2)

Now we see the ACF trails off over time, supporting the AR() component indication prior to cleaning the outliers.

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s5.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s5.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s5.v2 27.10623 27.10393
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                  [,1]
## holt_pval 0.001833601
## ses_pval  0.001647381

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(M,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.3462 
## 
##   Initial states:
##     l = 28022117.2545 
## 
##   sigma:  0.244
## 
##      AIC     AICc      BIC 
## 48704.38 48704.40 48719.89
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s5.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s5.v2 27.10623 27.10393 27.10565

There is no real difference between these three methods in terms of performance. They all have p-values too low to indicate the model is taking into account all the information.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 52.59, df = 10, p-value = 8.872e-08
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)
## Q* = 10.592, df = 7, p-value = 0.1574
## 
## Model df: 3.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima(lambda = 0) |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s5.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s5.v2 
## 30.18151
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s5v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S05\n\n")
##  MAPE results by model - S05
s5v2.results
##            ses     holt      ets    arima
## s5.v2 27.10623 27.10393 27.10565 30.18151

Var03

Exploratory Data Analysis

v3 <- ts(s5$Var03)
ggtsdisplay(v3)
## Warning: Removed 5 rows containing missing values (`geom_point()`).

  • The data is non-stationary
  • There is no noticeable trend over time.
  • There appears to be some seasonality.
  • The ACF tails off over time, indicating an AR(). The PACF plot cuts off at lag 1, indicating an AR(1) component.
  • There are some missing values and maybe one outlier.
gglagplot(v3)

We see a positive correlation among different lags, strongest at lag 1 and decreasing over time.

Data Preparation

Dealing with any outliers and missing values,

v3 <- v3 |>
  tsclean(lambda = "auto") 

ggtsdisplay(v3)

Missing values have now been imputed and we can observe that the one noticeable outlier has been reduced in magnitude.

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v3, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s5.v3 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s5.v3 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##           ses     holt
## s5.v3 8.07383 7.938347
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                   [,1]
## holt_pval 0.0015865109
## ses_pval  0.0002142398

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 68.1953 
## 
##   sigma:  0.878
## 
##      AIC     AICc      BIC 
## 8970.909 8970.928 8986.415
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s5.v3 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##           ses     holt     ets
## s5.v3 8.07383 7.938347 8.07383

There is no real difference between these three methods in terms of performance. They all have p-values too low to indicate the model is taking into account all the information.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 24.738, df = 10, p-value = 0.005865
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 10.244, df = 9, p-value = 0.3311
## 
## Model df: 1.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s5.v3 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s5.v3 
## 7.968795
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s5v3.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S05\n\n")
##  MAPE results by model - S05
s5.results <- rbind(s5v2.results,s5v3.results)
s5.results
##            ses      holt      ets     arima
## s5.v2 27.10623 27.103930 27.10565 30.181510
## s5.v3  8.07383  7.938347  8.07383  7.968795

S06

Var05

Exploratory Data Analysis

v5 <- ts(s6$Var05)
ggtsdisplay(v5)
## Warning: Removed 5 rows containing missing values (`geom_point()`).

  • The data appears non-stationary, but it’s dramatically affected by outliers
  • There are some missing values.
  • There is an upward trend over time.

The outliers are making it difficult to visually interpret the data. Let’s deal with the outlier(s) and reassess

v5 <- v5 |>
  tsclean() 

ggtsdisplay(v5)

Now we’re able to get a much better look at the data. There appears to be some seasonality, a clear upward trend over time. The variance appears constant. We see that the lags are significant over time in the ACF and cut off after lag 2 in the PACF.

gglagplot(v5)

We see a strong positive correlation at each lag.

Data Preparation

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v5, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s6.v5 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s6.v5 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s6.v5 5.871887 5.872024
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                [,1]
## holt_pval 0.9511334
## ses_pval  0.9044653

We see that both the Holt method and Simple Exponential Smoothing seem to be capturing the data fairly well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.907 
## 
##   Initial states:
##     l = 27.0512 
## 
##   sigma:  0.5092
## 
##      AIC     AICc      BIC 
## 7556.894 7556.913 7572.400
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s6.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s6.v5 5.871887 5.872024 5.871887

There is no real difference between these three methods in terms of performance. The Error, Trend, Seasonal (ETS) model performs the same as the SES model.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 23.182, df = 10, p-value = 0.01009
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2) with drift
## Q* = 11.107, df = 6, p-value = 0.08513
## 
## Model df: 4.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s6.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s6.v5 
## 11.01148
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s6v5.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S06\n\n")
##  MAPE results by model - S06
s6v5.results
##            ses     holt      ets    arima
## s6.v5 5.871887 5.872024 5.871887 11.01148

Var07

Exploratory Data Analysis

v7 <- ts(s6$Var07)
ggtsdisplay(v7)
## Warning: Removed 5 rows containing missing values (`geom_point()`).

Much like the previous dataset, one extreme outlier is rendering visualization impossible. Let’s deal with that outlier first and then interpret the data,

v7 <- v7 |>
  tsclean()

ggtsdisplay(v7)

  • The data is non-stationary
  • There is an upward trend over time.
  • There appears to be some seasonality.
  • The ACF stays large over time. The PACF plot cuts off at lag 1, indicating an AR(1) component.
  • Variance appears to be constant, so there’s no need to Box Cox or log transform the data.
gglagplot(v7)

We see a positive correlation among different lags, strongest at lag 1 and decreasing over time.

Data Preparation

Since we’ve already dealt with the outliers we can move on to splitting our training and testing sets.

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v7, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s6.v7 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s6.v7 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)
##            ses     holt
## s6.v7 6.317979 6.330951
rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)
##                [,1]
## holt_pval 0.6373357
## ses_pval  0.9123292

Simple exponential smoothing seems to fit the data well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit
## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.8951 
## 
##   Initial states:
##     l = 27.4005 
## 
##   sigma:  0.5195
## 
##      AIC     AICc      BIC 
## 7608.745 7608.764 7624.251
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s6.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)
##            ses     holt      ets
## s6.v7 6.317979 6.330951 6.317979

Again, ETS and SES appear to fit the data equally.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 14.131, df = 10, p-value = 0.1671
## 
## Model df: 0.   Total lags used: 10
autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2) with drift
## Q* = 6.9393, df = 6, p-value = 0.3265
## 
## Model df: 4.   Total lags used: 10
set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)
set.seed(123)
arima.acc <- c(s6.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc
##    s6.v5 
## 11.94021
autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

The arima model does fit the data very well here.

Results

s6v7.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S05\n\n")
##  MAPE results by model - S05
s6.results <- rbind(s6v5.results,s6v7.results)
s6.results
##            ses     holt      ets    arima
## s6.v5 5.871887 5.872024 5.871887 11.01148
## s6.v7 6.317979 6.330951 6.317979 11.94021

Final Results

final.results <- rbind(s1.results,s2.results,
                       s3.results,s4.results,
                       s5.results,s6.results)

final.results
##             ses      holt        ets     arima
## s1.v1 10.014358 10.498584  23.209708  9.994805
## s1.v2  2.168109  2.168723   2.168141  2.118945
## s2.v2 27.551391 27.568733  99.999997 28.317545
## s2.v3 17.170725 17.286836  17.287545 17.293806
## s3.v5 15.451085 19.267867  15.454392 40.199973
## s3.v7 15.322569 18.701468  26.681607 40.038021
## s4.v1 23.019520 23.025230  23.019520 23.239073
## s4.v2 28.771703 28.785932 329.415504 37.043313
## s5.v2 27.106234 27.103930  27.105652 30.181510
## s5.v3  8.073830  7.938347   8.073830  7.968795
## s6.v5  5.871887  5.872024   5.871887 11.011477
## s6.v7  6.317979  6.330951   6.317979 11.940211