DATA 624 PROJECT1

Introduction

This report is intended for colleagues from a variety of backgrounds and contains both technical and non-technical explanations of the work conducted. The objective of this project was to perform the appropriate analysis in order to forecast two variables (of five provided) each from six different time series sets. We were provided a spreadsheet that contains 1622 periods of every variable in every set and were expected to forecast 140 periods. The sets are labeled S01, S02, S03, S04, S05 and S06 and each contains variables labeled V01, V02, V03, V05, and V07. Different variables are required to be forecast depending on the set, specified below:

S01 – Forecast Var01, Var02
S02 – Forecast Var02, Var03
S03 – Forecast Var05, Var07
S04 – Forecast Var01, Var02
S05 – Forecast Var02, Var03
S06 – Forecast Var05, Var07

For each category, we perform the following:

Exploratory Data Analysis

Visualize the data
Identify trend, seasonality, stationarity, outliers, missing data, variance, etc. Interpret ACF and PACF to inform modeling approach.

Data Preparation

Prepare data for modeling, address outliers, missing values, stationarity, etc.
Split data into training and testing sets.

Data Modeling

Prepare models. For this project, we base model accuracy using MAPE (Mean Absolute Percentage Error), selecting the model with the lowest score.
Data Export. Using the best model, we predict the next 140 periods and write to disk.

Data Import

We read in the data and create separate data frames containing only the first 1622 observations.

Series Provided
SeriesInd	category	Var01	Var02	Var03	Var05	Var07
40669	S03	30.64286	123432400	30.34000	30.49000	30.57286
40669	S02	10.28000	60855800	10.05000	10.17000	10.28000
40669	S01	26.61000	10369300	25.89000	26.20000	26.01000
40669	S06	27.48000	39335700	26.82000	27.02000	27.32000
40669	S05	69.26000	27809100	68.19000	68.72000	69.15000
40669	S04	17.20000	16587400	16.88000	16.94000	17.10000
40670	S03	30.79857	150476200	30.46428	30.65714	30.62571
40670	S02	11.24000	215620200	10.40000	10.45000	10.96000
40670	S01	26.30000	10943800	25.70000	25.95000	25.86000
40670	S06	28.24000	55416000	27.24000	27.27000	28.07000

s01
SeriesInd	category	Var01	Var02	Var03	Var05	Var07
40669	S01	26.61	10369300	25.89	26.20	26.01
40670	S01	26.30	10943800	25.70	25.95	25.86
40671	S01	26.03	8933800	25.56	25.90	25.67
40672	S01	25.84	10775400	25.36	25.60	25.75
40673	S01	26.34	12875600	25.52	25.60	26.34

S01

par(mfrow = c(1,2))

s1 |>
  select(c(Var01,Var02)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S01") +
  ylab('')

s1 |>
  ggplot(aes(x = Var01, y = Var02)) +
  geom_point()

## Warning: Removed 2 rows containing missing values (`geom_point()`).

Var01

Exploratory Data Analysis

v1 <- ts(s1$Var01)
ggtsdisplay(v1)

## Warning: Removed 2 rows containing missing values (`geom_point()`).

The data is *non-stationary**
There are two missing values
The ACF is highly significant across many lags
The PACF cuts off at lag 1
There are some outliers though they are difficult to observe
There is an upward trend
Variance appears to be stable
There is some cyclicity in the data. There does not appear to be any seasonality
The data is non-linear

We can see the lags are highly correlated using a lagplot,

gglagplot(v1)

We can visualize the missing values to see where they appear within the data using the imputeTS package, which we can later use to impute the missing values using a moving average.

library(imputeTS)
v1 |>
  ggplot_na_distribution()

Data Preparation

To deal with the outliers, we’ll use the tsclean function from the forecast package, which uses median absolute deviation (MAD) to identify outliers and replace them with imputed values based on neighboring observations.

For the missing values, we’ll use the na_ma() function in the imputeTS package, which replaces missing values with the weighted moving average of neighboring observations within a specific window.

library(forecast)
v1 <- v1 |>
  tsclean(lambda = "auto") |>
  na_ma() # default window = 4

We take a first difference of the data to see if we can achieve stationarity,

v1.diff <- diff(v1)
ggtsdisplay(v1.diff)

There are still some outliers and significant values, but the ACF and PACF plots are informative: the ACF cuts off at lag 2, indicating a MA(2) component. The PACF also cuts off at lag 2, indication an AR(2) component.

Now we split the data into testing and training sets to prepare for modeling, setting aside 20% of the data for testing.

library(zeallot) # provides tuple assigment

train_test_split <- function(x, split) {
  split.index <- round(length(x) * split) # split point index
  train <- window(x, end = split.index)
  test <- window(x, start = split.index + 1)
  horizon = length(test)
  
  return(list(train, test, horizon))
}

c(train, test, horizon) %<-% train_test_split(v1, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s1.v1 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s1.v1 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s1.v1 10.01436 10.49858

Box.test(residuals(holt.fit))$p.value

## [1] 0.02235277

Box.test(residuals(ses.fit))$p.value

## [1] 0.005648436

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data")

ETS Modeling

set.seed(123)
ets.fit <- ets(train, model = "MMN")
ets.fit

## ETS(M,M,N) 
## 
## Call:
##  ets(y = train, model = "MMN") 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 27.0027 
##     b = 1.0007 
## 
##   sigma:  0.0127
## 
##      AIC     AICc      BIC 
## 7172.067 7172.114 7197.910

A multiplicative error, multiplicative trend, no seasonality gives us the best model, but there are still problems as there is significance in the ACF: the model is not incorporating all the information in the data, which we can see in the residuals,

checkresiduals(ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,M,N)
## Q* = 17.51, df = 10, p-value = 0.06382
## 
## Model df: 0.   Total lags used: 10

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s1.v1 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s1.v1 10.01436 10.49858 23.20971

autoplot(train) +
  autolayer(ets.fc, series = "ets") +
  autolayer(test, series = "test data")

ARIMA Modeling

An ARIMA(2,1,2) model with log transformation (lambda=0) achieves the best MAPE.

set.seed(123)
arima.fc <- train |>
  Arima(order=c(2,1,2), lambda=0) |>
  forecast(h=horizon)

checkresiduals(arima.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 7.7323, df = 6, p-value = 0.2584
## 
## Model df: 4.   Total lags used: 10

set.seed(123)
arima.acc <- c(s1.v1 = accuracy(arima.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)

##            ses     holt      ets    arima
## s1.v1 10.01436 10.49858 23.20971 9.994805

Results

s1v1.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)

Var02

Exploratory Data Analysis

v2 <- ts(s1$Var02)
ggtsdisplay(v2)

This data has a lot of outliers
There is a faint downward trend over time
The variance in the data appears to decrease over time
There is a strong autocorrelation among many lags
There is no missing data
The data is non-stationary
There doesn’t appear to be any seasonality

gglagplot(v2)

Although the ACF plot shows a highly significant autocorrelation, the lags show an inverse correlation between the time series and its lagged values.

Data Preparation

Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation

v2 <- v2 |>
  tsclean(lambda = "auto") |>
  log()

ggtsdisplay(diff(diff(v2)))

A second-difference, log tranformation brings the data into stationarity.

We see the ACF cuts off at Lag 1, suggesting an MA(1) component, while the PACF trails off, further suggesting the MA(1) component.

gglagplot(diff(diff(log(v2))))

We can see the inverse correlations in the lags are no longer present.

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s1.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s1.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s1.v2 2.168109 2.168723

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.4238 
## 
##   Initial states:
##     l = 16.1856 
## 
##   sigma:  0.311
## 
##      AIC     AICc      BIC 
## 6276.973 6276.991 6292.478

checkresiduals(ets.fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 69.91, df = 10, p-value = 4.614e-11
## 
## Model df: 0.   Total lags used: 10

The p-value is far too small to be useful, but we’ll make the predictions and store the results for continuity,

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s1.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s1.v2 2.168109 2.168723 2.168141

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
## 
## Model df: 5.   Total lags used: 10

The p-value of the residuals is still too low to be considered a good model, but it’s the best we’ve found so far,

set.seed(123)
arima.fc <- train |>
  Arima(order=c(1,1,4), lambda=0) |>
  forecast(h=horizon)

checkresiduals(arima.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,4)
## Q* = 13.84, df = 5, p-value = 0.01666
## 
## Model df: 5.   Total lags used: 10

set.seed(123)
arima.acc <- c(s1.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s1.v2 
## 2.118945

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data")

Results

s1v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)

cat("\tMAPE results by model - S01\n\n")

##  MAPE results by model - S01

s1.results <- rbind(s1v1.results,s1v2.results)
s1.results

##             ses      holt       ets    arima
## s1.v1 10.014358 10.498584 23.209708 9.994805
## s1.v2  2.168109  2.168723  2.168141 2.118945

S02

par(mfrow = c(1,2))

s2 |>
  select(c(Var02,Var03)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S02") +
  ylab('')

s2 |>
  ggplot(aes(x = Var02, y = Var03)) +
  geom_point()

## Warning: Removed 4 rows containing missing values (`geom_point()`).

Var02

Exploratory Data Analysis

v2 <- ts(s2$Var02)
ggtsdisplay(v2)

The data is non-stationary
There is a very faint downward trend over time.
There is high variance in the data
There does not appear to be any seasonality in the data
The ACF plot indicates the lags are highly significant and trail off over time, indicating an AR component. The PACF model cuts off after 8, indicating an MA component.
There are many outliers in the data
There are no missing values in the data

gglagplot(v2)

The lags show an inverse correlation.

Data Preparation

Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation

v2 <- v2 |>
  tsclean(lambda = "auto")

ggtsdisplay(diff(log(v2)))

The data now appears stationary. The ACF cuts off at 2 indicating an MA(2) component, the PACF cuts off at 9, indicating an AR(9) component.

We can see the inverse correlations in the lags are no longer present.

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s2.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s2.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s2.v2 27.55139 27.56873

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                   [,1]
## holt_pval 5.881639e-08
## ses_pval  7.954414e-08

The p-values are very low. These models do not fit the data well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") + 
  ylab("")

ETS Modeling

set.seed(123)
train.diff <- diff(log(train))
ets.fit <- ets(train.diff)
ets.fit

## ETS(A,N,N) 
## 
## Call:
##  ets(y = train.diff) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = -5e-04 
## 
##   sigma:  0.3676
## 
##      AIC     AICc      BIC 
## 6704.535 6704.553 6720.038

Because we differenced and logged the forecasts, we need to back transfrom the forecasts to match the test data.

library(stats)
set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.fc <- exp(cumsum(ets.fc$mean))
ets.acc <- c(s2.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt ets
## s2.v2 27.55139 27.56873 100

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 2988, df = 10, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 10

We can see that this model also does not fit the data very well,

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,3)
## Q* = 6.2503, df = 5, p-value = 0.2826
## 
## Model df: 5.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)

arima.acc <- c(s2.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s2.v2 
## 28.31755

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") + 
  ylab("")

This model fits the data pretty well.

Results

s2v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s2v2.results

##            ses     holt ets    arima
## s2.v2 27.55139 27.56873 100 28.31755

Var03

Exploratory Data Analysis

v3 <- ts(s2$Var03)
ggtsdisplay(v3)

## Warning: Removed 4 rows containing missing values (`geom_point()`).

The data is non-stationary
There is no noticeable trend over time
There are four missing values
There is one extreme outlier
There does not appear to be any seasonality in the data
The ACF plot indicates the lags are highly significant and trail off over time, indicating an AR component. The PACF model cuts off at lag 1, indicating an AR(1) component.

gglagplot(v3)

The lags show a positive linear correlation over shorter periods of time and inverse correlation over longer periods.

Data Preparation

Dealing with outliers and non-stationarity, applying log transformation to deal with the inverse autocorrelation

v3 <- v3 |>
  tsclean(lambda = "auto") |>
  na_ma()

ggtsdisplay(v3)

Let’s take a look at a difference of the data to see if it achieves stationarity,

ggtsdisplay(diff(v3))

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v3, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s2.v3 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s2.v3 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s2.v3 17.17073 17.28684

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                   [,1]
## holt_pval 1.352437e-03
## ses_pval  5.046795e-05

The p-values are very low. These models do not fit the data well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") + 
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(A,Ad,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.0266 
##     phi   = 0.8214 
## 
##   Initial states:
##     l = 9.4723 
##     b = 0.4578 
## 
##   sigma:  0.2565
## 
##      AIC     AICc      BIC 
## 5779.563 5779.628 5810.574

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s2.v3 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s2.v3 17.17073 17.28684 17.28755

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,Ad,N)
## Q* = 29.746, df = 10, p-value = 0.0009426
## 
## Model df: 0.   Total lags used: 10

We can see from the low p-value that this model also does not fit the data very well,

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0)
## Q* = 11.722, df = 8, p-value = 0.1641
## 
## Model df: 2.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s2.v3 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s2.v3 
## 17.29381

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

This model also does not fit the data well.

Results

s2v3.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S02\n\n")

##  MAPE results by model - S02

s2.results <- rbind(s2v2.results,s2v3.results)
s2.results

##            ses     holt       ets    arima
## s2.v2 27.55139 27.56873 100.00000 28.31755
## s2.v3 17.17073 17.28684  17.28755 17.29381

S03

par(mfrow = c(1,2))

s3 |>
  select(c(Var05,Var07)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S03") +
  ylab('')

s3 |>
  ggplot(aes(x = Var05, y = Var07)) +
  geom_point()

## Warning: Removed 4 rows containing missing values (`geom_point()`).

We can see that the data are very similiar and have a strong positive correlation.

Var05

Exploratory Data Analysis

v5 <- ts(s3$Var05)
ggtsdisplay(v5)

## Warning: Removed 4 rows containing missing values (`geom_point()`).

The data is non-stationary.
There is an upward trend over time.
There is some cyclicity, but no discernible seasonality.
The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
There are some missing values.
There are some outliers present.

gglagplot(v5)

The lagplot indicates a positive linear relationship across lags.

Data Preparation

Dealing with outliers and missing values,

v5 <- v5 |>
  tsclean(lambda = "auto") |>
  na_ma()

ggtsdisplay(v5)

Let’s difference the data and see if it achieves stationarity,

checkresiduals(diff(v5))

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 61.078, df = 10, p-value = 2.265e-09
## 
## Model df: 0.   Total lags used: 10

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v5, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s3.v5 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s3.v5 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s3.v5 15.45108 19.26787

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                [,1]
## holt_pval 0.8696317
## ses_pval  0.8115583

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(M,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9277 
## 
##   Initial states:
##     l = 30.4884 
## 
##   sigma:  0.0179
## 
##      AIC     AICc      BIC 
## 9655.076 9655.095 9670.582

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s3.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s3.v5 15.45108 19.26787 15.45439

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 17.805, df = 10, p-value = 0.05834
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2) with drift
## Q* = 10.565, df = 7, p-value = 0.1587
## 
## Model df: 3.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s3.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s3.v5 
## 40.19997

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s3v5.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
s3v5.results

##            ses     holt      ets    arima
## s3.v5 15.45108 19.26787 15.45439 40.19997

Var07

Exploratory Data Analysis

v7 <- ts(s3$Var07)
ggtsdisplay(v7)

## Warning: Removed 4 rows containing missing values (`geom_point()`).

The data is non-stationary.
There is an upward trend over time.
There is some cyclicity, but no discernible seasonality.
The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 2, indicating an AR(2) component.
There are some missing values.
There are some outliers present.

gglagplot(v7)

The lagplot indicates a positive linear relationship across lags.

Data Preparation

Dealing with outliers and missing values,

v7 <- v7 |>
  tsclean(lambda = "auto") |>
  na_ma()

ggtsdisplay(v7)

We’ve been checking differences manually, but we can also call the ndiffs function to determine the appropriate number of first differences,

ndiffs(v7)

## [1] 1

One difference is required to make the v7 data stationary.

Now we set aside our training and testing sets.

c(train, test, horizon) %<-% train_test_split(v7, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s3.v7 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s3.v7 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s3.v7 15.32257 18.70147

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                [,1]
## holt_pval 0.5457403
## ses_pval  0.3956926

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(M,A,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 30.7662 
##     b = 0.0751 
## 
##   sigma:  0.0168
## 
##      AIC     AICc      BIC 
## 9495.793 9495.839 9521.635

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s3.v7 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s3.v7 15.32257 18.70147 26.68161

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,N)
## Q* = 18.471, df = 10, p-value = 0.04752
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0) with drift
## Q* = 18.498, df = 10, p-value = 0.04712
## 
## Model df: 0.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima(lambda=0) |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s3.v7 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s3.v7 
## 40.03802

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s3v7.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S03\n\n")

##  MAPE results by model - S03

s3.results <- rbind(s3v5.results,s3v7.results)
s3.results

##            ses     holt      ets    arima
## s3.v5 15.45108 19.26787 15.45439 40.19997
## s3.v7 15.32257 18.70147 26.68161 40.03802

S04

par(mfrow = c(1,2))

s4 |>
  select(c(Var01,Var02)) |>
  ts() |>
  autoplot(facets = TRUE, main = "S04") +
  ylab('')

s4 |>
  ggplot(aes(x = Var01, y = Var02)) +
  geom_point()

## Warning: Removed 2 rows containing missing values (`geom_point()`).

Var01

Exploratory Data Analysis

v1 <- ts(s4$Var01)
ggtsdisplay(v1)

## Warning: Removed 2 rows containing missing values (`geom_point()`).

The data is non-stationary.
There is an upward trend over time.
There is some cyclicity, but no discernible seasonality.
The ACF plot tails off over time, indicating significance in the autocorrelation and presence of an AR component. The PACF cuts off at lag 1, indicating an AR(1) component.
There are two missing values.
It’s not clear from the line plot if any outliers are present. But we can run tsclean() to check for any.

gglagplot(v1)

There is a strong positive correlation from lags 1-9 that decreases slightly as the lags get larger and larger.

Data Preparation

Dealing with any outliers and missing values,

v1 <- v1 |>
  tsclean()

ggtsdisplay(v1)

Let’s difference the data and see if it achieves stationarity,

checkresiduals(diff(v1))

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 8.5322, df = 10, p-value = 0.577
## 
## Model df: 0.   Total lags used: 10

We can see the variance increases over time, and the ACF and PACF plots indicate the data is not quite stationary yet. Let’s run ndiffs() to check the KPSS,

ndiffs(v1, test = "kpss")

## [1] 1

It appears one first differencing is necessary to bring the data to stationarity. We can set aside our training and testing data,

c(train, test, horizon) %<-% train_test_split(v1, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s4.v1 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s4.v1 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s4.v1 23.01952 23.02523

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                 [,1]
## holt_pval 0.14618564
## ses_pval  0.07028179

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(M,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 17.1971 
## 
##   sigma:  0.0175
## 
##      AIC     AICc      BIC 
## 6808.987 6809.006 6824.493

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s4.v1 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s4.v1 23.01952 23.02523 23.01952

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 28.511, df = 10, p-value = 0.001495
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2)
## Q* = 11.977, df = 6, p-value = 0.06248
## 
## Model df: 4.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima(lambda = 0) |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s4.v1 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s4.v1 
## 23.23907

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s4v1.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S04\n\n")

##  MAPE results by model - S04

s4v1.results

##            ses     holt      ets    arima
## s4.v1 23.01952 23.02523 23.01952 23.23907

Var02

Exploratory Data Analysis

v2 <- ts(s4$Var02)
ggtsdisplay(v2)

The data is non-stationary.
There does appear to be some seasonality.
There are many outliers.
The ACF plot exhibits some seasonal behavior. The PACF cuts off at lag 4.
There are no missing values.

gglagplot(v2)

The data is inversely correlated at each lag.

Data Preparation

Dealing with any outliers and missing values,

v2 <- v2 |>
  tsclean() 

ggtsdisplay(v2)

In this case, setting lambda = NULL does a better job removing the outliers.

Let’s difference the data and see if it achieves stationarity,

checkresiduals(diff(v1))

## 
##  Ljung-Box test
## 
## data:  Residuals
## Q* = 8.5322, df = 10, p-value = 0.577
## 
## Model df: 0.   Total lags used: 10

We can see the variance increases over time, and the ACF and PACF plots indicate the data is stationary. Let’s run ndiffs() to check the KPSS,

ndiffs(v2, test = "kpss")

## [1] 1

It appears one first differencing is necessary to bring the data to stationarity. We can set aside our training and testing data,

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s4.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s4.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##           ses     holt
## s4.v2 28.7717 28.78593

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                   [,1]
## holt_pval 9.362804e-06
## ses_pval  9.514048e-06

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(M,A,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.4765 
##     beta  = 1e-04 
## 
##   Initial states:
##     l = 17382793.0973 
##     b = 362620.2033 
## 
##   sigma:  0.3722
## 
##      AIC     AICc      BIC 
## 50401.69 50401.73 50427.53

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s4.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##           ses     holt      ets
## s4.v2 28.7717 28.78593 329.4155

We can see that the ETS is way off

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,A,N)
## Q* = 59.165, df = 10, p-value = 5.212e-09
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)
## Q* = 7.614, df = 7, p-value = 0.3679
## 
## Model df: 3.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s4.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s4.v2 
## 37.04331

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s4v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S04\n\n")

##  MAPE results by model - S04

s4.results <- rbind(s4v1.results,s4v2.results)
s4.results

##            ses     holt       ets    arima
## s4.v1 23.01952 23.02523  23.01952 23.23907
## s4.v2 28.77170 28.78593 329.41550 37.04331

S05

Var02

Exploratory Data Analysis

v2 <- ts(s5$Var02)
ggtsdisplay(v2)

## Warning: Removed 1 rows containing missing values (`geom_point()`).

The data is non-stationary, although it appears to exhibit periods of stationarity
There is no noticeable trend over time.
There appears to be some seasonality.
The ACF plot indicates some seasonality. The PACF plot cuts off at lag 3, indicating an AR() component.
There is one missing value.
Many outliers are present.

gglagplot(v2)

Again we see an inverse correlation within each lag.

Data Preparation

Dealing with any outliers and missing values,

v2 <- v2 |>
  tsclean() 

ggtsdisplay(v2)

Now we see the ACF trails off over time, supporting the AR() component indication prior to cleaning the outliers.

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v2, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s5.v2 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s5.v2 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s5.v2 27.10623 27.10393

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                  [,1]
## holt_pval 0.001833601
## ses_pval  0.001647381

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(M,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.3462 
## 
##   Initial states:
##     l = 28022117.2545 
## 
##   sigma:  0.244
## 
##      AIC     AICc      BIC 
## 48704.38 48704.40 48719.89

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s5.v2 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s5.v2 27.10623 27.10393 27.10565

There is no real difference between these three methods in terms of performance. They all have p-values too low to indicate the model is taking into account all the information.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 52.59, df = 10, p-value = 8.872e-08
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima(lambda = 0) |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,2)
## Q* = 10.592, df = 7, p-value = 0.1574
## 
## Model df: 3.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima(lambda = 0) |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s5.v2 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s5.v2 
## 30.18151

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s5v2.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S05\n\n")

##  MAPE results by model - S05

s5v2.results

##            ses     holt      ets    arima
## s5.v2 27.10623 27.10393 27.10565 30.18151

Var03

Exploratory Data Analysis

v3 <- ts(s5$Var03)
ggtsdisplay(v3)

## Warning: Removed 5 rows containing missing values (`geom_point()`).

The data is non-stationary
There is no noticeable trend over time.
There appears to be some seasonality.
The ACF tails off over time, indicating an AR(). The PACF plot cuts off at lag 1, indicating an AR(1) component.
There are some missing values and maybe one outlier.

gglagplot(v3)

We see a positive correlation among different lags, strongest at lag 1 and decreasing over time.

Data Preparation

Dealing with any outliers and missing values,

v3 <- v3 |>
  tsclean(lambda = "auto") 

ggtsdisplay(v3)

Missing values have now been imputed and we can observe that the one noticeable outlier has been reduced in magnitude.

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v3, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s5.v3 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s5.v3 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##           ses     holt
## s5.v3 8.07383 7.938347

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                   [,1]
## holt_pval 0.0015865109
## ses_pval  0.0002142398

The Holt test is more effective, although neither capture the nuances of the data,

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 68.1953 
## 
##   sigma:  0.878
## 
##      AIC     AICc      BIC 
## 8970.909 8970.928 8986.415

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s5.v3 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##           ses     holt     ets
## s5.v3 8.07383 7.938347 8.07383

There is no real difference between these three methods in terms of performance. They all have p-values too low to indicate the model is taking into account all the information.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 24.738, df = 10, p-value = 0.005865
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 10.244, df = 9, p-value = 0.3311
## 
## Model df: 1.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s5.v3 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s5.v3 
## 7.968795

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s5v3.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S05\n\n")

##  MAPE results by model - S05

s5.results <- rbind(s5v2.results,s5v3.results)
s5.results

##            ses      holt      ets     arima
## s5.v2 27.10623 27.103930 27.10565 30.181510
## s5.v3  8.07383  7.938347  8.07383  7.968795

S06

Var05

Exploratory Data Analysis

v5 <- ts(s6$Var05)
ggtsdisplay(v5)

## Warning: Removed 5 rows containing missing values (`geom_point()`).

The data appears non-stationary, but it’s dramatically affected by outliers
There are some missing values.
There is an upward trend over time.

The outliers are making it difficult to visually interpret the data. Let’s deal with the outlier(s) and reassess

v5 <- v5 |>
  tsclean() 

ggtsdisplay(v5)

Now we’re able to get a much better look at the data. There appears to be some seasonality, a clear upward trend over time. The variance appears constant. We see that the lags are significant over time in the ACF and cut off after lag 2 in the PACF.

gglagplot(v5)

We see a strong positive correlation at each lag.

Data Preparation

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v5, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s6.v5 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s6.v5 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s6.v5 5.871887 5.872024

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                [,1]
## holt_pval 0.9511334
## ses_pval  0.9044653

We see that both the Holt method and Simple Exponential Smoothing seem to be capturing the data fairly well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.907 
## 
##   Initial states:
##     l = 27.0512 
## 
##   sigma:  0.5092
## 
##      AIC     AICc      BIC 
## 7556.894 7556.913 7572.400

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s6.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s6.v5 5.871887 5.872024 5.871887

There is no real difference between these three methods in terms of performance. The Error, Trend, Seasonal (ETS) model performs the same as the SES model.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 23.182, df = 10, p-value = 0.01009
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2) with drift
## Q* = 11.107, df = 6, p-value = 0.08513
## 
## Model df: 4.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s6.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s6.v5 
## 11.01148

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

Results

s6v5.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S06\n\n")

##  MAPE results by model - S06

s6v5.results

##            ses     holt      ets    arima
## s6.v5 5.871887 5.872024 5.871887 11.01148

Var07

Exploratory Data Analysis

v7 <- ts(s6$Var07)
ggtsdisplay(v7)

## Warning: Removed 5 rows containing missing values (`geom_point()`).

Much like the previous dataset, one extreme outlier is rendering visualization impossible. Let’s deal with that outlier first and then interpret the data,

v7 <- v7 |>
  tsclean()

ggtsdisplay(v7)

The data is non-stationary
There is an upward trend over time.
There appears to be some seasonality.
The ACF stays large over time. The PACF plot cuts off at lag 1, indicating an AR(1) component.
Variance appears to be constant, so there’s no need to Box Cox or log transform the data.

gglagplot(v7)

We see a positive correlation among different lags, strongest at lag 1 and decreasing over time.

Data Preparation

Since we’ve already dealt with the outliers we can move on to splitting our training and testing sets.

We split the data and prepare for modeling,

c(train, test, horizon) %<-% train_test_split(v7, 0.8)

Data Modeling

Simple Exponential Smoothing and Holt’s Method

set.seed(123)
ses.fit <- ses(train, h = horizon)
ses.acc <- c(s6.v7 = accuracy(ses.fit, test)['Test set', 'MAPE'])
holt.fit <- holt(train, damped=TRUE, h = horizon)
holt.acc <- c(s6.v7 = accuracy(holt.fit, test)['Test set', 'MAPE'])
cbind(ses = ses.acc, holt = holt.acc)

##            ses     holt
## s6.v7 6.317979 6.330951

rbind(holt_pval = Box.test(residuals(holt.fit))$p.value,
      ses_pval = Box.test(residuals(ses.fit))$p.value)

##                [,1]
## holt_pval 0.6373357
## ses_pval  0.9123292

Simple exponential smoothing seems to fit the data well.

autoplot(train) +
  autolayer(ses.fit, series = "ses") +
  autolayer(holt.fit, alpha = 0.4, series = "holt") +
  autolayer(test, series = "test data") +
  ylab("")

ETS Modeling

set.seed(123)
ets.fit <- ets(train)
ets.fit

## ETS(A,N,N) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.8951 
## 
##   Initial states:
##     l = 27.4005 
## 
##   sigma:  0.5195
## 
##      AIC     AICc      BIC 
## 7608.745 7608.764 7624.251

set.seed(123)
ets.fc <- forecast(ets.fit, h = horizon)
ets.acc <- c(s6.v5 = accuracy(ets.fc, test)['Test set', 'MAPE'])

cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc)

##            ses     holt      ets
## s6.v7 6.317979 6.330951 6.317979

Again, ETS and SES appear to fit the data equally.

checkresiduals(ets.fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 14.131, df = 10, p-value = 0.1671
## 
## Model df: 0.   Total lags used: 10

autoplot(train) +
  autolayer(ets.fc, series = "ets") + 
  autolayer(test, series = "test data") +
  ylab("")

ARIMA modeling

set.seed(123)
train |>
  auto.arima() |>
  forecast(h = horizon) |>
  checkresiduals()

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,2) with drift
## Q* = 6.9393, df = 6, p-value = 0.3265
## 
## Model df: 4.   Total lags used: 10

set.seed(123)
arima.fc <- train |>
  auto.arima() |>
  forecast(h=horizon)

set.seed(123)
arima.acc <- c(s6.v5 = accuracy(arima.fc, test)['Test set', 'MAPE'])
arima.acc

##    s6.v5 
## 11.94021

autoplot(train) +
  autolayer(arima.fc, series = "arima") +
  autolayer(test, series = "test data") +
  ylab("")

The arima model does fit the data very well here.

Results

s6v7.results <- cbind(ses = ses.acc, holt = holt.acc, ets = ets.acc, arima = arima.acc)
cat("\tMAPE results by model - S05\n\n")

##  MAPE results by model - S05

s6.results <- rbind(s6v5.results,s6v7.results)
s6.results

##            ses     holt      ets    arima
## s6.v5 5.871887 5.872024 5.871887 11.01148
## s6.v7 6.317979 6.330951 6.317979 11.94021

Final Results

final.results <- rbind(s1.results,s2.results,
                       s3.results,s4.results,
                       s5.results,s6.results)

final.results

##             ses      holt        ets     arima
## s1.v1 10.014358 10.498584  23.209708  9.994805
## s1.v2  2.168109  2.168723   2.168141  2.118945
## s2.v2 27.551391 27.568733  99.999997 28.317545
## s2.v3 17.170725 17.286836  17.287545 17.293806
## s3.v5 15.451085 19.267867  15.454392 40.199973
## s3.v7 15.322569 18.701468  26.681607 40.038021
## s4.v1 23.019520 23.025230  23.019520 23.239073
## s4.v2 28.771703 28.785932 329.415504 37.043313
## s5.v2 27.106234 27.103930  27.105652 30.181510
## s5.v3  8.073830  7.938347   8.073830  7.968795
## s6.v5  5.871887  5.872024   5.871887 11.011477
## s6.v7  6.317979  6.330951   6.317979 11.940211