Load & Preview Data

df_series_raw <- read_csv("project1-data.csv")
## Rows: 9732 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): SeriesInd, category
## dbl (5): Var01, Var02, Var03, Var05, Var07
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df_series_raw)
## Rows: 9,732
## Columns: 7
## $ SeriesInd <chr> "5/6/2011", "5/6/2011", "5/6/2011", "5/6/2011", "5/6/2011", …
## $ category  <chr> "S03", "S02", "S01", "S06", "S05", "S04", "S03", "S02", "S01…
## $ Var01     <dbl> 30.64286, 10.28000, 26.61000, 27.48000, 69.26000, 17.20000, …
## $ Var02     <dbl> 123432400, 60855800, 10369300, 39335700, 27809100, 16587400,…
## $ Var03     <dbl> 30.34000, 10.05000, 25.89000, 26.82000, 68.19000, 16.88000, …
## $ Var05     <dbl> 30.49000, 10.17000, 26.20000, 27.02000, 68.72000, 16.94000, …
## $ Var07     <dbl> 30.57286, 10.28000, 26.01000, 27.32000, 69.15000, 17.10000, …

Series 4

df_s04_raw <- df_series_raw %>% filter(category == "S04")

df_s04_raw <- df_s04_raw %>% select(c(SeriesInd, Var01, Var02))
df_s04_raw <- df_s04_raw %>% rename(Date = SeriesInd)
df_s04_raw$Date <- as.Date(df_s04_raw$Date, "%m/%d/%Y")

glimpse(df_s04_raw)
## Rows: 1,622
## Columns: 3
## $ Date  <date> 2011-05-06, 2011-05-07, 2011-05-08, 2011-05-09, 2011-05-10, 201…
## $ Var01 <dbl> 17.20, 17.23, 17.30, 16.90, 16.76, 16.83, 16.86, 16.98, 17.23, 1…
## $ Var02 <dbl> 16587400, 11718100, 16422000, 31816300, 15470000, 16181900, 1567…

Data Cleaning

Check for missing values

colSums(is.na(df_s04_raw))
##  Date Var01 Var02 
##     0     2     0
nrow(df_s04_raw)
## [1] 1622

Impute missing values

First, we impute the missing values using classification and regression trees via the Mice package.

# Impute data by regression 
imp_data <- mice(df_s04_raw, method = "cart")
## 
##  iter imp variable
##   1   1  Var01
##   1   2  Var01
##   1   3  Var01
##   1   4  Var01
##   1   5  Var01
##   2   1  Var01
##   2   2  Var01
##   2   3  Var01
##   2   4  Var01
##   2   5  Var01
##   3   1  Var01
##   3   2  Var01
##   3   3  Var01
##   3   4  Var01
##   3   5  Var01
##   4   1  Var01
##   4   2  Var01
##   4   3  Var01
##   4   4  Var01
##   4   5  Var01
##   5   1  Var01
##   5   2  Var01
##   5   3  Var01
##   5   4  Var01
##   5   5  Var01
df_s04 <- complete(imp_data)

Data Exploration

Next, we generate and review plots for both variables of interest. We see the Var01 series does not seem to exhibit seasonality, this is confirmed when subsets of the series are viewed via the window function. We do see some short-term trends for Var01, which appear to be difficult to predict. Var02 appears to be white noise with sporadic sharp upward spikes.

s04_var01_ts <- ts(df_s04$Var01)
s04_var02_ts <- ts(df_s04$Var02)

plot(s04_var01_ts)

plot(s04_var02_ts)

plot(window(s04_var02_ts, start=1186))

plot(window(s04_var01_ts, start=1186))

Check if there is a relationship between the two variables we need to forecast

We also review if there is a relationship between the two variables (linear, quadratic, etc.). From the scatter plots of the raw data as well as as the log-transformed factors, there does not appear to be any meaningful relationship. Further, we check the correlation of the two series and results show a weak negative correlation.

cor(df_s04$Var01, df_s04$Var02)
## [1] -0.1054731
qplot(Var01, Var02, data=as.data.frame(df_s04))

qplot(log(Var01), log(Var02), data=as.data.frame(df_s04))

Lag Plots

gglagplot(window(s04_var01_ts, start=1610))

gglagplot(window(s04_var02_ts, start=1610))

ACF Plots

ggAcf(s04_var01_ts)

ggAcf(s04_var02_ts)

PACF Plots

The ACF plots we generated show statistically significant autocorrelation at a 95% confidence level (and likely higher) for all of Var01 lags displayed in the plot and almost all lags displayed in the Var02 plot. Since we will later explore models that require stationarity, we also review PACF plots to help determine what order of differencing will be required. From the PACF plots, we see that the first lag accounts for almost all of the autocorrelation and therefore, first order differencing will likely make the data stationary. Var02 shows some additional lags with partial autocorrelation, however, the lag 1 also makes up the majority of the correlation, and first order differencing may suffice here as well.

ggPacf(s04_var01_ts)

ggPacf(s04_var02_ts)

Moving Average

While a moving average model is unlikely what we will use for our forecasts, nevertheless, we review a moving average of order 50 to reduce the noise of both of our series and perhaps uncover meaningful patterns.

autoplot(s04_var01_ts, series="Data") +
  autolayer(ma(s04_var01_ts, 50), series="50-MA")
## Warning: Removed 50 row(s) containing missing values (geom_path).

autoplot(s04_var02_ts, series="Data") +
  autolayer(ma(s04_var02_ts, 50), series="50-MA")
## Warning: Removed 50 row(s) containing missing values (geom_path).

Train Test Split

split_s04_var01_ts <- ts_split(ts.obj = s04_var01_ts, sample.out = 140)

s04_var01_train <- split_s04_var01_ts$train
s04_var01_test <- split_s04_var01_ts$test


split_s04_var02_ts <- ts_split(ts.obj = s04_var02_ts, sample.out = 140)

s04_var02_train <- split_s04_var02_ts$train
s04_var02_test <- split_s04_var02_ts$test

Exponential Smoothing

Exponential Smoothing Var01

Our first model of consideration is a simple exponential smoothing model. We let the model estimate the parameters, which result in moderate smoothing (alphas) considering the 0 to 1 constraint.

es_predict <- ses(s04_var01_train, h=140)

es_predict$model
## Simple exponential smoothing 
## 
## Call:
##  ses(y = s04_var01_train, h = 140) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 17.1987 
## 
##   sigma:  0.4952
## 
##      AIC     AICc      BIC 
## 8741.013 8741.029 8756.916
plot(es_predict)

ts.plot(es_predict, s04_var01_test, col = c("darkblue", "blue", "blue", "lightblue", "lightblue", "darkgreen"))

round(accuracy(es_predict, s04_var01_test), 2) 
##                ME RMSE  MAE  MPE MAPE MASE ACF1 Theil's U
## Training set 0.01 0.49 0.31 0.03 1.21 1.00 0.04        NA
## Test set     0.81 3.04 2.62 1.60 7.68 8.42 0.97      4.34

Exponential Smoothing Var02

es_predict <- ses(s04_var02_train, h=140)

es_predict$model
## Simple exponential smoothing 
## 
## Call:
##  ses(y = s04_var02_train, h = 140) 
## 
##   Smoothing parameters:
##     alpha = 0.4213 
## 
##   Initial states:
##     l = 17973481.0372 
## 
##   sigma:  12262949
## 
##      AIC     AICc      BIC 
## 59202.98 59203.00 59218.89
plot(es_predict)

ts.plot(es_predict, s04_var02_test, col = c("darkblue", "blue", "blue", "lightblue", "lightblue", "darkgreen"))

round(accuracy(es_predict, s04_var02_test), 2) 
##                      ME     RMSE     MAE    MPE  MAPE MASE ACF1 Theil's U
## Training set   -8849.49 12254672 7225203 -12.89 33.30 0.95 0.14        NA
## Test set     3567142.71  8870647 5433606   5.90 31.33 0.72 0.23      0.88

Var01 ARIMA

We also test forecasting with auto ARIMA models. For Var01, the ARIMA model chosen is the rather simple ARIMA(0,1,0), which performs first order differencing and is a random walk. It results in a constant point forecast. For Var02, ARIMA(2,1,1) with drift is selected, meaning it contains an auto-regressive component of order two, first order differencing, and a moving average of order one, the drift results in a subtle downward trend.

fit_var01 <- auto.arima(s04_var01_train)
arima_predict <- forecast(fit_var01, h=140)
plot(arima_predict)

round(accuracy(arima_predict, s04_var01_test), 2) 
##                ME RMSE  MAE  MPE MAPE MASE ACF1 Theil's U
## Training set 0.01 0.49 0.31 0.03 1.21 1.00 0.04        NA
## Test set     0.81 3.04 2.62 1.60 7.68 8.42 0.97      4.34

Var02 ARIMA

fit_var02 <- auto.arima(s04_var02_train)
arima_predict <- forecast(fit_var02, h=140)
plot(arima_predict)

round(accuracy(arima_predict, s04_var02_test), 2) 
##                    ME     RMSE     MAE    MPE  MAPE MASE ACF1 Theil's U
## Training set  24830.7 11659566 6760054 -15.14 32.53 0.89 0.00        NA
## Test set     616417.8  8111676 5240917 -16.31 36.87 0.69 0.23      0.85

Final Forecasts

For the forecasts we submit we chose to use simple exponential smoothing. For Var01, both the SES and auto ARIMA models provided the same constant point estimates for all 140 forecasts, resulting in the same MAPE. For Var02, SES showed comparable MAPE on the training set and better MAPE on the test compared to the auto ARIMA model. Therefore, we chose the SES for Var02 due to performance and SES for Var01 due to performance and consistency.

Final Forecasts Var01

final_predict_var01 <- ses(s04_var01_ts, h=140)

plot(final_predict_var01)

Final Forecasts Var02

final_predict_var02 <- ses(s04_var02_ts, h=140)

plot(final_predict_var02)