DATA624 Project 1

xlsx_path <- "raw_data.xlsx"
raw <- readxl::read_xlsx(xlsx_path)

[This file examines variables S05Var03, S06Var05, and S06Var07.]

Exploratory Visualization

This section contains initial visualizations of S05Var03, S06Var05, and S06Var07. These visualizations provide the basis for initial commentary and suggest a roadmap for the analysis that comprises the remainder of this file.

raw %>%
  ggplot(aes(x = SeriesInd, y = S05Var03)) +
  geom_line() +
  ggtitle("S05Var03")

raw %>%
  ggplot(aes(x = SeriesInd, y = S06Var05)) +
  geom_line() +
  ggtitle("S06Var05")

raw %>%
  ggplot(aes(x = SeriesInd, y = S06Var07)) +
  geom_line() +
  ggtitle("S06Var07")

All three variables closely resemble a random walk. The variables from group S06 closely resemble each other. None of the variables exhibits obvious seasonality or cyclicity. None of the variables is stationary. The group S06 variables appear to trend upward after SeriesInd = 41250, but because the data are compressed toward the bottom of the grid due to a small number of extreme outliers, it’s hard to be sure at this early stage. For all three variables, variability in the data does not appear to depend on the level of the data.

Examining the numeric data reveals gaps in the SeriesInd column. These gaps mostly occur at regular intervals, and could represent weekends and holidays in a calendar.

Do the gaps represent a pause in the process that generated this data? Or do the gaps conceal unknown values in the time series? If the average change in value across these gaps is larger than the typical difference between a value and its lag, then there’s reason to think these gaps represent missing data. If the average change across these gaps is approximately equal to the typical change from one value to the next, then the gaps probably represent a pause in the data-generating process.

Computing the average squared difference across gaps for S06Var05:

SeriesInd.ts <- ts(raw$SeriesInd)
gaps <- diff(SeriesInd.ts) > 1
gaps <- c(FALSE, gaps)
gaps.df <- data.frame("SeriesInd" = raw$SeriesInd, "AfterGap" = gaps)

gaps.df <- gaps.df %>%
  mutate("S06Var05" = raw$S06Var05, "S06Var05.diff" = raw$S06Var05 - lag(raw$S06Var05))

sqdiff_across_gaps_S06Var05 <- gaps.df %>%
  filter(AfterGap) %>%
  filter(S06Var05.diff > -50) %>%
  select(S06Var05.diff)

sqdiff_across_gaps_S06Var05 <- sqdiff_across_gaps_S06Var05^2
sqdiff_across_gaps_S06Var05 <-
  mean(sqdiff_across_gaps_S06Var05$S06Var05.diff)

Computing the average squared difference between successive entries for S06Var05:

sqdiff_across_all <- gaps.df %>%
  filter(abs(S06Var05.diff) < 50) %>%
  select(S06Var05.diff)

sqdiff_across_all <- sqdiff_across_all^2
sqdiff_across_all <- mean(sqdiff_across_all$S06Var05.diff)

The mean square difference between values across gaps is 0.376, and the mean square difference between all successive values is 0.326. These values are close enough to suggest that missing values in the SeriesInd column represent a pause in the data-generating process, rather than missing data. As a result, we can treat this data as if all the measurements are consecutive, with no missing values.

While it’s not known what process generated these data, the levels and behavior of the data are similar to those of stock prices. For the purpose of this report, I’ll regard each value as a closing stock price of a different company, and I’ll regard the time seriesSeriesInd` values as counting days. Restarting the time series at Day = 0 and dropping outliers from the group S06 data gives us the following:

RST.ts <- ts(raw$S05Var03)

UVW.ts <- raw %>%
  filter(S06Var05 < 100) %>%
  select(S06Var05) %>%
  ts()

XYZ.ts <- raw %>%
  filter(S06Var07 < 100) %>%
  select(S06Var07) %>%
  ts()

autoplot(RST.ts) +
  xlab("Day") +
  ylab("Closing Price (USD)") +
  ggtitle("Daily Closing Price of RST")

autoplot(UVW.ts) +
  xlab("Day") +
  ylab("Closing Price (USD)") +
  ggtitle("Daily Closing Price of UVW")

autoplot(XYZ.ts) +
  xlab("Day") +
  ylab("Closing Price (USD)") +
  ggtitle("Daily Closing Price of XYZ")

How similar are the time series representing UVW and XYZ?

autoplot(UVW.ts-XYZ.ts) +
  xlab("Day") +
  ylab("Difference in price (USD)") +
  ggtitle("Daily difference in prices, UVW and XYZ")

Differences in price are white noise centered at zero. The range of these differences is small compared to the level of each variable. For this reason, I’ll restrict the analysis to only UVW, and then apply the best model for UVW to XYZ as well.

Simple forecasts

Because “a naive forecast is optimal when data follow a random walk” (HA 3.1), I compute some simple forecasts for each variable before performing more complex analysis. The performance of these simple models will provide a benchmark for performance of more sophisticated models. In the event of a tie, I’ll favor these simpler models.

RST_rwf <- rwf(RST.ts, h = 140)
UVW_rwf <- rwf(UVW.ts, h = 140)
XYZ_rwf <- rwf(XYZ.ts, h = 140)
RST_drwf <- rwf(RST.ts, h = 140, drift = TRUE)
UVW_drwf <- rwf(UVW.ts, h = 140, drift = TRUE)
XYZ_drwf <- rwf(XYZ.ts, h = 140, drift = TRUE)
RST_mean <- meanf(RST.ts, h = 140)
UVW_mean <- meanf(UVW.ts, h = 140)
XYZ_mean <- meanf(XYZ.ts, h = 140)

autoplot(RST.ts) +
  autolayer(RST_rwf, series = "Naive", PI = FALSE) +
  autolayer(RST_drwf, series = "Drift", PI = FALSE) +
  autolayer(RST_mean, series = "Mean", PI = FALSE)

autoplot(UVW.ts) +
  autolayer(UVW_rwf, series = "Naive", PI = FALSE) +
  autolayer(UVW_drwf, series = "Drift", PI = FALSE) +
  autolayer(UVW_mean, series = "Mean", PI = FALSE)

Evaluating performance of simple models using cross-validation and RMSE:

RST_rmse_rwf_nodrift <- tsCV(RST.ts, rwf, drift = FALSE, h = 1)
RST_rmse_rwf_nodrift <- sqrt(mean(RST_rmse_rwf_nodrift^2, na.rm = TRUE))

RST_rmse_rwf_drift <- tsCV(RST.ts, rwf, drift = TRUE, h = 1)
RST_rmse_rwf_drift <- sqrt(mean(RST_rmse_rwf_drift^2, na.rm = TRUE))

RST_rmse_meanf <- tsCV(RST.ts, meanf, h = 1)
RST_rmse_meanf <- sqrt(mean(RST_rmse_meanf^2, na.rm = TRUE))

UVW_rmse_rwf_nodrift <- tsCV(UVW.ts, rwf, drift = FALSE, h = 1)
UVW_rmse_rwf_nodrift <- sqrt(mean(UVW_rmse_rwf_nodrift^2, na.rm = TRUE))

UVW_rmse_rwf_drift <- tsCV(UVW.ts, rwf, drift = TRUE, h = 1)
UVW_rmse_rwf_drift <- sqrt(mean(UVW_rmse_rwf_drift^2, na.rm = TRUE))

UVW_rmse_meanf <- tsCV(UVW.ts, meanf, h = 1)
UVW_rmse_meanf <- sqrt(mean(UVW_rmse_meanf^2, na.rm = TRUE))

For both RST and UVW, the best-performing model is the random walk forecast with no drift. For RST, RMSE = 0.9037. For UVW, RMSE = 0.5712.

Exponential smoothing

Below, I fit a simple exponential smoothing method.

RST_ses <- ses(RST.ts, h = 140)

## Warning in ets(x, "ANN", alpha = alpha, opt.crit = "mse", lambda = lambda, :
## Missing values encountered. Using longest contiguous portion of time series

summary(RST_ses)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = RST.ts, h = 140) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 68.1906 
## 
##   sigma:  0.9023
## 
##      AIC     AICc      BIC 
## 5142.281 5142.311 5156.312 
## 
## Error measures:
##                      ME     RMSE       MAE        MPE      MAPE      MASE
## Training set 0.02624863 0.901128 0.6446989 0.02598541 0.8551915 0.9987617
##                    ACF1
## Training set 0.08262822
## 
## Forecasts:
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 795       89.02993 87.87363 90.18623 87.26152 90.79833
## 796       89.02993 87.39475 90.66510 86.52915 91.53071
## 797       89.02993 87.02729 91.03256 85.96716 92.09269
##  [ reached 'max' / getOption("max.print") -- omitted 137 rows ]

The optimized simple exponential smoothing method computed \(\alpha = 0.9999\), making this method almost indistinguishable from the random-walk forecast. It offers a tiny improvement in performance when measured as RMSE, but this improvement is not sufficient to justify a more complex model.

UVW_ses <- ses(UVW.ts, h = 140)
summary(UVW_ses)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = UVW.ts, h = 140) 
## 
##   Smoothing parameters:
##     alpha = 0.8676 
## 
##   Initial states:
##     l = 27.0651 
## 
##   sigma:  0.5665
## 
##      AIC     AICc      BIC 
## 10105.98 10106.00 10122.15 
## 
## Error measures:
##                     ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.0150733 0.5661707 0.4145355 0.02695357 1.133511 0.9978013
##                      ACF1
## Training set 0.0002050798
## 
## Forecasts:
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1617       48.19949 47.47347 48.92552 47.08913 49.30985
## 1618       48.19949 47.23828 49.16070 46.72945 49.66954
## 1619       48.19949 47.05026 49.34872 46.44189 49.95709
##  [ reached 'max' / getOption("max.print") -- omitted 137 rows ]

For UVW, \(\alpha = 0.8676\), indicating that optimal simple exponential smoothing does take some account of values earlier than lag-1. Similar to SES’s performance with RST, SES offers only a very small performance gain over the random-walk forecast. This small gain is not sufficient to justify a more complex model.

Does allowing for drift and damping improve the performance of the exponential smoothing models?

RST_holt <- holt(RST.ts, h = 140, damped = TRUE)

## Warning in ets(x, "AAN", alpha = alpha, beta = beta, phi = phi, damped =
## damped, : Missing values encountered. Using longest contiguous portion of time
## series

UVW_holt <- holt(UVW.ts, h = 140, damped = TRUE)
summary(RST_holt)

## 
## Forecast method: Damped Holt's method
## 
## Model Information:
## Damped Holt's method 
## 
## Call:
##  holt(y = RST.ts, h = 140, damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 4e-04 
##     phi   = 0.8 
## 
##   Initial states:
##     l = 68.9316 
##     b = 0.159 
## 
##   sigma:  0.9044
## 
##      AIC     AICc      BIC 
## 5149.071 5149.178 5177.134 
## 
## Error measures:
##                     ME      RMSE       MAE        MPE      MAPE     MASE
## Training set 0.0244788 0.9015767 0.6455891 0.02341696 0.8565096 1.000141
##                    ACF1
## Training set 0.08159178
## 
## Forecasts:
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 795       89.03017 87.87110 90.18925 87.25753 90.80282
## 796       89.03037 87.39104 90.66971 86.52322 91.53753
## 797       89.03053 87.02254 91.03853 85.95957 92.10150
##  [ reached 'max' / getOption("max.print") -- omitted 137 rows ]

summary(UVW_holt)

## 
## Forecast method: Damped Holt's method
## 
## Model Information:
## Damped Holt's method 
## 
## Call:
##  holt(y = UVW.ts, h = 140, damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.8679 
##     beta  = 1e-04 
##     phi   = 0.9733 
## 
##   Initial states:
##     l = 27.4044 
##     b = 0.0605 
## 
##   sigma:  0.5671
## 
##      AIC     AICc      BIC 
## 10112.03 10112.09 10144.36 
## 
## Error measures:
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.01319427 0.5661796 0.4146892 0.02042205 1.134186 0.9981713
##                       ACF1
## Training set -0.0004246243
## 
## Forecasts:
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1617       48.19914 47.47242 48.92585 47.08772 49.31055
## 1618       48.19898 47.23669 49.16127 46.72728 49.67067
## 1619       48.19883 47.04818 49.34947 46.43907 49.95858
##  [ reached 'max' / getOption("max.print") -- omitted 137 rows ]

For the RST series, an exponential smoothing model with damping and drift performs marginally better than the random walk forecast with no drift. For this data, the optimal choice for \(\alpha\) is almost 1, indicating that forecast values are almost entirely dependent only on their lag-1. \(\phi = 0.8\). This is a low value that rapidly damps forecasts. Together, the high value of \(\alpha\) and the low value of \(\phi\) suggest that exponential smoothing with damping and drift don’t offer much additional insight above the simple random walk forecast.

The parameters for the optimal model for the UVW data are somewhat different, with \(\alpha = 0.8679\) and \(\phi = 0.9733\). This suggests that values earlier than lag-1 carry some importance in generating forecasts, and that damping at a more gradual pace is optimal. Still, RMSE improves only marginally with this more complex model. Simple random walk forecasts with no drift are still the top contenders for all three variables under examination.

ARIMA models

ARIMA models are generally restricted to stationary data. The most common technique for producing stationary data from a trended dataset such as this one is to perform first-differencing. We then perform a Box-Ljung test on the differenced data to determine whether the result is stationary.

Differenced data:

autoplot(diff(RST.ts)) +
  xlab("Day") +
  ylab("Day-over-day difference in closing price (USD)") +
  ggtitle("First-differenced closing prices for RST")

autoplot(diff(UVW.ts)) +
  xlab("Day") +
  ylab("Day-over-day difference in closing price (USD)") +
  ggtitle("First-differenced closing prices for UVW")

For both variables, this data does appear to be stationary– it has mean near zero, and its variability does not appear to change over time. The values appear to be random. What do the ACF plots show?

autoplot(Acf(diff(RST.ts))) +
  ggtitle("ACF plot for RST")

autoplot(Acf(diff(UVW.ts))) +
  ggtitle("ACF plot for UVW")

The ACF plots shows many significant lags, which suggest the differenced data may not be stationary.

autoplot(Pacf(diff(RST.ts))) +
  ggtitle("PACF plot for RST")

autoplot(Pacf(diff(UVW.ts))) +
  ggtitle("PACF plot for UVW")

The PACF plots also show several significant values, further casting doubt on the stationarity of this data. A portmanteau test confirms our suspicion:

Box.test(diff(RST.ts, differences = 2), type = 'Ljung-Box')

## 
##  Box-Ljung test
## 
## data:  diff(RST.ts, differences = 2)
## X-squared = 280.31, df = 1, p-value < 2.2e-16

Box.test(diff(UVW.ts, differences = 2), type = 'Ljung-Box')

## 
##  Box-Ljung test
## 
## data:  diff(UVW.ts, differences = 2)
## X-squared = 508.83, df = 1, p-value < 2.2e-16

The Box-Ljung test confirms that the data should not be considered stationary. Even after two rounds of differencing, the null hypothesis that each observation is independent of its lag is rejected, with a p-value near 0.

Let’s fit an ARIMA model anyway. Since the visual appearance of stationarity is so striking, it might be that the data is close enough to satisfying the assumptions that the model still proves useful.

RST_arima <- auto.arima(RST.ts)
UVW_arima <- auto.arima(UVW.ts)
summary(RST_arima)

## Series: RST.ts 
## ARIMA(0,1,2) 
## 
## Coefficients:
##          ma1      ma2
##       0.1148  -0.0259
## s.e.  0.0251   0.0257
## 
## sigma^2 estimated as 0.8048:  log likelihood=-2118.21
## AIC=4242.41   AICc=4242.43   BIC=4258.59
## 
## Training set error measures:
##                      ME      RMSE       MAE         MPE      MAPE      MASE
## Training set 0.01181137 0.8962794 0.6460439 0.009077146 0.7999247 0.9910684
##                     ACF1
## Training set 0.002297292

summary(UVW_arima)

## Series: UVW.ts 
## ARIMA(2,1,2) 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2
##       -0.2834  0.5443  0.1567  -0.5871
## s.e.   0.1655  0.1214  0.1603   0.1067
## 
## sigma^2 estimated as 0.3189:  log likelihood=-1366.62
## AIC=2743.24   AICc=2743.27   BIC=2770.17
## 
## Training set error measures:
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.01721161 0.5637965 0.4142626 0.03161162 1.133366 0.9971444
##                      ACF1
## Training set -0.002473942

For RST, RMSE for the ARIMA(0,1,2) model is 0.8963. For UVW, RMSE for ARIMA(2,1,2) is 0.5637. As with the damped and trended exponential smoothing methods, these models give marginal improvements in RMSE. However, these small improvements are not sufficient to justify their use as forecasting models, since they’re complicated relative to simple forecasting methods.

Had ARIMA models provided a greater reduction in RMSE, then we would proceed from here to an analysis of residuals.

Conclusion

For the variables RST.ts (named S05Var03 in the original template), UVW.ts (S06Var05), and XYZ.ts (S06Var07), the best forecasting model is a random walk forecast without drift. The forecast value for all future times is the most recent value of the time series. As time extends into the future, prediction intervals for these forecasts become wider:

autoplot(RST.ts) +
  autolayer(RST_rwf, series = "Naive", PI = TRUE) +
  xlab("Day") +
  ylab("Closing price (USD)") +
  ggtitle("Forecasts and prediction intervals for RST")

autoplot(UVW.ts) +
  autolayer(UVW_rwf, series = "Naive", PI = TRUE) +
  xlab("Day") +
  ylab("Closing price (USD)") +
  ggtitle("Forecasts and prediction intervals for UVW")

autoplot(XYZ.ts) +
  autolayer(XYZ_rwf, series = "Naive", PI = TRUE) +
  xlab("Day") +
  ylab("Closing price (USD)") +
  ggtitle("Forecasts and prediction intervals for XYZ")

This is consistent with reasonable expectations for stock market price data, which are notoriously resistant to time series forecasting methods. If we had more information about this data– such as information about the process that generated it, or historical data further into the past– it might have been possible to take better advantage of methods incorporating trend. Over the relatively short horizon of this data set, though, long-run trends that appear in stock prices were not wholly evident.

For all future times, the forecast values for each variable are given below.

#Forecast value for RST.ts (S05Var03):
RST_rwf_fc <- forecast(RST_rwf, h = 140)
print(RST_rwf_fc)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1623           89.7 88.54181 90.85818 87.92870 91.47129
## 1624           89.7 88.06207 91.33792 87.19501 92.20498
## 1625           89.7 87.69396 91.70603 86.63203 92.76797
##  [ reached 'max' / getOption("max.print") -- omitted 137 rows ]

#Forecast value for RST.ts (S06Var05):
UVW_rwf_fc <- forecast(UVW_rwf, h = 140)
print(UVW_rwf_fc)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1617          48.13 47.39796 48.86204 47.01045 49.24956
## 1618          48.13 47.09474 49.16526 46.54671 49.71329
## 1619          48.13 46.86207 49.39793 46.19087 50.06913
##  [ reached 'max' / getOption("max.print") -- omitted 137 rows ]

#Forecast value for RST.ts (S06Var07):
XYZ_rwf_fc <- forecast(XYZ_rwf, h = 140)
print(XYZ_rwf_fc)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1617          47.97 47.24943 48.69057 46.86799 49.07202
## 1618          47.97 46.95096 48.98904 46.41152 49.52849
## 1619          47.97 46.72194 49.21806 46.06125 49.87875
##  [ reached 'max' / getOption("max.print") -- omitted 137 rows ]