First, we load a list of packages (such as tidyverse, fpp2) that are necessary for this session. We read the raw data from excel and transform the SeriesInd into date format. Our goal is to forecast 140 periods for each group and for different subset of variables. Second, we split the data into six groups by the group variable. We check the number of missing date by variable (Var01, Var02, Var03, Var05 & Var07). Each group varies by number of missing date in the variables. For example, group S06 has to forecast Var05 and Var07, each has 5 missing dates (e.g. 145 missing values minus 140 periods that we need to forecast) in the data set. We need to impute the missing values.
# load packages
packages <- c("tidyverse", "fpp2", "kableExtra", "broom", "MLmetrics")
pacman::p_load(char = packages)
# read data, format date
df <- readxl::read_excel("Data Set for Class.xls") %>%
dplyr::mutate(SeriesInd = as.Date(SeriesInd, origin = "1900-01-01"))
# check missing
dfNest <- df %>%
arrange(SeriesInd, group) %>%
tidyr::nest(-group)
missing <- lapply(1:nrow(dfNest), function(x) colSums(is.na(dfNest %>%
dplyr::filter(dfNest$group == dfNest$group[[x]]) %>%
tidyr::unnest(data))))
names(missing) <- unique(dfNest$group)
missing
## $S01
## group SeriesInd Var01 Var02 Var03 Var05 Var07
## 0 0 142 140 144 144 144
##
## $S02
## group SeriesInd Var01 Var02 Var03 Var05 Var07
## 0 0 142 140 144 144 144
##
## $S03
## group SeriesInd Var01 Var02 Var03 Var05 Var07
## 0 0 142 140 144 144 144
##
## $S04
## group SeriesInd Var01 Var02 Var03 Var05 Var07
## 0 0 142 140 144 144 144
##
## $S05
## group SeriesInd Var01 Var02 Var03 Var05 Var07
## 0 0 143 141 145 145 145
##
## $S06
## group SeriesInd Var01 Var02 Var03 Var05 Var07
## 0 0 143 141 145 145 145
For group S06, we need to build forecasts for Var05 and Var07 respectively. We look at the data and find that it has a frequency of 5 (cycling every 5 days). We count the number of missing dates, and we also plot the raw data to visuzlie any extreme value. We realize that there’s one extreme and 5 missing values in each variable.
# extract group S06, Var05, Var07
dfS06 <- df %>%
dplyr::filter(group == "S06") %>%
dplyr::select(SeriesInd, Var05, Var07) %>%
dplyr::mutate(weekday = lubridate::wday(SeriesInd, label = TRUE, abbr = TRUE),
row = nrow(.):1) %>%
arrange(SeriesInd)
head(dfS06, 10) %>%
kable() %>%
kable_styling()
| SeriesInd | Var05 | Var07 | weekday | row |
|---|---|---|---|---|
| 2011-05-08 | 27.02 | 27.32 | Sun | 1762 |
| 2011-05-09 | 27.27 | 28.07 | Mon | 1761 |
| 2011-05-10 | 28.03 | 28.11 | Tue | 1760 |
| 2011-05-11 | 28.12 | 29.13 | Wed | 1759 |
| 2011-05-12 | 28.90 | 28.86 | Thu | 1758 |
| 2011-05-15 | 29.09 | 28.80 | Sun | 1757 |
| 2011-05-16 | 28.47 | 28.08 | Mon | 1756 |
| 2011-05-17 | 27.99 | 28.58 | Tue | 1755 |
| 2011-05-18 | 28.50 | 28.99 | Wed | 1754 |
| 2011-05-19 | 28.82 | 28.08 | Thu | 1753 |
# check missing
colSums(is.na(dfS06))
## SeriesInd Var05 Var07 weekday row
## 0 145 145 0 0
# plot time series for Var05, Var07, set frequency = 5
ts6_Var05 <- dfS06 %>%
dplyr::filter(row >140) %>%
dplyr::select(Var05) %>%
ts(., frequency = 5)
ts6_Var07 <- dfS06 %>%
dplyr::filter(row >140) %>%
dplyr::select(Var07) %>%
ts(., frequency = 5)
autoplot(ts6_Var05, main = "Var05 raw data")
autoplot(ts6_Var07, main = "Var07 raw data")
Next step, we need to remove outliers and impute missing values before building our forecast model. We can locate the extreme outlier(s) and check for replacement(s) using the forecast::tsoutliers function. Furthermore, we can apply the forecast::tsclean to remove the outlier(s) and replace any missing value. Subsequently, we examine the time series and plot the clean data. Now, it’s ready for building forecast.
# check outlier(s)
ts6_Var05_out <- forecast::tsoutliers(ts6_Var05)
ts6_Var07_out <- forecast::tsoutliers(ts6_Var07)
ts6_Var05_out
## $index
## [1] 320
##
## $replacements
## [1] 31.98139
ts6_Var07_out
## $index
## [1] 320
##
## $replacements
## [1] 31.86057
# we have 1 extreme and 5 missing values in each variable, let's use tsclean to replace outlier and missing values with estimated values
ts6_Var05_clean <- forecast::tsclean(ts6_Var05)
ts6_Var07_clean <- forecast::tsclean(ts6_Var07)
plot(ts6_Var05_clean, col = "red", lty = "dashed", main = "Var05 clean data (outlier removed)")
plot(ts6_Var07_clean, col = "red", lty = "dashed", main = "Var07 clean data (outlier removed)")
We decide to automate the process by relying on ETS model. We use the forecast::ets to look for the most appropriate forecasting model for each variable. In both cases, the ets recommends [A,N,N], i.e. simple exponential smoothing with additive errors. We fit the data and visualize the forecast with prediction interval (80, 95). In addition, we evaluate our models using forecast::accuracy. The MAPE is 1.13% and 1.14% for Var05 and Var07 respectively.
# Var05
Var05_ets <- forecast::ets(ts6_Var05_clean, lambda = "auto")
summary(Var05_ets)
## ETS(A,N,N)
##
## Call:
## forecast::ets(y = ts6_Var05_clean, lambda = "auto")
##
## Box-Cox transformation: lambda= 0.6336
##
## Smoothing parameters:
## alpha = 0.8724
##
## Initial states:
## l = 11.1802
##
## sigma: 0.1493
##
## AIC AICc BIC
## 5824.118 5824.133 5840.293
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01513232 0.5650197 0.4135217 0.02727642 1.130514 0.4887134
## ACF1
## Training set -0.004471557
# the ETS returns a set of parameters and recommend ETS(A,N,N), i.e. simple exponential smoothing with additive errors
Var05_ets %>%
forecast::forecast(h = 140) %>%
autoplot() +
xlab("") +
ylab("Var05")
# Var07
Var07_ets <- forecast::ets(ts6_Var07_clean, lambda = "auto")
summary(Var07_ets)
## ETS(A,N,N)
##
## Call:
## forecast::ets(y = ts6_Var07_clean, lambda = "auto")
##
## Box-Cox transformation: lambda= 0.7069
##
## Smoothing parameters:
## alpha = 0.9079
##
## Initial states:
## l = 13.2717
##
## sigma: 0.1938
##
## AIC AICc BIC
## 6669.095 6669.110 6685.269
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01411673 0.558895 0.4163823 0.02446939 1.138325 0.4816848
## ACF1
## Training set 0.01285416
# the ETS returns a set of parameters and recommend ETS(A,N,N), i.e. simple exponential smoothing with additive errors
Var07_ets %>%
forecast::forecast(h = 140) %>%
autoplot() +
xlab("") +
ylab("Var07")
# evaluation
ts6_evaluation <- lapply(list(Var05_ets, Var07_ets), function(x) forecast::accuracy(x) %>% tidy)
names(ts6_evaluation) <- c("Var05", "Var07")
ts6_evaluation
## $Var05
## # A tibble: 1 x 8
## .rownames ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Training set 0.0151 0.565 0.414 0.0273 1.13 0.489 -0.00447
##
## $Var07
## # A tibble: 1 x 8
## .rownames ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Training set 0.0141 0.559 0.416 0.0245 1.14 0.482 0.0129
# MAPE
print(paste0("MAPE for Var05 is ", MLmetrics::MAPE(Var05_ets$fitted, ts6_Var05_clean)))
## [1] "MAPE for Var05 is 0.0113051402177851"
print(paste0("MAPE for Var07 is ", MLmetrics::MAPE(Var07_ets$fitted, ts6_Var07_clean)))
## [1] "MAPE for Var07 is 0.0113832455506511"
We output our result and paste them back into the excel sheet.
dfS06_pred <- dfS06 %>%
dplyr::filter(row <= 140) %>%
dplyr::mutate(Var05 = Var05_ets %>%
forecast::forecast(h = 140) %>%
as.data.frame() %>%
dplyr::select(Var05 = `Point Forecast`) %>%
.$Var05,
Var07 = Var07_ets %>%
forecast::forecast(h = 140) %>%
as.data.frame() %>%
dplyr::select(Var07 = `Point Forecast`) %>%
.$Var07)
dfS06_output <- dfS06 %>%
dplyr::filter(row > 140) %>%
dplyr::bind_rows(., dfS06_pred) %>%
dplyr::select(-weekday, -row) %>%
arrange(SeriesInd)
dfS06_output %>%
tail %>%
kable() %>%
kable_styling()
| SeriesInd | Var05 | Var07 |
|---|---|---|
| 2018-04-25 | 48.19599 | 48.01149 |
| 2018-04-26 | 48.19599 | 48.01149 |
| 2018-04-30 | 48.19599 | 48.01149 |
| 2018-05-01 | 48.19599 | 48.01149 |
| 2018-05-02 | 48.19599 | 48.01149 |
| 2018-05-03 | 48.19599 | 48.01149 |