data_624_project

Set up

First, we load a list of packages (such as tidyverse, fpp2) that are necessary for this session. We read the raw data from excel and transform the SeriesInd into date format. Our goal is to forecast 140 periods for each group and for different subset of variables. Second, we split the data into six groups by the group variable. We check the number of missing date by variable (Var01, Var02, Var03, Var05 & Var07). Each group varies by number of missing date in the variables. For example, group S06 has to forecast Var05 and Var07, each has 5 missing dates (e.g. 145 missing values minus 140 periods that we need to forecast) in the data set. We need to impute the missing values.

# load packages
packages <- c("tidyverse", "fpp2", "kableExtra", "broom", "MLmetrics")
pacman::p_load(char = packages)

# read data, format date
df <- readxl::read_excel("Data Set for Class.xls") %>%
        dplyr::mutate(SeriesInd = as.Date(SeriesInd, origin = "1900-01-01"))

# check missing
dfNest <- df %>% 
        arrange(SeriesInd, group) %>%
        tidyr::nest(-group)

missing <- lapply(1:nrow(dfNest), function(x) colSums(is.na(dfNest %>% 
                                                         dplyr::filter(dfNest$group == dfNest$group[[x]]) %>% 
                                                         tidyr::unnest(data))))

names(missing) <- unique(dfNest$group)

missing
## $S01
##     group SeriesInd     Var01     Var02     Var03     Var05     Var07 
##         0         0       142       140       144       144       144 
## 
## $S02
##     group SeriesInd     Var01     Var02     Var03     Var05     Var07 
##         0         0       142       140       144       144       144 
## 
## $S03
##     group SeriesInd     Var01     Var02     Var03     Var05     Var07 
##         0         0       142       140       144       144       144 
## 
## $S04
##     group SeriesInd     Var01     Var02     Var03     Var05     Var07 
##         0         0       142       140       144       144       144 
## 
## $S05
##     group SeriesInd     Var01     Var02     Var03     Var05     Var07 
##         0         0       143       141       145       145       145 
## 
## $S06
##     group SeriesInd     Var01     Var02     Var03     Var05     Var07 
##         0         0       143       141       145       145       145

Data Exploration (group S06)

For group S06, we need to build forecasts for Var05 and Var07 respectively. We look at the data and find that it has a frequency of 5 (cycling every 5 days). We count the number of missing dates, and we also plot the raw data to visuzlie any extreme value. We realize that there’s one extreme and 5 missing values in each variable.

# extract group S06, Var05, Var07
dfS06 <- df %>%
        dplyr::filter(group == "S06") %>%
        dplyr::select(SeriesInd, Var05, Var07) %>%
        dplyr::mutate(weekday = lubridate::wday(SeriesInd, label = TRUE, abbr = TRUE),
                      row = nrow(.):1) %>%
        arrange(SeriesInd)

head(dfS06, 10) %>% 
        kable() %>%
        kable_styling()

SeriesInd	Var05	Var07	weekday	row
2011-05-08	27.02	27.32	Sun	1762
2011-05-09	27.27	28.07	Mon	1761
2011-05-10	28.03	28.11	Tue	1760
2011-05-11	28.12	29.13	Wed	1759
2011-05-12	28.90	28.86	Thu	1758
2011-05-15	29.09	28.80	Sun	1757
2011-05-16	28.47	28.08	Mon	1756
2011-05-17	27.99	28.58	Tue	1755
2011-05-18	28.50	28.99	Wed	1754
2011-05-19	28.82	28.08	Thu	1753


# check missing
colSums(is.na(dfS06))
## SeriesInd     Var05     Var07   weekday       row 
##         0       145       145         0         0

# plot time series for Var05, Var07, set frequency = 5
ts6_Var05 <- dfS06 %>%
        dplyr::filter(row >140) %>%
        dplyr::select(Var05) %>%
        ts(., frequency = 5)

ts6_Var07 <- dfS06 %>%
        dplyr::filter(row >140) %>%
        dplyr::select(Var07) %>%
        ts(., frequency = 5)

autoplot(ts6_Var05, main = "Var05 raw data")

autoplot(ts6_Var07, main = "Var07 raw data")

Remove outlier(s), impute missing value

Next step, we need to remove outliers and impute missing values before building our forecast model. We can locate the extreme outlier(s) and check for replacement(s) using the forecast::tsoutliers function. Furthermore, we can apply the forecast::tsclean to remove the outlier(s) and replace any missing value. Subsequently, we examine the time series and plot the clean data. Now, it’s ready for building forecast.

# check outlier(s)
ts6_Var05_out <- forecast::tsoutliers(ts6_Var05)
ts6_Var07_out <- forecast::tsoutliers(ts6_Var07)

ts6_Var05_out
## $index
## [1] 320
## 
## $replacements
## [1] 31.98139
ts6_Var07_out
## $index
## [1] 320
## 
## $replacements
## [1] 31.86057

# we have 1 extreme and 5 missing values in each variable, let's use tsclean to replace outlier and missing values with estimated values
ts6_Var05_clean <- forecast::tsclean(ts6_Var05)
ts6_Var07_clean <- forecast::tsclean(ts6_Var07)

plot(ts6_Var05_clean, col = "red", lty = "dashed", main = "Var05 clean data (outlier removed)")

plot(ts6_Var07_clean, col = "red", lty = "dashed", main = "Var07 clean data (outlier removed)")

Model - ETS

We decide to automate the process by relying on ETS model. We use the forecast::ets to look for the most appropriate forecasting model for each variable. In both cases, the ets recommends [A,N,N], i.e. simple exponential smoothing with additive errors. We fit the data and visualize the forecast with prediction interval (80, 95). In addition, we evaluate our models using forecast::accuracy. The MAPE is 1.13% and 1.14% for Var05 and Var07 respectively.

# Var05
Var05_ets <- forecast::ets(ts6_Var05_clean, lambda = "auto")
summary(Var05_ets)
## ETS(A,N,N) 
## 
## Call:
##  forecast::ets(y = ts6_Var05_clean, lambda = "auto") 
## 
##   Box-Cox transformation: lambda= 0.6336 
## 
##   Smoothing parameters:
##     alpha = 0.8724 
## 
##   Initial states:
##     l = 11.1802 
## 
##   sigma:  0.1493
## 
##      AIC     AICc      BIC 
## 5824.118 5824.133 5840.293 
## 
## Training set error measures:
##                      ME      RMSE       MAE        MPE     MAPE      MASE
## Training set 0.01513232 0.5650197 0.4135217 0.02727642 1.130514 0.4887134
##                      ACF1
## Training set -0.004471557

# the ETS returns a set of parameters and recommend ETS(A,N,N), i.e. simple exponential smoothing with additive errors
Var05_ets %>%
        forecast::forecast(h = 140) %>%
        autoplot() +
        xlab("") +
        ylab("Var05")


# Var07
Var07_ets <- forecast::ets(ts6_Var07_clean, lambda = "auto")
summary(Var07_ets)
## ETS(A,N,N) 
## 
## Call:
##  forecast::ets(y = ts6_Var07_clean, lambda = "auto") 
## 
##   Box-Cox transformation: lambda= 0.7069 
## 
##   Smoothing parameters:
##     alpha = 0.9079 
## 
##   Initial states:
##     l = 13.2717 
## 
##   sigma:  0.1938
## 
##      AIC     AICc      BIC 
## 6669.095 6669.110 6685.269 
## 
## Training set error measures:
##                      ME     RMSE       MAE        MPE     MAPE      MASE
## Training set 0.01411673 0.558895 0.4163823 0.02446939 1.138325 0.4816848
##                    ACF1
## Training set 0.01285416

# the ETS returns a set of parameters and recommend ETS(A,N,N), i.e. simple exponential smoothing with additive errors
Var07_ets %>%
        forecast::forecast(h = 140) %>%
        autoplot() +
        xlab("") +
        ylab("Var07")


# evaluation
ts6_evaluation <- lapply(list(Var05_ets, Var07_ets), function(x) forecast::accuracy(x) %>% tidy) 
names(ts6_evaluation) <- c("Var05", "Var07")
ts6_evaluation
## $Var05
## # A tibble: 1 x 8
##   .rownames        ME  RMSE   MAE    MPE  MAPE  MASE     ACF1
##   <chr>         <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>
## 1 Training set 0.0151 0.565 0.414 0.0273  1.13 0.489 -0.00447
## 
## $Var07
## # A tibble: 1 x 8
##   .rownames        ME  RMSE   MAE    MPE  MAPE  MASE   ACF1
##   <chr>         <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
## 1 Training set 0.0141 0.559 0.416 0.0245  1.14 0.482 0.0129

# MAPE
print(paste0("MAPE for Var05 is ", MLmetrics::MAPE(Var05_ets$fitted, ts6_Var05_clean)))
## [1] "MAPE for Var05 is 0.0113051402177851"
print(paste0("MAPE for Var07 is ", MLmetrics::MAPE(Var07_ets$fitted, ts6_Var07_clean)))
## [1] "MAPE for Var07 is 0.0113832455506511"

Output

We output our result and paste them back into the excel sheet.

dfS06_pred <- dfS06 %>%
        dplyr::filter(row <= 140) %>%
        dplyr::mutate(Var05 = Var05_ets %>%
                              forecast::forecast(h = 140) %>%
                              as.data.frame() %>%
                              dplyr::select(Var05 = `Point Forecast`) %>%
                              .$Var05,
                      Var07 = Var07_ets %>%
                              forecast::forecast(h = 140) %>%
                              as.data.frame() %>%
                              dplyr::select(Var07 = `Point Forecast`) %>%
                              .$Var07)
        
dfS06_output <- dfS06 %>%
        dplyr::filter(row > 140) %>%
        dplyr::bind_rows(., dfS06_pred) %>%
        dplyr::select(-weekday, -row) %>%
        arrange(SeriesInd)

dfS06_output %>% 
        tail %>% 
        kable() %>%
        kable_styling()

SeriesInd	Var05	Var07
2018-04-25	48.19599	48.01149
2018-04-26	48.19599	48.01149
2018-04-30	48.19599	48.01149
2018-05-01	48.19599	48.01149
2018-05-02	48.19599	48.01149
2018-05-03	48.19599	48.01149

data_624_project_1

Jimmy Ng

6/23/2020

Set up

Data Exploration (group S06)

Remove outlier(s), impute missing value

Model - ETS

Output