library(fpp3)
## Warning: package 'tsibble' was built under R version 4.3.2
library(forecast)
library(imputeTS)

Ask

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Apr 11. I will accept late submissions with a penalty until the meetup after that when we review some projects.

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file.

The variable Cash is provided in hundreds of dollars, other than that it is straight forward.

I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast.

I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file

Also please submit the forecast which you will put in an Excel readable file.

ATM

ATM <- readxl::read_excel("ATM624Data.xlsx",col_types = c("date","text", "numeric"))
head(ATM)
## # A tibble: 6 × 3
##   DATE                ATM    Cash
##   <dttm>              <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1     96
## 2 2009-05-01 00:00:00 ATM2    107
## 3 2009-05-02 00:00:00 ATM1     82
## 4 2009-05-02 00:00:00 ATM2     89
## 5 2009-05-03 00:00:00 ATM1     85
## 6 2009-05-03 00:00:00 ATM2     90
summary(ATM)
##       DATE                            ATM                 Cash        
##  Min.   :2009-05-01 00:00:00.00   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01 00:00:00.00   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01 00:00:00.00   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31 19:11:48.27                      Mean   :  155.6  
##  3rd Qu.:2010-02-01 00:00:00.00                      3rd Qu.:  114.0  
##  Max.   :2010-05-14 00:00:00.00                      Max.   :10919.8  
##                                                      NA's   :19
ATM$DATE = as.Date(ATM$DATE)

There are 19 NA cash values in this dataset. I subset the dataset further to inspect the NA rows and determine if it needs to be imputed or removed from the data set.

ATM[is.na(ATM$Cash), ]
## # A tibble: 19 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-18 ATM2     NA
##  4 2009-06-22 ATM1     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

For the ATMs that are missing Cash values on the given days, it is possible that there was no cash recieved. There are so few missing values that it could be removed from the dataset.

ATM[is.na(ATM$ATM), ]
## # A tibble: 14 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2010-05-01 <NA>     NA
##  2 2010-05-02 <NA>     NA
##  3 2010-05-03 <NA>     NA
##  4 2010-05-04 <NA>     NA
##  5 2010-05-05 <NA>     NA
##  6 2010-05-06 <NA>     NA
##  7 2010-05-07 <NA>     NA
##  8 2010-05-08 <NA>     NA
##  9 2010-05-09 <NA>     NA
## 10 2010-05-10 <NA>     NA
## 11 2010-05-11 <NA>     NA
## 12 2010-05-12 <NA>     NA
## 13 2010-05-13 <NA>     NA
## 14 2010-05-14 <NA>     NA

The ATMs that are NA could be due to data error. It would be best to remove them as they do not give any important information necessary for imputing.

The data is a tibble object and needs to be converted to a tsibble object in order to perform the time series analysis.

ATM_ts <- ATM |>
  mutate(DATE = ymd(DATE)) |>
  as_tsibble(key = c(ATM),
             index = DATE)

The time series plot shows the four different ATMs available in the dataset. The fourth ATM has a large increase in cash between January and April of 2010. The other ATMs all have relativly low Cash values.

ATM_ts |>
  autoplot(Cash) 
## Warning: Removed 14 rows containing missing values (`geom_line()`).

A facet wrap allows us to see the timeseries clearer. There also appears to be NA vale

ATM_ts |>
  autoplot(Cash) +
  facet_wrap(~ATM, ncol=1)
## Warning: Removed 14 rows containing missing values (`geom_line()`).

ATM 1

Taking a closer look at ATM1 and its time series plot, there is not obvious increase or decrease in the trend of the time series. There does appear to be a seasonlaity and the variation looks relatively stable.

ATM_ts |>
  filter(ATM == "ATM1") |>
  autoplot(Cash)

Filtering ATM1

Next, in order to apply models to this dataset, I filtered for ATM1 and made it its own time series object called ATM1.

ATM1<- ATM_ts |>
  filter(ATM == "ATM1") |>
   update_tsibble(key = c()) |>
  select(-(ATM))
ATM1 |>
  head()
## # A tsibble: 6 x 2 [1D]
##   DATE        Cash
##   <date>     <dbl>
## 1 2009-05-01    96
## 2 2009-05-02    82
## 3 2009-05-03    85
## 4 2009-05-04    90
## 5 2009-05-05    99
## 6 2009-05-06    88

Train ATM1 on data from before May 2010

By using filter index I am able to subset the time series to data before May 2010.

ATM1_train <- ATM_ts |>
  filter(ATM == "ATM1") |>
  filter_index("2009-05-01" ~ "2010-04-30") |>
  select(Cash)

NA values can impact forecasting, especially with forecast models. There are three NA values in the Cash column of the ATM1 time series.

sum(is.na(ATM1_train$Cash ) )
## [1] 3

I will use the median of the Cash variable to replace the missing values.

As a basic forecast, I applied SNAIVE, Mean, Naive and Drift models on the data.

ATM1_train$Cash[is.na(ATM1_train$Cash)] <- median(ATM1_train$Cash, na.rm = T)   

ATM1_fit <- ATM1_train |>
  model(
    Snaive = SNAIVE(Cash),
    Mean = MEAN(Cash),
    `Naïve` = NAIVE(Cash),
    Drift = NAIVE(Cash ~ drift()))

ATM1_fc <- ATM1_fit |>
  forecast(h=31)

ATM1_fc |>
  autoplot(ATM1_train) +
  autolayer(
    filter_index(ATM1, "2010-05-01" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = Cash`

From all four basic models, the best model by RMSE is SNAIVE model.

ATM1_fit |>
  accuracy()
## # A tibble: 4 × 10
##   .model .type           ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>        <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Snaive Training -3.63e- 2  27.8  17.7  -96.5  116.  1     1     0.159 
## 2 Mean   Training -1.56e-15  36.5  27.1 -175.   198.  1.53  1.31  0.0595
## 3 Naïve  Training -3.02e- 2  50.1  37.5 -132.   167.  2.12  1.80 -0.365 
## 4 Drift  Training -3.84e-17  50.1  37.5 -132.   167.  2.12  1.80 -0.365

ETS

Next, I compared the models above to an ETS model. Using the ets function, it outputs the optimal ETS parameters and selects the best AICc.

ets(ATM1_train$Cash)
## ETS(M,N,N) 
## 
## Call:
##  ets(y = ATM1_train$Cash) 
## 
##   Smoothing parameters:
##     alpha = 0.0322 
## 
##   Initial states:
##     l = 79.126 
## 
##   sigma:  0.4351
## 
##      AIC     AICc      BIC 
## 4783.131 4783.198 4794.831
ets(ATM1_train$Cash) |>
  accuracy()
##                       ME     RMSE     MAE       MPE     MAPE      MASE
## Training set -0.07129336 36.63921 27.3546 -178.5067 202.2401 0.7286553
##                    ACF1
## Training set 0.03799582

There is no clear linear or cyclic trend in the data overall.

Box Cox Transformation

lambda <-ATM1_train |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

fit <- ATM1_train |>
  model(
    arima = ARIMA(box_cox(Cash, lambda))) 

report(fit) |>
  accuracy()
## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.0994  -0.1141  -0.6456
## s.e.  0.0524   0.0527   0.0426
## 
## sigma^2 estimated as 1.576:  log likelihood=-589.77
## AIC=1187.54   AICc=1187.66   BIC=1203.07
## # A tibble: 1 × 10
##   .model .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 arima  Training  2.34  25.0  16.0 -87.9  107. 0.906 0.900 0.0116
fit |> select(arima) %>% gg_tsresiduals()

The residuals all fall within the boundary in the ACF plot and the residuals show a better distribution (closer to normal) than the ets function.

An Arima with a boxcox transformation had the lowest error compared to the other function. I will be using this to forecast results for the submission of this project.

ATM1_fc <- fit |>
  forecast(h=31)
data.frame(Date = ATM1_fc$DATE, Cash = ATM1_fc$.mean) |>
  write.csv("Project1_PartB_ATM1.csv")

ATM2

For ATM2 the periods are much shorter in this time series compared to the ATM1.

ATM2<- ATM_ts |>
  filter(ATM == "ATM2") |>
  update_tsibble(key = c()) |>
  select(-(ATM))

ATM2_train <- ATM_ts |>
  filter(ATM == "ATM2") |>
  filter_index("2009-05-01" ~ "2010-04-30") |>
  select(Cash)

ATM2_fit <- ATM2_train |>
  model(MEAN(Cash))

ATM2_fc <- ATM2_fit |>
  forecast(h=31)

ATM2_fc |>
  autoplot(ATM2_train) +
  autolayer(
    filter_index(ATM2, "2010-05-01" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = Cash`

sum(is.na(ATM2$Cash ) )
## [1] 2

There are two missing values in the Cash variable in ATM2.

ATM2[!complete.cases(ATM2), ]
## # A tsibble: 2 x 2 [1D]
##   DATE        Cash
##   <date>     <dbl>
## 1 2009-06-18    NA
## 2 2009-06-24    NA
ATM2_train <- ATM_ts |>
  filter(ATM == "ATM2") |>
  filter_index("2009-05-01" ~ "2010-04-30") |>
  select(Cash)

ATM2_train$Cash[is.na(ATM2_train$Cash)] <- median(ATM2_train$Cash, na.rm = T)   

ATM2_fit <- ATM2_train |>
  model(
    Snaive = SNAIVE(Cash),
    Mean = MEAN(Cash),
    `Naïve` = NAIVE(Cash),
    Drift = NAIVE(Cash ~ drift()))

ATM2_fc <- ATM2_fit |>
  forecast(h=31)

ATM2_fc |>
  autoplot(ATM2_train) +
  autolayer(
    filter_index(ATM2, "2010-05-01" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = Cash`

ATM2_fit |>
  accuracy()
## # A tibble: 4 × 10
##   .model .type           ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Snaive Training  2.23e- 2  30.1  20.8  -Inf   Inf  1     1     0.0447
## 2 Mean   Training  2.92e-15  38.7  33.1  -Inf   Inf  1.59  1.29  0.0215
## 3 Naïve  Training -4.67e- 2  54.2  42.5  -Inf   Inf  2.04  1.80 -0.309 
## 4 Drift  Training  2.15e-15  54.2  42.5  -Inf   Inf  2.04  1.80 -0.309
ATM2_train |>
  features(Cash, unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1
ATM2_train |>
  gg_tsdisplay(difference(Cash, 30),
               plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")
## Warning: Removed 30 rows containing missing values (`geom_line()`).
## Warning: Removed 30 rows containing missing values (`geom_point()`).

The significant lag at 1 and a differencing of 1.

fit <- ATM2_train |>
  model(
    arima012011 = ARIMA(Cash ~ pdq(0,1,2) + PDQ(0,1,1)),
    arima210011 = ARIMA(Cash ~ pdq(2,1,0) + PDQ(0,1,1)),
    auto = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
  )
fit |> pivot_longer(everything(), names_to = "Model name",
                     values_to = "Orders")
## # A mable: 3 x 2
## # Key:     Model name [3]
##   `Model name`                   Orders
##   <chr>                         <model>
## 1 arima012011  <ARIMA(0,1,2)(0,1,1)[7]>
## 2 arima210011  <ARIMA(2,1,0)(0,1,1)[7]>
## 3 auto         <ARIMA(2,0,2)(0,1,1)[7]>
glance(fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 3 × 6
##   .model      sigma2 log_lik   AIC  AICc   BIC
##   <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 auto          602.  -1653. 3319. 3319. 3342.
## 2 arima012011   639.  -1664. 3336. 3336. 3351.
## 3 arima210011   869.  -1715. 3438. 3439. 3454.
fit |> select(auto) |> gg_tsresiduals(lag=36)

forecast(fit, h=31) |>
  filter(.model=='auto') |>
  autoplot(ATM2_train) +
  labs(title = "ATM4 Withdrawals",
       y="Cash (hundreds)")

fit |>
  accuracy()
## # A tibble: 3 × 10
##   .model      .type         ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>       <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 arima012011 Training -0.434   24.9  17.3  -Inf   Inf 0.829 0.826 -0.00407
## 2 arima210011 Training  0.0155  29.0  20.2   NaN   Inf 0.970 0.963 -0.129  
## 3 auto        Training -0.885   24.1  17.0  -Inf   Inf 0.819 0.801 -0.00512
lambda <-ATM2_train |>
  features(Cash, features = guerrero) |>
  pull(lambda_guerrero)

fit <- ATM2_train |>
  model(
    arima = ARIMA(box_cox(Cash, lambda))) 

report(fit) |>
  accuracy()
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##           ar1      ar2    ma1     ma2     sma1
##       -0.4247  -0.9074  0.468  0.8004  -0.7262
## s.e.   0.0587   0.0438  0.089  0.0584   0.0409
## 
## sigma^2 estimated as 69.93:  log likelihood=-1267.92
## AIC=2547.85   AICc=2548.09   BIC=2571.13
## # A tibble: 1 × 10
##   .model .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 arima  Training 0.243  24.4  17.1  -Inf   Inf 0.823 0.811 -0.0123
ATM2_fc <- fit |>
  forecast(h=31)

ATM2_fc |>
  autoplot(ATM2_train) +
  autolayer(
    filter_index(ATM2, "2010-05-01" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = Cash`

ATM2_train |> gg_tsdisplay(difference(box_cox(Cash,lambda), 30),
                     plot_type='partial', lag_max = 36)
## Warning: Removed 30 rows containing missing values (`geom_line()`).
## Warning: Removed 30 rows containing missing values (`geom_point()`).

fit <- ATM2_train |>
  model(
    arima_box_cox = ARIMA(box_cox(Cash,lambda) ~ pdq(2,0,2) + PDQ(0,1,1)),
  )
fit |> pivot_longer(everything(), names_to = "Model name",
                     values_to = "Orders")
## # A mable: 1 x 2
## # Key:     Model name [1]
##   `Model name`                    Orders
##   <chr>                          <model>
## 1 arima_box_cox <ARIMA(2,0,2)(0,1,1)[7]>
glance(fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 1 × 6
##   .model        sigma2 log_lik   AIC  AICc   BIC
##   <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 arima_box_cox   69.9  -1268. 2548. 2548. 2571.
fit |> select(arima_box_cox) |> gg_tsresiduals(lag=36)

report(fit) |>
  accuracy()
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##           ar1      ar2    ma1     ma2     sma1
##       -0.4247  -0.9074  0.468  0.8004  -0.7262
## s.e.   0.0587   0.0438  0.089  0.0584   0.0409
## 
## sigma^2 estimated as 69.93:  log likelihood=-1267.92
## AIC=2547.85   AICc=2548.09   BIC=2571.13
## # A tibble: 1 × 10
##   .model        .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>         <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 arima_box_cox Training 0.243  24.4  17.1  -Inf   Inf 0.823 0.811 -0.0123
fit <- ATM2_train |>
  model(ARIMA(Cash, stepwise = FALSE, approx = FALSE)
  )
ATM2_csv <- forecast(fit, h=31)
data.frame(`Date` = ATM2_csv$DATE, Cash = ATM2_csv$.mean) |>
  write.csv("Project1_PartA_ATM2.csv")

ATM3

ATM_ts |>
  filter(ATM == "ATM3") |>
  head()
## # A tsibble: 6 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM3      0
## 2 2009-05-02 ATM3      0
## 3 2009-05-03 ATM3      0
## 4 2009-05-04 ATM3      0
## 5 2009-05-05 ATM3      0
## 6 2009-05-06 ATM3      0

ATM3 has 365 values and of all of these values only 3 have Cash values. Every other day had 0 Cash withdrawn. Because of the lack of variation in the cash withdrawals, this time series would not provide accurate forecasts.

sum(is.na(ATM2$Cash ) )
## [1] 2
ATM3 <- ATM_ts |>
  filter(ATM == "ATM3") |>
  update_tsibble(key = c()) |>
  select(-(ATM))

ATM3_train <- ATM_ts |>
  filter(ATM == "ATM3") |>
  filter_index("2009-05-01" ~ "2010-04-30") |>
  select(Cash)


ATM3_fit <- ATM3_train |>
  model(MEAN(Cash))

ATM3_fc <- ATM3_fit |>
  forecast(h=31)

ATM3_fc |>
  autoplot(ATM3_train) +
  autolayer(
    filter_index(ATM3, "2010-05-01" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = Cash`

ATM 4

ATM4 <- ATM_ts |>
  filter(ATM == "ATM4") |>
  update_tsibble(key = c()) |>
  select(-(ATM))
  
ATM4 |>
  head()
## # A tsibble: 6 x 2 [1D]
##   DATE        Cash
##   <date>     <dbl>
## 1 2009-05-01 777. 
## 2 2009-05-02 524. 
## 3 2009-05-03 793. 
## 4 2009-05-04 908. 
## 5 2009-05-05  52.8
## 6 2009-05-06  52.2
sum(is.na(ATM4$Cash ) )
## [1] 0

The fourth ATM has no NA values.

ATM4_train <- ATM_ts |>
  filter(ATM == "ATM4") |>
  filter_index("2009-05-01" ~ "2010-04-30") |>
  select(Cash)

ATM4_fit <- ATM4_train |>
  model(MEAN(Cash))

ATM4_fc <- ATM4_fit |>
  forecast(h=31)

ATM4_fc |>
  autoplot(ATM4_train) +
  autolayer(
    filter_index(ATM4, "2010-05-01" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = Cash`

The Time series does have a significant outlier in the amount of cash withdrawn between January and April 2010. This outlier will affect the forecasts so to address this, I will impute the outliers.

By setting the robust parameter to true, STL decomposition can be more robust to outliers.

STL Decomposition

dcmp <- ATM4_train |>
  model(
    STL(Cash ~ trend(window = 7) +
                   season(window = "periodic"),
    robust = TRUE)) |>
  components() |>
  select(-.model)

dcmp |>
  model(NAIVE(season_adjust)) |>
  forecast() |>
  autoplot(dcmp) +
  labs(y = "Number of people",
       title = "US retail employment")

There is no obvious trend, but there is an obvious seasonality.

fit_dcmp <- ATM4_train |>
  model(stlf = decomposition_model(
    STL(Cash ~ trend(window = 7), robust = TRUE),
    NAIVE(season_adjust)
  ))

fit_dcmp |>
  forecast() |>
  autoplot(ATM4_train)+
  labs(y = "Cash (hundreds)",
       title = "ATM4 Withdrawal")

fit_dcmp |> gg_tsresiduals()
## Warning: Removed 7 rows containing missing values (`geom_line()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).

fit_dcmp |>
  accuracy()
## # A tibble: 1 × 10
##   .model .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 stlf   Training -0.490  907.  375. -320.  453. 0.934  1.01 -0.497

impute using the tsimpute function

ATM4_train2 <- imputeTS::na_interpolation(ATM4_train)
ATM4_fit2 <- ATM4_train2 |>
  model(
    Snaive = SNAIVE(Cash),
    Mean = MEAN(Cash),
    `Naïve` = NAIVE(Cash),
    Drift = NAIVE(Cash ~ drift()),
    auto = ARIMA(Cash))

ATM4_fc2 <- ATM4_fit2 |>
  forecast(h=31)

ATM4_fc2 |>
  autoplot(ATM4_train2) +
  autolayer(
    filter_index(ATM4, "2010-05-01" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = Cash`

ATM4_fit2 |>
  accuracy()
## # A tibble: 5 × 10
##   .model .type           ME  RMSE   MAE   MPE  MAPE  MASE RMSSE     ACF1
##   <chr>  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Snaive Training -3.70e+ 0  896.  402. -395.  450. 1     1     -0.0151 
## 2 Mean   Training  1.99e-14  650.  324. -617.  647. 0.805 0.725 -0.00936
## 3 Naïve  Training -8.10e- 1  925.  444. -555.  615. 1.10  1.03  -0.488  
## 4 Drift  Training -1.20e-15  925.  444. -554.  614. 1.10  1.03  -0.488  
## 5 auto   Training -1.51e-10  650.  324. -617.  647. 0.805 0.725 -0.00936

The mean and the auto arima function give the best forecast in terms of RMSE for the imputation. I will use the auto arima function for the forecast.

data.frame(`Date` = ATM4_fc2$DATE, Cash = ATM4_fc2$.mean) |>
  write.csv("Project1_PartA_ATM4.csv")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

Add this to your existing files above.

residential_power <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")
summary(residential_power)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
head(residential_power)
## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147
tail(residential_power)
## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          919 2013-Jul   8415321
## 2          920 2013-Aug   9080226
## 3          921 2013-Sep   7968220
## 4          922 2013-Oct   5759367
## 5          923 2013-Nov   5769083
## 6          924 2013-Dec   9606304

Making a tsibble obejct

rp_ts <- residential_power |>
  select(`YYYY-MMM`, KWH) |>
  mutate(YearMonth = yearmonth(`YYYY-MMM`)) |>
  as_tsibble(index = YearMonth) |>
  select(-`YYYY-MMM`)
head(rp_ts)
## # A tsibble: 6 x 2 [1M]
##       KWH YearMonth
##     <dbl>     <mth>
## 1 6862583  1998 Jan
## 2 5838198  1998 Feb
## 3 5420658  1998 Mar
## 4 5010364  1998 Apr
## 5 4665377  1998 May
## 6 6467147  1998 Jun
rp_train <- rp_ts |>
  filter_index("1998 Jan" ~ "2013 Oct")
rp_train |>
  head()
## # A tsibble: 6 x 2 [1M]
##       KWH YearMonth
##     <dbl>     <mth>
## 1 6862583  1998 Jan
## 2 5838198  1998 Feb
## 3 5420658  1998 Mar
## 4 5010364  1998 Apr
## 5 4665377  1998 May
## 6 6467147  1998 Jun

According to the summary, there is only one NA value in the dataset. According to the table below, the missing value is for KWH.

rp_train[is.na(rp_train$KWH), ]
## # A tsibble: 1 x 2 [1M]
##     KWH YearMonth
##   <dbl>     <mth>
## 1    NA  2008 Sep
rp_train |>
  ggplot(aes(KWH)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

Based on this graph, and the outlier at the beginning, I will use the median to impute the data.

rp_train <- na_interpolation(rp_train)
rp_train |>
  head()
## # A tsibble: 6 x 2 [1M]
##       KWH YearMonth
##     <dbl>     <mth>
## 1 6862583  1998 Jan
## 2 5838198  1998 Feb
## 3 5420658  1998 Mar
## 4 5010364  1998 Apr
## 5 4665377  1998 May
## 6 6467147  1998 Jun
rp_train |>
  autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`

There is an upward trend in the data and a strong seasonality. There also is an outlier after January 2010 that may impact the data.

dcmp <- rp_train |>
  model(stl = STL(KWH ))

components(dcmp) |>
  as_tsibble() |>
  autoplot(KWH, colour="gray") +
  geom_line(aes(y=trend), colour = "#D55E00") +
  labs(
    y = "KWH",
    title = ""
  )

components(dcmp) |> autoplot()

rp_fit <- rp_train |>
  model(
    Snaive = SNAIVE(KWH),
    Mean = MEAN(KWH),
    `Naïve` = NAIVE(KWH),
    Drift = NAIVE(KWH ~ drift()))

rp_fc <- rp_fit |>
  forecast(h=12)

rp_fc |>
  autoplot(rp_train) +
  autolayer(
    filter_index(rp_ts, "2013 Dec" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = KWH`
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

rp_fit |>
  accuracy()
## # A tibble: 4 × 10
##   .model .type           ME     RMSE      MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>        <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Snaive Training  7.83e+ 4 1163215.  687493. -4.16  14.6  1     1    0.240 
## 2 Mean   Training -4.12e-10 1428917. 1175505. -7.79  22.0  1.71  1.23 0.435 
## 3 Naïve  Training -5.84e+ 3 1522098. 1188389. -5.94  21.9  1.73  1.31 0.0279
## 4 Drift  Training -4.24e-10 1522087. 1187988. -5.84  21.9  1.73  1.31 0.0279

Of the four models, the best model based on the RMSE is the SNAIVE model.

rp_train |>
  gg_tsdisplay(KWH,
               plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")

rp_train |>
  features(KWH, unitroot_ndiffs)
## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1
rp_train |>
  gg_tsdisplay(difference(KWH,12),
               plot_type='partial', lag=36) +
  labs(title="Seasonally differenced", y="")
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).

fit <- rp_train |>
  model(
    arima012011 = ARIMA(KWH ~ pdq(0,1,1) + PDQ(0,1,1)),
    arima210011 = ARIMA(KWH ~ pdq(1,1,0) + PDQ(0,1,1)),
    auto = ARIMA(KWH, stepwise = FALSE, approx = FALSE)
  )
fit |> pivot_longer(everything(), names_to = "Model name",
                     values_to = "Orders")
## # A mable: 3 x 2
## # Key:     Model name [3]
##   `Model name`                             Orders
##   <chr>                                   <model>
## 1 arima012011           <ARIMA(0,1,1)(0,1,1)[12]>
## 2 arima210011           <ARIMA(1,1,0)(0,1,1)[12]>
## 3 auto         <ARIMA(1,0,0)(0,1,1)[12] w/ drift>
glance(fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 3 × 6
##   .model             sigma2 log_lik   AIC  AICc   BIC
##   <chr>               <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 arima012011 765139451910.  -2680. 5365. 5366. 5375.
## 2 auto        707997072666.  -2687. 5381. 5382. 5394.
## 3 arima210011 930636243374.  -2698. 5401. 5401. 5411.
lambda <- rp_train |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)

fit <- rp_train |>
  model(
    arima = ARIMA(box_cox(KWH, lambda))) 

report(fit) |>
  accuracy()
## Series: KWH 
## Model: ARIMA(0,0,1)(0,0,2)[12] w/ mean 
## Transformation: box_cox(KWH, lambda) 
## 
## Coefficients:
##          ma1    sma1    sma2  constant
##       0.1378  0.1367  0.1227    2.4179
## s.e.  0.0726  0.0729  0.0662    0.0001
## 
## sigma^2 estimated as 2.321e-07:  log likelihood=1183.38
## AIC=-2356.77   AICc=-2356.44   BIC=-2340.53
## # A tibble: 1 × 10
##   .model .type         ME     RMSE     MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>      <dbl>    <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima  Training 259048. 1294645. 965398. -2.93  17.9  1.40  1.11 0.297

From the above analysis, I will be using the SNAIVE model to predict.

rp_fit <- rp_train |>
  model(
    Snaive = SNAIVE(KWH))

rp_fc <- rp_fit |>
  forecast(h=14)

rp_fc |>
  autoplot(rp_train) +
  autolayer(
    filter_index(rp_ts, "2013 Dec" ~ .),
    colour = "black"
  )
## Plot variable not specified, automatically selected `.vars = KWH`
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

data.frame(`YYYY-MMM` = rp_fc$YearMonth, KWH = rp_fc$.mean) |>
  write.csv("Project1_PartB.csv")

Part C

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.