Midterm project for DATA624 course in CUNY’s MSDS program

Part A - ATM Forecast

#1 Data preparation

First I load in the provided data and take a look at the first few rows. I see that we have a date column, which I recreate in a standardized format that will be recognized as a date by a tsibble later.

atm_raw <- read_csv("atm_raw.csv")
head(atm_raw)
## # A tibble: 6 x 3
##   DATE                 ATM    Cash
##   <chr>                <chr> <dbl>
## 1 5/1/2009 12:00:00 AM ATM1     96
## 2 5/1/2009 12:00:00 AM ATM2    107
## 3 5/2/2009 12:00:00 AM ATM1     82
## 4 5/2/2009 12:00:00 AM ATM2     89
## 5 5/3/2009 12:00:00 AM ATM1     85
## 6 5/3/2009 12:00:00 AM ATM2     90
#start cleaned dataset
atm <- atm_raw

#set to date
atm$DATE = as.Date(atm$DATE, format='%m/%d/%Y 12:00:00 AM', origin = '1990-01-01')

A first glance at the data I see there is some work to do. I see 14 cases having no value for ATM, these can be removed as I see these are associated with May dates (which we’ll be predicting later) and no Cash value is present. I also see that ATM1 and ATM2 have a couple missing values. ATM3 appears to have a strange distribution with nearly all the percentiles having a value of zero.

atm %>%
  group_by(ATM) %>%
  skim()
Data summary
Name Piped data
Number of rows 1474
Number of columns 3
_______________________
Column type frequency:
Date 1
numeric 1
________________________
Group variables ATM

Variable type: Date

skim_variable ATM n_missing complete_rate min max median n_unique
DATE ATM1 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM2 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM3 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE ATM4 0 1 2009-05-01 2010-04-30 2009-10-30 365
DATE NA 0 1 2010-05-01 2010-05-14 2010-05-07 14

Variable type: numeric

skim_variable ATM n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Cash ATM1 3 0.99 83.89 36.66 1 73.0 91 108 180 ▂▁▇▃▁
Cash ATM2 2 0.99 62.58 38.90 0 25.5 67 93 147 ▇▅▇▇▂
Cash ATM3 0 1.00 0.72 7.94 0 0.0 0 0 96 ▇▁▁▁▁
Cash ATM4 0 1.00 474.01 650.95 2 124.0 404 705 10920 ▇▁▁▁▁
Cash NA 14 0.00 NaN NA NA NA NA NA NA

I drop the cases where the ATM variable is NA, as these also had empty Cash values. This will not result in an issue of continuity in our time series data, as these are the value at the end of the dates that will be predicted later.

#remove cases with ATM=NA
atm <- atm %>%
  drop_na(ATM)

Let’s take a closer look at ATM3 cash values. It appears there are 362 days with a value of zero Cash withdrawn, and 1 day each for 8,200, 8,500, and 9,600.

atm %>%
  filter(ATM == "ATM3") %>%
  group_by(Cash) %>%
  summarize(n = n())
## # A tibble: 4 x 2
##    Cash     n
##   <dbl> <int>
## 1     0   362
## 2    82     1
## 3    85     1
## 4    96     1

Looking at the only cases where ATM3 has greater than zero for the Cash variable, I see that these are the last 3 days of the time series data. This suggests the machine was previously broken or was only installed on 4/28/2010, explaining the lack of data for the majority of this time series dataset. As a prediction based off only 3 data points would not be wise, I’ll disregard ATM3 moving forward and relay to my boss that we’ll need more data before we can forecast for ATM3.

atm %>%
  filter(ATM == "ATM3",
         Cash > 0)
## # A tibble: 3 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-28 ATM3     96
## 2 2010-04-29 ATM3     82
## 3 2010-04-30 ATM3     85

Finally, I’ll set each ATM to it’s own dataframe.

atm$Cash <- as.numeric(unlist(atm$Cash))


atm1 <- atm %>%
  filter(ATM == "ATM1")

atm2 <- atm %>%
  filter(ATM == "ATM2")

atm4 <- atm %>%
  filter(ATM == "ATM4")

Imputation

Next I’ll see how the imputeTS package would impute the handful of NA values using the interpolation method of imputation. This is a fairly basic method, but it essentially splits fills in the missing value halfway between the previous and next value in the time series. If I had more missing values, or missing values bunched together, I might consider more robust methods but this will suffice for this situation.

ATM1

The red values it’s suggesting appear reasonable to my eye, and considering it’s a very small proportion of the full dataset (0.8%) this is acceptable to use in order to achieve a complete time series dataset.

#store imputation
imputation <- na_interpolation(atm1$Cash)

#plot imputation
ggplot_na_imputations(atm1$Cash, imputation) +
  my_plot_theme

#incorporation imputation
atm1$Cash <- na_interpolation(atm1$Cash)

ATM2

And again for ATM2 we see reasonable values for the interpolation imputation method.

#store imputation
imputation <- na_interpolation(atm2$Cash)

#plot imputation
ggplot_na_imputations(atm2$Cash, imputation) +
  my_plot_theme

#incorporation imputation
atm2$Cash <- imputation

#2 Visualize the data

I plot each ATM as a boxplot below to get a better look at the distributions. ATM1 and ATM2 seem the most normally distributed, though there are quite a few outliers on the low end in ATM1. Further, ATM1 has the highest median of the three ATMs which might suggest it’s in a wealthier area. Finally, ATM4 has a very small range of values, however, taking note of the scale this is an optics issue caused by that extreme outlier well over $900,000. I’ve stumbled on another data cleaning issue.

atm1_box <- ggplot(atm1, aes(x = Cash)) +
                     geom_boxplot()

atm2_box <- ggplot(atm2, aes(x = Cash)) +
                     geom_boxplot()

atm4_box <- ggplot(atm4, aes(x = Cash)) +
                     geom_boxplot()
  
  
ggarrange(atm1_box, atm2_box, atm4_box,
  labels = c("ATM1", "ATM2", "ATM4"),
  ncol = 1) +
  my_plot_theme

Check for outliers

A nice function from the forecast package allows it to see the number of cases that are outliers. The function tsoutliers() has thorough documentation, but essentially is better suited to detected outliers of a time series more than standard data as it decomposes the series and considers seasonality and trend before determining outliers. I’ll now convert each ATM dataframe into a tsibble object so I can use the forecast package’s functionality moving forward.

ATM1

ATM1 has over 30 outliers, which is consistent with the boxplot above. We also see the recommended replacement values. While these are statistically identified as outliers, there are so many it would be unwise to remove and/or replace these values. It’s good to know these exist, but I won’t fiddle with them.

#make tsibble version that works with tsoutliers
atm1 <- as.data.frame(atm1)
atm1_ts <- atm1 %>%
  select(-DATE) %>%
  mutate(Date = as.Date("2009-05-01") + 0:364) %>%
  as_tsibble(index = Date, key = Cash)

#check for outliers
tsoutliers(atm1_ts$Cash)
## $index
##  [1]  45  46  47  48  49  50  51  52  53  54  55  56  62  63  64  65  66  67  68
## [20]  69  70  71  72  73  74 352 353 354 355 356 357 358 359 360 361 362 363 364
## [39] 365
## 
## $replacements
##  [1]  20.46154  21.92308  23.38462  24.84615  26.30769  27.76923  29.23077
##  [8]  30.69231  32.15385  33.61538  35.07692  36.53846  51.28571  52.57143
## [15]  53.85714  55.14286  56.42857  57.71429  59.00000  60.28571  61.57143
## [22]  62.85714  64.14286  65.42857  66.71429 138.00000 138.00000 138.00000
## [29] 138.00000 138.00000 138.00000 138.00000 138.00000 138.00000 138.00000
## [36] 138.00000 138.00000 138.00000 138.00000
#make tsibble version that will work with autoplot later
atm1_ts <- as.data.frame(atm1_ts)

atm1_ts <- tsibble(
  Date = as.Date("2009-05-01") + 0:364,
  Cash = atm1$Cash)

ATM2

ATM2 has far fewer outliers, only 5. These didn’t show up on the boxplot above, but that is likely because the tsoutliers() function found outliers with regard to the seasonality and/or trend, whereas the box-plot is just looking at raw numbers and a raw distribution. While these values are certainly high, I don’t believe they are in error so I won’t remove/replace these either.

#make tsibble version that works with tsoutliers
atm2 <- as.data.frame(atm2)
atm2_ts <- atm2 %>%
  select(-DATE) %>%
  mutate(Date = as.Date("2009-05-01") + 0:364) %>%
  as_tsibble(index = Date, key = Cash)

#check for outliers
tsoutliers(atm2_ts$Cash)
## $index
## [1] 361 362 363 364 365
## 
## $replacements
## [1] 136 136 136 136 136
#make tsibble version that will work with autoplot later
atm2_ts <- as.data.frame(atm2_ts)

atm2_ts <- tsibble(
  Date = as.Date("2009-05-01") + 0:364,
  Cash = atm2$Cash)

ATM4

ATM4 has the highest number of outliers we’ve seen, and it’s unclear from this readout which is the extreme value seen in the boxplot.

#make tsibble version that works with tsoutliers
atm4 <- as.data.frame(atm4)
atm4_ts <- atm4 %>%
  select(-DATE) %>%
  mutate(Date = as.Date("2009-05-01") + 0:364) %>%
  as_tsibble(index = Date, key = Cash)

#check for outliers
tsoutliers(atm4_ts$Cash)
## $index
##  [1] 300 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330
## [20] 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349
## [39] 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365
## 
## $replacements
##  [1] 804 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
## [20] 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847
## [39] 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847 847

I filter by values where Cash is greater than 10,000 and find this on the date 02-09-2010. A this value is so extreme, 27 times greater than the median, I’ll replace it with the median. I’m assuming this must be in error, or maybe a day the ATM was robbed! If I were in a setting with coworkers I’d certainly ask my team if anyone knew what may have happened that day, I’d imagine a single ATM doesn’t even contain that much cash at any given time - if I was in the industry I’d hopefully know this max amount.

#make tsibble version that will work with autoplot later
atm4_ts <- as.data.frame(atm4_ts)

atm4_ts <- tsibble(
  Date = as.Date("2009-05-01") + 0:364,
  Cash = atm4$Cash)


#finder outlier
atm4_ts %>%
  filter(Cash > 10000)
## # A tsibble: 1 x 2 [1D]
##   Date        Cash
##   <date>     <dbl>
## 1 2010-02-09 10920
#set outlier to NA for now
atm4_ts$Cash[atm4_ts$Cash=="10920"]<-NA

#calculate median w/o extreme outlier
atm4_median <- median(atm4_ts$Cash, na.rm = TRUE)

#put median in for outlier/NA value
atm4_ts$Cash[atm4_ts$Cash==NA]<-atm4_median

I revisit the boxplot and see a few outliers on the high end, but nothing as extreme as I saw earlier.

ggplot(atm4_ts, aes(x = Cash)) +
  geom_boxplot() +
  my_plot_theme

Consider Transformations

Now I can proceed looking for trend, seasonality, and/or the need for Box-Cox transformation on each of the ATMs’ time series data.

ATM1

A plot of the cleaned ATM1 data shows some changes in variance and certainly seasonality.

autoplot(atm1_ts, Cash)

In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.35. The plot still has plenty of seasonality, but the variance is more stable.

#calculate lambda
lambda1 <- atm1_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(atm1_ts %>%
    autoplot(box_cox(Cash, lambda1)) +
     labs(y = "",
       title = "Transformed ATM1 Cash with lambda = 0.35"))
#apply transformation
atm1_tf <-
    atm1_ts %>% 
    mutate(Cash = box_cox(Cash, lambda1))

Now I’ll check the residuals. I clearly see lags on day 7, 14, and 21 on the ACF plot, which makes sense as it’s likely seasonal around the days of the week. I’ll definitely need a model that incorporates this seasonality.

#check residuals
ggtsdisplay(atm1_tf$Cash, main="ATM1 Residuals after Box-Cox")

ATM2

A plot of the cleaned ATM2 data shows some changes in variance and certainly seasonality - similar to ATM1.

autoplot(atm2_ts, Cash)

In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.68. The plot still has plenty of seasonality, but the variance is more stable.

#calculate lambda
lambda2 <- atm2_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(atm2_ts %>%
    autoplot(box_cox(Cash, lambda2)) +
     labs(y = "",
       title = "Transformed atm4 Cash with lambda = 0.68"))
#apply transformation
atm2_tf <-
    atm2_ts %>% 
    mutate(Cash = box_cox(Cash, lambda2))

Now I’ll check the residuals. Again I see lags at multiples of 7, but also more prominent ones at 2, 5, and 9 that will have to be checked on.

#check residuals
ggtsdisplay(atm2_tf$Cash, main="atm4 Residuals after Box-Cox")

ATM4

A plot of the cleaned ATM4 data shows seasonality with some of the most variance we’ve seen so far.

autoplot(atm4_ts, Cash)

In order to better regulate the variation in the plot above, I’ll perform a Box-Cox transformation. In this case lambda = 0.40. The plot still has plenty of seasonality, but the variance is more stable.

#calculate lambda
lambda4 <- atm4_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(atm4_ts %>%
    autoplot(box_cox(Cash, lambda4)) +
     labs(y = "",
       title = "Transformed ATM4 Cash with lambda = 0.40"))
#apply transformation
atm4_tf <-
    atm4_ts %>% 
    mutate(Cash = box_cox(Cash, lambda4))

Now I’ll check the residuals. I see primarily lags at multiples of 7, not at the other intervals seen in atm4.

#check residuals
ggtsdisplay(atm4_tf$Cash, main="ATM4 Residuals after Box-Cox")

#3 Choose your model

It’s clear all of our ATMs will need modeling that incorporated their seasonality around the weekdays. I’ll try the simplest option, the Seasonal Naive Method first on each - this can also work as a baseline model to compare the others to. Next, I’ll try an ETS model, as they can be used on non-stationary data and it operates based on weighted averages, favoring the more recent data points. Holt-Winter’s Seasonal smoothing will be most appropriate as it considers both trend and seasonality. Finally, I’ll see what the auto-selected ARIMA model can come up with.

After evaluating the performance of each of these, for each ATM, I’ll choose my model before producing forecasts.

#4 Train the model

ATM1

#shave off April 2010 to make our training dataset
atm1_train <- atm1_ts %>%
  filter(Date < as.Date('2010-04-01'))

Seasonal Naive Model

To start this Seasonal Naive Model (in blue) appears to not capture the range of April 2010, and further is sometimes high when the seasonality is low in the real data. This doesn’t appear to be a great fit from just looking at the plot, but it isn’t terrible.

#save the model
atm1_fit_snaive <- atm1_train %>%
  model(SNAIVE(Cash ~ lag(7)))

#create forecast
atm1_forecast_snaive <- atm1_fit_snaive %>%
  forecast(new_data = anti_join(atm1_ts, atm1_train))

#plot forecast against real April data
atm1_forecast_snaive %>% autoplot(atm1_ts) +
  labs(title = "ATM1 - Seasonal Naive Model",
       y = "USD in Hundreds")

After saving the model fit and checking the residuals, I see the above model is not capturing the 7-day lag (days of the week).

#check residuals
atm1_fit_snaive %>% 
  gg_tsresiduals()

Finally, in looking at the accuracy we see an RMSE of 16.08 and a MAE of 13.53. As our book authors recommend using RMSE and/or MAE to compare forecast methods. I’ll compare future ATM1 scores to this base Seasonal Naive Model - and remember that the lower these values the better the fit.

#calculate accuracy
atm1_forecast_snaive %>% 
  accuracy(atm1_ts)
## # A tibble: 1 x 10
##   .model                .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 SNAIVE(Cash ~ lag(7)) Test  -7.07  16.0  13.5 -66.2  73.4 0.732 0.558 0.0437
#store in df
models_atm1 <- data.frame(model = "seasonal naive",
                                RMSE = "16.03746",
                                MAE = "13.53333")

ETS Model

Next I check the Winter-Holt Seasonal Exponential Smoothing Model. I chose the multiplicative method for this model as the seasonal variations are changing proportional to the point in time for this time series, the seasonality isn’t completely steady. As this data isn’t show any major trends, I don’t considered the damped method.

Visually, this appears to fit a lot better, particularly the pattern of when to go up or down, though the magnitude at times is a little off.

#save the model
atm1_fit_etsAAM <- atm1_train %>%
  model(ETS(Cash ~ error("A") +
                       trend("A") + season("M")))

#create forecast
atm1_forecast_etsAAM <- atm1_fit_etsAAM %>%
  forecast(new_data = anti_join(atm1_ts, atm1_train))

#plot forecast against real April data
atm1_forecast_etsAAM %>% autoplot(atm1_ts) +
  labs(title = "ATM1 - ETS AAM",
       y = "USD in Hundreds")

The residual plots look much better than the Seasonal Naive Model, though we still have two value outside the ideal range on the ACF plot it isn’t by too much.

#check residuals
atm1_fit_etsAAM %>% 
  gg_tsresiduals()

I find the RMSE and MAE of this model and add it to me dataframe comparing all of the models I’m trying for ATM1.

#calculate accuracy
atm1_forecast_etsAAM %>% 
  accuracy(atm1_ts)
## # A tibble: 1 x 10
##   .model                  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 "ETS(Cash ~ error(\"A\~ Test  -5.16  11.4  8.92 -50.6  54.8 0.482 0.397 -0.224
#add to model list
models_atm1[nrow(models_atm1) +1,] <- c("winter-holt seasonal, mult", "11.16975", "8.571197")

ARIMA Model

Last I’ll build an ARIMA Model. This model type requires the data be stationary. I look at the plotted data and do not see signs of non-stationarity, there isn’t too much up and down of the seasonality across time.

atm1_train %>%
  autoplot()

Looking at the ACF plot I don’t see the characteristic high R1 value associated with non-stationary data, only the significant lags every 7 days.

#acf plot only
atm1_train %>%
  ACF(Cash) %>%
  autoplot()

Further, the ndiffs() function suggests no difference for this time series. Since our p-value on the KPSS test is greater than 0.05, we conclude the series is stationary and ready for ARIMA modeling.

ndiffs(atm1_train$Cash)
## [1] 0
unitroot_kpss(atm1_train$Cash)
##   kpss_stat kpss_pvalue 
##   0.3737313   0.0884779

Allowing the algorithm to select the best ARIMA model, it chooses a ARIMA(0,0,1)(0,1,2)[7]. While the AIC is displayed, we don’t have these values for the other models so will need to find the RMSE and MAE.

#use autoselect ARIMA model
atm1_fit_ARIMA <- atm1_train %>%
  model(ARIMA(Cash))

report(atm1_fit_ARIMA)
## Series: Cash 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.2003  -0.5834  -0.1090
## s.e.  0.0578   0.0532   0.0531
## 
## sigma^2 estimated as 597.7:  log likelihood=-1514.45
## AIC=3036.9   AICc=3037.02   BIC=3052.07

Below I see the April 2010 forecast from the selected ARIMA model against the actual April 2010 data. This looks to be fit about as well as the ETS model, but it’s hard to tell which is better by eye.

#create forecast
atm1_forecast_ARIMA <- atm1_fit_ARIMA %>%
  forecast(new_data = anti_join(atm1_ts, atm1_train))

#plot forecast against real April data
atm1_forecast_ARIMA %>% autoplot(atm1_ts) +
  labs(title = "ATM1 - ARIMA",
       y = "USD in Hundreds")

Checking the residuals plot it appears there is no pattern and it’s simply white noise, which is good.

atm1_fit_ARIMA %>%
  gg_tsresiduals()

Finding the accuracy of the forecast, I see an RMSE of 12.38991 and a MAE of 9.706059.

#calculate accuracy
atm1_forecast_ARIMA %>% 
  accuracy(atm1_ts)
## # A tibble: 1 x 10
##   .model      .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA(Cash) Test  -6.90  12.4  9.71 -83.3  86.4 0.525 0.431 -0.287
#add to model list
models_atm1[nrow(models_atm1) +1,] <- c("arima", "12.38991", "9.706059")

ATM2

#shave off April 2010 to make our training dataset
atm2_train <- atm2_ts %>%
  filter(Date < as.Date('2010-04-01'))

Seasonal Naive Model

Again I’ll start with a Seasonal Naive Model. This model visually appears to have predicted a few more oscilations than we see in the actual April 2010 data, so this again may not be the best fit.

#save the model
atm2_fit_snaive <- atm2_train %>%
  model(SNAIVE(Cash ~ lag(7)))

#create forecast
atm2_forecast_snaive <- atm2_fit_snaive %>%
  forecast(new_data = anti_join(atm2_ts, atm2_train))

#plot forecast against real April data
atm2_forecast_snaive %>% autoplot(atm2_ts) +
  labs(title = "ATM2 - Seasonal Naive Model",
       y = "USD in Hundreds")

The residuals show a huge lag at 7 on the ACF plot, a further indication we are missing that critical seasonality for atm4.

#check residuals
atm2_fit_snaive %>% 
  gg_tsresiduals()

I’ll calculate and store the accuracy of this model, but I find it likely the ETS and ARIMA will outperform it easily.

#calculate accuracy
atm2_forecast_snaive %>% 
  accuracy(atm2_ts)
## # A tibble: 1 x 10
##   .model                .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 SNAIVE(Cash ~ lag(7)) Test   12.9  26.2  17.8  35.1  45.1 0.835 0.849 -0.319
#store in df
models_atm2 <- data.frame(model = "seasonal naive",
                                RMSE = "26.16231",
                                MAE = "17.8")

ETS Model

Next I’ll try the Winter-Holt Seasonal Exponential Smoothing Model. I’ll choose the additive method for ATM2 as the seasonal variations do not seem to change much dependent on the point in the time series. From looking at the plot, the pattern appears correct but it seems to be dipping a bit early each season than the actual April 2010 data. This might not be the best fit for ATM2 either.

#save the model
atm2_fit_etsAAA <- atm2_train %>%
  model(ETS(Cash ~ error("A") +
                       trend("A") + season("A")))

#create forecast
atm2_forecast_etsAAA <- atm2_fit_etsAAA %>%
  forecast(new_data = anti_join(atm2_ts, atm2_train))

#plot forecast against real April data
atm2_forecast_etsAAA %>% autoplot(atm2_ts) +
  labs(title = "ATM2 - ETS AAA",
       y = "USD in Hundreds")

This residual plot certainly looks better than the Seasonal Naive, but we still have lags at a few values a bit further out of the bounds than is ideal.

#check residuals
atm2_fit_etsAAA %>% 
  gg_tsresiduals()

The RMSE and MAE values are lower than the Seasonal Naive, but I’m hopeful the ARIMA will do even better in this case.

#calculate accuracy
atm2_forecast_etsAAA %>% 
  accuracy(atm2_ts)
## # A tibble: 1 x 10
##   .model                 .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>                  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 "ETS(Cash ~ error(\"A~ Test   11.7  20.6  15.2 -2.03  41.5 0.716 0.669 -0.0785
#add to model list
models_atm2[nrow(models_atm1) +1,] <- c("winter-holt seasonal, add", "20.62269", "15.24716")

ARIMA Model

Last we’ll try an ARIMA Model, but first we check if our data is stationary. It looks even more stationary than ATM1 did, and that one didn’t require and differencing.

atm2_train %>%
  autoplot()

My eyes must be off, it appears 1 difference is appropriate.

ndiffs(atm2_train$Cash)
## [1] 1

I make a new tsibble of the training data and apply the differencing. I check to see if any furthering differencing is needed, and receive a 0.

atm2_train_diff <- atm2_train %>%
  mutate(Cash = difference(Cash))

#check again
ndiffs(atm2_train_diff$Cash)
## [1] 0

And with a 0.1 value on the KPSS test the data is now considered stationary, as we accept the null hypothesis of this test.

unitroot_kpss(atm2_train_diff$Cash)
##   kpss_stat kpss_pvalue 
##  0.01813872  0.10000000

For the differenced ATM2 data it selects an ARIMA(3,0,0)(2,1,0)[7].

#use autoselect ARIMA model
atm2_fit_ARIMA <- atm2_train %>%
   model(ARIMA(Cash))

report(atm2_fit_ARIMA)
## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2    sma1
##       -0.4322  -0.9259  0.4799  0.8085  -0.779
## s.e.   0.0484   0.0399  0.0805  0.0537   0.042
## 
## sigma^2 estimated as 626.2:  log likelihood=-1521.62
## AIC=3055.25   AICc=3055.51   BIC=3078.01

A visual inspection of the forecast looks pretty good, though it seems not to have as severe of drops each week as the actual data.

#create forecast
atm2_forecast_ARIMA <- atm2_fit_ARIMA %>%
  forecast(h = 30)

#plot forecast against real April data
atm2_forecast_ARIMA %>% autoplot(atm2_ts) +
  labs(title = "ATM2 - ARIMA",
       y = "USD in Hundreds")

Our residuals plot shows no values/lags out of bounds on the ACF plot and a normally distributed histogram of residuals, suggesting the residuals variance is just white noise.

atm2_fit_ARIMA %>%
  gg_tsresiduals()

Again, we’ll need to find compatible measures of fit to the Seasonal Naive & ETS Models, so I calculate them and store below.

#calculate accuracy
atm2_forecast_ARIMA %>% 
  accuracy(atm2_ts)
## # A tibble: 1 x 10
##   .model      .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 ARIMA(Cash) Test   8.62  21.0  16.3 -50.5  79.5 0.767 0.681 0.0999
#add to model list
models_atm2[nrow(models_atm2) +1,] <- c("arima", "20.97845", "16.34507")

ATM4

#shave off April 2010 to make our training dataset
atm4_train <- atm4_ts %>%
  filter(Date < as.Date('2010-04-01'))

Seasonal Naive Model

I’ll start with a simple Seasonal Naive Model. Visually, these predictions look decent, but about 2/3 the way into April it’s quite a bit high compared to the actual data for the month.

#save the model
atm4_fit_snaive <- atm4_train %>%
  model(SNAIVE(Cash ~ lag(7)))

#create forecast
atm4_forecast_snaive <- atm4_fit_snaive %>%
  forecast(new_data = anti_join(atm4_ts, atm4_train))

#plot forecast against real April data
atm4_forecast_snaive %>% autoplot(atm4_ts) +
  labs(title = "ATM4 - Seasonal Naive Model",
       y = "USD in Hundreds")

The residuals once again show that missing lag on 7 days.

#check residuals
atm4_fit_snaive %>%
  gg_tsresiduals()

I’ll calculate and store the accuracy measures.

#calculate accuracy
atm4_forecast_snaive %>% 
  accuracy(atm4_ts)
## # A tibble: 1 x 10
##   .model                .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>                 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 SNAIVE(Cash ~ lag(7)) Test  -130.  301.  227. -374.  389. 0.645 0.653 0.0305
#store in df
models_atm4 <- data.frame(model = "seasonal naive",
                                RMSE = "301.022",
                                MAE = "227.3667")

ETS Model

Again I’ll try an ETS Winter-Holt Seasonal method, with multiplicative method for the variation in the variation across the time series seen in the autoplot earlier in the Visualize the data section. Unfortunately, after 2 hours of troubleshooting I cannot determine why, regardless of how I reconfigure the tsibble, my ETS models only return null models. This happens when I specify the terms and when I don’t. A fraction of the code I attempted below. Possibly this means the model isn’t appropriate for this data, but I wasn’t able to find confirmation of this in any resources online or in our textbook.

# # attempt 1 at tsibble
# atm4_train_ts <- as.data.frame(atm4_train)
# atm4_train_ts <- tsibble(
#   Date = as.Date("2009-05-01") + 0:334,
#   Cash = atm4_train_ts$Cash)
# 
# #attempt 2 at tsibble
# atm4_train_df <- as.data.frame(atm4_train)
# 
# atm4_train_ts_2 <- atm4_train_df %>%
#   select(-Date) %>%
#   mutate(Date = as.Date("2009-05-01") + 0:334) %>%
#   as_tsibble(index = Date, key = Cash)
# 
# #save the model ##just a null models
# atm4_fit_etsANA <- atm4_train %>%
#   model(ETS(Cash ~ error('A') + trend('N') + season('A')))
# 
# #error can't find variable
# # atm4_fit_etsAAM <- atm4_train_ts_2 %>%
# #   model(ETS(Cash))

ARIMA Model

Moving on to an ARIMA model, reploting the data there may be a need for differencing to make this data stationary.

atm4_train %>%
  autoplot()

However, no differencing is needed.

ndiffs(atm4_train$Cash)
## [1] 0

An ARIMA(0,0,0)(2,0,0)[7] w/ mean is chosen by the auto-selector.

#use autoselect ARIMA model
atm4_fit_ARIMA <- atm4_train %>%
  model(ARIMA(Cash))

report(atm4_fit_ARIMA)
## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.1481  0.0914  342.1023
## s.e.  0.0546  0.0550   18.9594
## 
## sigma^2 estimated as 123136:  log likelihood=-2430.51
## AIC=4869.01   AICc=4869.13   BIC=4884.27

The forecast below looks like a poor fit, not capturing the seasonality.

#create forecast
atm4_forecast_ARIMA <- atm4_fit_ARIMA %>%
  forecast(h = 30)

#plot forecast against real April data
atm4_forecast_ARIMA %>% autoplot(atm4_ts) +
  labs(title = "ATM4 - ARIMA",
       y = "USD in Hundreds")

The residual ACF plot is one of the best I’ve seen in this assignment so far, but the histogram has a bit of right-skew to the distribution.

atm4_fit_ARIMA %>%
  gg_tsresiduals()

#calculate accuracy
atm4_forecast_ARIMA %>% 
  accuracy(atm4_ts)
## # A tibble: 1 x 10
##   .model      .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA(Cash) Test  -54.9  274.  235. -933.  957. 0.666 0.593 -0.0111
#add to model list
models_atm4[nrow(models_atm4) +1,] <- c("arima", "273.5695", "234.8024")

#5 Evaluate model performance

ATM1

I can reference the dataframe I made storing the RMSE and MAE for each model I tried above. It appears the ETS model, a Winter-Holt Seasonal model of the multiplicative type, has both the lowest RMSE and MAE. I will proceed to forecast using this model.The ARIMA model wasn’t too far behind either.

models_atm1
##                        model     RMSE      MAE
## 1             seasonal naive 16.03746 13.53333
## 2 winter-holt seasonal, mult 11.16975 8.571197
## 3                      arima 12.38991 9.706059

ATM2

Looking at the stored RMSE and MAE accuracy measures for the three models, it appear that again the Winter-Holt Seasonal model, an Exponential Smoothing model with the additive method, performs best. As was the case for ATM1, the ARIMA model isn’t far behind. I’ll move forward again with the ETS model being best for ATM2, as it was for ATM1.

models_atm2
##                       model     RMSE      MAE
## 1            seasonal naive 26.16231     17.8
## 2                      <NA>     <NA>     <NA>
## 3                      <NA>     <NA>     <NA>
## 4 winter-holt seasonal, add 20.62269 15.24716
## 5                     arima 20.97845 16.34507

ATM4

For the two models I was able to get functioning, we see the RMSE is better on the ARIMA and the MAE is better on the Seasonal Naive. Looking at the forecast plots for each above, and remembering that when all things are nearly equal, simple is always better, I’ll choose the Seasonal Naive to forecast for ATM4.

models_atm4
##            model     RMSE      MAE
## 1 seasonal naive  301.022 227.3667
## 2          arima 273.5695 234.8024

#6 Produce forecasts

ATM1

Using the entire dataset, I refit the model and forecast values for May 2010. I see in the plot below the forecast, with the confidence intervals getting wider/less sure later in May.

#save the model
atm1_fit_etsAAM_final <- atm1_ts %>%
  model(ETS(Cash ~ error("A") +
                       trend("A") + season("M")))

#create forecast
atm1_forecast_etsAAM_final <- atm1_fit_etsAAM_final %>%
  forecast(h = 30)

#plot forecast against real April data
atm1_forecast_etsAAM_final %>% autoplot(atm1_ts) +
  labs(title = "ATM1 - ETS AAM May 2010 Predictions",
       y = "USD in Hundreds")

Last, I take the forecasted values and save to an Excel CSV file.

#take date & forecast values
forecast_ATM1 <- as.data.frame(atm1_forecast_etsAAM_final) %>%
  select(c(Date, `.mean`))

#rename columns for clarity
colnames(forecast_ATM1) <- c('Date', 'Predicted Cash')

#save to excel csv format
write_excel_csv(forecast_ATM1, "forecast_ATM1.csv")

ATM2

Using the entire dataset, I refit the model and forecast values for May 2010. I see in the plot below the forecast, with the confidence intervals getting wider/less sure later in May.

#save the model
atm2_fit_etsAAA_final <- atm2_ts %>%
  model(ETS(Cash ~ error("A") +
                       trend("A") + season("A")))

#create forecast
atm2_forecast_etsAAA_final <- atm2_fit_etsAAA_final %>%
  forecast(h=30)

#plot forecast against real April data
atm2_forecast_etsAAA_final %>% autoplot(atm2_ts) +
  labs(title = "ATM2 - ETS AAA",
       y = "USD in Hundreds")

Last, I take the forecasted values and save to an Excel CSV file.

#take date & forecast values
forecast_atm2 <- as.data.frame(atm2_forecast_etsAAA_final) %>%
  select(c(Date, `.mean`))

#rename columns for clarity
colnames(forecast_atm2) <- c('Date', 'Predicted Cash')

#save to excel csv format
write_excel_csv(forecast_atm2, "forecast_atm2.csv")

ATM4

I refit the model using the full dataset and produce the May 2010 forecasts.

#save the model
atm4_fit_snaive_final <- atm4_ts %>%
  model(SNAIVE(Cash))

#create forecast
atm4_forecast_snaive_final <- atm4_fit_snaive_final %>%
  forecast(h = 30)

#plot forecast against real April data
atm4_forecast_snaive_final %>% autoplot(atm4_ts) +
  labs(title = "ATM4 - Seasonal Naive Model",
       y = "USD in Hundreds")

Last, I save the forecasted values to an Excel CSV file.

#take date & forecast values
forecast_atm4 <- as.data.frame(atm4_forecast_snaive_final) %>%
  select(c(Date, `.mean`))

#rename columns for clarity
colnames(forecast_atm4) <- c('Date', 'Predicted Cash')

#save to excel csv format
write_excel_csv(forecast_atm4, "forecast_ATM4.csv")

Part B - Power Usage Forecast

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

#1 Data preparation

Look at a summary of the data we see there are 3 variables, CaseSequence showing the order of the YYYY-MMM variable, and finally the KWH variable showing the power usage. KWH has one missing value we will want to impute before we proceed.

power_raw <- read_csv("power_raw.csv")

skim(power_raw)
Data summary
Name power_raw
Number of rows 192
Number of columns 3
_______________________
Column type frequency:
character 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
YYYY-MMM 0 1 8 8 0 192 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CaseSequence 0 1.00 828.5 55.57 733 780.75 828.5 876.25 924 ▇▇▇▇▇
KWH 1 0.99 6502474.6 1447570.89 770523 5429912.00 6283324.0 7620523.50 10655730 ▁▁▇▅▁

Imputation

Below I use the imputeTS package to generate a value for the identified NA value, using the linear (default) method. The point seems reasonable and to take into account the seasonality we see in the data so I incorporate it.

#store imputation
imputation <- na_interpolation(power_raw$KWH)
#plot imputation
ggplot_na_imputations(power_raw$KWH, imputation) +
  my_plot_theme

#incorporate imputation
power <- power_raw
power$KWH <- imputation

#2 Visualize the data

I start by visualizing the data with a boxplot and see there is one low outlier we may want to check on.

power %>%
  subset(select = KWH) %>%
  ggplot(aes(KWH)) +
  geom_boxplot() +
  labs(title = "Checking KWH Distribution") +
  my_plot_theme

Check for Outliers

The outlier shown above on the box plot appears to be from July 2010 and is not marked as an extreme value. In a real-world setting I would be curious to know from coworkers if there was a large outage for part of this month that decreased consumption, or possible a measurement error.

power %>%
  filter(KWH < 1000000)
## # A tibble: 1 x 3
##   CaseSequence `YYYY-MMM`    KWH
##          <dbl> <chr>       <dbl>
## 1          883 2010-Jul   770523

Since this outlier is only 12% of the median for this time series, I will impute it with the median as I presume this value must be in error - without being able to check with real-life coworkers.

#set outlier to NA for now
power$KWH[power$KWH=="770523"]<-NA

#calculate median w/o extreme outlier
power_median <- median(power$KWH, na.rm = TRUE)

#put median in for outlier/NA value
power$KWH[power$KWH==NA]<-power_median

Checking the boxplot again the distribution appears much more normal with still a fair amount of variation.

ggplot(power, aes(x = KWH)) +
  geom_boxplot() +
  my_plot_theme

Consider Transformations

After transforming out data to a tsibble and divided our KWH by thousands for easier chart reading, a quick check of the the histogram shows somewhat normally distributed data. In an effort to keep our predictions simple and easy to explain, I’m hopeful nontransformed data modeling will work okay, but I’ll check both.

#KWH in thousands
power$KWH <- power$KWH / 1000


#transfer to tsibble
power_ts <- tsibble(
  Month = yearmonth(power$`YYYY-MMM`),
  KWH = power$KWH)


ggplot(power_ts, aes(x = KWH)) +
  geom_histogram(bins = 20)

The Box-Cox results in a lambda of -0.23, I’ll store this version of the dataset to check against the non-transformed data later.

#calculate lambda
lambda <- power_ts %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)


#plug into autoplot
ggplotly(power_ts %>%
    autoplot(box_cox(KWH, lambda)) +
     labs(y = "",
       title = "Transformed KWH with lambda = -0.23"))
#apply transformation
power_tf <-
    power_ts %>% 
    mutate(KWH = box_cox(KWH, lambda))

#3 Choose your model

This data appears similar to most of the ATM data I worked with, except it has a bit more trend, going up near the end fo the time serires data. I’ll try some of the same models again.

#4 Train the model

#shave off2013 to make training
power_train <-  head(power_ts, -12)

power_train_tf <-  head(power_tf, -12)

Seasonal Naive Model

The Seasonal Naive Model looks pretty good, though it seems to have a bit more range in the seasonality than the actual data for 2013.

#save the model
power_fit_snaive <- power_train %>%
  model(SNAIVE(KWH ~ lag("year")))

#create forecast
power_forecast_snaive <- power_fit_snaive %>%
  forecast(new_data = anti_join(power_ts, power_train))

#plot forecast against real April data
power_forecast_snaive %>% autoplot(power_ts) +
  labs(title = "KWH Predictions - Seasonal Naive Model",
       y = "KWH")

However I see the 12 lag is quite significant.

#check residuals
power_fit_snaive %>% 
  gg_tsresiduals()

As I mentioned earlier, I’ll try running the model on the transformed data in case there are significant gains that make this more complex transformation worthwhile.

#save the model
power_fit_snaive_tf <- power_train_tf %>%
  model(SNAIVE(KWH ~ lag("year")))

#create forecast
power_forecast_snaive_tf <- power_fit_snaive_tf %>%
  forecast(new_data = anti_join(power_tf, power_train_tf))

#calculate accuracy
power_forecast_snaive %>% 
  accuracy(power_tf)
## # A tibble: 1 x 10
##   .model             .type     ME  RMSE   MAE     MPE   MAPE   MASE  RMSSE  ACF1
##   <chr>              <chr>  <dbl> <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <dbl>
## 1 "SNAIVE(KWH ~ lag~ Test  -7209. 7345. 7209. -1.90e5 1.90e5 5.76e5 4.69e5 0.517
#store in df
models_power <- data.frame(model = "seasonal naive - boxcox",
                                RMSE = "7344.53",
                                MAE = "7209.359")

And the modeling on the original data values.

#save the model
power_fit_snaive <- power_train %>%
  model(SNAIVE(KWH ~ lag("year")))

#create forecast
power_forecast_snaive <- power_fit_snaive %>%
  forecast(new_data = anti_join(power_ts, power_train))

#calculate accuracy
power_forecast_snaive %>% 
  accuracy(power_ts)
## # A tibble: 1 x 10
##   .model                 .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>                  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 "SNAIVE(KWH ~ lag(\"y~ Test   405. 1036.  619.  4.55  7.06  1.01  1.32 -0.0313
#add to model list
models_power[nrow(models_power) +1,] <- c("seasonal naive", "1035.538", "618.6056")

ARIMA

Looking at the training data I suspect we might need differencing.

power_train %>%
  autoplot()

One differencing is needed and done.

ndiffs(power_train$KWH)
## [1] 1

No more differencing is needed.

power_train_diff <- power_train %>%
  mutate(KWH = difference(KWH))

#check again
ndiffs(power_train_diff$KWH)
## [1] 0
unitroot_kpss(power_train_diff$KWH)
##   kpss_stat kpss_pvalue 
##  0.02071724  0.10000000

The automic model selection from the ARIMA() function returns a ARIMA(4,0,0)(2,1,0)[12] w/ drift.

#use autoselect ARIMA model
power_fit_ARIMA <- power_train %>%
  model(ARIMA(KWH))

report(power_fit_ARIMA)
## Series: KWH 
## Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2    ma3     ma4     sar1     sar2  constant
##       0.3870  0.0625  0.236  0.0013  -0.7301  -0.4443  209.8055
## s.e.  0.0779  0.0861  0.073  0.0884   0.0762   0.0814   78.5199
## 
## sigma^2 estimated as 339705:  log likelihood=-1302.14
## AIC=2620.28   AICc=2621.19   BIC=2645.27

Our forecast of 2014 looks pretty dang good compared to all the other models in this project.

#create forecast
power_forecast_ARIMA <- power_fit_ARIMA %>%
  forecast(h = 12)


#plot forecast against real April data
power_forecast_ARIMA %>% autoplot(power_ts) +
  labs(title = "Power KWH Predictions - ARIMA",
       y = "KWH")

The residuals look quite good other than one out of bounds at 5 on the ACF plot. Histogram of residuals looks mostly normally distributed.

power_fit_ARIMA %>%
  gg_tsresiduals()

I get extremely low RMSE and MAE values, this is another very well fitting model, though I should be wary of any potential overfitting.

#calculate accuracy
power_forecast_ARIMA %>% 
  accuracy(power_ts)
## # A tibble: 1 x 10
##   .model     .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA(KWH) Test   220.  945.  641.  2.01  7.56  1.05  1.21 -0.0398
#add to model list
models_power[nrow(models_power) +1,] <- c("arima", "944.7432", "640.5602")

#5 Evaluate model performance

Comparing first the Season Naive on the transformed data (boxcox) vs original scale and values, the BoxCox performed better on RMSE but worse on MAE, which is a reflection of the scaling and how the values are calculated. Generally, I don’t believe you’d want to compare particularly the MAE on data of different scales. Despite the transformed data performing better, I’ll stick with the Seasonal Naive Model for simplicity.

Comparing the Season Naive Model to the ARIMA, we see ARIMA had a better RMSE and Seasonal Naive had a better MAE. As I aim to choose the simpler method if not too much accuracy is lost, I’ll move forward with the non-transformed Seasonal Naive Model.

models_power
##                     model     RMSE      MAE
## 1 seasonal naive - boxcox  7344.53 7209.359
## 2          seasonal naive 1035.538 618.6056
## 3                   arima 944.7432 640.5602

#6 Produce forecasts

Below I retrain the model on the full dataset, before predicted 12 months of data, all of 2014. In the plot I see both the seasonality and a bit of the trend maintained in the 1-year forecast.

#save the model
power_fit_snaive_final <- power_ts %>%
  model(SNAIVE(KWH ~ lag("year")))

#create forecast
power_forecast_snaive_final <- power_fit_snaive_final %>%
  forecast(h=12)

#plot forecast against real April data
power_forecast_snaive_final %>% autoplot(power_ts) +
  labs(title = "KWH Predictions - Seasonal Naive Model",
       y = "KWH")

Finally, these forecasted values are saved.

#take date & forecast values
forecast_power <- as.data.frame(power_forecast_snaive_final) %>%
  select(c(Month, `.mean`))

#rename columns for clarity
colnames(forecast_power) <- c('Month', 'Predicted KWH')


#save to excel csv format
write_excel_csv(forecast_power, "forecast_power.csv")