DATA624_Project1

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

We are going to start by reading in the data and doing doing some basic exploration of the data.

library(readxl)
temp <- readxl::read_xlsx("/Users/williamaiken/DATA624/HW/ATM624Data.xlsx")

First it is helpful to inspect the head of the data set. I like the dplyr ‘glimpse’ function for this. The date does make a lot of sense. I thought it might be using the Unix Epoch (days since 1970) but that doesn’t appear to be the case because the numbers are too big for that. In the end it doesn’t really matter as long as they are ordered. I did a count on the dates and there are no more than 4 events for a given date.

dplyr::glimpse(temp)

## Rows: 1,474
## Columns: 3
## $ DATE <dbl> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938, 39…
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

Just to get a sense of the missing values I like to start by getting a count of the missing values. So we are missing 19 ‘Cash’ values and 14 ‘ATM’ values. The missing ATM values are the 14 days at the end of the data set that we will forecast.

sum(is.na(temp$Cash))

## [1] 19

sum(is.na(temp$ATM))

## [1] 14

temp |> filter(is.na(Cash))

## # A tibble: 19 × 3
##     DATE ATM    Cash
##    <dbl> <chr> <dbl>
##  1 39977 ATM1     NA
##  2 39980 ATM1     NA
##  3 39982 ATM2     NA
##  4 39986 ATM1     NA
##  5 39988 ATM2     NA
##  6 40299 <NA>     NA
##  7 40300 <NA>     NA
##  8 40301 <NA>     NA
##  9 40302 <NA>     NA
## 10 40303 <NA>     NA
## 11 40304 <NA>     NA
## 12 40305 <NA>     NA
## 13 40306 <NA>     NA
## 14 40307 <NA>     NA
## 15 40308 <NA>     NA
## 16 40309 <NA>     NA
## 17 40310 <NA>     NA
## 18 40311 <NA>     NA
## 19 40312 <NA>     NA

When we look at ATM4 we see that the ‘Cash’ values are larger than the other ATMs and that the values are given with an unusual amount of precision. It’s possible that this ATM is not in US dollars or that it is a foreign atm that is converting the amount into US dollars?

dplyr::glimpse(temp |> filter(ATM == 'ATM4'))

## Rows: 365
## Columns: 3
## $ DATE <dbl> 39934, 39935, 39936, 39937, 39938, 39939, 39940, 39941, 39942, 39…
## $ ATM  <chr> "ATM4", "ATM4", "ATM4", "ATM4", "ATM4", "ATM4", "ATM4", "ATM4", "…
## $ Cash <dbl> 776.99342, 524.41796, 792.81136, 908.23846, 52.83210, 52.20845, 5…

We have limited or histogram to ATM withdraws less than $1500 for ease of visualization. The fact that this limit has to be used means that we may have some bad values in our data set that may need to addressed. Most banks set a limit of how much money can be removed from an ATM to prevent fraud (300-3000). We can see that the distributions of the ATMs are all very different, particularly ATM 4. It would not be appropriate to take the average across the ATMs and create one model. Each ATM should be modeled separately.

library(ggplot2)
library(dplyr)
ggplot(temp %>% filter(!is.na(ATM) & Cash < 1500), aes(Cash, fill = ATM)) + geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity')

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Before we do any fixing of the data, I want to visualize the data as time series.

temp2 <- as_tsibble(temp, index = 'DATE', key = 'ATM')
autoplot(temp2)

## Plot variable not specified, automatically selected `.vars = Cash`

## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

We need to visualize these plots separately to really see what is going on. In the ATM1 plot we see no trend but possibly a seasonality. This would make sense for an ATM that was commonly used during the week but not on weekends.

ATM1

temp2 |> filter(ATM == 'ATM1') |> autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

Let’s start by replacing the missing values with the mean and then using classical decomposition to inspect the trend and seasonality.

I recognize that it would be better to calculate the mean of the day of the week of the missing value but we are missing so few values that I don’t think that just using the mean will impact our forecasts that much.

atm1_df <- temp2 |> filter(ATM == 'ATM1')

atm1_df <- atm1_df %>% mutate(Cash = if_else(is.na(Cash), mean(Cash, na.rm = TRUE), Cash))

We can see that there is no real trend but there is a seasonal component which we are not fully capturing because we see that the ‘random’ component is not fully random.

  atm1_df |>model(
    classical_decomposition(Cash ~ season(7), type = "multiplicative")
  ) |>
  components() |>
  autoplot() +
  labs(title = "Classical additive decomposition of ATM1")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

ATM1 Model Selection

We are going use the ‘model’ function to find the ETS model that best fits the data. The best ETS model in Multiplicative Error, No Trend, No Seasonal.

fit <- atm1_df %>% model(ETS(Cash))

report(fit)

## Series: Cash 
## Model: ETS(M,N,N) 
##   Smoothing parameters:
##     alpha = 0.03200099 
## 
##   Initial states:
##      l[0]
##  79.04326
## 
##   sigma^2:  0.1896
## 
##      AIC     AICc      BIC 
## 4783.080 4783.147 4794.780

The automatic model selection is not picking up on the seasonal component. If we explicitly set the period we can add a seasonal component and we get a lower AIC.

fit <- atm1_df %>% model(ETS(Cash ~ error("M") + trend("N") + season("M", period = 7)))

report(fit)

## Series: Cash 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1601814 
##     gamma = 0.3633592 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  74.26055 0.2931719 0.9825334 1.236199 1.045999 1.070437 1.021235 1.350425
## 
##   sigma^2:  0.1344
## 
##      AIC     AICc      BIC 
## 4566.773 4567.395 4605.772

Next we are going to inspect the model residuals. The innovation residuals look like white note except at the end. The distribution of the residuals looks normal.

fit |> gg_tsresiduals()

ATM1 Forecasting

fc <- atm1_df %>% model(ETS(Cash ~ error("M") + trend("N") + season("M", period = 7))) %>% forecast(h=14)
  
fc %>% autoplot(atm1_df)

ATM1 <- fc$.mean

ATM2

For ATM2, I see no apparent trend but possibly a seasonality.

temp2 |> filter(ATM == 'ATM2') |> autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

atm2_df <- temp2 |> filter(ATM == 'ATM2')

atm2_df <- atm2_df %>% mutate(Cash = if_else(is.na(Cash), mean(Cash, na.rm = TRUE), Cash))

  atm2_df |>model(
    classical_decomposition(Cash ~ season(7), type = "multiplicative")
  ) |>
  components() |>
  autoplot() +
  labs(title = "Classical additive decomposition of ATM2")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

##### ATM2 Model Selection

Here we fit our model using the model selected automatically with and added seasonal component.

fit <- atm2_df %>% model(ETS(Cash ~ error("A") + trend("A") + season("A", period = 7)))

report(fit)

## Series: Cash 
## Model: ETS(A,A,A) 
##   Smoothing parameters:
##     alpha = 0.0001000869 
##     beta  = 0.0001000007 
##     gamma = 0.3605111 
## 
##   Initial states:
##      l[0]       b[0]      s[0]    s[-1]    s[-2]     s[-3]    s[-4]    s[-5]
##  73.11908 -0.0402754 -36.92106 -19.0999 13.93127 -2.467569 5.052248 13.77562
##     s[-6]
##  25.72939
## 
##   sigma^2:  647.9224
## 
##      AIC     AICc      BIC 
## 4529.220 4530.106 4576.019

Next we are going to inspect the model residuals. The innovation residuals looke like white noise and residual distribution looks normal.

fit |> gg_tsresiduals()

ATM@ Forecasting

Now we forecast for the next two weeks using our selected AAA model. Our confidence intervals include nonesense values (amounts less than 0).

fc <- atm2_df %>%model(ETS(Cash ~ error("A") + trend("A") + season("A", period = 7))) %>% forecast(h=14)
  
fc %>% autoplot(atm2_df)

ATM2 <- fc$.mean

ATM3

Oh boy, ATM3 is going to be tricky. There are only three cash values for ATM3 at the very end of the series. It’s possible that ATM3 was only put into service at the very end of the series. We could assume that there is the same seasonal trend as ATM1 and ATM2. I think that the forecasts for ATM1 could be used for ATM3 since the last three values for ATM1 and ATM3 are the same.

temp2 |> filter(ATM == 'ATM3') |> autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

atm3_df <- temp2 |> filter(ATM == 'ATM3')

atm3_df <- atm3_df %>% mutate(Cash = if_else(is.na(Cash), mean(Cash, na.rm = TRUE), Cash))

  atm3_df |>model(
    classical_decomposition(Cash ~ season(7), type = "multiplicative")
  ) |>
  components() |>
  autoplot() +
  labs(title = "Classical additive decomposition of ATM3")

## Warning: There were 4 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `ymin = min_fn(.val, na.rm = TRUE)`.
## ℹ In group 3: `.var = seasonal`.
## Caused by warning in `min_fn()`:
## ! no non-missing arguments to min; returning Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

## Warning: Removed 733 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_rect()`).

ATM4

For ATM4, there is one outlier that is almost an order of magnitude greater than any other value in the data set (10919.76). We are going to replace that one value with the mean.

temp2 |> filter(ATM == 'ATM4') |> autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

atm4_df <- temp2 |> filter(ATM == 'ATM4')

atm4_df <- atm4_df %>% mutate(Cash = if_else(Cash == max(Cash, na.rm = TRUE), mean(Cash, na.rm = TRUE), Cash))

  atm4_df |>model(
    classical_decomposition(Cash ~ season(7), type = "multiplicative")
  ) |>
  components() |>
  autoplot() +
  labs(title = "Classical additive decomposition of ATM4")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

ATM4 Model Selection

fit <- atm4_df %>% model(ETS(Cash ~ error("A") + trend("N") + season("A", period = 7)))

report(fit)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000108 
##     gamma = 0.0001001184 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]    s[-2]   s[-3]    s[-4]    s[-5]    s[-6]
##  453.529 -273.2254 -32.19763 28.30188 40.2273 89.02959 48.41241 99.45182
## 
##   sigma^2:  110777.9
## 
##      AIC     AICc      BIC 
## 6403.928 6404.549 6442.927

Next we are going to inspect the model residuals. The residuals don’t look centered on zero so we are going to apply Box-Cox transformation.

fit |> gg_tsresiduals()

ATM4 Transformation

atm_trans <- as.data.frame(atm4_df)
bc_trans <- preProcess(atm_trans["Cash"], method = c("BoxCox"))
transformed <- predict(bc_trans, atm_trans["Cash"])

atm_trans$Cash_t <- transformed$Cash

atm_trans <- as_tsibble(atm_trans, index = 'DATE')

atm_trans |> autoplot(Cash_t)

Our Lamda is 0.4 which we will need to back transform the forecasts at the end

bc_trans$bc

## $Cash
## Box-Cox Transformation
## 
## 365 data points used to estimate Lambda
## 
## Input data summary:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.563  124.334  403.839  445.425  704.192 1712.075 
## 
## Largest/Smallest: 1100 
## Sample Skewness: 0.659 
## 
## Estimated Lambda: 0.4

Now we are going to fit our ETS model, in all the ATM data set the model function was not able to detect the seasonal trend which highlights the importance of visualizing the data before model selection.

fit <- atm_trans %>% model(ETS(Cash_t ~ error("A") + trend("N") + season("A", period = 7)))

report(fit)

## Series: Cash_t 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0005262839 
##     gamma = 0.1053827 
## 
##   Initial states:
##      l[0]      s[0]      s[-1]      s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  23.02844 -12.69921 -0.7352099 -0.5172638 3.833066 1.641212 2.405787 6.071617
## 
##   sigma^2:  96.3451
## 
##      AIC     AICc      BIC 
## 3831.646 3832.268 3870.645

The residuals are more normally distributed.

fit |> gg_tsresiduals()

ATM Forecasting

Now we are going to do our forcasting using a model that we manually selected by taking the model selected by the model function and adding a seasonal component.

fc <- atm4_df %>% model(ETS(Cash ~ error("A") + trend("N") + season("A", period = 7))) %>% forecast(h=14)
  
fc %>% autoplot(atm4_df)

Now we need to back transform our forecasts. This function comes from a solution that Rob Hyndman posted on StackExchange.

StackExchange

boxinvTransform <- function(y, lambda) {
  if (lambda == 0L) { exp(y) }
  else { (y * lambda + 1)^(1/lambda) }
}

fc_t <- fc$.mean

ATM4 <- boxinvTransform(fc_t, 0.4)

Now we are going to join all of our forecasts and save them out. I’m making the choice to round the predictions for ATM1, ATM2, ATM3 because the values in the original data was given to me a whole dollar amounts.

Upon inspection our forecasts look reasonable given the original data.

ATM3 <- round(ATM1)
ATM1 <- round(ATM1)
ATM2 <- round(ATM2)

forecast_part1 <- bind_cols(ATM1, ATM2, ATM3, ATM4)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`

names(forecast_part1) <- c("ATM1", "ATM2", "ATM3", "ATM4")

#write_excel_csv(forecast_part1, "forecast_part1.csv")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

pwr_df <- readxl::read_xlsx('/Users/williamaiken/DATA624/HW/ResidentialCustomerForecastLoad-624.xlsx')

glimpse(pwr_df)

## Rows: 192
## Columns: 3
## $ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
## $ `YYYY-MMM`   <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
## $ KWH          <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…

We have one missing value that will have to be dealt with.

sum(is.na(pwr_df$KWH))

## [1] 1

Let’s examine a histogram of the KWH. We find a distribution that is approximately normal centered around 6 million KWH.

ggplot(pwr_df, aes(KWH, fill = KWH)) + geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: The following aesthetics were dropped during statistical transformation: fill.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Next let’s examine the time series of KWH. This data has both a postive trend and a seasonal component. We can see there is an outlier value in ~2011

If you don’t use the ‘yearmonth’ function, you run into issues when you try to do the classical decomposition in a few steps.

pwr_df$date <- yearmonth(as.character(pwr_df$'YYYY-MMM'))
pwr_df <- as_tsibble(pwr_df |> select(date, KWH), index = 'date')
pwr_df |> autoplot()

## Plot variable not specified, automatically selected `.vars = KWH`

We need to fix the missing value. We are going to replace the missing value with the mean.

pwr_df <- pwr_df |> mutate(KWH = ifelse(is.na(KWH), mean(KWH, na.rm = TRUE), KWH), year = lubridate::year(date))

Let’s use classical decomposition to understand the data a little better

  pwr_df |>model(
    classical_decomposition(KWH ~ season(12), type = "multiplicative")
  ) |>
  components() |>
  autoplot() +
  labs(title = "Classical additive decomposition of KWH")

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).

pwr_df <- pwr_df %>% mutate(KWH = ifelse(KWH == min(KWH, na.rm = TRUE), mean(KWH, na.rm = TRUE), KWH))

We are going to use Exponential Smoothing to for our forecasting. We are going use the ‘model’ function to select the optimal ETS model.

The model selected is a ETS(M,N,M) model which is multiplicative error, No trend, multiplicative seasonality.

fit <- pwr_df %>% model(ETS(KWH))

report(fit)

## Series: KWH 
## Model: ETS(M,N,M) 
##   Smoothing parameters:
##     alpha = 0.1630125 
##     gamma = 0.2215093 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]   s[-5]    s[-6]
##  6185479 0.8927677 0.7647435 0.933915 1.246473 1.288732 1.24108 1.009862
##      s[-7]     s[-8]     s[-9]    s[-10]   s[-11]
##  0.7562118 0.8047877 0.8694098 0.9998625 1.192155
## 
##   sigma^2:  0.0097
## 
##      AIC     AICc      BIC 
## 6150.263 6152.990 6199.125

Next we are going to inspect the model residuals

fit |> gg_tsresiduals()

The residuals show us that the data could benefit from some transformation. Our lambda is -0.3 which we will need later for the back transformation.

pwr_trans <- as.data.frame(pwr_df)
bc_trans <- preProcess(pwr_trans["KWH"], method = c("BoxCox"))
transformed <- predict(bc_trans, pwr_trans["KWH"])

pwr_trans$KWH_t <- transformed$KWH

pwr_trans <- as_tsibble(pwr_trans, index = 'date')

pwr_trans |> autoplot(KWH_t)

We select our model after transformation which thankfully picks up the seasonal component (MAM). When we look at the AIC, the temptation is to freakout thinking that we have done something wrong. When you have log transformed the data it is possible to get a negative AIC.

fit2 <- pwr_trans %>% model(ETS(KWH_t))

report(fit2)

## Series: KWH_t 
## Model: ETS(M,A,M) 
##   Smoothing parameters:
##     alpha = 0.08071573 
##     beta  = 0.000100015 
##     gamma = 0.0001001188 
## 
##   Initial states:
##      l[0]         b[0]      s[0]     s[-1]     s[-2]  s[-3]    s[-4]    s[-5]
##  3.302321 1.089662e-05 0.9998876 0.9992466 0.9996419 1.0005 1.000694 1.000496
##     s[-6]     s[-7]     s[-8]     s[-9]  s[-10]   s[-11]
##  1.000048 0.9993343 0.9995079 0.9998246 1.00023 1.000589
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -1684.751 -1681.234 -1629.373

The innovation residuals look like white noise and the distribution of the residuals looks normal.

fit2 |> gg_tsresiduals()

Lastly we are going to perform forecasting with our selected model and then transform our forecasts back to the original scale.

fc <- pwr_trans %>% model(ETS(KWH_t)) %>% forecast(h=24)
  
fc %>% autoplot(pwr_trans)

These forecasts look reasonable when compared to the original data.

fc_t <- fc$.mean

forecast_part2 <- boxinvTransform(fc_t, -0.3)

forecast_part2 <- round(forecast_part2)

forecast_part2 <- as.data.frame(forecast_part2)

#write_excel_csv(forecast_part2, "forecast_part2.csv")

Takeaways from this project. The model selection looks easy but requires you to understand the original data to verify that the model selected makes sense give the properties of the data. The most time is spent understanding the data, cleaning and transforming the data. Decisions will be have to be made but it’s important that your logic and choices are clear to your audience.