library(fabletools)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
library(fable)
library(ggplot2)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tsibble)
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(readxl)
library(dplyr)
library(zoo)
##
## Attaching package: 'zoo'
##
## The following object is masked from 'package:tsibble':
##
## index
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(feasts)
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
# Loading in provided excel file of ATM data
atm <- readxl::read_excel("C:\\Users\\chung\\Downloads\\ATM624Data.xlsx")
# Taking a look at the data passed into R studio
head(atm)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
I note here that we will have to reformat the DATE column as the format after importing the data does not match the format in excel which is “m/d/yyyy h:mm:ss AM/PM”. The DATE in our imported data shows the number of days since 1899-12-30
summary(atm)
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
I note here that there are 19 NA values in the ‘Cash’ variable, this is another thing that I will have to address.
# Ensuring all four ATMs are accounted for in the data
unique(atm$ATM)
## [1] "ATM1" "ATM2" NA "ATM3" "ATM4"
Looking into the counts of each ATM as the NA values could be missing data for one of the four ATMs being recorded.
atm %>%
group_by(ATM) %>%
summarise(Count = n())
## # A tibble: 5 × 2
## ATM Count
## <chr> <int>
## 1 ATM1 365
## 2 ATM2 365
## 3 ATM3 365
## 4 ATM4 365
## 5 <NA> 14
atm_na <- atm %>%
filter(is.na(ATM))
atm_na
## # A tibble: 14 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 40299 <NA> NA
## 2 40300 <NA> NA
## 3 40301 <NA> NA
## 4 40302 <NA> NA
## 5 40303 <NA> NA
## 6 40304 <NA> NA
## 7 40305 <NA> NA
## 8 40306 <NA> NA
## 9 40307 <NA> NA
## 10 40308 <NA> NA
## 11 40309 <NA> NA
## 12 40310 <NA> NA
## 13 40311 <NA> NA
## 14 40312 <NA> NA
Exploring each ATM value’s counts, there are 365 days worth of values, and so NA values in the ATM column will also be addressed in data preprocessing. I will drop the observations with NA ATM values, then check to see if the date ranges for each ATM are aligned.
# Plotting each ATM to see the general structure of the data
ggplot(atm, aes(DATE, Cash, col = ATM)) +
geom_line() +
facet_wrap(~ ATM, scales = "free_y")
## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Viewing ATM4's outlier
atm %>%
filter(ATM == 'ATM4') %>%
arrange(desc(Cash))
## # A tibble: 365 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 40218 ATM4 10920.
## 2 40078 ATM4 1712.
## 3 40207 ATM4 1575.
## 4 40060 ATM4 1495.
## 5 39973 ATM4 1484.
## 6 40061 ATM4 1301.
## 7 40237 ATM4 1276.
## 8 40048 ATM4 1246.
## 9 39978 ATM4 1221.
## 10 40085 ATM4 1195.
## # ℹ 355 more rows
There is a clear outlier in ATM4’s data with one value being above $9000, given the variable ‘Cash’ is recorded in hundreds of dollars, that amount being withdrawn from an ATM is very unlikely. In preprocessing I will impute a value to address the outlier.
# Exploring ATM3's cash values
atm %>%
filter(ATM == "ATM3") %>%
distinct(Cash)
## # A tibble: 4 × 1
## Cash
## <dbl>
## 1 0
## 2 96
## 3 82
## 4 85
ATM3 has only 3 distinct values for Cash where all days except for 3 is 0.
In our data exploration I discovered that I need to reformat the date column in the imported data, address the Cash/ATM column NA values, and impute a value to address the outlier in ATM4.
# Reformatting the date
atm <- atm %>%
mutate(DATE = as.Date(DATE, origin = '1899-12-30'))
# Checking date ranges
atm %>%
group_by(ATM) %>%
summarise(
earliest_date = min(DATE),
latest_date = max(DATE))
## # A tibble: 5 × 3
## ATM earliest_date latest_date
## <chr> <date> <date>
## 1 ATM1 2009-05-01 2010-04-30
## 2 ATM2 2009-05-01 2010-04-30
## 3 ATM3 2009-05-01 2010-04-30
## 4 ATM4 2009-05-01 2010-04-30
## 5 <NA> 2010-05-01 2010-05-14
The date ranges after reformatting are appropriate, each ATM’s date ranges match up and represent a years worth of data by day ending at 4/30/2010.
# Omitting the NA ATM value observations from our data
atm <- atm %>%
filter(!is.na(ATM))
# Rechecking ATM column NA counts
atm %>%
group_by(ATM) %>%
summarise(Count = n())
## # A tibble: 4 × 2
## ATM Count
## <chr> <int>
## 1 ATM1 365
## 2 ATM2 365
## 3 ATM3 365
## 4 ATM4 365
# Rechecking Cash NA counts
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 0.5
## Median :2009-10-30 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
## NA's :5
There are still 5 NA’s in the Cash variable, I will take a closer look.
atm_na <- atm %>%
filter(is.na(Cash))
atm_na
## # A tibble: 5 × 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
Given that the NA values are not related to predictor values and the data is not structurally missing, I will impute missing values. I choose to impute here instead of removing the observations to maintain the structure of the data for when I convert to time series. I will impute the median values to avoid influence of extreme values.
# Calculating median for ATM1's Cash
median_cash_atm1 <- median(atm$Cash[atm$ATM == "ATM1"], na.rm = TRUE)
# Impute NA values in Cash for ATM1
atm$Cash[is.na(atm$Cash) & atm$ATM == "ATM1"] <- median_cash_atm1
median_cash_atm1
## [1] 91
# Calculating median for ATM2's Cash
median_cash_atm2 <- median(atm$Cash[atm$ATM == "ATM2"], na.rm = TRUE)
# Impute NA values in Cash for ATM2
atm$Cash[is.na(atm$Cash) & atm$ATM == "ATM2"] <- median_cash_atm2
median_cash_atm2
## [1] 67
# Checking that all NA values in Cash variable have been imputed
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1460 Min. : 0.0
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 1.0
## Median :2009-10-30 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.3
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
Imputing the outlier in ATM4’s Cash variable with ATM4’s median Cash value
# Finding the atm data index where the ATM4 outlier is located
global_index <- which(atm$ATM == "ATM4")[which.max(atm$Cash[atm$ATM == "ATM4"])]
global_index
## [1] 1380
# Calculating the median Cash where ATM is ATM4
median_cash_atm4 <- median(atm$Cash[atm$ATM == "ATM4" & seq_along(atm$ATM) != 1380], na.rm = TRUE)
median_cash_atm4
## [1] 403.3049
# Imputing the Cash variable at the index
atm$Cash[1380] <- median_cash_atm4
# Viewing data after imputation
ggplot(atm, aes(DATE, Cash, col = ATM)) +
geom_line() +
facet_wrap(~ ATM, scales = "free_y")
Breaking the data into 4 separate time series
atm1_ts <- atm %>%
as_tsibble(index = DATE, key = ATM) %>%
filter(ATM == "ATM1")
atm2_ts <- atm %>%
as_tsibble(index = DATE, key = ATM) %>%
filter(ATM == "ATM2")
atm3_ts <- atm %>%
as_tsibble(index = DATE, key = ATM) %>%
filter(ATM == "ATM3")
atm4_ts <- atm %>%
as_tsibble(index = DATE, key = ATM) %>%
filter(ATM == "ATM4")
I will begin by analyzing the ACF/PACF plots, and the STL decomposition for the data
library(feasts)
atm1_ts %>%
gg_tsdisplay(Cash, plot_type = 'partial')
The autocorrelation spikes at 7, 14, and 21 indicates that there is weekly seasonality in the data for ATM1. There is no apparent trend and variability is stable throughout.
The STL decomposition with a weekly parameter below reflect these findings, showing a stable trend and weekly seasonality that can change over time.
atm1_ts %>%
model(stl = STL(Cash)) %>%
components() %>%
autoplot()
ndiffs(atm1_ts$Cash)
## [1] 0
With a reasonably stable variance and ndiff indicating no differencing needed I will not do a transformation or differencing.
I will opt to train and test a seasonal naive method, Holter Winters additive, and compare with an auto optimized ARIMA. My inclination is that a Holter Winters additive should be the best due to the seasonality and lack of trend.
fit <- atm1_ts %>%
model(ARIMA(Cash))
report(fit)
## Series: Cash
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1695 -0.5867 -0.0989
## s.e. 0.0553 0.0508 0.0497
##
## sigma^2 estimated as 559.8: log likelihood=-1641.12
## AIC=3290.23 AICc=3290.34 BIC=3305.75
The auto ARIMA indicates that the optimal ARIMA model is a Seasonal ARIMA model with a single moving average term, 1 seasonal differencing, and strong weekly seasonal noise.
fit <- atm1_ts %>%
model(ETS(Cash))
report(fit)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.01651456
## gamma = 0.3270894
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 78.33245 -60.17048 -9.153121 17.72025 6.50369 15.97554 16.44455 12.67958
##
## sigma^2: 579.039
##
## AIC AICc BIC
## 4486.250 4486.871 4525.249
The auto ETS indicates that the optimal ETS model is additive in seasonality and error but no trend component.
atm1_train <- atm1_ts %>%
filter(DATE <= "2010-04-01")
atm1_test <- atm1_ts %>%
filter(DATE >= "2010-04-01")
atm1_fits <- atm1_train %>%
model(
SNAIVE = SNAIVE(Cash),
auto_ETS = ETS(Cash),
'Auto ARIMA' = ARIMA(Cash)
)
atm1_forecasts <- atm1_fits %>%
forecast(h = 30)
# Plotting forecasts
atm1_forecasts %>%
autoplot(atm1_test) +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM1 Forecasts: April") +
xlab("Date") +
ylab("Cash in Hundreds")
atm1_forecasts %>%
accuracy(atm1_test)
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto ARIMA ATM1 Test -7.24 12.6 10.0 -85.8 88.9 NaN NaN -0.340
## 2 SNAIVE ATM1 Test -8.17 16.8 14.5 -69.5 76.6 NaN NaN -0.0151
## 3 auto_ETS ATM1 Test -6.52 12.1 9.55 -64.5 67.8 NaN NaN -0.325
From our accuracy values we can tell that the best model here is the auto ETS. The ETS model scored the lowest RMSE and MAE value.
# Fitting ETS for final forecast output
atm1_fit_ets <- atm1_ts %>%
model(ETS = ETS(Cash))
atm1_forecast_ets <- atm1_fit_ets %>%
forecast(h =30)
atm1_forecast_ets %>%
autoplot(atm1_ts) +
labs(title = "ATM1 ETS: May",
y = "Cash in Hundreds")
atm1_forecast_results <- as.data.frame(atm1_forecast_ets) %>%
select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
rename(Date = DATE, Cash = .mean) %>%
mutate(Cash=round(Cash,2))
atm1_forecast_results
## Date Cash
## 1 2010-05-01 87.05
## 2 2010-05-02 100.76
## 3 2010-05-03 73.11
## 4 2010-05-04 5.74
## 5 2010-05-05 100.13
## 6 2010-05-06 79.43
## 7 2010-05-07 85.60
## 8 2010-05-08 87.05
## 9 2010-05-09 100.76
## 10 2010-05-10 73.11
## 11 2010-05-11 5.74
## 12 2010-05-12 100.13
## 13 2010-05-13 79.43
## 14 2010-05-14 85.60
## 15 2010-05-15 87.05
## 16 2010-05-16 100.76
## 17 2010-05-17 73.11
## 18 2010-05-18 5.74
## 19 2010-05-19 100.13
## 20 2010-05-20 79.43
## 21 2010-05-21 85.60
## 22 2010-05-22 87.05
## 23 2010-05-23 100.76
## 24 2010-05-24 73.11
## 25 2010-05-25 5.74
## 26 2010-05-26 100.13
## 27 2010-05-27 79.43
## 28 2010-05-28 85.60
## 29 2010-05-29 87.05
## 30 2010-05-30 100.76
write.csv(atm1_forecast_results, file = "atm1_forecast_results.csv")
Starting with an STL decomposition to observe components, and ACF/PACF plots
atm2_ts %>%
model(stl = STL(Cash)) %>%
components() %>%
autoplot()
gg_tsdisplay(atm2_ts, Cash)
For ATM2 we see similar components as ATM1, where on the time plot we can observe stable variance throughout, a relatively stable trend - that perhaps shows cyclic behavior, and weekly seasonality. Also noted in the ACF plot, we have spike again in lags 7, 14, and 21.
ndiffs(atm2_ts$Cash)
## [1] 1
Our ndiffs function suggests a differencing to the first order is required to make the data stationary.
# Differencing the data for weekly seasonality
atm2_ts <- atm2_ts %>%
mutate(diff_ATM2 = difference(Cash, lag = 7))
# Checking ndiffs after differencing indicates no additional differencing required
ndiffs(atm2_ts$diff_ATM2)
## [1] 0
fit <- atm2_ts %>%
model(ARIMA(diff_ATM2))
report(fit)
## Series: diff_ATM2
## Model: ARIMA(0,0,0)(2,0,0)[7]
##
## Coefficients:
## sar1 sar2
## -0.5795 -0.1814
## s.e. 0.0520 0.0516
##
## sigma^2 estimated as 654.4: log likelihood=-1672.25
## AIC=3350.49 AICc=3350.56 BIC=3362.19
fit <- atm2_ts %>%
model(ETS(Cash))
report(fit)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.000100055
## gamma = 0.3627371
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 72.97637 -46.64692 -22.2591 4.814576 0.2230553 20.58732 14.88622 28.39485
##
## sigma^2: 644.2689
##
## AIC AICc BIC
## 4525.212 4525.834 4564.211
Our Auto ARIMA suggests that a Seasonal ARIMA(0,0,0)(2,0,0) is optimal, where there are two seasonal autoregressive terms.
To compare different models here we will have to include differenced and nondifferenced data. The ARIMA model requres differencing, however the ETS and SNAIVE models do not. Similiary to ATM1 I would suspect that an ETS model would do well here because of the weekly seasonality and lack of trend.
atm2_train <- atm2_ts %>%
filter(DATE < "2010-04-01")
atm2_test <- atm2_ts %>%
filter(DATE >= "2010-04-01")
# Nondifferenced models
atm2_fit_nondiff <- atm2_train %>%
model(
SNAIVE = SNAIVE(Cash),
ETS = ETS(Cash),
)
# Differenced models
atm2_fit_diff <- atm2_train %>%
slice(2:336) %>%
model(
`Auto ARIMA` = ARIMA(diff_ATM2, stepwise = FALSE, approx = FALSE)
)
atm2_forecast_nondiff <- atm2_fit_nondiff %>%
forecast(h = 30)
atm2__forecast_diff <- atm2_fit_diff %>%
forecast(h = 30)
# Plotting nondifferenced models
atm2_forecast_nondiff %>%
autoplot(atm2_ts, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM2 Forecasts: April") +
xlab("Date") +
ylab("Cash in Hundreds")
# Plotting differenced models
atm2__forecast_diff %>%
autoplot(atm2_ts, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "ATM2 Forecasts | April") +
xlab("Date") +
ylab("Cash")
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
atm2_forecast_nondiff %>%
accuracy(atm2_test)
## # A tibble: 2 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS ATM2 Test 9.51 19.4 14.3 -26.3 58.2 NaN NaN -0.0932
## 2 SNAIVE ATM2 Test 12.9 26.2 17.8 35.1 45.1 NaN NaN -0.319
atm2__forecast_diff %>%
accuracy(atm2_test)
## # A tibble: 1 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto ARIMA ATM2 Test 3.71 20.1 14.2 Inf Inf NaN NaN -0.312
Our auto ETS model had the lowest RMSE suggesting it was the best fit. The ARIMA(0,0,0)(2,0,0)[7] however provided a very close RMSE, and a slightly lower MAE. I will forecast using the auto ETS on nondifferenced data.
atm2_fit_ets <- atm2_ts %>%
model(ETS = ETS(Cash))
atm2_forecast_ets <- atm2_fit_ets %>%
forecast(h = 30)
atm2_forecast_ets %>%
autoplot(atm2_ts) +
labs(title = "ATM2 ETS Nondiff: May",
y = "Cash in Hundreds")
atm2_forecast_results <- as.data.frame(atm2_forecast_ets) %>%
select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
rename(Date = DATE, Cash = .mean) %>%
mutate(Cash=round(Cash,2))
atm2_forecast_results
## Date Cash
## 1 2010-05-01 68.35
## 2 2010-05-02 74.19
## 3 2010-05-03 11.09
## 4 2010-05-04 2.14
## 5 2010-05-05 101.60
## 6 2010-05-06 92.38
## 7 2010-05-07 68.98
## 8 2010-05-08 68.35
## 9 2010-05-09 74.19
## 10 2010-05-10 11.09
## 11 2010-05-11 2.14
## 12 2010-05-12 101.60
## 13 2010-05-13 92.38
## 14 2010-05-14 68.98
## 15 2010-05-15 68.35
## 16 2010-05-16 74.19
## 17 2010-05-17 11.09
## 18 2010-05-18 2.14
## 19 2010-05-19 101.60
## 20 2010-05-20 92.38
## 21 2010-05-21 68.98
## 22 2010-05-22 68.35
## 23 2010-05-23 74.19
## 24 2010-05-24 11.09
## 25 2010-05-25 2.14
## 26 2010-05-26 101.60
## 27 2010-05-27 92.38
## 28 2010-05-28 68.98
## 29 2010-05-29 68.35
## 30 2010-05-30 74.19
write.csv(atm2_forecast_results, file = "atm2_forecast_results.csv")
Our ATM3 data only contains 3 observations of Cash withdrawn above $0.
atm3_ts %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
Because of the lack of data our best forecast here is likely the Naive
model where the future observation is most likely the most recent past
observation. Another thought would be to use the random walk with drift,
however with only 3 data points any trending is not reliable. With three
data points training and testing models ultimately not feasible.
atm3_fit <- atm3_ts %>%
model(NAIVE(Cash))
atm3_forecast <- atm3_fit %>%
forecast(h = 30)
atm3_forecast %>%
autoplot(atm3_ts)
My recommendation here is to wait until more data is available for analysis, or attempt to obtain more data in the case that this ATM was out for repair and has data prior to the date range we have at our disposal.
atm3_forecast_results <- as.data.frame(atm3_forecast) %>%
select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
rename(Date = DATE, Cash = .mean) %>%
mutate(Cash=round(Cash,2))
atm3_forecast_results
## Date Cash
## 1 2010-05-01 85
## 2 2010-05-02 85
## 3 2010-05-03 85
## 4 2010-05-04 85
## 5 2010-05-05 85
## 6 2010-05-06 85
## 7 2010-05-07 85
## 8 2010-05-08 85
## 9 2010-05-09 85
## 10 2010-05-10 85
## 11 2010-05-11 85
## 12 2010-05-12 85
## 13 2010-05-13 85
## 14 2010-05-14 85
## 15 2010-05-15 85
## 16 2010-05-16 85
## 17 2010-05-17 85
## 18 2010-05-18 85
## 19 2010-05-19 85
## 20 2010-05-20 85
## 21 2010-05-21 85
## 22 2010-05-22 85
## 23 2010-05-23 85
## 24 2010-05-24 85
## 25 2010-05-25 85
## 26 2010-05-26 85
## 27 2010-05-27 85
## 28 2010-05-28 85
## 29 2010-05-29 85
## 30 2010-05-30 85
write.csv(atm3_forecast_results, file = "atm3_forecast_results.csv")
atm4_ts %>%
model(stl = STL(Cash)) %>%
components() %>%
autoplot()
gg_tsdisplay(atm4_ts, Cash)
Similarly to ATM 1 and 2 we see ACF autocorrelations spiking at 7, 14, and 21, with variable seasonality, and not much trend-cycle. We do however have a changing variablity in the time plots above, signaling that we can utilize a box-cox transformation to normalize the variance.
atm4_lambda <- atm4_ts %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm4_lambda
## [1] 0.4525599
With a suggested lambda of 0.4525599 we can transform the data and when reporting our forecasts. The forecast package will handle back-transforming in the forecast() function.
atm4_ts <- atm4_ts %>%
mutate(Cash_boxcox = box_cox(Cash, atm4_lambda))
atm4_ts %>%
model(stl = STL(Cash_boxcox)) %>%
components() %>%
autoplot()
gg_tsdisplay(atm4_ts, Cash_boxcox)
As we can see post transformation the variance has been smoothed,
however the ACF spikes still remain.
ndiffs(atm4_ts$Cash_boxcox)
## [1] 0
No differencing is suggested by our ndiffs post transformation
For ATM4 we will train SNAIVE, autoARIMA, and autoETS models. Due to the oscilations in the remainder component I believe that the SNAIVE might perform the best, the ETS model in this case may overfit.
atm4_train <- atm4_ts %>%
filter(DATE < "2010-04-01")
atm4_test <- atm4_ts %>%
filter(DATE >= "2010-04-01")
atm4_fits <- atm4_train %>%
model(
SNAIVE = SNAIVE(Cash_boxcox),
auto_ETS = ETS(Cash_boxcox),
arima = ARIMA(Cash_boxcox)
)
atm4_forecasts <- atm4_fits %>%
forecast(h = 30)
atm4_forecasts %>%
autoplot(atm4_test)+
labs(
y = "Cash in Hundreds",
title = "ATM4 Forecasts: April"
)
atm4_forecasts %>%
accuracy(atm4_test)
## # A tibble: 3 × 11
## .model ATM .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE ATM4 Test -5.89 12.2 8.87 -83.3 91.5 NaN NaN 0.0751
## 2 arima ATM4 Test -1.88 12.2 9.78 -128. 147. NaN NaN -0.0836
## 3 auto_ETS ATM4 Test -2.18 10.7 8.41 -108. 123. NaN NaN 0.0608
For ATM4 our auto_ETS model performed the most accurately on the RMSE and the MAE values.
atm4_fit_ets <- atm4_ts %>%
model(ETS = ETS(Cash_boxcox))
atm4_forecasts_ets <- atm4_fit_ets %>%
forecast(h=30)
atm4_forecasts_ets %>%
autoplot(atm4_ts) +
labs(title = "ATM4 ETS Nondiff: May",
y = "Cash in Hundreds")
atm4_forecast_results <- as.data.frame(atm4_forecasts_ets) %>%
select(DATE, .mean) %>% # Using mean of forecasts to represent the point estimate projections
rename(Date = DATE, Cash = .mean) %>%
mutate(Cash=round(Cash,2))
atm4_forecast_results
## Date Cash
## 1 2010-05-01 25.51
## 2 2010-05-02 29.59
## 3 2010-05-03 29.98
## 4 2010-05-04 9.22
## 5 2010-05-05 33.29
## 6 2010-05-06 27.14
## 7 2010-05-07 34.14
## 8 2010-05-08 25.51
## 9 2010-05-09 29.59
## 10 2010-05-10 29.98
## 11 2010-05-11 9.22
## 12 2010-05-12 33.29
## 13 2010-05-13 27.14
## 14 2010-05-14 34.14
## 15 2010-05-15 25.51
## 16 2010-05-16 29.59
## 17 2010-05-17 29.98
## 18 2010-05-18 9.22
## 19 2010-05-19 33.29
## 20 2010-05-20 27.14
## 21 2010-05-21 34.14
## 22 2010-05-22 25.51
## 23 2010-05-23 29.59
## 24 2010-05-24 29.98
## 25 2010-05-25 9.22
## 26 2010-05-26 33.29
## 27 2010-05-27 27.14
## 28 2010-05-28 34.14
## 29 2010-05-29 25.51
## 30 2010-05-30 29.59
write.csv(atm4_forecast_results, file = "atm4_forecast_results.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
# Loading in provided excel file of residential power usage
power_data <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(power_data)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
For the purposes of our timeseries we will remove the variable CaseSequence, rename and set the index to Date, and convert the new date variable from chr type to date type.
summary(power_data)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
For the one missing value, we will address with imputation.
# Identifying index of NA value
which(is.na(power_data$KWH))
## [1] 129
# Using linear interpolation to estimate missing value based on neighboring data points.
power_data$KWH <- na.approx(power_data$KWH)
# Checking that no NA values are left
summary(power_data)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5434539
## Median :828.5 Mode :character Median : 6314472
## Mean :828.5 Mean : 6502824
## 3rd Qu.:876.2 3rd Qu.: 7608792
## Max. :924.0 Max. :10655730
# Creating time series with date and KWH, dropping unneeded CaseSequence
power_data <- power_data %>%
rename(Date = `YYYY-MMM`) %>%
mutate(Date = yearmonth(Date))
# Create tsibble
power_ts <- power_data %>%
as_tsibble(index = Date)
# Drop CaseSequence
power_ts <- power_ts %>%
select(-CaseSequence)
First visualizing the data and inspecting components
gg_tsdisplay(power_ts)
## Plot variable not specified, automatically selected `y = KWH`
power_ts %>%
model(stl = STL(KWH)) %>%
components() %>%
autoplot()
We can also see that there is an outlier in the data. I will address the outlier via linear interpolation. In the gg_tsdisplay plot above the outlier is shown to be a recording of less power usage than in the troughs of the months of the least usage. The recording is likely an error.
# Addressing outlier via linear interpolation.
min_kwh <- min(power_ts$KWH, na.rm = TRUE)
print(min_kwh)
## [1] 770523
# Replacing outlier with NA
power_ts <- power_ts %>%
mutate(KWH = ifelse(KWH == min_kwh, NA, KWH))
# Using interpolation on created NA value
power_ts <- power_ts %>%
mutate(KWH = na.approx(KWH))
Showing plots after addressing outlier
gg_tsdisplay(power_ts, KWH)
power_ts %>%
model(stl = STL(KWH)) %>%
components() %>%
autoplot()
The two above plots show that there is a recently accelerating positive trend in the data, and there can be strong monthly seasonality noted. In the ACF plot there can be a 6 month pattern with autocorrelation spikes at lags 6, 12, and 18. The Monthly seasonality is likely associated with the changing of the seasons. The range of seasonality does not seem to be increasing - the variability is stable.
ndiffs(power_ts$KWH)
## [1] 1
Our ndiffs function, the trend seen above, and ACF plot above indicate nonstationary data. I will do a first order differencing before modeling for ARIMA. For our modeling I will compare an autoARIMA, autoETS, and STL. STL allows for exponential smoothing and can take into user selected components
# Differencing the data for monthly seasonality
power_ts <- power_ts %>%
mutate(diff_KWH = difference(KWH, lag = 12))
# Checking ndiffs after differencing indicates no additional differencing required
ndiffs(power_ts$diff_KWH)
## [1] 0
fit <- power_ts %>%
model(ARIMA(diff_KWH))
report(fit)
## Series: diff_KWH
## Model: ARIMA(1,0,0)(2,0,0)[12] w/ mean
##
## Coefficients:
## ar1 sar1 sar2 constant
## 0.3073 -0.7261 -0.4155 158323.36
## s.e. 0.0752 0.0767 0.0782 49164.21
##
## sigma^2 estimated as 3.774e+11: log likelihood=-2662.58
## AIC=5335.16 AICc=5335.48 BIC=5351.45
fit <- power_ts %>%
model(ETS = ETS(KWH))
report(fit)
## Series: KWH
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.1152884
## gamma = 0.1983171
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 6188218 0.9083241 0.7515655 0.9314405 1.225896 1.276768 1.233079 1.016933
## s[-7] s[-8] s[-9] s[-10] s[-11]
## 0.7623912 0.804406 0.8855059 1.011111 1.19258
##
## sigma^2: 0.0094
##
## AIC AICc BIC
## 6145.320 6148.047 6194.182
For our models I will compare an SNAIVE, an ARIMA(1,0,0)(2,0,0)[12] w/ mean (Seasonal autoregressive), and ETS(M,N,M). The auto ETS is suggesting that the seasonality is multiplicative - something that I did not catch by observing out decomposition.
power_ts_train <- power_ts %>%
filter(Date < yearmonth("2013 Jan"))
power_ts_test <- power_ts %>%
filter(Date >= yearmonth("2013 Jan"))
# Nondifferenced models
power_fit_nondiff <- power_ts_train %>%
model(
SNAIVE = SNAIVE(KWH),
ETS = ETS(KWH),
)
# Differenced models
power_fit_diff <- power_ts_train %>%
model(
`Auto ARIMA` = ARIMA(diff_KWH, stepwise = FALSE, approx = FALSE)
)
power_forecast_nondiff <- power_fit_nondiff %>%
forecast(h = 12)
power_forecast_diff <- power_fit_diff %>%
forecast(h = 12)
# Plotting nondifferenced models
power_forecast_nondiff %>%
autoplot(power_ts, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "Non Diff Model Forecasts: 2013") +
xlab("Date") +
ylab("KWH")
# Plotting nondifferenced models
power_forecast_diff %>%
autoplot(power_ts, level = NULL)+
facet_wrap( ~ .model, scales = "free_y") +
guides(colour = guide_legend(title = "Forecast"))+
labs(title= "Diff Model Forecasts: 2013") +
xlab("Date") +
ylab("KWH")
power_forecast_nondiff %>%
accuracy(power_ts_test)
## # A tibble: 2 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS Test 381167. 1063141. 631440. 3.50 7.29 NaN NaN 0.0249
## 2 SNAIVE Test 405195. 1035538. 618606. 4.55 7.06 NaN NaN -0.0313
power_forecast_diff %>%
accuracy(power_ts_test)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto ARIMA Test 234994. 936323. 625297. -329. 539. NaN NaN -0.0436
Out of our three models the one with the lowest RMSE is the ARIMA(1,0,0)(2,0,0)[12] w/ mean. This model takes into account the mean of the differenced data, and uses a 1 term non seasonal autoregressive term, with 2 seasonal autoregressive terms. When comparing versus the other models, the SNAIVE did better in accuracy than the ETS, but both the models run on the nondifferenced data were far from close to the ARIMA’s RMSE. The accuracy values make sense as the ETS model may tend to overfit when there is a trend involved, and the ARIMA model possibly adapts to longer term forecasting better.
I will move forward with the ARIMA model for modeling.
power_model_ARIMA <- power_ts %>%
model(ARIMA(diff_KWH))
power_forecast_ARIMA <- power_model_ARIMA %>%
forecast(h = 12)
power_forecast_ARIMA %>%
autoplot(power_ts) +
labs(title = "ARIMA(1,0,0)(2,0,0)[12] monthly forecast: 2014",
y = "KWH")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
power_forecast_results <- as.data.frame(power_forecast_ARIMA) %>%
select(Date, .mean)
power_forecast_results
## Date .mean
## 1 2014 Jan -477705.329
## 2 2014 Feb 1048413.918
## 3 2014 Mar 182796.294
## 4 2014 Apr -90589.288
## 5 2014 May 6132.105
## 6 2014 Jun 281254.815
## 7 2014 Jul 1072841.129
## 8 2014 Aug 904218.322
## 9 2014 Sep 506776.073
## 10 2014 Oct 107295.038
## 11 2014 Nov 376991.984
## 12 2014 Dec -1256249.701
write.csv(power_forecast_results, file = "power_forecast_results.csv")