Import Libraries
library(fpp3)## Warning: package 'fpp3' was built under R version 4.1.2
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✓ tibble 3.1.6 ✓ tsibble 1.1.3
## ✓ dplyr 1.0.8 ✓ tsibbledata 0.4.1
## ✓ tidyr 1.2.0 ✓ feasts 0.3.0
## ✓ lubridate 1.8.0 ✓ fable 0.3.2
## ✓ ggplot2 3.3.5 ✓ fabletools 0.3.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'tsibble' was built under R version 4.1.2
## Warning: package 'tsibbledata' was built under R version 4.1.2
## Warning: package 'feasts' was built under R version 4.1.2
## Warning: package 'fable' was built under R version 4.1.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
library(dplyr)
library(tidyr)
library(ggplot2)
library(readxl)Part 1 - ATM Data
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Load and Inspect Data
atm <- read_excel("/Users/alecmccabe/Downloads/ATM624DATA.xlsx", col_types = c('date', 'text', 'numeric')) %>%
# converting into date format
mutate(DATE = as_date(DATE)) %>%
# convert to tsibble with DATE as index and ATM as key
as_tsibble(index = DATE, key = ATM)
print(dim(atm))## [1] 1474 3
summary(atm)## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
The ATM dataset contains 1474 entries. Each entry corresponds to a cash withdrawel from a specific ATM at a specific time. The dates range from May-2009 to May-2010.
colSums(is.na(atm))## DATE ATM Cash
## 0 14 19
atm[is.na(atm$ATM),]There are 14 missing values for the ATM column. But upon inspection they are all in May-2010, which we are attempting to model. We can remove these without effect.
atm <- atm %>%
drop_na(ATM)atm[is.na(atm$Cash),]After removing the entries with missing ATM values, there are 5 entries left with missing Cash values. All of the missing values are in June of 2009. In later steps we will think about different ways we can overcome this.
We can overcome the missing values in a few ways: - First, we can replace the missing values with the mean or median values for the ATM and Month. - Second, we can aggregate all the data onto a weekly or monthly level. This however may not be the best approach since all of the missing Cash values are from the same 2-week period in June. Potentially a monthly aggregation would be appropriate. - Lastly, we can ignore the imputation of missing values and instead rely purely on an ARIMA model, which can inherently handle missing values.
For this project, I will explore options 1.
# calculate median values
atm1_median <- median(atm[(atm$ATM == 'ATM1') & (month(atm$DATE)==6) & (!is.na(atm$Cash)),]$Cash)
atm2_median <- median(atm[(atm$ATM == 'ATM1') & (month(atm$DATE)==6) & (!is.na(atm$Cash)),]$Cash)
# replace missing values with median values
atm[(atm$ATM == 'ATM1') & (is.na(atm$Cash)),]$Cash = atm1_median
atm[(atm$ATM == 'ATM2') & (is.na(atm$Cash)),]$Cash = atm2_medianExploring the Data
atm %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Time Series by ATM")From the above, we can see that ATMs 1 and 2 have similar series values, with continued activity (withdrawls) across the period. It is difficult to ascertain the behavior of ATM4 because of an extreme outlier which we will need to address.
ATM 4 differs, mainly in that it exhibit significantly smaller levels of cash withdrawls. In fact, ATM3 has 0 withdraws until late April of 2010, after which there is a large spike. Is this spike an outlier, or was this third ATM newly introduced to a busy corner? Regardless, we don’t have enough data for the ATM to include in our analysis so we will remove in further steps.
atm <- atm %>%
filter(ATM != 'ATM3')For ATM 4, we will need to address the singular outlier occuring around Marth of 2010. We can do this by replacing it with the average of the series.
max <- max(atm[atm$ATM=='ATM4',]$Cash)
median <- median(atm[atm$ATM=='ATM4',]$Cash)
atm[(atm$ATM=='ATM4') & (atm$Cash == max),]$Cash = medianatm %>%
autoplot(Cash) +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Time Series by ATM")atm %>%
ACF(Cash) %>%
autoplot() +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Time Series by ATM")From the ACT plots above, we can see that ATMs 1 & 2 both exhibit weekly seasonality. We cannot see the same type of seasonality from ATM 4.
Exploring ATM 1
atm1 <- atm %>%
filter(ATM == 'ATM1')
# ATM1 STL Decomposition
atm1 %>%
model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "ATM1 with Seasonal Decomposition")Based on the seasonal decomposition above, we can see that there is no significant trend in the series. The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.
atm1 |>
gg_lag(Cash, geom = "point") +
labs(x = "lag(Cash, k)")Now looking at the lag plot for ATM 1, we can once again see that this ATM’s time series follows a weekly (7-day) seasonality cycle.
# create training set and compare against last months
atm1_train <- atm1 %>%
filter(DATE < '2010-04-01')We can test out a range of models and transformations using the below code:
# lambda for box cox transformation
atm1_lambda <- atm1_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm1_fit <- atm1_train %>%
model(
# additive ETS model
add_ets = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
mult_ets = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
add_ets_transform = ETS(box_cox(Cash,atm1_lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
mult_ets_transform = ETS(box_cox(Cash,atm1_lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_transform = SNAIVE(box_cox(Cash,atm1_lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,atm1_lambda))
)
accuracy(atm1_fit)For ATM 1, the multiplicative-transform model produced the lowest RSME
atm1_forecast <- atm1_fit %>%
forecast(h = 31) %>%
filter(.model=='mult_ets_transform')
atm1_forecast %>%
autoplot(atm1, level=NULL) +
labs(title="Transformed Multiplicative ETS Forecast")The Transformed Multiplicative model seems to over-predict certain forecasted values when compared against actual April 2010 values. However the fit is still good.
# report of the arima model
atm1_fit %>% select(.model = "mult_ets_transform") %>% report()## Series: Cash
## Model: ETS(M,N,M)
## Transformation: box_cox(Cash, atm1_lambda)
## Smoothing parameters:
## alpha = 0.03183972
## gamma = 0.2749021
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 11.94849 0.4187292 1.012255 1.149805 1.054564 1.076637 1.048191 1.239819
##
## sigma^2: 0.0533
##
## AIC AICc BIC
## 2657.902 2658.581 2696.043
# residuals
atm1_fit %>%
select("mult_ets_transform") %>%
gg_tsresiduals()#Box-Pierce test, â„“=2m for seasonal data, m=7
atm1_fit %>%
select(.model = "mult_ets_transform") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)Based on the above plot, the residuals seems to be normally distributed and can therefore be considered white noise.
Exploring ATM 2
atm2 <- atm %>%
filter(ATM == 'ATM2')
# ATM1 STL Decomposition
atm2 %>%
model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "ATM2 with Seasonal Decomposition")Based on the seasonal decomposition above, we can see that there is no significant trend in the series. However, the trend in the final months becomes erratic.
The data is follows a weekly seasonality similar to ATM 1. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010 (also similar to ATM 1).
atm2 |>
gg_lag(Cash, geom = "point") +
labs(x = "lag(Cash, k)")Looking at the lag plots for ATM 2, it becomes harder to determine the appropriate autocorrelation value. That being said, lag 7 seems to be the most correlated, which would agree with the above STL analysis suggesting a weekly seasonality.
# create training set and compare against last months
atm2_train <- atm2 %>%
filter(DATE < '2010-04-01')# lambda for box cox transformation
atm2_lambda <- atm2_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm2_fit <- atm2_train %>%
model(
# additive ETS model
add_ets = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
mult_ets = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
add_ets_transform = ETS(box_cox(Cash,atm2_lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
mult_ets_transform = ETS(box_cox(Cash,atm2_lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_transform = SNAIVE(box_cox(Cash,atm2_lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,atm2_lambda))
)
accuracy(atm2_fit)For ATM 2, ARIMA and Transformed Additive ETS produced the lowest RSME values. We can compare their AIC and BIC to determine which model we should move forward with.
atm2_fit %>%
forecast(h = 31) %>%
filter(.model %in% c('add_ets_transform', 'ARIMA')) %>%
autoplot(atm2, level=NULL) +
labs(title="Transformed Additive ETS & ARIMA Forecast")BOth models do a decent job at forecasting the actual values in April. However it appears that Arima has a most constant mean. Both seem to miss the large negative swings.
# report of the multiplicative model
atm2_fit %>% select(.model = "add_ets_transform") %>% report()## Series: Cash
## Model: ETS(A,N,A)
## Transformation: box_cox(Cash, atm2_lambda)
## Smoothing parameters:
## alpha = 0.0001000105
## gamma = 0.3788876
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 28.54438 -19.49563 -14.30877 4.824496 1.906367 7.832009 9.026157 10.21537
##
## sigma^2: 83.0954
##
## AIC AICc BIC
## 3439.307 3439.986 3477.448
# report of the multiplicative model
atm2_fit %>% select(.model = "ARIMA") %>% report()## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, atm2_lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4242 -0.9271 0.462 0.8169 -0.7544
## s.e. 0.0503 0.0425 0.081 0.0574 0.0460
##
## sigma^2 estimated as 78.22: log likelihood=-1180.26
## AIC=2372.52 AICc=2372.78 BIC=2395.27
The ARIMA model has significantly lower AIC and BIC values. Considering it also has the lowest RSME, it is an obvious choice to move forward with ARIMA for ATM2.
atm2_forecast <- atm2_fit %>%
forecast(h = 31) %>%
filter(.model %in% c('ARIMA'))# residuals
atm2_fit %>%
select("ARIMA") %>%
gg_tsresiduals()#Box-Pierce test, â„“=2m for seasonal data, m=7
atm2_fit %>%
select(.model = "ARIMA") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)From the above residual plots we can see that the residuals are near-normally distributed with a slight right skew. That being said, the skew is marginal and we can consider the residuals to be white noise.
Exploring ATM 4
atm4 <- atm %>%
filter(ATM == 'ATM4')
# ATM1 STL Decomposition
atm4 %>%
model(STL(Cash ~ season(window="periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "ATM4 with Seasonal Decomposition")Based on the seasonal decomposition above, we can see that there is no significant trend in the series. Unlike ATM2, this series’ trend exhibits not erratic behavior in the final months.
The data is follows a weekly seasonality. We also see that the remainder is mostly random, but exhibits seemingly non-random behavior around March of 2010.
atm4 |>
gg_lag(Cash, geom = "point") +
labs(x = "lag(Cash, k)")Again, when looking at the lag plots, it becomes difficult to match the STL decomposition seasonality assesment. According to the STL decomposition, this ATM4 series follows a weekly seasonality. That is very difficult to notice using these lag plots.
# create training set and compare against last months
atm4_train <- atm4 %>%
filter(DATE < '2010-04-01')# lambda for box cox transformation
atm4_lambda <- atm4_train %>%
features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
atm4_fit <- atm4_train %>%
model(
# additive ETS model
add_ets = ETS(Cash ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
mult_ets = ETS(Cash ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(Cash),
# transformed additive ETS model
add_ets_transform = ETS(box_cox(Cash,atm4_lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
mult_ets_transform = ETS(box_cox(Cash,atm4_lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_transform = SNAIVE(box_cox(Cash,atm4_lambda)),
# arima model
ARIMA = ARIMA(box_cox(Cash,atm4_lambda))
)
accuracy(atm4_fit)For ATM 4, the Transformed Additive ETS model produced the lowest RSME
atm4_forecast <- atm4_fit %>%
forecast(h = 31) %>%
filter(.model=='add_ets_transform')
atm4_forecast %>%
autoplot(atm4, level=NULL) +
labs(title="Transformed Additive ETS Forecast")While the mean of the forecasted values match the mean of the last months in the test set, we can see that the chosen model does not do a gret job at capturing the true variance of cash withdrawls from ATM 4.
# report of the arima model
atm4_fit %>% select(.model = "add_ets_transform") %>% report()## Series: Cash
## Model: ETS(A,N,A)
## Transformation: box_cox(Cash, atm4_lambda)
## Smoothing parameters:
## alpha = 0.000100027
## gamma = 0.0001000061
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 23.12951 -10.53245 -0.008230648 0.3562908 2.212736 2.473327 2.258718 3.23961
##
## sigma^2: 92.8782
##
## AIC AICc BIC
## 3476.592 3477.271 3514.734
# residuals
atm4_fit %>%
select("add_ets_transform") %>%
gg_tsresiduals()#Box-Pierce test, â„“=2m for seasonal data, m=7
atm4_fit %>%
select(.model = "add_ets_transform") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)Producing Forecasts for May
atm1_may_forecast <- atm1_fit %>%
forecast(h = 61) %>%
filter(.model=='mult_ets_transform')
atm2_may_forecast <- atm2_fit %>%
forecast(h = 61) %>%
filter(.model=='ARIMA')
atm4_may_forecast <- atm4_fit %>%
forecast(h = 61) %>%
filter(.model=='add_ets_transform')
forecasts <- bind_rows(atm1_may_forecast, atm2_may_forecast, atm4_may_forecast) %>%
as.data.frame() %>%
select(DATE, ATM, .mean) %>%
rename(Cash = .mean)forecasts# altogether plot
forecasts %>%
as_tsibble(index = DATE, key = ATM) %>%
autoplot(Cash) +
autolayer(atm, Cash, colour = "black") +
facet_wrap(~ATM, scales = "free", nrow = 4) +
labs(title = "ATM Withdrawls")forecasts %>% write.csv("atm124_forecasts.csv")Part 2 - Residential Power Usage
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Load and Inspect Data
power <- read_excel("/Users/alecmccabe/Downloads/ResidentialCustomerForecastLoad-624.xlsx")head(power)power <- power %>%
# renaming column name
rename(Month = 'YYYY-MMM') %>%
# converting into date format
mutate(Month = yearmonth(Month)) %>%
# converting to tsibble
as_tsibble(index = Month)
head(power)colSums(is.na(power))## CaseSequence Month KWH
## 0 0 1
There is one missing value for KWH in the dataframe.
power[is.na(power$KWH),]Handling Missing Values and Outliers
power %>%
autoplot(KWH)We can see that the missing value occurs 2008 september. This missing value can further be seen in the above plot. We can replace the missing KWH value with the mean of the series until that point.
We also see a large outlier occurring after Janurary 2010 which we will want to address.
When addressing the missing value, we can use a simple approach like we did earlier, or we can use ARIMA to impute the missing value!
power <- power %>%
# Fit ARIMA model to the data containing missing values
model(ARIMA(KWH)) %>%
# Estimate Cash for the missing values.
interpolate(power)We can additionally use ARIMA to handle our outlier. By replacing the outlier value with missing and then running the same ARIMA interpolation function.
outlier = min(power$KWH)
power <- power %>%
#replace outliers with NA
mutate(KWH = replace(KWH, KWH == outlier, NA))
power <- power %>%
# Fit ARIMA model to the data with missing values
model(ARIMA(KWH)) %>%
# Estimate Cash for the missing values / outliers
interpolate(power)power %>%
autoplot(KWH)The resulting dataset looks much better. We no longer have a gap in our data. Additionally the outlier has been removed and the replaced value fits nicelyt with the rest of the series.
Seasonal Decomposition
# ATM1 STL Decomposition
power %>%
model(STL(KWH ~ season(window="periodic"), robust = TRUE)) %>%
components() %>%
autoplot() +
labs(title = "Power Usage Seasonal Decomposition")The power series exhibits moderate upward trend which is exacerbated in 2010 and onwards. From the above plot (as well as the one below) we can see that the series follows an annual seasonality. The remainder looks mostly random.
power %>%
ACF(KWH) %>%
autoplot() +
labs(title = "ACF Plot for Power Usage")The ACF plot above shows us that indeed the series follows an annual seasonality. Also interesting to point out is that the series has negative correlation between it’s 6-month lags. This makes sense, as people will typically use power relative to the time of year / heat of year. When it is warmer out (Summer), power usage will increase. When it is colder out (Winter), power usage will decrease.
power %>%
gg_subseries(KWH) +
labs(title = "Residential Power Usage")It appears that this series follows an annual seasonality. Additionally every 6 months is negatively correlated with the previous 6 months. This has to do with the nature of power usage across seasons. Typically an individual will use more or less power depending on if it is hot or cold outside.
Across the years we can see that the variance for Janurary and December are increasing, whereas for other months the change is not as drastic.
Testing models
# lambda for box cox transformation
power_lambda <- power %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
power_fit <- power %>%
model(
# additive ETS model
additive = ETS(KWH ~ error("A") + trend("N") + season("A")),
# multiplicative ETS model
multiplicative = ETS(KWH ~ error("M") + trend("N") + season("M")),
# SNAIVE model
snaive = SNAIVE(KWH),
# transformed additive ETS model
additive_ts = ETS(box_cox(KWH,power_lambda) ~ error("A") + trend("N") + season("A")),
# transformed multiplicative ETS model
multiplicative_ts = ETS(box_cox(KWH,power_lambda) ~ error("M") + trend("N") + season("M")),
# transformed SNAIVE model
snaive_ts = SNAIVE(box_cox(KWH,power_lambda)),
# arima model
ARIMA = ARIMA(box_cox(KWH,power_lambda))
)
accuracy(power_fit)We can see that the Additive ETS model produces the lowest RSME, which is promising. But based on the ATM exercise above we should also compare the AIC and BIC values against that of ARIMA.
# report of the arima model
power_fit %>% select(.model = "additive") %>% report()## Series: KWH
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.1184148
## gamma = 0.0001000057
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6] s[-7]
## 6212283 -361852 -1628501 -871943 1161428 1738041 1314356 -21752.88 -1481715
## s[-8] s[-9] s[-10] s[-11]
## -1225127 -534893.8 481413.7 1430547
##
## sigma^2: 419673242075
##
## AIC AICc BIC
## 6163.349 6166.076 6212.211
# report of the arima model
power_fit %>% select(.model = "ARIMA") %>% report()## Series: KWH
## Model: ARIMA(0,0,1)(2,1,0)[12] w/ drift
## Transformation: box_cox(KWH, power_lambda)
##
## Coefficients:
## ma1 sar1 sar2 constant
## 0.2462 -0.7202 -0.3546 0.0013
## s.e. 0.0833 0.0751 0.0769 0.0004
##
## sigma^2 estimated as 1.505e-05: log likelihood=751.74
## AIC=-1493.47 AICc=-1493.13 BIC=-1477.51
While ARIMA and the Standard Additive Model have similar RSMEs, ARIMA has the lowest AIC and BIC. In fact, for ARIMA these metrics are both negative.
# residuals
power_fit %>%
select("ARIMA") %>%
gg_tsresiduals()#Box-Pierce test, â„“=2m for seasonal data, m=7
power_fit %>%
select(.model = "ARIMA") %>%
augment() %>%
features(.innov, box_pierce, lag = 14, dof = 0)We can see that the residuals are normally distributed and therefore can be considered white noise.
Forecasting
power_forecast <- power_fit %>%
forecast(h = 12) %>%
filter(.model=='ARIMA')
# forcasted plot
power_forecast %>%
autoplot(power)The forecast seems to follow the expected flucuations in the series. Below are the final predictions.
forecast_results <- power_forecast %>%
as.data.frame() %>%
select(Month, .mean) %>%
rename(prediction = .mean)
forecast_resultsforecast_results %>% write.csv("power_forecasts.csv")