library('ggplot2')
library('forecast')
library('tseries')
library("readxl")
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('tidyr')
library('readr')
library('purrr')
You can also embed plots, for example:
atm_data <- read_excel("ATM624Data.xlsx")
#atm_data$yearandmonth = format(as.Date(atm_data$DATE), "%Y-%m")
atm_data
## # A tibble: 1,474 x 3
## DATE ATM Cash
## <dttm> <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1 96
## 2 2009-05-01 00:00:00 ATM2 107
## 3 2009-05-01 00:00:00 ATM3 0
## 4 2009-05-01 00:00:00 ATM4 777.
## 5 2009-05-02 00:00:00 ATM1 82
## 6 2009-05-02 00:00:00 ATM2 89
## 7 2009-05-02 00:00:00 ATM3 0
## 8 2009-05-02 00:00:00 ATM4 524.
## 9 2009-05-03 00:00:00 ATM1 85
## 10 2009-05-03 00:00:00 ATM2 90
## # ... with 1,464 more rows
If this were a business problem, I would like to understand which ATM’s have huge draws and an estimate of how often it has to be refilled. So i group and spread the data by ATM. This is a reasonable approach when the groupings are limited.
df<-atm_data %>%
drop_na() %>%
spread(ATM, Cash) %>%
mutate(DATE = as.Date(DATE, origin = "2009-03-30"))
df
## # A tibble: 365 x 5
## DATE ATM1 ATM2 ATM3 ATM4
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01 96 107 0 777.
## 2 2009-05-02 82 89 0 524.
## 3 2009-05-03 85 90 0 793.
## 4 2009-05-04 90 55 0 908.
## 5 2009-05-05 99 79 0 52.8
## 6 2009-05-06 88 19 0 52.2
## 7 2009-05-07 8 2 0 55.5
## 8 2009-05-08 104 103 0 559.
## 9 2009-05-09 87 107 0 904.
## 10 2009-05-10 93 118 0 879.
## # ... with 355 more rows
Now, I would like to convert the data to a timeseries and plot them.
atm_ts<-ts(df %>% select(-DATE))
autoplot(atm_ts) +
labs(title = "Withdrawn Cash From All 4 ATMS", subtitle = "05/09 - 04/10", x = "Days") +
scale_y_continuous("ATM withdrawal ($)") +
scale_color_discrete(NULL) +
theme(legend.position = c(0.1, 0.8))
The timeseries above is unclear due to spikes and also the min/max range on the ATM’s seems very different. I will attempt to use a faceted plot to take an individual look at these time series.
df %>% gather(atm, Cash, -DATE) %>%
ggplot(aes(x = DATE, y = Cash, col = atm)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ atm, ncol = 1, scales = "free_y") +
labs(title = "Cash Withdrawal from 4 ATM's", subtitle = "05/09 - 04/10", x = "Date") +
scale_y_continuous("$Cash in hundreds ")
#summarise_at(group_by(atm_data,yearandmonth),vars(Cash),funs(sum(.,na.rm=TRUE)))
#atm_ts<-ts(atm_data %>% select(-DATE))
Inference from the above chart ATM1 & ATM2 have similar charts while ATM3 has only a couple of data points. Also ATM4 has some spike and will be ignored in the forecasting for better prediction.
atm1<-atm_ts[, "ATM1"]
atm2<-atm_ts[, "ATM2"]
atm3<-atm_ts[(nrow(atm_ts) - 2):nrow(atm_ts), "ATM3"]
atm3<-ts(atm3, start = 363)
atm3<-atm_ts[, "ATM3"]
atm3[which(atm3 == 0)]<-NA
atm4<-atm_ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)
Now, lets analyze these dataseries individually to see if there are any seasonality present. we take about 2 months of data to look at the charts in detail.
autoplot(ts(atm_ts[1:61, ])) +
labs(title = "Withdrawn Cash From All ATMs", subtitle = "05/09 - 06/10", x = "Day") +
scale_y_continuous("Withdrawn Cash in hundreds" ) +
scale_color_discrete(NULL)
from the chart, ce ATM1, 2 & 4 displays a weekly seasonality . we will now extract the time series with a frequency of 7.we ignore atm3 as it has only a couple of data points.
atm1<-ts(atm1, frequency = 7)
atm2<-ts(atm2, frequency = 7)
atm2[which(is.na(atm2))]<-median(atm2, na.rm = TRUE)
atm4<-ts(atm4, frequency = 7)
ggtsdisplay(atm1, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")
ggtsdisplay(diff(atm1, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")
The timeseries appears stationary, so no non-seasonal differencing is suggested by the data.
The significant spikes in the ACF and PACF at k=1 suggest non-seasonal AR(1) and/or MA(1) components of the model.
The spikes in the ACF and PACF at k=7 followed by decreasing spikes at k=14 and k=21 suggest seasonal AR(1) and/or seasonal MA(1) components.
atm1_lambda<-BoxCox.lambda(atm1)
we will now define function to create models & return AIC values for timeseries
create model with Box-Cox and specified ARIMA parameters; extract AIC
atm_aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm1, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1_lambda))
}
We will create possible combinations of p, q, P, Q except all zero
expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>%
slice(which.min(aic))
## p q P Q aic
## 1 1 1 0 1 1096.583
atm1_fit<-Arima(atm1, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1_lambda)
#The residuals are investigated using a Ljung-Box test and diagnostic plotting:
Box.test(resid(atm1_fit), type = "L", fitdf = 3, lag = 7)
##
## Box-Ljung test
##
## data: resid(atm1_fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909
ggtsdisplay(resid(atm1_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")
The Ljung-Box test returns a p-value > 0.05, suggesting that the residuals may be white noise.
The residuals appear to be approximately normally distributed with a mean around zero.
They do not appear to be autocorrelated, but there is an almost-significant spike at k=6.
This model is acceptable and will be used for forecasting.
Lets repeat these process for ATM2 & ATM4
ggtsdisplay(atm2, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")
ggtsdisplay(diff(atm2, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")
As above, the large spike at k=7 suggests D=1, while the stationary nature of the timeseries suggests d=0.
The spikes in ACF & PACF in the non-differenced series at k=2 & k=5 suggest p,q∈[0,2,5].
The spikes in the ACF & PACF of the differenced series do not strongly suggest any need for seasonal AR or MA elements, but since the values at k=1 are followed by decreasing values, P,Q∈[0,1] are also investigated.
atm2_lambda<-BoxCox.lambda(atm2)
we will now define function to create models & return AIC values for timeseries
create model with Box-Cox and specified ARIMA parameters; extract AIC
atm2_aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm2, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2_lambda))
}
We will create possible combinations of p, q, P, Q except all zero
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2_aic)) %>%
slice(which.min(aic))
## p q P Q aic
## 1 5 5 0 1 2544.791
atm2_fit<-Arima(atm2, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2_lambda)
#The residuals are investigated using a Ljung-Box test and diagnostic plotting:
Box.test(resid(atm2_fit), type = "L", fitdf = 11, lag = 14)
##
## Box-Ljung test
##
## data: resid(atm2_fit)
## X-squared = 4.6984, df = 3, p-value = 0.1953
ggtsdisplay(resid(atm2_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM2 withdrawals", xlab = "Week", ylab = "Residual")
The Ljung-Box test, when using lag = 14 due to the high number of parameters in the fit, returns a p-value >> 0.05. This suggests that the residuals may be white noise.
The residuals appear to be approximately normally distributed with a mean around zero.
This model is acceptable and will be used for forecasting.
As mentioned above, there are only 3 observations at ATM3 and only these observations are used for the forecast.
A simple mean forecast will be used for this ATM.
Lets repeat the same model as atm1 & 2
ggtsdisplay(atm4, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")
ggtsdisplay(diff(atm4, 7), points = FALSE, main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")
The stationary timeseries with a large spike at k=7 suggests D=1 and d=0.
Similar spikes in the ACF & PACF of both the original and differenced timeseries as ATM2 suggest p,q∈[0,2,5] and P,Q∈[0,1] (though the evidence for seasonal AR and/or MA components are stronger in this case).
The code from above is reused to investigate the same possible models for ATM4
atm4_lambda<-BoxCox.lambda(atm4)
we will now define function to create models & return AIC values for timeseries
create model with Box-Cox and specified ARIMA parameters; extract AIC
atm4_aic<-function(p, d, q, P, D, Q) {
AIC(Arima(atm4, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4_lambda))
}
We will create possible combinations of p, q, P, Q except all zero
expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0) %>%
mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4_aic)) %>%
slice(which.min(aic))
## p q P Q aic
## 1 5 2 0 1 2863.061
atm4_fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4_lambda)
#The residuals are investigated using a Ljung-Box test and diagnostic plotting:
Box.test(resid(atm4_fit), type = "L", fitdf = 3, lag = 7)
##
## Box-Ljung test
##
## data: resid(atm4_fit)
## X-squared = 8.6819, df = 4, p-value = 0.06956
ggtsdisplay(resid(atm4_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")
The Ljung-Box test again returns a p-value >> 0.05, with residuals approximately normally distributed with a mean around zero.
This model is acceptable and will be used for forecasting.
atm1_forecast<-forecast(atm1_fit, 31, level = 95)
atm2_forecast<-forecast(atm2_fit, 31, level = 95)
atm3_forecast<-meanf(atm3, 31, level = 95)
atm4_forecast<-forecast(atm4_fit, 31, level = 95)
gridExtra::grid.arrange(
autoplot(atm1_forecast) +
labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
autoplot(atm2_forecast) +
labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
autoplot(atm3_forecast) +
labs(title = "ATM3: mean", x = "Day", y = NULL) +
theme(legend.position = "none"),
autoplot(atm4_forecast) +
labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
theme(legend.position = "none"),
top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for 05/10\n")
)
As expected, these values show seasonality for ATMs 1, 2, and 4, with a single value forecast for ATM3.
The forecast values are gathered and output to a .csv, which is manually tranferred to Excel for submission
data_frame(DATE = rep(max(df$DATE) + 1:31, 4), ATM = rep(names(df)[-1], each = 31), Cash = c(atm1_forecast$mean, atm2_forecast$mean, atm3_forecast$mean, atm4_forecast$mean)) %>%
write_csv("data624_ATMData_Projection.csv")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
kwh<-read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kwh<-ts(kwh[, "KWH"], start = c(1998, 1), frequency = 12)
autoplot(kwh) +
labs(title = "Monthly Residential Power Usage", subtitle = "01/98 - 12/13")
seasonality is very clear in this data and it apperars to be semi-annual with a peak every six months, but it may be annual, as the peaks seem to alternate in their height. A spike down in July 2010
It seems that the variance of the series may increase with its level. Therefore a Box-cox transformation is also investigated.
kwh_lambda<-BoxCox.lambda(kwh)
kwh_trans<-BoxCox(kwh, kwh_lambda)
The series appears stationary, so no non-seasonal differencing appears necessary.
The decaying seasonal spikes in the PACF suggests a seasonal AR(1) component, while the very quickly-decaying seasonal spikes in the ACF suggest the possibility of a seasonal MA(1) component.
Spikes in the PACF and ACF at k=1 and k=4 suggest non-seasonal AR(1) or AR(4) components, and non-seasonal MA(1) or MA(4) components.
The function used to select the model with lowest AIC in Part A is redefined for use on the kWh timeseries with D=1 and d=0.
kwh_aic<-function(p, q, P, Q) {
AIC(Arima(kwh, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = kwh_lambda))
}
We will create a model with Box-Cox and specified ARIMA parameters
expand.grid(p = c(0, 1, 4), q = c(0, 1, 4), P = 0:1, Q = 0:1) %>%
filter(p > 0 | q > 0 | P > 0 | Q > 0, p < 4 | q < 4 | P < 1 | Q < 1) %>%
mutate(aic = pmap_dbl(list(p, q, P, Q), kwh_aic)) %>%
slice(which.min(aic))
## p q P Q aic
## 1 1 0 0 1 575.2659
We will create possible combinations except all zeros and p = q = 4; P = Q = 1 (returns error)
The minimum AIC value returned is for the ARIMA(1,0,0)(0,1,1) model
kwh_fit<-Arima(kwh, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = kwh_lambda)
The residuals of this fit are investigated with a Ljung-Box test and diagnostic plotting
Box.test(resid(kwh_fit), type = "L", fitdf = 3, lag = 12)
##
## Box-Ljung test
##
## data: resid(kwh_fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937
ggtsdisplay(resid(kwh_fit), points = FALSE, main = "Residential Power Usage Residuals for ARIMA(1,0,0)(0,1,1)")
The Ljung-Box test returns a p-value >> 0.05, but the spikes in ACF & PACF at k=3 and k=4 suggest the possibility of AR(3) or MA(3) components (since the spike at k=4 was addressed above).
Investigation of these does not yield any AIC values lower than that of the above-identified model
expand.grid(p = c(1, 3), q = c(1, 3)) %>%
mutate(aic = pmap_dbl(list(p, q, 0, 1), kwh_aic))
## p q aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511
Viewing the residuals of the fit model again with a histogram, the model is acceptable.
The residuals appear to be roughly normally distributed around zero (with the exception of the significant dip in July 2010) without any significant autocorrelation
ggtsdisplay(resid(kwh_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Using the ARIMA(1,0,0)(0,1,1) model, the next 12 months is forecasted and this forecast is plotted
kwh_forecast<-forecast(kwh_fit, 12, level = 95)
autoplot(kwh_forecast) +
labs(title = "Residential Energy Forecast for 2014", subtitle = "Using ARIMA(1,0,0)(0,1,1)", x = "Month", y = "kWh") +
theme(legend.position = "none")
As expected, the forecast shows annual seasonality while showing some drift due to the non-seasonal autocorrelation. The forecast values are output to a .csv for inclusion in the required Excel submission
data_frame(`YYYY-MMM` = paste0(2014, "-", month.abb), KWH = kwh_forecast$mean) %>%
write_csv("data624_residential_kwh_forecast.csv")