Project 1

library('ggplot2')
library('forecast')
library('tseries')
library("readxl")
library('dplyr')
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('tidyr')
library('readr')
library('purrr')

Load Data

You can also embed plots, for example:

atm_data <- read_excel("ATM624Data.xlsx")
#atm_data$yearandmonth = format(as.Date(atm_data$DATE), "%Y-%m")
atm_data
## # A tibble: 1,474 x 3
##    DATE                ATM    Cash
##    <dttm>              <chr> <dbl>
##  1 2009-05-01 00:00:00 ATM1    96 
##  2 2009-05-01 00:00:00 ATM2   107 
##  3 2009-05-01 00:00:00 ATM3     0 
##  4 2009-05-01 00:00:00 ATM4   777.
##  5 2009-05-02 00:00:00 ATM1    82 
##  6 2009-05-02 00:00:00 ATM2    89 
##  7 2009-05-02 00:00:00 ATM3     0 
##  8 2009-05-02 00:00:00 ATM4   524.
##  9 2009-05-03 00:00:00 ATM1    85 
## 10 2009-05-03 00:00:00 ATM2    90 
## # ... with 1,464 more rows

If this were a business problem, I would like to understand which ATM’s have huge draws and an estimate of how often it has to be refilled. So i group and spread the data by ATM. This is a reasonable approach when the groupings are limited.

df<-atm_data %>%
  drop_na() %>%
  spread(ATM, Cash) %>% 
  mutate(DATE = as.Date(DATE, origin = "2009-03-30"))
df
## # A tibble: 365 x 5
##    DATE        ATM1  ATM2  ATM3  ATM4
##    <date>     <dbl> <dbl> <dbl> <dbl>
##  1 2009-05-01    96   107     0 777. 
##  2 2009-05-02    82    89     0 524. 
##  3 2009-05-03    85    90     0 793. 
##  4 2009-05-04    90    55     0 908. 
##  5 2009-05-05    99    79     0  52.8
##  6 2009-05-06    88    19     0  52.2
##  7 2009-05-07     8     2     0  55.5
##  8 2009-05-08   104   103     0 559. 
##  9 2009-05-09    87   107     0 904. 
## 10 2009-05-10    93   118     0 879. 
## # ... with 355 more rows

Now, I would like to convert the data to a timeseries and plot them.

atm_ts<-ts(df %>% select(-DATE))
autoplot(atm_ts) +
  labs(title = "Withdrawn Cash From All 4 ATMS", subtitle = "05/09 - 04/10", x = "Days") +
  scale_y_continuous("ATM withdrawal ($)") +
  scale_color_discrete(NULL) +
  theme(legend.position = c(0.1, 0.8))

The timeseries above is unclear due to spikes and also the min/max range on the ATM’s seems very different. I will attempt to use a faceted plot to take an individual look at these time series.

df %>% gather(atm, Cash, -DATE) %>% 
  ggplot(aes(x = DATE, y = Cash, col = atm)) +
  geom_line(show.legend = FALSE) +
  facet_wrap(~ atm, ncol = 1, scales = "free_y") +
  labs(title = "Cash Withdrawal from 4 ATM's", subtitle = "05/09 - 04/10", x = "Date") +
  scale_y_continuous("$Cash in hundreds ")

#summarise_at(group_by(atm_data,yearandmonth),vars(Cash),funs(sum(.,na.rm=TRUE)))
#atm_ts<-ts(atm_data %>% select(-DATE))

Inference from the above chart ATM1 & ATM2 have similar charts while ATM3 has only a couple of data points. Also ATM4 has some spike and will be ignored in the forecasting for better prediction.

atm1<-atm_ts[, "ATM1"]
atm2<-atm_ts[, "ATM2"]
atm3<-atm_ts[(nrow(atm_ts) - 2):nrow(atm_ts), "ATM3"]
atm3<-ts(atm3, start = 363)
atm3<-atm_ts[, "ATM3"]
atm3[which(atm3 == 0)]<-NA
atm4<-atm_ts[, "ATM4"]
atm4[which.max(atm4)]<-median(atm4, na.rm = TRUE)

Now, lets analyze these dataseries individually to see if there are any seasonality present. we take about 2 months of data to look at the charts in detail.

autoplot(ts(atm_ts[1:61, ])) +
  labs(title = "Withdrawn Cash From All ATMs", subtitle = "05/09 - 06/10", x = "Day") +
  scale_y_continuous("Withdrawn Cash in hundreds" ) +
  scale_color_discrete(NULL)

from the chart, ce ATM1, 2 & 4 displays a weekly seasonality . we will now extract the time series with a frequency of 7.we ignore atm3 as it has only a couple of data points.

atm1<-ts(atm1, frequency = 7)
atm2<-ts(atm2, frequency = 7)
atm2[which(is.na(atm2))]<-median(atm2, na.rm = TRUE)
atm4<-ts(atm4, frequency = 7)

ATM1

ggtsdisplay(atm1, points = FALSE, main = "Withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

ggtsdisplay(diff(atm1, 7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM1", xlab = "Week", ylab = "Cash in hundreds")

The timeseries appears stationary, so no non-seasonal differencing is suggested by the data.
The significant spikes in the ACF and PACF at k=1 suggest non-seasonal AR(1) and/or MA(1) components of the model.
The spikes in the ACF and PACF at k=7 followed by decreasing spikes at k=14 and k=21 suggest seasonal AR(1) and/or seasonal MA(1) components.

atm1_lambda<-BoxCox.lambda(atm1)

we will now define function to create models & return AIC values for timeseries
create model with Box-Cox and specified ARIMA parameters; extract AIC

atm_aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm1, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm1_lambda))
}

We will create possible combinations of p, q, P, Q except all zero

expand.grid(p = 0:1, q = 0:1, P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm_aic)) %>% 
  slice(which.min(aic))
##   p q P Q      aic
## 1 1 1 0 1 1096.583
atm1_fit<-Arima(atm1, order = c(1, 0, 1), seasonal = c(0, 1, 1), lambda = atm1_lambda)
#The residuals are investigated using a Ljung-Box test and diagnostic plotting:
Box.test(resid(atm1_fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm1_fit)
## X-squared = 8.0185, df = 4, p-value = 0.0909
ggtsdisplay(resid(atm1_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,1)(0,1,1) fit of ATM1 withdrawals", xlab = "Week", ylab = "Residual")

The Ljung-Box test returns a p-value > 0.05, suggesting that the residuals may be white noise.
The residuals appear to be approximately normally distributed with a mean around zero.
They do not appear to be autocorrelated, but there is an almost-significant spike at k=6.
This model is acceptable and will be used for forecasting.

Lets repeat these process for ATM2 & ATM4

ATM2

ggtsdisplay(atm2, points = FALSE, main = "Withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

ggtsdisplay(diff(atm2, 7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM2", xlab = "Week", ylab = "Cash in hundreds")

As above, the large spike at k=7 suggests D=1, while the stationary nature of the timeseries suggests d=0.
The spikes in ACF & PACF in the non-differenced series at k=2 & k=5 suggest p,q∈[0,2,5].
The spikes in the ACF & PACF of the differenced series do not strongly suggest any need for seasonal AR or MA elements, but since the values at k=1 are followed by decreasing values, P,Q∈[0,1] are also investigated.

atm2_lambda<-BoxCox.lambda(atm2)

we will now define function to create models & return AIC values for timeseries
create model with Box-Cox and specified ARIMA parameters; extract AIC

atm2_aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm2, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm2_lambda))
}

We will create possible combinations of p, q, P, Q except all zero

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm2_aic)) %>% 
  slice(which.min(aic))
##   p q P Q      aic
## 1 5 5 0 1 2544.791
atm2_fit<-Arima(atm2, order = c(5, 0, 5), seasonal = c(0, 1, 1), lambda = atm2_lambda)
#The residuals are investigated using a Ljung-Box test and diagnostic plotting:
Box.test(resid(atm2_fit), type = "L", fitdf = 11, lag = 14)
## 
##  Box-Ljung test
## 
## data:  resid(atm2_fit)
## X-squared = 4.6984, df = 3, p-value = 0.1953
ggtsdisplay(resid(atm2_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(5,0,5)(0,1,1) of ATM2 withdrawals", xlab = "Week", ylab = "Residual")

The Ljung-Box test, when using lag = 14 due to the high number of parameters in the fit, returns a p-value >> 0.05. This suggests that the residuals may be white noise.
The residuals appear to be approximately normally distributed with a mean around zero.
This model is acceptable and will be used for forecasting.

ATM 3

As mentioned above, there are only 3 observations at ATM3 and only these observations are used for the forecast.
A simple mean forecast will be used for this ATM.

ATM 4

Lets repeat the same model as atm1 & 2

ggtsdisplay(atm4, points = FALSE, main = "Withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

ggtsdisplay(diff(atm4, 7), points = FALSE,  main = "Differenced (lag-7) withdrawals from ATM4", xlab = "Week", ylab = "Cash in hundreds")

The stationary timeseries with a large spike at k=7 suggests D=1 and d=0.
Similar spikes in the ACF & PACF of both the original and differenced timeseries as ATM2 suggest p,q∈[0,2,5] and P,Q∈[0,1] (though the evidence for seasonal AR and/or MA components are stronger in this case).
The code from above is reused to investigate the same possible models for ATM4

atm4_lambda<-BoxCox.lambda(atm4)

we will now define function to create models & return AIC values for timeseries
create model with Box-Cox and specified ARIMA parameters; extract AIC

atm4_aic<-function(p, d, q, P, D, Q) {
  AIC(Arima(atm4, order = c(p, d, q), seasonal = c(P, D, Q), lambda = atm4_lambda))
}

We will create possible combinations of p, q, P, Q except all zero

expand.grid(p = c(0, 2, 5), q = c(0, 2, 5), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0) %>% 
  mutate(aic = pmap_dbl(list(p, 0, q, P, 1, Q), atm4_aic)) %>% 
  slice(which.min(aic))
##   p q P Q      aic
## 1 5 2 0 1 2863.061
atm4_fit<-Arima(atm4, order = c(0, 0, 2), seasonal = c(0, 1, 1), lambda = atm4_lambda)
#The residuals are investigated using a Ljung-Box test and diagnostic plotting:
Box.test(resid(atm4_fit), type = "L", fitdf = 3, lag = 7)
## 
##  Box-Ljung test
## 
## data:  resid(atm4_fit)
## X-squared = 8.6819, df = 4, p-value = 0.06956
ggtsdisplay(resid(atm4_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(0,0,2)(0,1,1) of ATM4 withdrawals", xlab = "Week", ylab = "Residual")

The Ljung-Box test again returns a p-value >> 0.05, with residuals approximately normally distributed with a mean around zero.
This model is acceptable and will be used for forecasting.

atm1_forecast<-forecast(atm1_fit, 31, level = 95)
atm2_forecast<-forecast(atm2_fit, 31, level = 95)
atm3_forecast<-meanf(atm3, 31, level = 95)
atm4_forecast<-forecast(atm4_fit, 31, level = 95)
gridExtra::grid.arrange(
  autoplot(atm1_forecast) + 
    labs(title = "ATM1: ARIMA(1,0,1)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm2_forecast) + 
    labs(title = "ATM2: ARIMA(5,0,5)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm3_forecast) + 
    labs(title = "ATM3: mean", x = "Day", y = NULL) +
    theme(legend.position = "none"),
  autoplot(atm4_forecast) + 
    labs(title = "ATM4: ARIMA(0,0,2)(0,1,1)", x = "Week", y = NULL) +
    theme(legend.position = "none"),
  top = grid::textGrob("Forecasted ATM withdrawals (in hundreds of dollars) for 05/10\n")
)

As expected, these values show seasonality for ATMs 1, 2, and 4, with a single value forecast for ATM3.
The forecast values are gathered and output to a .csv, which is manually tranferred to Excel for submission

data_frame(DATE = rep(max(df$DATE) + 1:31, 4), ATM = rep(names(df)[-1], each = 31), Cash = c(atm1_forecast$mean, atm2_forecast$mean, atm3_forecast$mean, atm4_forecast$mean)) %>% 
  write_csv("data624_ATMData_Projection.csv")

Project 1 : Part B - REsidential Power Usage

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

kwh<-read_excel("ResidentialCustomerForecastLoad-624.xlsx")
kwh<-ts(kwh[, "KWH"], start = c(1998, 1), frequency = 12)
autoplot(kwh) +
  labs(title = "Monthly Residential Power Usage", subtitle = "01/98 - 12/13")

seasonality is very clear in this data and it apperars to be semi-annual with a peak every six months, but it may be annual, as the peaks seem to alternate in their height. A spike down in July 2010
It seems that the variance of the series may increase with its level. Therefore a Box-cox transformation is also investigated.

kwh_lambda<-BoxCox.lambda(kwh)
kwh_trans<-BoxCox(kwh, kwh_lambda)

The series appears stationary, so no non-seasonal differencing appears necessary.
The decaying seasonal spikes in the PACF suggests a seasonal AR(1) component, while the very quickly-decaying seasonal spikes in the ACF suggest the possibility of a seasonal MA(1) component.
Spikes in the PACF and ACF at k=1 and k=4 suggest non-seasonal AR(1) or AR(4) components, and non-seasonal MA(1) or MA(4) components.
The function used to select the model with lowest AIC in Part A is redefined for use on the kWh timeseries with D=1 and d=0.

kwh_aic<-function(p, q, P, Q) {
  AIC(Arima(kwh, order = c(p, 0, q), seasonal = c(P, 1, Q), lambda = kwh_lambda))
}

We will create a model with Box-Cox and specified ARIMA parameters

expand.grid(p = c(0, 1, 4), q = c(0, 1, 4), P = 0:1, Q = 0:1) %>%
  filter(p > 0 | q > 0 | P > 0 | Q > 0, p < 4 | q < 4 | P < 1 | Q < 1) %>%
  mutate(aic = pmap_dbl(list(p, q, P, Q), kwh_aic)) %>% 
  slice(which.min(aic))
##   p q P Q      aic
## 1 1 0 0 1 575.2659

We will create possible combinations except all zeros and p = q = 4; P = Q = 1 (returns error)
The minimum AIC value returned is for the ARIMA(1,0,0)(0,1,1) model

kwh_fit<-Arima(kwh, order = c(1, 0, 0), seasonal = c(0, 1, 1), lambda = kwh_lambda)

The residuals of this fit are investigated with a Ljung-Box test and diagnostic plotting

Box.test(resid(kwh_fit), type = "L", fitdf = 3, lag = 12)  
## 
##  Box-Ljung test
## 
## data:  resid(kwh_fit)
## X-squared = 6.4543, df = 9, p-value = 0.6937
ggtsdisplay(resid(kwh_fit), points = FALSE, main = "Residential Power Usage Residuals for ARIMA(1,0,0)(0,1,1)")

The Ljung-Box test returns a p-value >> 0.05, but the spikes in ACF & PACF at k=3 and k=4 suggest the possibility of AR(3) or MA(3) components (since the spike at k=4 was addressed above).
Investigation of these does not yield any AIC values lower than that of the above-identified model

expand.grid(p = c(1, 3), q = c(1, 3)) %>%
  mutate(aic = pmap_dbl(list(p, q, 0, 1), kwh_aic))
##   p q      aic
## 1 1 1 575.7090
## 2 3 1 579.7385
## 3 1 3 579.9391
## 4 3 3 581.4511

Viewing the residuals of the fit model again with a histogram, the model is acceptable.
The residuals appear to be roughly normally distributed around zero (with the exception of the significant dip in July 2010) without any significant autocorrelation

ggtsdisplay(resid(kwh_fit), points = FALSE, plot.type = "histogram", main = "Residuals for ARIMA(1,0,0)(0,1,1) of residential power usage")
## Warning: Removed 1 rows containing non-finite values (stat_bin).

Using the ARIMA(1,0,0)(0,1,1) model, the next 12 months is forecasted and this forecast is plotted

kwh_forecast<-forecast(kwh_fit, 12, level = 95)
autoplot(kwh_forecast) + 
    labs(title = "Residential Energy Forecast for 2014", subtitle = "Using ARIMA(1,0,0)(0,1,1)", x = "Month", y = "kWh") +
    theme(legend.position = "none")

As expected, the forecast shows annual seasonality while showing some drift due to the non-seasonal autocorrelation. The forecast values are output to a .csv for inclusion in the required Excel submission

data_frame(`YYYY-MMM` = paste0(2014, "-", month.abb), KWH = kwh_forecast$mean) %>% 
  write_csv("data624_residential_kwh_forecast.csv")