Forecasting cash taken out of ATM Machines, May 2010
First lets load the data using the read.xlsx function from the readxl package and take a look at the summary.
## DATE ATM Cash
## Min. :2009-05-01 ATM1:365 Min. : 0.0
## 1st Qu.:2009-08-01 ATM2:365 1st Qu.: 0.5
## Median :2009-11-01 ATM3:365 Median : 73.0
## Mean :2009-10-31 ATM4:365 Mean : 155.6
## 3rd Qu.:2010-02-01 NA's: 14 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
We have three variables in our data. Date, ATM and Cash.
Lets follow a few more tidying steps and see what this data looks like.
structurally missing: predictors that have no value for a given sample informative missingness: is there a pattern to the missingness censored data: exact value is missing, but something is known about the data.
Only 1% of our data is missing, but we should still examine this missing data for structural or informative missingness, or if this is censored data. This graph plots missing data in dark color and complete data in light color
We have two groups of missing data. Any data that is missing it’s ATM factor is also missing it’s Cash value. The second group is just missing it’s cash value.
Lets first look at the data that is missing just its cash value
dat%>%
filter(!is.na(ATM) & is.na(Cash))
## # A tibble: 5 x 3
## DATE ATM Cash
## <date> <fct> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-18 ATM2 NA
## 4 2009-06-22 ATM1 NA
## 5 2009-06-24 ATM2 NA
The values missing just their cash only take place in June 2009 in ATM 1 and ATM 2. Lets take a look at the values for ATM 1 and 2 during the month of June 2009
This data does not appear to be structurally missing, it seems like a couple fairly random values were dropped. Note how the dates for the missing values do not coincide across ATM1 and ATM2. Likely it was some error in recording, not an outlier in the ATM’s actual usage.
MICE might be a fine option here, but I believe a more conservative mean or median imputation would do just fine. It is unlikely to change our seasonal trends too much if one month is slightly out of phase.
I would like to look at the overall trend of ATM1 and ATM2 because I need to decide if I want to impute a local mean (say June’s mean) or the mean of the data as a whole.
It looks like there is a clear trend in the data (at least for ATM2) so I think I will impute a local mean instead of the global mean for these ATM’s. This feels like a nice middle ground between a MICE imputation and a general mean imputation.
get_mean<-function( atm = 'ATM1'){
dat%>%filter(month(DATE)==06 & ATM == atm)%>%na.omit()%>%summarize(m=mean(Cash))%>%as.numeric()
}
m_1<-get_mean()
m_2<-get_mean(atm='ATM2')
dat_impute<-dat%>%
mutate(Cash = replace(Cash, is.na(Cash) & ATM =='ATM1', m_1))
dat_impute<-dat_impute%>%
mutate(Cash = replace(Cash, is.na(Cash) & ATM =='ATM2', m_2))
vis_miss(dat_impute)
We have successfuly replaced the missing values with their local mean.
Now lets look at the data that is missing both the ATM value and the vash value
dat_impute%>%filter(is.na(ATM))
## # A tibble: 14 x 3
## DATE ATM Cash
## <date> <fct> <dbl>
## 1 2010-05-01 <NA> NA
## 2 2010-05-02 <NA> NA
## 3 2010-05-03 <NA> NA
## 4 2010-05-04 <NA> NA
## 5 2010-05-05 <NA> NA
## 6 2010-05-06 <NA> NA
## 7 2010-05-07 <NA> NA
## 8 2010-05-08 <NA> NA
## 9 2010-05-09 <NA> NA
## 10 2010-05-10 <NA> NA
## 11 2010-05-11 <NA> NA
## 12 2010-05-12 <NA> NA
## 13 2010-05-13 <NA> NA
## 14 2010-05-14 <NA> NA
dat_impute$DATE%>%max()
## [1] "2010-05-14"
This data is missing from the end of our series, so now we need to figure out which of the ATMs it is missing from.
dat_impute%>%
filter(!is.na(Cash))%>%
group_by(ATM)%>%
summarize(last_date = max(DATE))
## # A tibble: 4 x 2
## ATM last_date
## <fct> <date>
## 1 ATM1 2010-04-30
## 2 ATM2 2010-04-30
## 3 ATM3 2010-04-30
## 4 ATM4 2010-04-30
Interesting… It looks like there might be two weeks of empty cells tacked on to the end of our data. In this case we will clearly drop these rows, since they contain no information about our ATMs or the cash withdrawn.
## [1] "2010-04-03"
Lets take one final look at our data to confirm it’s tidyness
While there might be some macro trends in ATM usage, I believe it is wiser to look at each ATM individually, since we have no information about the location or connection between these ATM’s besides the fact that the sampling took place over the same timeframe (one might be in a daycare, while the other might be in a casino across the country.)
Lets first look at our raw data plotted as a time series
There is no clear trend in the data, but it looks like there might be a weekly seasonality.
Lets look at a decomposed version of the data
The increasing variance in the error is concerning. Lets try this again with a multiplicative decomposition.
There still seems to be a growth effect in the residuals towards the end. lets try a transformation to smooth out the variance
library(car)
lambda<-ATM1%>%
features(Cash, features=guerrero)%>%
pull(lambda_guerrero)
ATM1_mutate<-ATM1%>%mutate(Cash= box_cox(Cash,lambda))
ATM1_mutate%>%model(classical_decomposition(type='add'))%>%
components()%>%
autoplot()
This mutation seems to significantly decrease the difference in magnitude between the high variance and low variance portions of the data.
ATM1%>%
gg_tsdisplay(plot_type='partial')
## Plot variable not specified, automatically selected `y = Cash`
differencing is when you calculate the difference between each interval. This will make a non-stationary series stationary (a requirement for ARIMA modeling)
The high critical values in our ACF and PACF plots imply the need for differencing. Lets run a unit root test to confirm if differencing is required.
Kwiatkowski-Phillips-Schmidt-shin (KPSS) test H_0= Data are stationary H_A = data isn’t stationary
| kpss_stat | kpss_pvalue |
|---|---|
| 0.309504 | 0.1 |
Since the test statistic is bigger than the 1% critical value, the p value is less than .01. This means \(H_O\) is rejected and differencing is necessary. (The data is not stationary)
Lets perform differencing and test if we need second order differencing as well.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0102 0.1
The test statistic is small and the p is high, so 2nd order differencing is not required.
we could have also used unitroot_ndiffs to tell us how many differencing iterations were required.
Lets look at our ACF and PACF plots with the differencing in place
ATM1%>%
mutate(Cash = difference(Cash))%>%
gg_tsdisplay(plot_type='partial')
## Plot variable not specified, automatically selected `y = Cash`
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
Our high critical values at weekly intervals indicate that seasonal differencing is probably more appropriate
Lets look again at the ACF and PACF plots of our data from ATM1 with the transformation in place.
The ACF plot has peaks at 2 5 and 7. Firstly, this implies that a seasonal difference plot should be looked at first.
Lets look at the seasonal difference ACF and PACF plots for a 7 day season
ATM1%>%
gg_tsdisplay(difference(Cash,7),
plot_type = 'partial', lag =21)+
labs(title = 'Seasonally Differenced Cash at ATM1')
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).
This data is still not stationary, but double differencing seems to muddy the picture further, so lets stick with this for now.
The lag spike at 5 in the ACF indicates non-seasonal MA(1) and the lag spike at 7 in the ACF suggests a seasonal MA(1) component. This implies an ARIMA(0,0,1)(0,1,1)_7 seasonal model. We will include this in our modeling procedure.
The lag spike at 6 is concerning to me, but I want to avoid using such a large lag for the non-seasonal data, because it doesn’t line up with my observation of the residuals (which looks a lot like white noise.)
fit<-ATM1%>%
model(
arima001011 = ARIMA(Cash~pdq(0,0,1)+PDQ(0,1,1)),
arima000011 = ARIMA(Cash~pdq(0,0,0)+PDQ(0,1,1)),
auto = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
)
piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
values_to = 'Orders')
piv$Orders
## <lst_mdl[3]>
## [1] <ARIMA(0,0,1)(0,1,1)[7]> <ARIMA(0,0,0)(0,1,1)[7]> <ARIMA(0,0,2)(0,1,1)[7]>
glance(fit)%>%arrange(AICc)%>%select(.model:BIC)
## # A tibble: 3 x 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 auto 4.37 -714. 1436. 1437. 1452.
## 2 arima001011 4.42 -717. 1439. 1439. 1451.
## 3 arima000011 4.47 -719. 1441. 1441. 1449.
Although the auto-selected model is ARIMA(0,0,2)(0,1,1) has a better \(AIC_c\) value, I do not think it is significanlty better than choosing a simpler, more interpretable model. This is why I am selecting the ARIMA(0,0,1)(0,1,1) model for my predictions.
In essence we are selecting between the moving average component of the non-seasonal component being 2 or 1. The difference in \(AIC_c\) between the two is not large enough to include the second term.
fit%>%
forecast(h=30)%>%
filter(.model == 'arima001011')%>%
autoplot(ATM1)
Lets first look at our raw data plotted as a time series for ATM2
ATM2 Seems to have a clear trend. We also observed this trend during our imputation.
Lets look at a decomposed version of the data
The increasing variance in the error is similar to ATM1 where a transformation was warrented. Lets use that same transformation again.
This mutation seems to significantly decrease the difference in magnitude between the high variance and low variance portions of the data.
Now lets look at our initial ACF and PACF plots
The high critical values in our ACF and PACF plots imply the need for differencing. Lets run a unit root test to confirm if differencing is required.
| kpss_stat | kpss_pvalue |
|---|---|
| 2.263145 | 0.01 |
Since the test statistic is bigger than the 1% critical value, the p value is less than .01. This means \(H_O\) is rejected and differencing is necessary. (The data is not stationary)
Lets perform differencing and test if we need second order differencing as well.
ATM1%>%
mutate(diff_Cash = difference(Cash))%>%
features(diff_Cash, unitroot_kpss)
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0105 0.1
The test statistic is small and the p is high, so 2nd order differencing is not required.
Lets look at our ACF and PACF plots with the differencing in place
Our high critical values at weekly intervals indicate that seasonal differencing is probably more appropriate
The PACF plot has peaks at 2 5 and 7. Firstly, this implies that a seasonal difference plot should be looked at first.
Lets look at the seasonal difference ACF and PACF plots for a 7 day season
The only significant lag spikes are at 5 and 7, which implies an ARIMA(0,0,0)(0,1,x)_7
fit<-ATM2%>%
model(
arima005011 = ARIMA(Cash~pdq(0,0,5)+PDQ(0,1,1)),
arima000011 = ARIMA(Cash~pdq(0,0,0)+PDQ(0,1,1)),
auto = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
)
piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
values_to = 'Orders')
piv$Orders
## <lst_mdl[3]>
## [1] <ARIMA(0,0,5)(0,1,1)[7]> <ARIMA(0,0,0)(0,1,1)[7]>
## [3] <ARIMA(2,0,2)(0,1,1)[7] w/ drift>
glance(fit)%>%arrange(AICc)%>%select(.model:BIC)
## # A tibble: 3 x 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 auto 628. -1536. 3086. 3086. 3112.
## 2 arima005011 646. -1540. 3094. 3094. 3121.
## 3 arima000011 673. -1549. 3101. 3101. 3109.
The AIC values are so close, that I hesistate to choose the auto-selected ARIMA(2,0,2)(0,1,1) model, but for sake of comparison, I will lean that way because I chose the simpler model for ATM1
fit%>%
forecast(h=30)%>%
filter(.model == 'auto')%>%
autoplot(ATM2)
Lets take a look at the data from ATM3
Uh oh, it looks like this ATM just got turned on! Lets see how many values are non-zero before we move forward
ATM3%>%filter(Cash>10)%>%nrow()
## [1] 3
we only have 3 data points for this entire series. You could potentially use a Naive prediction method if you absolutely had to have some sort of prediction, but our lack of data combined with our lack of any other type of information on this ATM (is it new? is there missing data?) would even keep me from doing that. I believe a stronger statement is to not make a prediction for ATM3
Lets first look at our raw data plotted as a time series for ATM4
This looks like an outlier or a black swan event. Either someone missed leaned on the zero key during data entry, or some high roller was hitting the tables. I am leaning towards data entry error, because I doubt an ATM would even have that much cash on hand at a single time.
Lets double check and see how many of these outlier events occur in this dataset.
ATM4%>%filter(Cash>3000)%>%nrow()
## [1] 1
As suspected it is just a single event. Because of this, I am going to impute the mean value for ATM4 at this datapoint.
m_4<-mean(ATM4$Cash)
ATM4<-ATM4%>%
mutate(Cash = replace(Cash, Cash>3000 , m_4))
Now lets look at our imputed data for ATM4
ATM4%>%autoplot()
ATM4 Seems to have no trend.
Lets look at a decomposed version of the data
This Data looks very clean, The error looks very random upon first glance. No transformation is needed at this time
The only critical value seems to be at lag 7, which implies a seasonal model might be appropriate.
The Seasonal difference plot seems to muddy the water, this makes me want to include a seasonal naive decomposition in our model building in addition to our ARIMA
fit<-ATM4%>%
model(
auto = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
)
piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
values_to = 'Orders')
piv$Orders
## <lst_mdl[1]>
## [1] <ARIMA(0,0,0)(2,0,0)[7] w/ mean>
glance(fit)%>%arrange(AICc)
## # A tibble: 1 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 auto 122616. -2458. 4925. 4925. 4940. <cpl [14]> <cpl [0]>
The AIC values are so close, that I hesistate to choose the auto-selected ARIMA(2,0,2)(0,1,1) model, but for sake of comparison, I will lean that way because I chose the simpler model for ATM1
fit%>%
forecast(h=30)%>%
filter(.model == 'auto')%>%
autoplot(ATM4)
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
First lets import our dataset and convert it from an xlsx file
dat<-read_xlsx('data/ResidentialCustomerForecastLoad-624.xlsx')
colnames(dat)<-c('case','date','KWH')
dat$date<-yearmonth(dat$date)
dat<-dat%>%as_tsibble(index = date)%>%select(-case)
Lets first look at our data
There seems to be some sort of black swan event during January 2010, that quickly corrects. Lets look at the value and month of that event.
There also seems to be a single NA in our dataset.
min_val<-na.omit(dat$KWH)%>%min()
mu<-mean(na.omit(dat$KWH))
min_val/mu
## [1] 0.1184969
The power consumption in our outlier is just 11% of the average power consumption during this period. It also immediatley returns to an average level after this singular event.
Scanning through the historical events during July 2010, nothing would have affected the energy market to any great degree, so this either must be a local failure at the plant or a data entry error.
Predicting black swan events in power consumption is a far different problem than predicting future power consumption generally. This is why domain knowledge is so important in predictive analytics.
My strategy will be to impute a mean value for this event, because if we end up using some sort of trend or drift, I can’t think of why we would want to depress our projection due to a single high leverage point.
I would advise whoever wanted this projection they should start a second inquiry into why this event happened and if there was any method to warn of such an event.
m<-mu
dat_impute<-dat%>%
mutate(KWH = replace(KWH, is.na(KWH) |KWH<3000000, m))
dat_impute%>%autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`
Now our data looks far more regular.
Lets take a look at the decomposition of KWH
dat_impute%>%model(classical_decomposition(type='add'))%>%
components()%>%
autoplot()
## Model not specified, defaulting to automatic modelling of the `KWH` variable. Override this using the model formula.
## Warning: Removed 6 row(s) containing missing values (geom_path).
This data looks very clean. The error looks random, the seasonal trend looks strong
Lets look at the ACF and PACF charts
dat<-dat_impute
dat%>%
gg_tsdisplay(plot_type='partial')
## Plot variable not specified, automatically selected `y = KWH`
The oscillatory nature of the ACF denotes seasonality, so lets try a seasonal decomposition
dat%>%
gg_tsdisplay(difference(KWH,12),
plot_type = 'partial', lag =36)+
labs(title = 'Seasonally Differenced KWH')
## Warning: Removed 12 row(s) containing missing values (geom_path).
## Warning: Removed 12 rows containing missing values (geom_point).
These peaks in the ACF plot suggest an ARMIA(0,0,1)(0,1,1)_12 model. Lets include that with our auto arima function. Ill also include a random noise for the non-seasonal part as a baseline
fit<-dat%>%
model(
arima001011 = ARIMA(KWH~pdq(0,0,1)+PDQ(0,1,1)),
arima000011 = ARIMA(KWH~pdq(0,0,0)+PDQ(0,1,1)),
auto = ARIMA(KWH, stepwise = FALSE, approx = FALSE)
)
piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
values_to = 'Orders')
piv$Orders
## <lst_mdl[3]>
## [1] <ARIMA(0,0,1)(0,1,1)[12] w/ drift> <ARIMA(0,0,0)(0,1,1)[12] w/ drift>
## [3] <ARIMA(0,0,3)(2,1,0)[12] w/ drift>
glance(fit)%>%arrange(AICc)%>%select(.model:BIC)
## # A tibble: 3 x 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 auto 398539133989. -2661. 5336. 5337. 5358.
## 2 arima001011 419521571478. -2667. 5342. 5342. 5355.
## 3 arima000011 471076865814. -2677. 5360. 5360. 5370.
The models are all in the same ballpark, but lets optimize and use the suggested model ARIMA(0,0,3)(2,1,0) with drift for our prediction
fit%>%
forecast(h=12)%>%
filter(.model == 'auto')%>%
autoplot(dat)
This model seems to capture the yearly trend well.