Predictive Analytics Project 1

Jack Wright

Part A

Forecasting cash taken out of ATM Machines, May 2010

First lets load the data using the read.xlsx function from the readxl package and take a look at the summary.

##       DATE              ATM           Cash        
##  Min.   :2009-05-01   ATM1:365   Min.   :    0.0  
##  1st Qu.:2009-08-01   ATM2:365   1st Qu.:    0.5  
##  Median :2009-11-01   ATM3:365   Median :   73.0  
##  Mean   :2009-10-31   ATM4:365   Mean   :  155.6  
##  3rd Qu.:2010-02-01   NA's: 14   3rd Qu.:  114.0  
##  Max.   :2010-05-14              Max.   :10919.8  
##                                  NA's   :19

We have three variables in our data. Date, ATM and Cash.

Lets follow a few more tidying steps and see what this data looks like.

structurally missing: predictors that have no value for a given sample informative missingness: is there a pattern to the missingness censored data: exact value is missing, but something is known about the data.

NA Values

Only 1% of our data is missing, but we should still examine this missing data for structural or informative missingness, or if this is censored data. This graph plots missing data in dark color and complete data in light color

We have two groups of missing data. Any data that is missing it’s ATM factor is also missing it’s Cash value. The second group is just missing it’s cash value.

Lets first look at the data that is missing just its cash value

dat%>%
  filter(!is.na(ATM) & is.na(Cash))
## # A tibble: 5 x 3
##   DATE       ATM    Cash
##   <date>     <fct> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-18 ATM2     NA
## 4 2009-06-22 ATM1     NA
## 5 2009-06-24 ATM2     NA

The values missing just their cash only take place in June 2009 in ATM 1 and ATM 2. Lets take a look at the values for ATM 1 and 2 during the month of June 2009

This data does not appear to be structurally missing, it seems like a couple fairly random values were dropped. Note how the dates for the missing values do not coincide across ATM1 and ATM2. Likely it was some error in recording, not an outlier in the ATM’s actual usage.

MICE might be a fine option here, but I believe a more conservative mean or median imputation would do just fine. It is unlikely to change our seasonal trends too much if one month is slightly out of phase.

I would like to look at the overall trend of ATM1 and ATM2 because I need to decide if I want to impute a local mean (say June’s mean) or the mean of the data as a whole.

It looks like there is a clear trend in the data (at least for ATM2) so I think I will impute a local mean instead of the global mean for these ATM’s. This feels like a nice middle ground between a MICE imputation and a general mean imputation.

get_mean<-function( atm = 'ATM1'){
  dat%>%filter(month(DATE)==06 & ATM == atm)%>%na.omit()%>%summarize(m=mean(Cash))%>%as.numeric()
  
}

m_1<-get_mean()
m_2<-get_mean(atm='ATM2')

dat_impute<-dat%>%
  mutate(Cash = replace(Cash, is.na(Cash) & ATM =='ATM1', m_1))
dat_impute<-dat_impute%>%
  mutate(Cash = replace(Cash, is.na(Cash) & ATM =='ATM2', m_2))

vis_miss(dat_impute)

We have successfuly replaced the missing values with their local mean.

Now lets look at the data that is missing both the ATM value and the vash value

dat_impute%>%filter(is.na(ATM))
## # A tibble: 14 x 3
##    DATE       ATM    Cash
##    <date>     <fct> <dbl>
##  1 2010-05-01 <NA>     NA
##  2 2010-05-02 <NA>     NA
##  3 2010-05-03 <NA>     NA
##  4 2010-05-04 <NA>     NA
##  5 2010-05-05 <NA>     NA
##  6 2010-05-06 <NA>     NA
##  7 2010-05-07 <NA>     NA
##  8 2010-05-08 <NA>     NA
##  9 2010-05-09 <NA>     NA
## 10 2010-05-10 <NA>     NA
## 11 2010-05-11 <NA>     NA
## 12 2010-05-12 <NA>     NA
## 13 2010-05-13 <NA>     NA
## 14 2010-05-14 <NA>     NA
dat_impute$DATE%>%max()
## [1] "2010-05-14"

This data is missing from the end of our series, so now we need to figure out which of the ATMs it is missing from.

dat_impute%>%
  filter(!is.na(Cash))%>%
  group_by(ATM)%>%
  summarize(last_date = max(DATE))
## # A tibble: 4 x 2
##   ATM   last_date 
##   <fct> <date>    
## 1 ATM1  2010-04-30
## 2 ATM2  2010-04-30
## 3 ATM3  2010-04-30
## 4 ATM4  2010-04-30

Interesting… It looks like there might be two weeks of empty cells tacked on to the end of our data. In this case we will clearly drop these rows, since they contain no information about our ATMs or the cash withdrawn.

## [1] "2010-04-03"

Lets take one final look at our data to confirm it’s tidyness

EDA

While there might be some macro trends in ATM usage, I believe it is wiser to look at each ATM individually, since we have no information about the location or connection between these ATM’s besides the fact that the sampling took place over the same timeframe (one might be in a daycare, while the other might be in a casino across the country.)

ATM1

Lets first look at our raw data plotted as a time series

There is no clear trend in the data, but it looks like there might be a weekly seasonality.

Lets look at a decomposed version of the data

The increasing variance in the error is concerning. Lets try this again with a multiplicative decomposition.

There still seems to be a growth effect in the residuals towards the end. lets try a transformation to smooth out the variance

library(car)
lambda<-ATM1%>%
  features(Cash, features=guerrero)%>%
  pull(lambda_guerrero)
ATM1_mutate<-ATM1%>%mutate(Cash= box_cox(Cash,lambda))

ATM1_mutate%>%model(classical_decomposition(type='add'))%>%
  components()%>%
  autoplot()

This mutation seems to significantly decrease the difference in magnitude between the high variance and low variance portions of the data.

ATM1%>%
  gg_tsdisplay(plot_type='partial')
## Plot variable not specified, automatically selected `y = Cash`

differencing is when you calculate the difference between each interval. This will make a non-stationary series stationary (a requirement for ARIMA modeling)

The high critical values in our ACF and PACF plots imply the need for differencing. Lets run a unit root test to confirm if differencing is required.

Kwiatkowski-Phillips-Schmidt-shin (KPSS) test H_0= Data are stationary H_A = data isn’t stationary

kpss_stat kpss_pvalue
0.309504 0.1

Since the test statistic is bigger than the 1% critical value, the p value is less than .01. This means \(H_O\) is rejected and differencing is necessary. (The data is not stationary)

Lets perform differencing and test if we need second order differencing as well.

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0102         0.1

The test statistic is small and the p is high, so 2nd order differencing is not required.

we could have also used unitroot_ndiffs to tell us how many differencing iterations were required.

Lets look at our ACF and PACF plots with the differencing in place

ATM1%>%
  mutate(Cash = difference(Cash))%>%
  gg_tsdisplay(plot_type='partial')
## Plot variable not specified, automatically selected `y = Cash`
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

Our high critical values at weekly intervals indicate that seasonal differencing is probably more appropriate

Lets look again at the ACF and PACF plots of our data from ATM1 with the transformation in place.

The ACF plot has peaks at 2 5 and 7. Firstly, this implies that a seasonal difference plot should be looked at first.

Lets look at the seasonal difference ACF and PACF plots for a 7 day season

ATM1%>%
  gg_tsdisplay(difference(Cash,7),
               plot_type = 'partial', lag =21)+
  labs(title = 'Seasonally Differenced Cash at ATM1')
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).

This data is still not stationary, but double differencing seems to muddy the picture further, so lets stick with this for now.

The lag spike at 5 in the ACF indicates non-seasonal MA(1) and the lag spike at 7 in the ACF suggests a seasonal MA(1) component. This implies an ARIMA(0,0,1)(0,1,1)_7 seasonal model. We will include this in our modeling procedure.

The lag spike at 6 is concerning to me, but I want to avoid using such a large lag for the non-seasonal data, because it doesn’t line up with my observation of the residuals (which looks a lot like white noise.)

fit<-ATM1%>%
  model(
    arima001011 = ARIMA(Cash~pdq(0,0,1)+PDQ(0,1,1)),
    arima000011 = ARIMA(Cash~pdq(0,0,0)+PDQ(0,1,1)),
    auto        = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
  )

piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
                    values_to = 'Orders')

piv$Orders
## <lst_mdl[3]>
## [1] <ARIMA(0,0,1)(0,1,1)[7]> <ARIMA(0,0,0)(0,1,1)[7]> <ARIMA(0,0,2)(0,1,1)[7]>
glance(fit)%>%arrange(AICc)%>%select(.model:BIC)
## # A tibble: 3 x 6
##   .model      sigma2 log_lik   AIC  AICc   BIC
##   <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 auto          4.37   -714. 1436. 1437. 1452.
## 2 arima001011   4.42   -717. 1439. 1439. 1451.
## 3 arima000011   4.47   -719. 1441. 1441. 1449.

Although the auto-selected model is ARIMA(0,0,2)(0,1,1) has a better \(AIC_c\) value, I do not think it is significanlty better than choosing a simpler, more interpretable model. This is why I am selecting the ARIMA(0,0,1)(0,1,1) model for my predictions.

In essence we are selecting between the moving average component of the non-seasonal component being 2 or 1. The difference in \(AIC_c\) between the two is not large enough to include the second term.

fit%>%
  forecast(h=30)%>%
  filter(.model == 'arima001011')%>%
  autoplot(ATM1)

ATM2

Lets first look at our raw data plotted as a time series for ATM2

ATM2 Seems to have a clear trend. We also observed this trend during our imputation.

Lets look at a decomposed version of the data

The increasing variance in the error is similar to ATM1 where a transformation was warrented. Lets use that same transformation again.

This mutation seems to significantly decrease the difference in magnitude between the high variance and low variance portions of the data.

Now lets look at our initial ACF and PACF plots

The high critical values in our ACF and PACF plots imply the need for differencing. Lets run a unit root test to confirm if differencing is required.

kpss_stat kpss_pvalue
2.263145 0.01

Since the test statistic is bigger than the 1% critical value, the p value is less than .01. This means \(H_O\) is rejected and differencing is necessary. (The data is not stationary)

Lets perform differencing and test if we need second order differencing as well.

ATM1%>%
  mutate(diff_Cash = difference(Cash))%>%
  features(diff_Cash, unitroot_kpss)
## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1    0.0105         0.1

The test statistic is small and the p is high, so 2nd order differencing is not required.

Lets look at our ACF and PACF plots with the differencing in place

Our high critical values at weekly intervals indicate that seasonal differencing is probably more appropriate

The PACF plot has peaks at 2 5 and 7. Firstly, this implies that a seasonal difference plot should be looked at first.

Lets look at the seasonal difference ACF and PACF plots for a 7 day season

The only significant lag spikes are at 5 and 7, which implies an ARIMA(0,0,0)(0,1,x)_7

fit<-ATM2%>%
  model(
    arima005011 = ARIMA(Cash~pdq(0,0,5)+PDQ(0,1,1)),
    arima000011 = ARIMA(Cash~pdq(0,0,0)+PDQ(0,1,1)),
    auto        = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
  )

piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
                    values_to = 'Orders')

piv$Orders
## <lst_mdl[3]>
## [1] <ARIMA(0,0,5)(0,1,1)[7]>          <ARIMA(0,0,0)(0,1,1)[7]>         
## [3] <ARIMA(2,0,2)(0,1,1)[7] w/ drift>
glance(fit)%>%arrange(AICc)%>%select(.model:BIC)
## # A tibble: 3 x 6
##   .model      sigma2 log_lik   AIC  AICc   BIC
##   <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 auto          628.  -1536. 3086. 3086. 3112.
## 2 arima005011   646.  -1540. 3094. 3094. 3121.
## 3 arima000011   673.  -1549. 3101. 3101. 3109.

The AIC values are so close, that I hesistate to choose the auto-selected ARIMA(2,0,2)(0,1,1) model, but for sake of comparison, I will lean that way because I chose the simpler model for ATM1

fit%>%
  forecast(h=30)%>%
  filter(.model == 'auto')%>%
  autoplot(ATM2)

ATM3

Lets take a look at the data from ATM3

Uh oh, it looks like this ATM just got turned on! Lets see how many values are non-zero before we move forward

ATM3%>%filter(Cash>10)%>%nrow()
## [1] 3

we only have 3 data points for this entire series. You could potentially use a Naive prediction method if you absolutely had to have some sort of prediction, but our lack of data combined with our lack of any other type of information on this ATM (is it new? is there missing data?) would even keep me from doing that. I believe a stronger statement is to not make a prediction for ATM3

ATM4

Lets first look at our raw data plotted as a time series for ATM4

This looks like an outlier or a black swan event. Either someone missed leaned on the zero key during data entry, or some high roller was hitting the tables. I am leaning towards data entry error, because I doubt an ATM would even have that much cash on hand at a single time.

Lets double check and see how many of these outlier events occur in this dataset.

ATM4%>%filter(Cash>3000)%>%nrow()
## [1] 1

As suspected it is just a single event. Because of this, I am going to impute the mean value for ATM4 at this datapoint.

m_4<-mean(ATM4$Cash)

ATM4<-ATM4%>%
  mutate(Cash = replace(Cash, Cash>3000 , m_4))

Now lets look at our imputed data for ATM4

ATM4%>%autoplot()

ATM4 Seems to have no trend.

Lets look at a decomposed version of the data

This Data looks very clean, The error looks very random upon first glance. No transformation is needed at this time

The only critical value seems to be at lag 7, which implies a seasonal model might be appropriate.

The Seasonal difference plot seems to muddy the water, this makes me want to include a seasonal naive decomposition in our model building in addition to our ARIMA

fit<-ATM4%>%
  model(
    auto        = ARIMA(Cash, stepwise = FALSE, approx = FALSE)
  )

piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
                    values_to = 'Orders')

piv$Orders
## <lst_mdl[1]>
## [1] <ARIMA(0,0,0)(2,0,0)[7] w/ mean>
glance(fit)%>%arrange(AICc)
## # A tibble: 1 x 8
##   .model  sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 auto   122616.  -2458. 4925. 4925. 4940. <cpl [14]> <cpl [0]>

The AIC values are so close, that I hesistate to choose the auto-selected ARIMA(2,0,2)(0,1,1) model, but for sake of comparison, I will lean that way because I chose the simpler model for ATM1

fit%>%
  forecast(h=30)%>%
  filter(.model == 'auto')%>%
  autoplot(ATM4)

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

First lets import our dataset and convert it from an xlsx file

dat<-read_xlsx('data/ResidentialCustomerForecastLoad-624.xlsx')

colnames(dat)<-c('case','date','KWH')
dat$date<-yearmonth(dat$date)
dat<-dat%>%as_tsibble(index = date)%>%select(-case)

Lets first look at our data

There seems to be some sort of black swan event during January 2010, that quickly corrects. Lets look at the value and month of that event.

There also seems to be a single NA in our dataset.

min_val<-na.omit(dat$KWH)%>%min()
mu<-mean(na.omit(dat$KWH))

min_val/mu
## [1] 0.1184969

The power consumption in our outlier is just 11% of the average power consumption during this period. It also immediatley returns to an average level after this singular event.

Scanning through the historical events during July 2010, nothing would have affected the energy market to any great degree, so this either must be a local failure at the plant or a data entry error.

Predicting black swan events in power consumption is a far different problem than predicting future power consumption generally. This is why domain knowledge is so important in predictive analytics.

My strategy will be to impute a mean value for this event, because if we end up using some sort of trend or drift, I can’t think of why we would want to depress our projection due to a single high leverage point.

I would advise whoever wanted this projection they should start a second inquiry into why this event happened and if there was any method to warn of such an event.

m<-mu


dat_impute<-dat%>%
  mutate(KWH = replace(KWH, is.na(KWH) |KWH<3000000, m))


dat_impute%>%autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`

Now our data looks far more regular.

Lets take a look at the decomposition of KWH

dat_impute%>%model(classical_decomposition(type='add'))%>%
  components()%>%
  autoplot()
## Model not specified, defaulting to automatic modelling of the `KWH` variable. Override this using the model formula.
## Warning: Removed 6 row(s) containing missing values (geom_path).

This data looks very clean. The error looks random, the seasonal trend looks strong

Lets look at the ACF and PACF charts

dat<-dat_impute
dat%>%
  gg_tsdisplay(plot_type='partial')
## Plot variable not specified, automatically selected `y = KWH`

The oscillatory nature of the ACF denotes seasonality, so lets try a seasonal decomposition

dat%>%
  gg_tsdisplay(difference(KWH,12),
               plot_type = 'partial', lag =36)+
  labs(title = 'Seasonally Differenced KWH')
## Warning: Removed 12 row(s) containing missing values (geom_path).
## Warning: Removed 12 rows containing missing values (geom_point).

These peaks in the ACF plot suggest an ARMIA(0,0,1)(0,1,1)_12 model. Lets include that with our auto arima function. Ill also include a random noise for the non-seasonal part as a baseline

fit<-dat%>%
  model(
    arima001011 = ARIMA(KWH~pdq(0,0,1)+PDQ(0,1,1)),
    arima000011 = ARIMA(KWH~pdq(0,0,0)+PDQ(0,1,1)),
    auto        = ARIMA(KWH, stepwise = FALSE, approx = FALSE)
  )

piv<-fit%>%pivot_longer(everything(), names_to = 'model name',
                    values_to = 'Orders')

piv$Orders
## <lst_mdl[3]>
## [1] <ARIMA(0,0,1)(0,1,1)[12] w/ drift> <ARIMA(0,0,0)(0,1,1)[12] w/ drift>
## [3] <ARIMA(0,0,3)(2,1,0)[12] w/ drift>
glance(fit)%>%arrange(AICc)%>%select(.model:BIC)
## # A tibble: 3 x 6
##   .model             sigma2 log_lik   AIC  AICc   BIC
##   <chr>               <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 auto        398539133989.  -2661. 5336. 5337. 5358.
## 2 arima001011 419521571478.  -2667. 5342. 5342. 5355.
## 3 arima000011 471076865814.  -2677. 5360. 5360. 5370.

The models are all in the same ballpark, but lets optimize and use the suggested model ARIMA(0,0,3)(2,1,0) with drift for our prediction

fit%>%
  forecast(h=12)%>%
  filter(.model == 'auto')%>%
  autoplot(dat)

This model seems to capture the yearly trend well.