Part A

Forecast how much cash is taken out of 4 different ATM machines for May 2010.

The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward.

Explain and demonstrate your process, techniques used and not used, and your actual forecast.

Please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

atm <- read_excel("data/ATM624Data.xlsx",col_types = c('date', 'text', 'numeric')) %>% 
  mutate(DATE = as.Date(DATE)) %>%
  as_tsibble(index = DATE, key = ATM) %>%
  filter(DATE < "2010-05-01")

head(atm,5)
## # A tsibble: 5 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-02 ATM1     82
## 3 2009-05-03 ATM1     85
## 4 2009-05-04 ATM1     90
## 5 2009-05-05 ATM1     99

Thought Process

Problem

Forecast how much cash will be withdrawn from ATM1, ATM2, ATM3 and ATM4 during May 2010.

Data

The data is broken down into total daily withdrawals from select ATM’s. It is a very short time frame from 2009 to 2010. There are 1474 withdrawals in the dataset. The dataset has 5 na values that will have to be inspected and dealt with.

None of the individual distributions are normally distributed. This implies that transformations might be necessary. This also implies that a linear model will not be sufficient in predicting the future values.

summary(atm[2:3])
##      ATM                 Cash        
##  Length:1460        Min.   :    0.0  
##  Class :character   1st Qu.:    0.5  
##  Mode  :character   Median :   73.0  
##                     Mean   :  155.6  
##                     3rd Qu.:  114.0  
##                     Max.   :10919.8  
##                     NA's   :5
atm1 <- atm %>% filter(ATM == "ATM1")
summary(atm1)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 91.00  
##  Mean   :2009-10-30                      Mean   : 83.89  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00  
##                                          NA's   :3
ggplot(atm1, aes(x=Cash)) + 
  geom_density()

atm2 <- atm %>% filter(ATM == "ATM2")
summary(atm2)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  0.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 25.50  
##  Median :2009-10-30   Mode  :character   Median : 67.00  
##  Mean   :2009-10-30                      Mean   : 62.58  
##  3rd Qu.:2010-01-29                      3rd Qu.: 93.00  
##  Max.   :2010-04-30                      Max.   :147.00  
##                                          NA's   :2
ggplot(atm2, aes(x=Cash)) + 
  geom_density()

atm3 <- atm %>% filter(ATM == "ATM3")
summary(atm3)
##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:365         Min.   : 0.0000  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 0.0000  
##  Median :2009-10-30   Mode  :character   Median : 0.0000  
##  Mean   :2009-10-30                      Mean   : 0.7206  
##  3rd Qu.:2010-01-29                      3rd Qu.: 0.0000  
##  Max.   :2010-04-30                      Max.   :96.0000
ggplot(atm3, aes(x=Cash)) + 
  geom_density()

atm4 <- atm %>% filter(ATM == "ATM4")
summary(atm4)
##       DATE                ATM                 Cash          
##  Min.   :2009-05-01   Length:365         Min.   :    1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.:  124.334  
##  Median :2009-10-30   Mode  :character   Median :  403.839  
##  Mean   :2009-10-30                      Mean   :  474.043  
##  3rd Qu.:2010-01-29                      3rd Qu.:  704.507  
##  Max.   :2010-04-30                      Max.   :10919.762
ggplot(atm4, aes(x=Cash)) + 
  geom_density()

NA’s

The data set is daily data so any Na’s is an entire days observation and it could possibly throw a forecast off.

Looking at the summary stats there are 5 na’s, three in atm 1 and two in atm 2. Atm’s 1 and 2 both have different means and medians. The difference is roughly 20 dollars and since we are dealing with smaller amounts of money a 20 dollar difference is a significant percentage of the individual average of each atm’s withdrawal. This makes replacing those 5 na’s with the mean of the overall data unwise as the overall mean is 2 to 3 times bigger than the means of atm 1 and 2. Which would distort those months data. It would be best to use the average of the month for each particular atm.

june_data1 <- atm1 %>% filter(month(DATE) == 6)
atm1$Cash[is.na(atm1$Cash)]<-mean(june_data1$Cash,na.rm=TRUE)
summary(atm1)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  1.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 73.00  
##  Median :2009-10-30   Mode  :character   Median : 90.00  
##  Mean   :2009-10-30                      Mean   : 83.93  
##  3rd Qu.:2010-01-29                      3rd Qu.:108.00  
##  Max.   :2010-04-30                      Max.   :180.00
june_data2 <- atm2 %>% filter(month(DATE) == 6)
atm2$Cash[is.na(atm2$Cash)]<-mean(june_data2$Cash,na.rm=TRUE)
summary(atm2)
##       DATE                ATM                 Cash       
##  Min.   :2009-05-01   Length:365         Min.   :  0.00  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 26.00  
##  Median :2009-10-30   Mode  :character   Median : 67.00  
##  Mean   :2009-10-30                      Mean   : 62.65  
##  3rd Qu.:2010-01-29                      3rd Qu.: 93.00  
##  Max.   :2010-04-30                      Max.   :147.00

Charts

Looking at the charts, atm’s 1 and 2 look to have seasonality at lag 7. Atm 3 only has a day or two worth of transactions. Atm 4 doesn’t look to have seasonality but it does have an outlier that needs to be inspected.

fixed_atm <- rbind(as.data.frame(atm1),as.data.frame(atm2),as.data.frame(atm3),as.data.frame(atm4))

fixed_atm <- fixed_atm %>% as_tsibble(index=DATE,key=ATM)
atm1 %>%
  gg_tsdisplay(Cash,plot_type = 'partial') + 
  labs(title = " ATM 1")

atm2 %>% 
  gg_tsdisplay(Cash,plot_type = 'partial') + 
  labs(title = " ATM 2")

atm3 %>% 
  gg_tsdisplay(Cash,plot_type = 'partial') +  
  labs(title = " ATM 3")

atm4 %>% 
  gg_tsdisplay(Cash,plot_type = 'partial') +  
  labs(title = " ATM 4")

Data Transformation

ATM1

ATM 1 has seasonality at lags with a multiple of 7. Differencing this data will be needed to remove the seasonality.

atm1 %>% gg_tsdisplay(difference(Cash,7) %>% difference(),plot_type = 'partial')

ATM2

ATM 2 has seasonality at lags with a multiple of 7. Differencing this data will be needed to remove the seasonality.

atm2 %>% gg_tsdisplay(difference(Cash,7)%>% difference(),plot_type = 'partial')

ATM4

The outlier in Atm4 occurs on February 9, index 285, which is a few days before Valentines day. Without more years of data to compare and contrast this too, I would make the assumption that this outlier is a rather normal event around that time of year. I also am not sure how the data was inputted, whether it is an automatic electronic transfer or a human data entry error. It is so far out of the norm, that there are two choices for fixing it. It can be removed or it can be altered someway, like removing a digit. I will partially go with my assumption of that particular day being an outlier, but I will also assume the value is input error. To correct the error I am going to remove the zero in the value 10919 and make it 1919. This will still keep this value as the maximum value but its effects are quite strong as the mean dropped by $25.

By altering the outlier the data set now has seasonality and is more like atm’s 1 and 2. The data will now have to be transformed through differencing.

Also since ATM4 seems to have a lot of relatively high values I will do a box-cox transformation on the data set.

atm4[285,'Cash'] <- 1919
summary(atm4)
##       DATE                ATM                 Cash         
##  Min.   :2009-05-01   Length:365         Min.   :   1.563  
##  1st Qu.:2009-07-31   Class :character   1st Qu.: 124.334  
##  Median :2009-10-30   Mode  :character   Median : 403.839  
##  Mean   :2009-10-30                      Mean   : 449.384  
##  3rd Qu.:2010-01-29                      3rd Qu.: 704.507  
##  Max.   :2010-04-30                      Max.   :1919.000
atm4 %>% gg_tsdisplay(Cash,plot_type = 'partial')

lambda4 <- atm4 %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm4 %>% gg_tsdisplay(difference(box_cox(Cash,lambda4),7),plot_type = 'partial')

## Models

ATM1 ARIMA (p,d,q) + (P,D,Q) Model

Since AMT1 has seasonality I believe the best model would be an ARIMA model with a seasonal component of ARIMA (p,d,q) + (P,D,Q)

fit_atm1 <- atm1 %>%
  model(
    ATM012011 = ARIMA(Cash ~ pdq(0,1,2) + PDQ(0,1,1)),
    ATM210011 = ARIMA(Cash ~ pdq(2,1,0) + PDQ(0,1,1)),
    auto = ARIMA(Cash, stepwise = FALSE, approx = FALSE))
glance(fit_atm1) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 3 × 6
##   .model    sigma2 log_lik   AIC  AICc   BIC
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 auto        560.  -1641. 3290. 3290. 3306.
## 2 ATM012011   569.  -1643. 3294. 3294. 3309.
## 3 ATM210011   721.  -1682. 3372. 3372. 3387.
fit_atm1 %>% select(auto) %>% gg_tsresiduals()

atm1_pred <- fit_atm1 %>% forecast(h=14)

atm1_pred %>%   
  filter(.model=='auto') %>%
  autoplot(atm1)

for_xl1 <- atm1_pred %>% filter(.model=='auto') %>% as.data.frame() %>% select(c(ATM,DATE,.mean))

ATM2 ARIMA (p,d,q) + (P,D,Q) Model

ATM2 also has seasonality, once again I believe the best model would be an ARIMA model with a seasonal component of ARIMA (p,d,q) + (P,D,Q)

fit_atm2 <- atm2 %>%
  model(
    ATM011011 = ARIMA(Cash ~ pdq(0,1,1) + PDQ(0,1,1)),
    ATM210011 = ARIMA(Cash ~ pdq(2,1,0) + PDQ(0,1,1)),
    auto = ARIMA(Cash, stepwise = FALSE, approx = FALSE))
glance(fit_atm2) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 3 × 6
##   .model    sigma2 log_lik   AIC  AICc   BIC
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 auto        608.  -1655. 3323. 3323. 3346.
## 2 ATM011011   644.  -1666. 3338. 3338. 3349.
## 3 ATM210011   879.  -1717. 3443. 3443. 3458.
fit_atm2 %>% select(auto) %>% gg_tsresiduals()

atm2_pred <- fit_atm2 %>% forecast(h=14)

atm2_pred %>%  
  filter(.model=='auto') %>%
  autoplot(atm2)

for_xl2 <-atm2_pred %>% filter(.model=='auto') %>% as.data.frame() %>% select(c(ATM,DATE,.mean))

ATM3 NAIVE Model

ATM3 only has a few data points, I believe it would be rather foolish to try and predict anything from the data set, so I will use a NAIVE model.

fit_atm3 <- atm3 %>% model(NAIVE(Cash))

atm3_pred <- fit_atm3 %>% forecast(h=14) 

atm3_pred %>% autoplot(atm3)

for_xl3 <-atm3_pred %>% as.data.frame() %>% select(c(ATM,DATE,.mean))

ATM4 ARIMA ARIMA (p,d,q) + (P,D,Q) Model

Like ATM models 1 and 2, atm model 4 will use an ARIMA (p,d,q) + (P,D,Q)

fit_atm4 <- atm4 %>% model(arima = ARIMA(box_cox(Cash,lambda4) ~ 1 + pdq(1,0,1) + PDQ(0,1,1)),
                           ssnl = SNAIVE(box_cox(Cash,lambda4)),
                           auto = ARIMA(box_cox(Cash,lambda4) ~ pdq() + PDQ())) 
glance(fit_atm4)
## # A tibble: 3 × 9
##   ATM   .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr> <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 ATM4  arima    141.  -1396. 2803. 2803. 2822. <cpl [1]>  <cpl [8]>
## 2 ATM4  ssnl     241.     NA    NA    NA    NA  <NULL>     <NULL>   
## 3 ATM4  auto     148.  -1429. 2867. 2867. 2882. <cpl [14]> <cpl [0]>
fit_atm4 %>% select(.model = "arima") %>% report()
## Series: Cash 
## Model: ARIMA(1,0,1)(0,1,1)[7] w/ drift 
## Transformation: box_cox(Cash, lambda4) 
## 
## Coefficients:
##          ar1      ma1     sma1  constant
##       0.7498  -0.6861  -0.8685   -0.0030
## s.e.  0.2529   0.2767   0.0330    0.0298
## 
## sigma^2 estimated as 140.8:  log likelihood=-1396.39
## AIC=2802.77   AICc=2802.94   BIC=2822.17
fit_atm4 %>%  
  select(ssnl) %>%
  gg_tsresiduals()

atm4_pred <- fit_atm4 %>% forecast(h=14)
atm4_pred %>%
  filter(.model=='arima') %>%
  autoplot(atm4)

for_xl4 <- atm4_pred %>% filter(.model=='arima') %>% as.data.frame() %>% select(c(ATM,DATE,.mean)) 

All Final Predictions

predictions_merge <- rbind(for_xl1,for_xl2,for_xl3,for_xl4)
write_xlsx(predictions_merge,"ATM_PREDICTIONS.xlsx")





Part B

Thought Process

Problem

Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

Data

The data set is a time series of a monthly aggregate of kilowatts per hours from the year 1998 until 2013. The goal is to predict the monthly usage for the year 2014.

The description of the data has a skewness of .17 and a kurtosis of .45, which indicates the data might be normally distributed. However, upon inspecting the density plot, it becomes apparent that the data is not normally distributed.

The data has one missing value, seasonality and has one outlier.

kwh <- read_excel("data/ResidentialCustomerForecastLoad-624.xlsx",col_types = c('numeric', 'text', 'numeric'))
colnames(kwh)[2] <-'yearmonth'

kwh$yearmonth <- seq(as.Date("1998-01-01"), as.Date("2013-12-31"), by = "1 month")
kwh$yearmonth <- yearmonth(kwh$yearmonth)
kwh <- kwh %>% as_tsibble(index=yearmonth)
head(kwh,5)
## # A tsibble: 5 x 3 [1M]
##   CaseSequence yearmonth     KWH
##          <dbl>     <mth>   <dbl>
## 1          733  1998 Jan 6862583
## 2          734  1998 Feb 5838198
## 3          735  1998 Mar 5420658
## 4          736  1998 Apr 5010364
## 5          737  1998 May 4665377
describe(kwh[c(1,3)])
##              vars   n      mean         sd    median   trimmed        mad
## CaseSequence    1 192     828.5      55.57     828.5     828.5      71.16
## KWH             2 191 6502474.6 1447570.89 6283324.0 6439474.9 1543073.77
##                 min      max   range skew kurtosis        se
## CaseSequence    733      924     191 0.00    -1.22      4.01
## KWH          770523 10655730 9885207 0.17     0.45 104742.55
ggplot(kwh, aes(x=KWH)) + 
  geom_density()

kwh %>% gg_tsdisplay(KWH,plot_type='partial')

colSums(is.na(kwh))
## CaseSequence    yearmonth          KWH 
##            0            0            1
kwh %>% autoplot(KWH)

Fixing Na’s

Since the data has seasonality it is best if the imputed value is derived from like values. The missing value appears in September of 2008. The monthly mean of September values is 7702333. This value will be used to replace the na. By doing this the mean value for the months of September remained the same. While the 1st and 3rd quartiles changed slightly, as did the median value.

sept <- kwh %>% filter(month(yearmonth)==9)
sept
## # A tsibble: 16 x 3 [1M]
##    CaseSequence yearmonth     KWH
##           <dbl>     <mth>   <dbl>
##  1          741  1998 Sep 6989888
##  2          753  1999 Sep 7899368
##  3          765  2000 Sep 8912169
##  4          777  2001 Sep 7112069
##  5          789  2002 Sep 8245227
##  6          801  2003 Sep 7791791
##  7          813  2004 Sep 6690366
##  8          825  2005 Sep 7057213
##  9          837  2006 Sep 7296355
## 10          849  2007 Sep 7666970
## 11          861  2008 Sep      NA
## 12          873  2009 Sep 7583146
## 13          885  2010 Sep 7819472
## 14          897  2011 Sep 8943599
## 15          909  2012 Sep 7559148
## 16          921  2013 Sep 7968220
summary(sept$KWH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 6690366 7204212 7666970 7702333 7933794 8943599       1
kwh[129,'KWH'] = 7702333
sept <- kwh %>% filter(month(yearmonth)==9)
sept
## # A tsibble: 16 x 3 [1M]
##    CaseSequence yearmonth     KWH
##           <dbl>     <mth>   <dbl>
##  1          741  1998 Sep 6989888
##  2          753  1999 Sep 7899368
##  3          765  2000 Sep 8912169
##  4          777  2001 Sep 7112069
##  5          789  2002 Sep 8245227
##  6          801  2003 Sep 7791791
##  7          813  2004 Sep 6690366
##  8          825  2005 Sep 7057213
##  9          837  2006 Sep 7296355
## 10          849  2007 Sep 7666970
## 11          861  2008 Sep 7702333
## 12          873  2009 Sep 7583146
## 13          885  2010 Sep 7819472
## 14          897  2011 Sep 8943599
## 15          909  2012 Sep 7559148
## 16          921  2013 Sep 7968220
summary(sept$KWH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 6690366 7250284 7684652 7702333 7916581 8943599
kwh %>% gg_tsdisplay(KWH,plot_type='partial')

Outliers

I don’t know what caused electricity usage to drop by 88% from June 2010 to July 2010. The drop is clearly significant when looking at the plot of the data so it has to be dealt with or it will skew the results of the forecasts.

This data set is over a long period of time, so the outlier looks more like a one time anomaly. In order to fix this outlier, I won’t remove it, instead I will do the same for the outlier that I did for the na. I will take the monthly average and replace the value.

july <- kwh %>% filter(month(yearmonth)==8)
summary(july$KWH)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  7309774  7731092  8149929  8298211  8509231 10308076
kwh[151,'KWH'] = 8298211
july <- kwh %>% filter(month(yearmonth)==8)
summary(july$KWH)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  7309774  7731092  8149929  8298211  8509231 10308076

Transformations

There isn’t a lot of difference in variance throughout the data. The variance isn’t growing as time goes on. However I would like to see if doing a box-cox transformation on the data set has any meaningful bearing on the forecasts.

lambda <- kwh %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)
kwh %>% gg_tsdisplay(difference(box_cox(KWH,lambda),12), plot_type='partial', lag =36) +
  labs(title = "Seasonal Differenced(12)  KWH")

Models

All the models will be arima with seasonality in them. I will use some modesl with custom pdq’s and PDQ’s versus models that are auto chosen. Also there will be a box-cox transformed model with the same custom pdq values and an auto model.

kwh_models <- kwh %>% model(arima_bc = ARIMA(box_cox(KWH,lambda) ~ 0 + pdq(0,1,2) + PDQ(0,1,1)),
                            arima_no_bc = ARIMA(KWH ~ 0 + pdq(0,1,2) + PDQ(0,1,1)),
                            arima_bc1 = ARIMA(box_cox(KWH,lambda) ~ 0 + pdq(1,1,2) + PDQ(1,1,2)),
                            arima_no_bc1 = ARIMA(KWH ~ 0 + pdq(1,1,2) + PDQ(1,1,2)),
                            auto_bc = ARIMA(box_cox(KWH,lambda) ~ 0+  pdq() + PDQ()),
                            auto_no_bc = ARIMA(KWH ~ 0+  pdq() + PDQ()),
                            )
glance(kwh_models) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 6 × 6
##   .model         sigma2 log_lik    AIC   AICc    BIC
##   <chr>           <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1 auto_bc      1.10e- 5    782. -1553. -1552. -1534.
## 2 arima_bc     1.10e- 5    776. -1545. -1545. -1532.
## 3 arima_bc1    1.09e- 5    779. -1544. -1544. -1522.
## 4 arima_no_bc1 3.86e+11  -2645.  5304.  5304.  5326.
## 5 arima_no_bc  3.95e+11  -2648.  5305.  5305.  5317.
## 6 auto_no_bc   3.87e+11  -2658.  5330.  5331.  5352.

Final Model Selection

In order to select the final model for the predictions I am going to compare the auto_bc model and the arima_no_bc1 models to the years of 2012 and 2013.

auto_bc <- kwh_models %>%
  forecast(h=12) %>%
  filter(.model=='auto_bc') 

arima_no_bc1 <- kwh_models %>%
  forecast(h=12) %>%
  filter(.model=='arima_no_bc1')
au_bc <- auto_bc$.mean 
ar_nbc1 <- arima_no_bc1$.mean

kwh_2013 <- kwh %>% filter_index('2013'~'2014')
kwh_2012 <- kwh %>% filter_index('2012'~'2013')
all_preds <- as.data.frame(cbind(seq(1:12), au_bc, ar_nbc1, kwh_2013$KWH,kwh_2012$KWH))
names(all_preds)[1] = 'index'
names(all_preds)[2] = 'auto_bc'
names(all_preds)[3] = 'arima_no_bc1'
names(all_preds)[4] = 'KWH_2013'
names(all_preds)[5] = 'KWH_2012'
data_long <- melt(all_preds, id = "index")
ggplot(data_long, aes(x =index, y = value, color = variable)) +  
   geom_line()

The choice between models comes down to a choice between where do you expect the most growth. The auto_bc model(red) predicts the most growth from months 6 to 8. While the arima_no_bc model(green) expects more growth in months 3 through 5 and months 10 and 11. Honestly, looking at this data it looks like an average between the two forecasts would probably be best. However, that is not an option here. Since the data is seasonal for every 12th month and every multiple of 3, the auto_bc is slightly higher than 2012 and 2013 at month 3, yet quite significantly higher at month 6. The arima_no_bc is the opposite of the auto_bc at those points. It is significantly higher at point three and basically the same as 2012 and 2013 at point 6. Points 9 and 12 are similar for both the auto_bc and arima_no_bc models. In the end the final model that I will choose is the auto_bc model because of the growth during the summer months, rather than the growth during the spring and fall.

write_xlsx(as.data.frame(au_bc),"ELECTRICITY_PREDICTIONS.xlsx")
kwh_models %>%
  forecast(h=12) %>%
  filter(.model=='auto_bc') %>%
  autoplot(kwh)