Part A | Forecasting ATM Widthrawals

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.


Load data set

#laods as list
atm <- read_excel('ATM624Data.xlsx', col_names=TRUE, col_types=c('date', 'text', 'numeric'))

Preview data

As you can see, the ‘ATM’ column identifies the ATM machine: ATM1, ATM2, ATM3, ATM4. ‘Cash’ is expressed in hundreds of dollars.

unique(atm$ATM)
## [1] "ATM1" "ATM2" NA     "ATM3" "ATM4"
kable(head(atm))
DATE ATM Cash
2009-05-01 ATM1 96
2009-05-01 ATM2 107
2009-05-02 ATM1 82
2009-05-02 ATM2 89
2009-05-03 ATM1 85
2009-05-03 ATM2 90

Determine if there are any missing values in the data set

The ‘DATE’ column does not have any missing values.

any(is.na(atm$DATE))
## [1] FALSE

The ‘ATM’ column has some missing values. It would be difficult to determine which ATM machine to assign rows with missing values. The approach I am going to take is to drop rows with missing ATM values.

any(is.na(atm$ATM))
## [1] TRUE

Upon further investigation, rows with missing ‘ATM’ values are not significant because these are blank rows for the month of May 2010, the month that we want to forecast. We can drop these.

kable(atm[is.na(atm$ATM),])
DATE ATM Cash
2010-05-01 NA NA
2010-05-02 NA NA
2010-05-03 NA NA
2010-05-04 NA NA
2010-05-05 NA NA
2010-05-06 NA NA
2010-05-07 NA NA
2010-05-08 NA NA
2010-05-09 NA NA
2010-05-10 NA NA
2010-05-11 NA NA
2010-05-12 NA NA
2010-05-13 NA NA
2010-05-14 NA NA

Drop rows with missing ATM values

The code below drops rows the rows in the table above with missing ‘ATM’ values. These are not relevant as these are blank rows for the month of May 2010.

atm <- atm[!is.na(atm$ATM),]

Are there any rows with missing ‘Cash’ values? The answer is YES.

any(is.na(atm$Cash))
## [1] TRUE

We have 5 rows with missing Cash values. They all happen to be for the month of June 2009. We can fill these missing values with the average withdrawals of the previous and next day. For example, for missing 6/13/2009 cash for ATM1, take the average of the day before and the day after. This average value will be used for the missing value.

kable(atm[is.na(atm$Cash),])
DATE ATM Cash
2009-06-13 ATM1 NA
2009-06-16 ATM1 NA
2009-06-18 ATM2 NA
2009-06-22 ATM1 NA
2009-06-24 ATM2 NA

Before assigning values to missing Cash, we’re going to sort the data by ATM first and then by DATE.

atm <- atm[with(atm, order(ATM, DATE)),]

Assign missing cash values by taking average of previous and next day.

#Convert atm$DATE to 'Date' class. Originally, it is  "POSIXct" "POSIXt" class. 
atm$DATE = as.Date(atm$DATE)

#assign missing values for ATM1 with following dates by taking average of previous and next day
atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-13"), ]$Cash <- 
    (atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-12"), ]$Cash + atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-14"), ]$Cash)/2

atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-16"), ]$Cash <- 
    (atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-15"), ]$Cash + atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-17"), ]$Cash)/2

atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-22"), ]$Cash <- 
    (atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-21"), ]$Cash + atm[atm$ATM == 'ATM1' & atm$DATE == as.Date("2009-06-23"), ]$Cash)/2

#assign missing values for ATM2 with following dates by taking average of pervious and next day
atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-18"), ]$Cash <- 
    (atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-17"), ]$Cash + atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-19"), ]$Cash)/2

atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-24"), ]$Cash <- 
    (atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-23"), ]$Cash + atm[atm$ATM == 'ATM2' & atm$DATE == as.Date("2009-06-25"), ]$Cash)/2

Values assigned for missing cash values.

DATE ATM Cash
2009-06-13 ATM1 131.0
2009-06-16 ATM1 107.0
2009-06-22 ATM1 111.5
2009-06-18 ATM2 79.0
2009-06-24 ATM2 51.0

Create lists for each ATM machine.

atm1 <- atm[atm$ATM == 'ATM1',]
atm2 <- atm[atm$ATM == 'ATM2',]
atm3 <- atm[atm$ATM == 'ATM3',]
atm4 <- atm[atm$ATM == 'ATM4',]

Time Series Analysis

Create time series objects for each ATM.

atm1_ts <- ts(atm1["Cash"], start=c(2009, 05, 01), frequency=7)
atm2_ts <- ts(atm2["Cash"], start=c(2009, 05, 01), frequency=7)
atm3_ts <- ts(atm3["Cash"], start=c(2009, 05, 01), frequency=7)
atm4_ts <- ts(atm4["Cash"], start=c(2009, 05, 01), frequency=7)


ATM 1

I don’t see any obvious trend in the data. ACF and PACF shows significant correlation at lag 7.

tsdisplay(atm1_ts)

Stationarity: KPSS test

  • null hypothesis: time series is stationary
  • alternative: time series is not stationary

The KPSS test for has p-value of 0.07863. We reject the null hypothesis. ATM 1 series is not stationary.

kpss.test(atm1_ts, null="Trend")
## 
##  KPSS Test for Trend Stationarity
## 
## data:  atm1_ts
## KPSS Trend = 0.13054, Truncation lag parameter = 5, p-value =
## 0.07863

To stabilize the variance, a Box-Cox transformation is going to be applied. Save lambda and use this later.

(atm1_lambda = BoxCox.lambda(atm1_ts))
## [1] 0.2615708

Below is plot with Box-Cox transformation and differencing of 1. There is correlation spike at lag 7, 14, 21 for ACF.

atm1_ts %>% BoxCox(atm1_lambda) %>% diff(1) %>% tsdisplay()


ATM1 Model 1 (auto.arima): ARIMA(1,1,0)(0,1,1)[7], AICc=1384.94

atm1_model1 <- auto.arima(atm1_ts, d=1, max.p=10, max.q=10, lambda = atm1_lambda)
summary(atm1_model1)
## Series: atm1_ts 
## ARIMA(1,1,0)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2615708 
## 
## Coefficients:
##           ar1     sma1
##       -0.3908  -0.5994
## s.e.   0.0486   0.0452
## 
## sigma^2 estimated as 2.776:  log likelihood=-689.43
## AIC=1384.87   AICc=1384.94   BIC=1396.5
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE     MASE
## Training set -0.3365687 34.89839 21.47221 -106.0333 128.8101 1.213234
##                    ACF1
## Training set -0.1152877

ATM1 Model 2: ARIMA(7,1,8)(0,1,1)[7], AICc=1243.89

After a few trial and errors, ARIMA(7,1,8)(0,1,1)[7] appears to have a lower AICc.

atm1_model2 <- Arima(atm1_ts, order=c(7,1,8), seasonal=c(0,1,1), lambda = atm1_lambda)
summary(atm1_model2)
## Series: atm1_ts 
## ARIMA(7,1,8)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2615708 
## 
## Coefficients:
##           ar1     ar2      ar3     ar4      ar5     ar6     ar7      ma1
##       -0.2389  0.3859  -0.2668  0.1050  -0.1147  0.1169  0.7039  -0.6283
## s.e.   0.1642  0.1825   0.1572  0.1673   0.1686  0.1440  0.1519   0.1735
##           ma2     ma3      ma4      ma5      ma6      ma7     ma8     sma1
##       -0.8359  0.7028  -0.2262  -0.0460  -0.1756  -0.3951  0.6043  -0.7315
## s.e.   0.1607  0.2708   0.2370   0.2722   0.2455   0.1629  0.1625   0.0624
## 
## sigma^2 estimated as 1.728:  log likelihood=-604.04
## AIC=1242.08   AICc=1243.89   BIC=1308
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.4596644 24.11322 15.66045 -78.80768 95.82487 0.8848551
##                    ACF1
## Training set 0.04340309

Residuals of selected model

Residuals appear to be approximately normal. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of 0.5562. Accept null hypothesis that residuals is white noise.

checkresiduals(atm1_model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(7,1,8)(0,1,1)[7]
## Q* = 2.0787, df = 3, p-value = 0.5562
## 
## Model df: 16.   Total lags used: 19

Forecast for next 31 days of selected model for ATM 1

autoplot(forecast(atm1_model2, h=31))


ATM 2

I don’t see any obvious trend in the data. ACF and PACF shows multiple correlation spikes.

tsdisplay(atm2_ts)

Stationarity: KPSS test

  • null hypothesis: time series is stationary
  • alternative: time series is not stationary

The KPSS test for has p-value of 0.06702. We reject the null hypothesis. ATM 2 series is not stationary.

kpss.test(atm2_ts, null="Trend")
## 
##  KPSS Test for Trend Stationarity
## 
## data:  atm2_ts
## KPSS Trend = 0.13681, Truncation lag parameter = 5, p-value =
## 0.06702

To stabilize the variance, a Box-Cox transformation is going to be applied. Save lambda and use this later.

(atm2_lambda = BoxCox.lambda(atm2_ts))
## [1] 0.7242585

Below is plot with Box-Cox transformation and differencing of 1.

atm2_ts %>% BoxCox(atm2_lambda) %>% diff(1) %>% tsdisplay()


ATM2 Model 1 (auto.arima): ARIMA(1,1,0)(0,1,1)[7], AICc=2707.4

atm2_model1 <- auto.arima(atm2_ts, d=1, max.p=10, max.q=10, lambda = atm2_lambda)
summary(atm2_model1)
## Series: atm2_ts 
## ARIMA(1,1,0)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7242585 
## 
## Coefficients:
##           ar1     sma1
##       -0.4391  -0.5305
## s.e.   0.0479   0.0469
## 
## sigma^2 estimated as 113:  log likelihood=-1350.66
## AIC=2707.33   AICc=2707.4   BIC=2718.96
## 
## Training set error measures:
##                     ME     RMSE      MAE MPE MAPE     MASE       ACF1
## Training set 0.2513631 32.42686 21.68245 NaN  Inf 1.043042 -0.1725429

ATM2 Model 2: ARIMA(7,1,5)(0,1,1)[7] , AICc=2541.61

After a few trial and errors, ARIMA(7,1,5)(0,1,1)[7] appears to have a lower AICc.

atm2_model2 <- Arima(atm2_ts, order=c(7,1,5), seasonal=c(0,1,1), lambda = atm2_lambda)
summary(atm2_model2)
## Series: atm2_ts 
## ARIMA(7,1,5)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.7242585 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6      ar7
##       -1.0555  -1.6784  -0.6883  -0.5785  -0.2066  -0.1997  -0.2077
## s.e.   0.1516   0.2750   0.2892   0.2432   0.1282   0.0966   0.0696
##          ma1     ma2      ma3      ma4      ma5     sma1
##       0.1325  0.5394  -1.0202  -0.1707  -0.4809  -0.7574
## s.e.  0.1523  0.1917   0.0309   0.1448   0.1967   0.0457
## 
## sigma^2 estimated as 65.79:  log likelihood=-1256.19
## AIC=2540.39   AICc=2541.61   BIC=2594.68
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.6039325 23.84162 16.68036 -Inf  Inf 0.8024147 -0.01275306

Residuals of selected model

Residuals appear to be approximately normal. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of 0.3098. Accept null hypothesis that residuals is white noise.

checkresiduals(atm2_model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(7,1,5)(0,1,1)[7]
## Q* = 3.5855, df = 3, p-value = 0.3098
## 
## Model df: 13.   Total lags used: 16

Forecast for next 31 days of selected model for ATM 2

autoplot(forecast(atm2_model2, h=31))


ATM 3

It appears that ATM 3 just recently became operational. We don’t have much data to use for forecasting.

tsdisplay(atm3_ts)


Forcast for the next 31 days for ATM 3

For this, a simple naive forecast is done where the forecast is the value of the last observation.

plot(naive(atm3_ts, h=31))


ATM 4

ATM 4 series appears to have an outlier. I’m going to use the tsoutliers function to get more insight on this.

tsdisplay(atm4_ts)

Below is replacement recommendation from the function tsoutliers.

tsoutliers(atm4_ts)
## $index
## [1] 285
## 
## $replacements
## [1] 190.9351

Replace outlier with recommended value.

atm4_ts[285] <- 190.9351

Updated Plot of ATM 4

I don’t see any obvious trend. I see multiple spikes on ACF and PACF. They have very similar patterns.

tsdisplay(atm4_ts)

Stationarity: KPSS test

  • null hypothesis: time series is stationary
  • alternative: time series is not stationary

The test-statistic 0.1215 is less than the critical values. We accept the null hypothesis that the series is stationary. There is no need for differencing.

#https://stats.stackexchange.com/questions/160810/interpretation-of-critical-values-of-kpss-test
ur.kpss(atm4_ts) %>% summary
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.1223 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

To stabilize the variance, a Box-Cox transformation is going to be applied. Save lambda and use this later.

(atm4_lambda = BoxCox.lambda(atm4_ts))
## [1] 0.4492823

Below is plot with Box-Cox transformation. ACF and PACF look very similar.

atm4_ts %>% BoxCox(atm4_lambda) %>% tsdisplay()


ATM4 Model 1 (auto.arima): ARIMA(0,0,1)(2,0,0)[7], AICc=2929.39

atm4_model1 <- auto.arima(atm4_ts, max.p=10, max.q=10, lambda = atm4_lambda)
summary(atm4_model1)
## Series: atm4_ts 
## ARIMA(0,0,1)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4492823 
## 
## Coefficients:
##          ma1    sar1    sar2     mean
##       0.0795  0.2074  0.2030  28.5700
## s.e.  0.0527  0.0516  0.0524   1.2384
## 
## sigma^2 estimated as 175.6:  log likelihood=-1459.61
## AIC=2929.23   AICc=2929.39   BIC=2948.73
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 84.92431 352.1777 274.4794 -367.8509 412.3995 0.7923398
##                    ACF1
## Training set 0.01963427

ATM4 Model 2: ARIMA(0,0,0)(0,1,1)[7], AICc=2858.68

This appears to have a lower AIC.

atm4_model2 <- Arima(atm4_ts, order=c(0,0,0), seasonal = c(0,1,1), lambda = atm4_lambda)
summary(atm4_model2)
## Series: atm4_ts 
## ARIMA(0,0,0)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.4492823 
## 
## Coefficients:
##          sma1
##       -0.8698
## s.e.   0.0317
## 
## sigma^2 estimated as 165.9:  log likelihood=-1427.32
## AIC=2858.65   AICc=2858.68   BIC=2866.41
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 63.29268 339.3257 256.9349 -371.0955 414.7838 0.7416942
##                    ACF1
## Training set 0.09999483

Residuals of selected model

Residuals appear to to be somewhat skewed but roughly normal in shapre. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of 0.06942. Residuals may not be white noise.

checkresiduals(atm4_model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(0,1,1)[7]
## Q* = 21.181, df = 13, p-value = 0.06942
## 
## Model df: 1.   Total lags used: 14

Forecast for next 31 days of selected model for ATM 4

autoplot(forecast(atm2_model2, h=31))



Forecasting of four ATM machines

The forecasts below for the four ATM machines for the next 31 days (May 2010) seem reasonable.

grid.arrange(ncol=2, nrow=2,
             forecast(atm1_model2, h=31) %>% autoplot,
             forecast(atm2_model2, h=31) %>% autoplot,
             naive(atm3_ts, h=31) %>% autoplot,
             forecast(atm4_model2, h=31) %>% autoplot)

Dump CSV file of forecast.

write.csv(forecast(atm1_model2, h=31), "ATM1_forecast.csv")
write.csv(forecast(atm2_model2, h=31), "ATM2_forecast.csv")
write.csv(naive(atm3_ts, h=31), "ATM3_forecast.csv")
write.csv(forecast(atm4_model2, h=31), "ATM4_forecast.csv")


Part B | Forecasting Residential Power Load

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.


Read data set file.

power_data <- read_excel('ResidentialCustomerForecastLoad-624.xlsx', col_names=TRUE)

Preview file.

kable(head(power_data))
CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147

Summary of power data

summary(power_data)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

Create time series object

power_ts <- ts(power_data[, "KWH"], start = c(1998, 1), frequency = 12)

Outliers

The tsoutliers function shows 2 outlier points in the series with recommended replacement values.

tsoutliers(power_ts)
## $index
## [1] 151 192
## 
## $replacements
## [1] 7696306 7028762

Below are the current values at the indicated indices above (no replacement done yet)

(power_ts[151])
## [1] 770523
(power_ts[192]) 
## [1] 9606304

The red and blue horizontal lines below show the outlier values. The outlier indicated by red line does seem high but it doesn’t seem like it is too high relative to the surrounding values. The outlier indicated by blue line is way too low. I will replace this.

autoplot(power_ts) + geom_hline(yintercept=770523, color='blue') + geom_hline(yintercept=9606304, color='red')

Replacing outlier value indicated by blue line.

power_ts[151] <- 7696306

Adjusted plot after replacement

autoplot(power_ts)

Time Series Analysis

There is a slight upward trend towards the right. Plot shows a somewhat stationary series.

tsdisplay(power_ts)

Stationarity: KPSS test

  • null hypothesis: time series is stationary
  • alternative: time series is not stationary

The KPSS test for has p-value of 0.02475. We reject the null hypothesis. Series is not stationary.

kpss.test(power_ts, null="Trend")
## 
##  KPSS Test for Trend Stationarity
## 
## data:  power_ts
## KPSS Trend = 0.17667, Truncation lag parameter = 4, p-value =
## 0.02475

To stabilize the variance, a Box-Cox transformation is going to be applied. Save lambda and use this later.

(power_lambda = BoxCox.lambda(power_ts))
## [1] -0.1977787

ndiffs function recommends differencing by 1.

ndiffs(power_ts)
## [1] 1

Below is plot with Box-Cox transformation with differencing of 1.
The slight upward trend is no longer present.

power_ts %>% BoxCox(power_lambda) %>% diff(1) %>% tsdisplay()


Model 1 (auto.arima): ARIMA(0,1,1)(2,1,0)[12] , AICc=-1429.06

power_model1 <- auto.arima(power_ts, d=1, max.p=10, max.q=10, lambda = power_lambda)
summary(power_model1)
## Series: power_ts 
## ARIMA(0,1,1)(2,1,0)[12] 
## Box Cox transformation: lambda= -0.1977787 
## 
## Coefficients:
##           ma1     sar1     sar2
##       -0.9129  -0.7900  -0.3998
## s.e.   0.0502   0.0732   0.0759
## 
## sigma^2 estimated as 1.911e-05:  log likelihood=718.64
## AIC=-1429.29   AICc=-1429.06   BIC=-1416.54
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE     MASE
## Training set 22291.25 665989.7 475942.3 -0.1436716 7.108555 0.774068
##                   ACF1
## Training set 0.1625114

Model 2: ARIMA(3,1,2)(0,1,2)[12] , AICc=-1434.21

This appears to have a lower AIC after trying out different combinations.

power_model2 <- Arima(power_ts, order=c(3,1,2), seasonal = c(0,1,2), lambda = power_lambda)
summary(power_model2)
## Series: power_ts 
## ARIMA(3,1,2)(0,1,2)[12] 
## Box Cox transformation: lambda= -0.1977787 
## 
## Coefficients:
##           ar1     ar2     ar3      ma1      ma2     sma1    sma2
##       -0.2699  0.0670  0.1789  -0.4507  -0.4874  -0.8051  0.1559
## s.e.   0.2141  0.0953  0.0830   0.2117   0.1980   0.0847  0.0922
## 
## sigma^2 estimated as 1.818e-05:  log likelihood=725.53
## AIC=-1435.05   AICc=-1434.21   BIC=-1409.55
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 36494.63 645086.4 453332.2 0.05866348 6.787337 0.7372951
##                    ACF1
## Training set 0.06634145

Residuals of selected model

Residuals appear to be approximately normal. ACF of residuals are within the limits of threshold. The Ljung-Box test has a p-value of p-value = 0.8325. Accept residuals as white noise.

checkresiduals(power_model2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,2)(0,1,2)[12]
## Q* = 11.445, df = 17, p-value = 0.8325
## 
## Model df: 7.   Total lags used: 24

Forecast of power consumption for the next 12 months

The forecast for 2014 seem reasonable.

autoplot(forecast(power_model2, h=12))