Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast.
## DATE ATM Cash
## Min. :39934 Length:1474 Min. : 0.0
## 1st Qu.:40026 Class :character 1st Qu.: 0.5
## Median :40118 Mode :character Median : 73.0
## Mean :40118 Mean : 155.6
## 3rd Qu.:40210 3rd Qu.: 114.0
## Max. :40312 Max. :10919.8
## NA's :19
We need to get the DATE column to its standard format and convert the data into a tsibble. The missing values will be substituted with the median values for the atms.
atm$DATE = as.Date(atm$DATE, origin = '1900-01-01')
impute.median = function(x) replace(x, is.na(x), median(x, na.rm=TRUE))
atm = atm %>%
filter(DATE < "2010-05-01") %>%
group_by(ATM) %>%
mutate(Cash = impute.median(Cash)
)
summary(atm)
## DATE ATM Cash
## Min. :2009-05-03 Length:1452 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.0
## Median :2009-10-31 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.5
## 3rd Qu.:2010-01-30 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
Now we can plot the monthly time series. As we can see, ATM 1, 2, 3 and 4 have no trend or seasonality associated with the data. ATM 3 has data where no transactions were made before April 2010. ATM 4 shows a spike in February 2010.
## Plot variable not specified, automatically selected `.vars = Cash`
## `mutate_if()` ignored the following grouping variables:
There is no noticeable trend in ATM 1 but there is seasonality. The models SNAIVE, ETS, and ARIMA will be evaluated.
The ARIMA model without transformation performed best according to the RMSE. Lets evaluate the model’s components.
## # A tibble: 8 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 multi_tran Train… 2.78 22.9 15.0 -32.6 50.8 0.843 0.821 0.112
## 2 ATM1 arima Train… -0.0792 23.4 14.7 -103. 118. 0.825 0.840 -0.00814
## 3 ATM1 add Train… -0.174 23.8 15.0 -107. 122. 0.847 0.855 0.128
## 4 ATM1 arima_tran Train… 1.74 24.5 15.7 -90.9 109. 0.884 0.881 0.00488
## 5 ATM1 add_tran Train… 1.80 24.6 15.8 -94.9 113. 0.891 0.885 0.129
## 6 ATM1 multi Train… -1.16 26.3 16.3 -128. 144. 0.916 0.945 0.0788
## 7 ATM1 snaive Train… -0.0730 27.8 17.8 -97.1 117. 1 1 0.159
## 8 ATM1 snaive_tran Train… -0.0730 27.8 17.8 -97.1 117. 1 1 0.159
The model that will be used to forecast ATM1 in May 2010 is ARIMA(0,0,2)(0,1,2)[7]
## Series: Cash
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1698 -0.5863 -0.0994
## s.e. 0.0554 0.0510 0.0500
##
## sigma^2 estimated as 562.8: log likelihood=-1632.92
## AIC=3273.84 AICc=3273.95 BIC=3289.34
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 16.4 0.175
There is no noticeable trend in ATM 2 but there is seasonality. The models SNAIVE, ETS, and ARIMA will be evaluated.
The ARIMA model without transformation performed best according to the RMSE. Lets evaluate the model’s components.
## # A tibble: 8 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM2 arima Traini… -1.00 24.1 17.0 -Inf Inf 0.820 0.801 -0.00230
## 2 ATM2 arima_tran Traini… 0.183 24.4 17.1 -Inf Inf 0.825 0.812 -0.0106
## 3 ATM2 add Traini… -0.798 25.0 17.7 -Inf Inf 0.854 0.831 0.0161
## 4 ATM2 add_tran Traini… 0.532 25.4 17.8 -Inf Inf 0.858 0.843 0.0170
## 5 ATM2 snaive Traini… -0.0674 30.1 20.7 -Inf Inf 1 1 0.0476
## 6 ATM2 snaive_tran Traini… -0.0674 30.1 20.7 -Inf Inf 1 1 0.0476
## 7 ATM2 multi_tran Traini… 5.38 33.5 26.7 -Inf Inf 1.29 1.11 0.00508
## 8 ATM2 multi Traini… 0.972 34.8 26.9 -278. 307. 1.30 1.16 0.0205
The model that will be used to forecast ATM2 in May 2010 is ARIMA(2,0,2)(0,1,1)[7]
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4292 -0.9169 0.4699 0.8071 -0.7566
## s.e. 0.0515 0.0404 0.0810 0.0548 0.0384
##
## sigma^2 estimated as 600.4: log likelihood=-1643.82
## AIC=3299.64 AICc=3299.88 BIC=3322.89
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 9.21 0.685
ATM 3 has many 0 values, therefore noticing a trend or seasonality is non-existent. This ATM can be considered out of service until April 10. A model cannot predict on the limited data.
There is no noticeable trend in ATM 4 but there is seasonality. The models SNAIVE, ETS, and ARIMA will be evaluated. There is a noticeable spike in the data in February 2010.
The additive model without transformation performed best according to the RMSE. Lets evaluate the model’s components.
## # A tibble: 8 × 11
## ATM .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 add Trai… -5.18e- 1 636. 301. -579. 608. 0.749 0.708 -0.00412
## 2 ATM4 arima Trai… -1.41e-13 651. 324. -619. 649. 0.806 0.725 -0.00970
## 3 ATM4 multi_tran Trai… 1.88e+ 2 670. 312. -266. 330. 0.775 0.746 -0.00315
## 4 ATM4 add_tran Trai… 1.93e+ 2 671. 313. -261. 327. 0.778 0.747 -0.00332
## 5 ATM4 arima_tran Trai… 2.24e+ 2 685. 321. -222. 289. 0.799 0.762 -0.00636
## 6 ATM4 multi Trai… -8.12e+ 1 850. 377. -748. 781. 0.938 0.946 -0.00884
## 7 ATM4 snaive Trai… -3.65e+ 0 898. 402. -395. 450. 1 1 -0.0149
## 8 ATM4 snaive_tran Trai… -3.65e+ 0 898. 402. -395. 450. 1 1 -0.0149
The model that will be used to forecast ATM1 in May 2010 is ETS(A,N,A).
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000812
## gamma = 0.0001000586
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 475.0999 -298.6338 -59.56414 179.8411 5.029765 60.52828 19.86394 92.9349
##
## sigma^2: 415042.5
##
## AIC AICc BIC
## 6846.372 6846.997 6885.316
## `mutate_if()` ignored the following grouping variables:
## • Column `ATM`
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 4.48 0.973
The ATM forecast will be exported to an Excel file. Code was in reference to https://stackoverflow.com/questions/34659586/combining-forecasts-into-a-data-frame-in-r-and-then-exporting-into-excel
A simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
Since there is a missing value, we will replace it with the median number. Then we need to convert the data to a time series with a monthly time interval.
With the data in a time series, we can plot the data.
In the plot, we can see seasonality within the data but no trend in the data. We can visualize this using STL decomposition, season plot and sub series plot.
Lets determine how much differencing is needed to obtain stationary time series. We require a seasonal difference to obtain a stationary time series.
## [1] 1
Now the model is stationary and can begin modeling.
It seems the best model would be ARIMA(1,0,0)(0,1,1)[12] with drift
## Series: rcf
## ARIMA(1,0,0)(0,1,1)[12] with drift
## Box Cox transformation: lambda= -0.1484337
##
## Coefficients:
## ar1 sma1 drift
## 0.3001 -0.7364 1e-04
## s.e. 0.0722 0.0693 1e-04
##
## sigma^2 = 7.855e-05: log likelihood = 594.93
## AIC=-1181.87 AICc=-1181.64 BIC=-1169.09
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 22217.68 588817.2 451817.4 -0.220072 6.902593 0.7424511 0.05056403
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
## Q* = 26.668, df = 22, p-value = 0.2241
##
## Model df: 2. Total lags used: 24
Lets see if changing components will improve the model. I will begin with ARIMA(1,0,1)(0,1,1)[12]
## Series: rcf
## ARIMA(1,0,1)(0,1,1)[12]
## Box Cox transformation: lambda= -0.1484337
##
## Coefficients:
## ar1 ma1 sma1
## 0.8444 -0.5949 -0.6886
## s.e. 0.1288 0.2018 0.0644
##
## sigma^2 = 8.251e-05: log likelihood = 591.2
## AIC=-1174.39 AICc=-1174.16 BIC=-1161.62
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 114904.1 604994.1 468753.6 1.207013 7.111408 0.7702816 0.08236063
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[12]
## Q* = 26.429, df = 21, p-value = 0.1906
##
## Model df: 3. Total lags used: 24
Though the auto arima model suggested ARIMA(1,0,0)(0,1,1)[12] with drift, I will use ARIMA(1,0,1)(0,1,1)[12] due to the AICc being minimal.
fc_rcf = Arima(rcf, order=c(1,0,1), seasonal=c(0,1,1), lambda=BoxCox.lambda(rcf))
pred_rcf = forecast(fc_rcf,h=12,level=95)
autoplot(pred_rcf)
Lastly, the forecast will be exported to Excel.