In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast.
Upload the ATM624Data file into github account or any cloud app.
Using R access the file and define the column types.
Convert the data type from “Date” to Date and round “Cash” to 3
Spread ATM column into separate columns for each ATM
Convert to proper Time series format with Date as the index
The Summary statistics on the prepared data indicate that ATM1 and ATM2 have a mean withdrawal of 83.89 and 62.58 respectively.ATM3 has a Mean: 0.7206 and ATM4 has a Mean:474.043. Hence ATM4 shows the higher withdrawals than all, whereas ATM3 is mostly 0.
ATM4 has a extreme outlier of 10919.762 .
The box plot chart shows ATM1 having many outliers.ATM2 is fairly distributed without any outliers. ATM3 is zero except for 3 outliers which have high value. ATM4 is also well distributed except for a outlier which is very high compared to the others. Hence we need to check the outliers before generating the time series.
ATM1 has 3 missing values and ATM2 has 2 missing values.Hence these need to be interpolated.
## Date ATM1 ATM2 ATM3
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000
## Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000
## Mean :2009-10-30 Mean : 83.89 Mean : 62.58 Mean : 0.7206
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
## Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000
## NA's :3 NA's :2
## ATM4
## Min. : 1.563
## 1st Qu.: 124.334
## Median : 403.839
## Mean : 474.043
## 3rd Qu.: 704.507
## Max. :10919.762
##
## Date ATM1 ATM2 ATM3 ATM4
## 0 3 2 0 0
Replaced missing/NA numerical values with the median value.
Replaced the extreme outlier in ATM4 cash with the nearest value using tsoutliers().
## $index
## [1] 285
##
## $replacements
## [1] 230.175
The Time series plot does not show any trend across the years.
ATM1 is clearly with seasonality, which is weekly seasonality. The ACF plots show strong correlation have significant lag7, lag14, and lag21. This is a non-stationary time series.This should be a data for ETS or ARIMA model. As there is high value in the residuals it needs transformation.
ATM2 is clearly with seasonality, which is weekly seasonality. The ACF plots show strong correlation have significant lag7, lag14, and lag21. This is a non-stationary time series.This should be a data for ETS or ARIMA model.
ATM3 have 0 for all values except 3. This data should only provide a flat forecast of 0.
ATM4: It looks like there is seasonality similar to ATM1 and ATM2. However the residuals show high variation. Hence we need to perform transformation. Lamba value of 0.449 indicates a square root transform. After apllying transformation, the plot is not being stationary. Hence differencing is needed.One seasonal differencing was applied so D=1, while the non-seasonal part suggests d=0. The seasonal lags at ACF sudden drops while the ones in PACF gradually decrease, i think i can try with AR(1) and MA(1). There is one non-seasonal spike at lag10 in ACF and PACF plots suggest p=q=1. I will try ARIMA(1,0,1)(0,1,1)[7].
For ATM1 forecasts, ARIMA model with box-cox transformation is displaying the lowest AICc.
Plotting the residual diagnostic graphs of this ARIMA transform model shows that the mean of the residuals is close to zero.The ACF plot shows that there is no significant correlation in the residuals series. 95% of the spikes in the ACF lies within ± 1.96 of the blue line. Hence the residuals are all white noise.The histogram of the residuals presents normal distribution.
The order auto fitted is ARIMA(0,0,2)(0,1,1)[7]. This represents non seasonal MA(2), seasonal MA(1) with single differencing and Seasonality of 7 days( weekly).
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima_transform 1.58 -590. 1188. 1188. 1203.
## 2 ets_transform 1.61 -1159. 2339. 2339. 2378.
## 3 arima 560. -1641. 3290. 3290. 3306.
## 4 ets 579. -2233. 4486. 4487. 4525.
## <lst_mdl[1]>
## [1] <ARIMA(0,0,2)(0,1,1)[7]>
## Date Cash
## 1 2010-05-01 93
## 2 2010-05-02 107
## 3 2010-05-03 79
## 4 2010-05-04 6
## 5 2010-05-05 107
## 6 2010-05-06 85
## 7 2010-05-07 92
## 8 2010-05-08 94
## 9 2010-05-09 108
## 10 2010-05-10 80
## 11 2010-05-11 6
## 12 2010-05-12 107
## 13 2010-05-13 86
## 14 2010-05-14 92
## 15 2010-05-15 95
## 16 2010-05-16 108
## 17 2010-05-17 81
## 18 2010-05-18 6
## 19 2010-05-19 108
## 20 2010-05-20 86
## 21 2010-05-21 93
## 22 2010-05-22 95
## 23 2010-05-23 109
## 24 2010-05-24 81
## 25 2010-05-25 6
## 26 2010-05-26 109
## 27 2010-05-27 87
## 28 2010-05-28 94
## 29 2010-05-29 96
## 30 2010-05-30 110
## 31 2010-05-31 82
For ATM2 forecasts, ARIMA model with box-cox transformation is displaying the lowest AICc.
Plotting the residual diagnostic graphs of this ARIMA transform model shows that the mean of the residuals is close to zero.The ACF plot shows that there is no significant correlation in the residuals series. 95% of the spikes in the ACF lies within ± 1.96 of the blue line. Hence the residuals are all white noise.The histogram of the residuals presents normal distribution.
The order auto fitted is ARIMA(2,0,2)(0,1,1)[7]. This represents non seasonal AR(2) & MA(2), seasonal MA(1) with single differencing and Seasonality of 7 days( weekly)
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima_transform 69.9 -1268. 2548. 2548. 2571.
## 2 arima 602. -1653. 3319. 3319. 3342.
## 3 ets_transform 74.6 -1859. 3738. 3739. 3777.
## 4 ets 644. -2253. 4525. 4526. 4564.
## <lst_mdl[1]>
## [1] <ARIMA(2,0,2)(0,1,1)[7]>
## Date Cash
## 1 2010-05-01 67
## 2 2010-05-02 73
## 3 2010-05-03 15
## 4 2010-05-04 9
## 5 2010-05-05 99
## 6 2010-05-06 90
## 7 2010-05-07 67
## 8 2010-05-08 67
## 9 2010-05-09 73
## 10 2010-05-10 16
## 11 2010-05-11 10
## 12 2010-05-12 99
## 13 2010-05-13 90
## 14 2010-05-14 67
## 15 2010-05-15 68
## 16 2010-05-16 74
## 17 2010-05-17 16
## 18 2010-05-18 11
## 19 2010-05-19 100
## 20 2010-05-20 90
## 21 2010-05-21 67
## 22 2010-05-22 68
## 23 2010-05-23 74
## 24 2010-05-24 17
## 25 2010-05-25 12
## 26 2010-05-26 100
## 27 2010-05-27 91
## 28 2010-05-28 67
## 29 2010-05-29 68
## 30 2010-05-30 74
## 31 2010-05-31 17
The ATM3 has only few observations.
When it comes to the ATM3, there are only 4 observations, which simply is not enough data to make a proper forecast and all of them are 0 which would give us a flat forecast.
ARIMA model provided the lowest AICc.
## [1] 0 96 82 85
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima 25.4 -1109. 2223. 2223. 2235.
## 2 ets 25.4 -1666. 3338. 3338. 3350.
## 3 mean 63.1 NA NA NA NA
## 4 naive 25.9 NA NA NA NA
## <lst_mdl[1]>
## [1] <ARIMA(0,0,2)>
## Date Cash
## 1 2010-05-01 3
## 2 2010-05-02 1
## 3 2010-05-03 0
## 4 2010-05-04 0
## 5 2010-05-05 0
## 6 2010-05-06 0
## 7 2010-05-07 0
## 8 2010-05-08 0
## 9 2010-05-09 0
## 10 2010-05-10 0
## 11 2010-05-11 0
## 12 2010-05-12 0
## 13 2010-05-13 0
## 14 2010-05-14 0
## 15 2010-05-15 0
## 16 2010-05-16 0
## 17 2010-05-17 0
## 18 2010-05-18 0
## 19 2010-05-19 0
## 20 2010-05-20 0
## 21 2010-05-21 0
## 22 2010-05-22 0
## 23 2010-05-23 0
## 24 2010-05-24 0
## 25 2010-05-25 0
## 26 2010-05-26 0
## 27 2010-05-27 0
## 28 2010-05-28 0
## 29 2010-05-29 0
## 30 2010-05-30 0
## 31 2010-05-31 0
For ATM4 forecasts, ARIMA model with box-cox transformation and ARIMA model enhanced with different combination of seasonal and non seasonal components were tested.
Plotting the residual diagnostic graphs of this ARIMA transform model shows that the mean of the residuals is close to zero.The ACF plot shows that there is no significant correlation in the residuals series. 95% of the spikes in the ACF lies within ± 1.96 of the blue line. Hence the residuals are all white noise.The histogram of the residuals presents normal distribution.
The SARIMA model with order (1,0,1)(0,1,1)[7] showed the lowest AICc. This represents non seasonal AR(1) & MA(1), seasonal MA(1) with single differencing and Seasonality of 7 days( weekly).
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima_3 166. -1426. 2860. 2860. 2876.
## 2 arima_2 166. -1426. 2862. 2862. 2882.
## 3 arima_1 176. -1461. 2931. 2931. 2951.
## 4 ets_atm4 0.222 -2001. 4021. 4022. 4060.
## <lst_mdl[1]>
## [1] <ARIMA(1,0,1)(0,1,1)[7]>
## Date Cash
## 1 2010-05-01 361
## 2 2010-05-02 446
## 3 2010-05-03 464
## 4 2010-05-04 133
## 5 2010-05-05 535
## 6 2010-05-06 386
## 7 2010-05-07 594
## 8 2010-05-08 390
## 9 2010-05-09 469
## 10 2010-05-10 480
## 11 2010-05-11 138
## 12 2010-05-12 544
## 13 2010-05-13 391
## 14 2010-05-14 599
## 15 2010-05-15 393
## 16 2010-05-16 472
## 17 2010-05-17 483
## 18 2010-05-18 139
## 19 2010-05-19 546
## 20 2010-05-20 393
## 21 2010-05-21 601
## 22 2010-05-22 395
## 23 2010-05-23 473
## 24 2010-05-24 484
## 25 2010-05-25 141
## 26 2010-05-26 547
## 27 2010-05-27 394
## 28 2010-05-28 602
## 29 2010-05-29 396
## 30 2010-05-30 475
## 31 2010-05-31 486
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours
Upload the ResidentialCustomerForecastLoad-624 file into github account or any cloud app.
Using R access the file and define the column types.
Convert to proper Time series format with YearMonth as the index
The summary stats indicate that KWH has a mean of 6502475 and median of 6283324
It has 1 missing value.Hence this needs to be interpolated.
The boxplot shows that the KWH data is fairly distributed well except for a outlier which has very low value.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
## NULL
Replaced missing/NA numerical values with the median value.
The outlier check tsoutliers() is not revealing any observation to be imputed.
## $index
## integer(0)
##
## $replacements
## numeric(0)
The time series plot shows a strong seasonal pattern with peaks and troughs that repeat approximately every 12 months. This is evident by the spikes at lag 12 in both the ACF and PACF plots.
The series appears to be mostly stationary in terms of mean (it hovers around a consistent level with seasonal fluctuations), though there’s that single unusual dip around 2010.
The ACF plot shows significant positive autocorrelation at lags 12 and 24, confirming the annual seasonality.
The PACF shows a spike at lag 12, then cuts off, suggesting a seasonal AR component.
SARIMA would be appropriate given the clear seasonal pattern or ETS model
As the residuals are high, we need to perform a transformation.
The lambda value of 0.204 suggests a log transformation.
The KPSS tests to determine the appropriate number of first differences shows 1. So we need single differencing.
After transformation and differencing, Plotting the residual diagnostic graphs shows that the mean of the residuals is close to zero.
Non seasonal ACF lag shows spike at 12, hence p=1, PACF none so q=0. The seasonal ACF does not show any information, wheras the PACF shows spike at lag4 and decreasing thereafter. As its annual seasonality period will be 12.
I will start with ARIMA(1,1,0)(0,0,1)[12] and then try other combinations of p,q,P,Q.
The ARIMA model (arima_4) with order (1,0,0)(0,1,1)[12] w/ drift showed the lowest AICc.This represents non seasonal AR(1) , seasonal MA(1) with single differencing and Seasonality of 12 months(1 year).
## [1] "Lambda value for transformation check is :0.104166550686943"
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
## # A tibble: 7 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima_4 0.948 -258. 523. 523. 536.
## 2 arima_3 0.951 -257. 525. 525. 541.
## 3 arima_5 0.951 -257. 525. 525. 541.
## 4 arima_2 0.955 -257. 527. 527. 546.
## 5 arima 1.29 -296. 601. 601. 617.
## 6 arima_1 1.99 -336. 678. 679. 688.
## 7 ets_transform 0.000572 -486. 1002. 1005. 1051.
## <lst_mdl[1]>
## [1] <ARIMA(1,0,0)(0,1,1)[12] w/ drift>
The forecasts for May 2010 ATM Cashwithdrawals for the 4 ATMS is placed in Part A sheet.
The forecasts for Residential power usage for 2014 is placed in Part B sheet.
A file Project1_forecasts.xlsx is generated to output the same.