| title: “DS624_Project1_JagdishChhabria” author: “Jagdish Chhabria” date: “10/29/2021” output: html_document: default |
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
The data set contains cash withdrawals from four different ATMs during the period from May 1, 2009 to April 30, 2010. We start by separating the single column of data into 4 separate time series so they can be analyzed separately.
Let’s inspect the data.
| DATE | ATM1 | ATM2 | ATM3 | ATM4 |
|---|---|---|---|---|
| 2009-05-01 | 96 | 107 | 0 | 777 |
| 2009-05-02 | 82 | 89 | 0 | 524 |
| 2009-05-03 | 85 | 90 | 0 | 793 |
| 2009-05-04 | 90 | 55 | 0 | 908 |
| 2009-05-05 | 99 | 79 | 0 | 53 |
| 2009-05-06 | 88 | 19 | 0 | 52 |
| 2009-05-07 | 8 | 2 | 0 | 55 |
| 2009-05-08 | 104 | 103 | 0 | 559 |
| 2009-05-09 | 87 | 107 | 0 | 904 |
| 2009-05-10 | 93 | 118 | 0 | 879 |
| 2009-05-11 | 86 | 75 | 0 | 274 |
| 2009-05-12 | 111 | 111 | 0 | 396 |
| 2009-05-13 | 75 | 25 | 0 | 275 |
| 2009-05-14 | 6 | 16 | 0 | 16 |
| 2009-05-15 | 102 | 137 | 0 | 852 |
| 2009-05-16 | 73 | 95 | 0 | 380 |
| 2009-05-17 | 92 | 103 | 0 | 31 |
| 2009-05-18 | 82 | 80 | 0 | 492 |
| 2009-05-19 | 86 | 118 | 0 | 84 |
| 2009-05-20 | 73 | 30 | 0 | 129 |
| 2009-05-21 | 20 | 7 | 0 | 14 |
| 2009-05-22 | 100 | 118 | 0 | 815 |
| 2009-05-23 | 93 | 104 | 0 | 758 |
| 2009-05-24 | 90 | 59 | 0 | 601 |
| 2009-05-25 | 94 | 40 | 0 | 907 |
| 2009-05-26 | 98 | 106 | 0 | 503 |
| 2009-05-27 | 73 | 18 | 0 | 88 |
| 2009-05-28 | 10 | 9 | 0 | 35 |
| 2009-05-29 | 97 | 136 | 0 | 338 |
| 2009-05-30 | 102 | 118 | 0 | 5 |
Let’s explore the data further.
## ATM1 ATM2 ATM3 ATM4
## Min. : 1.00 Min. : 0.00 Min. : 0.0000 Min. : 2
## 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000 1st Qu.: 124
## Median : 91.00 Median : 67.00 Median : 0.0000 Median : 404
## Mean : 83.89 Mean : 62.58 Mean : 0.7206 Mean : 474
## 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000 3rd Qu.: 705
## Max. :180.00 Max. :147.00 Max. :96.0000 Max. :10920
## NA's :3 NA's :2
## vars n mean sd median trimmed mad min max range skew
## DATE 1 365 NaN NA NA NaN NA Inf -Inf -Inf NA
## ATM1 2 362 83.89 36.66 91 86.65 25.20 1 180 179 -0.71
## ATM2 3 363 62.58 38.90 67 62.24 48.93 0 147 147 -0.03
## ATM3 4 365 0.72 7.94 0 0.00 0.00 0 96 96 10.93
## ATM4 5 365 474.01 650.95 404 416.65 425.51 2 10920 10918 11.39
## kurtosis se
## DATE NA NA
## ATM1 0.19 1.93
## ATM2 -1.09 2.04
## ATM3 118.38 0.42
## ATM4 178.92 34.07
From the above, we notice the following unusual things: 1) ATM1 has 3 missing values and ATM2 has 2 missing values. 2) ATM2 has 2 days with withdrawals = 0, which indicates that this data represents missing values separately from no withdrawals (i.e. withdrawal amount = 0) 3) ATM3 seems to have been operational for only the last 3 days of the entire 1 year period. This could potentially indicate that it was newly installed in the last week of April 2010 or was inoperational uptil that date for some other reason. 4) ATM has an outlier i.e. a daily withdrawal of $1,092,000 which is far above the mean withdrawal amount. The date is February 9 which is not a public holiday or some other major event date. 5) The mean for ATM4 is much higher than the mean for ATM1 and ATM2 indicating that it is potentially in a very busy, commercial neighborhood or an affluent neighborhood.
Let’s check the 4 ATM time series for outliers.
From the above, we can see that ATM has almost 15% of the data points as outliers, ATM2 does not have any outliers, ATM3 data should not be considered representative, and ATM4 has 2 outliers of which one is quite significant.
From the above plots, we can see that ATM1 and ATM2 data shows some seasonality.ATM3 plot doesn’t provide much information as expected. ATM4 plot primarily shows the big spike for the outlier.
## ATM1 ATM2 ATM3 ATM4
## Min. : 1.00 Min. : 0.00 Min. : 0.0000 Min. : 2.0
## 1st Qu.: 73.00 1st Qu.: 26.00 1st Qu.: 0.0000 1st Qu.: 124.0
## Median : 91.00 Median : 67.00 Median : 0.0000 Median : 403.0
## Mean : 84.15 Mean : 62.59 Mean : 0.7206 Mean : 444.7
## 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000 3rd Qu.: 704.0
## Max. :180.00 Max. :147.00 Max. :96.0000 Max. :1712.0
From the above, we can see that the missing data points have now been imputed.
The above shows that there is weekly seasonality in the ATM1 data, and the PACF plot also confirms that the autocorrelation is restricted to weekly lags.
Based on the results of the ndiffs and nsdiffs function, it seems that no differencing is required, which seems contrary to what the plot was showing.
We now try to decompose the ATM1 time series.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.465 0.0495
Based on the above, we can say that the time series is not stationary. Let’s apply the Box Cox transformation.
## [1] 0.346298
## [1] 0.6833615
## [1] 0.4025564
The optimal lambda values for the Box Cox tranformations of ATM1, ATM2 and ATM4 are 0.3549, 0.6796 and 0.3995 respectively.
Let’s transform the data using these lambda values.
The classical decomposition shows a strong seasonal component and a spike in variance around March 2010.
We re-run the KPSS test for stationarity and can see that the data is now stationary.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.00871 0.1
Based on this test, we can conclude that the differenced and Box Cox transformed time series is stationary. Let’s try fitting different models (seasonally naive, ETS and ARIMA) on the ATM1 data, and evaluate against a training period.
Let us compare the accuracy across the models.
## # A tibble: 3 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test -6.81 12.3 9.68 -82.9 86.0 0.523 0.429 -0.282
## 2 ETS Test -5.80 11.6 9.13 -61.5 65.1 0.493 0.404 -0.255
## 3 SNAIVE Test -7.07 16.0 13.5 -66.2 73.4 0.731 0.557 0.0437
The ETS model seems to the best choice here.
Let’s check the residuals.
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 7.91 0.245
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ETS 19.1 0.00399
## # A tibble: 3 x 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 SNAIVE 831. NA NA NA NA NA NA NA <NULL> <NULL>
## 2 ETS 622. -2047. 4114. 4114. 4152. 605. 608. 15.6 <NULL> <NULL>
## 3 ARIMA 602. -1516. 3039. 3039. 3054. NA NA NA <cpl [0]> <cpl [15]>
The residuals are not correlated and look like white noise.
To come up with the forecast for May 2010 from this model, we need to reverse the Box Cox transformation.
The next month’s forecast for ATM1 cash withdrawals along with the prediction bands is shown in the plot above.
###ATM2
The plot above shows autocorrelation at lags = 7,14,21 indicating weekly seasonality.
ATM2 shows significant seasonality. Let’s transform the data.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.00852 0.1
## # A tibble: 3 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test -22.1 32.2 25.4 -312. 315. 1.19 1.05 0.186
## 2 ETS Test -21.1 31.5 24.3 -256. 260. 1.14 1.03 0.159
## 3 SNAIVE Test -22.3 32.0 26.9 -265. 270. 1.26 1.04 0.160
Let’s check the residuals.
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 7.91 0.245
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ETS 19.1 0.00399
## # A tibble: 3 x 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 SNAIVE 831. NA NA NA NA NA NA NA <NULL> <NULL>
## 2 ETS 622. -2047. 4114. 4114. 4152. 605. 608. 15.6 <NULL> <NULL>
## 3 ARIMA 602. -1516. 3039. 3039. 3054. NA NA NA <cpl [0]> <cpl [15]>
To come up with the forecast for May 2010 from this model, we need to reverse the Box Cox transformation.
The next month’s forecast for ATM1 cash withdrawals along with the prediction bands is shown in the plot above.
The plot above shows autocorrelation at lags = 7,14,21 indicating weekly seasonality.
ATM2 shows significant seasonality. Let’s transform the data.
## # A tibble: 3 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test -56.6 285. 243. -994. 1019. 0.688 0.618 -0.0264
## 2 ETS Test -50.3 308. 261. -1141. 1174. 0.740 0.669 0.124
## 3 SNAIVE Test -130. 301. 227. -374. 389. 0.644 0.653 0.0305
Let’s check the residuals.
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 0.786 0.992
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ETS 9.21 0.162
## # A tibble: 3 x 11
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE ar_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
## 1 SNAIVE 213314. NA NA NA NA NA NA NA <NULL>
## 2 ETS 112320. -2917. 5854. 5855. 5893. 109302. 109540. 264. <NULL>
## 3 ARIMA 121960. -2434. 4883. 4884. 4914. NA NA NA <cpl [10]>
## # ... with 1 more variable: ma_roots <list>
To come up with the forecast for May 2010 from this model, we need to reverse the Box Cox transformation.
The next month’s forecast for ATM1 cash withdrawals along with the prediction bands is shown in the plot above.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.
| CaseSequence | YYYY-MMM | KWH |
|---|---|---|
| 733 | 1998-Jan | 6862583 |
| 734 | 1998-Feb | 5838198 |
| 735 | 1998-Mar | 5420658 |
| 736 | 1998-Apr | 5010364 |
| 737 | 1998-May | 4665377 |
| 738 | 1998-Jun | 6467147 |
| 739 | 1998-Jul | 8914755 |
| 740 | 1998-Aug | 8607428 |
| 741 | 1998-Sep | 6989888 |
| 742 | 1998-Oct | 6345620 |
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CaseSequence | 1 | 192 | 828.5 | 55.56978 | 828.5 | 828.5 | 71.1648 | 733 | 924 | 191 | 0.0000000 | -1.2187656 | 4.010403 |
| YYYY-MMM* | 2 | 192 | 96.5 | 55.56978 | 96.5 | 96.5 | 71.1648 | 1 | 192 | 191 | 0.0000000 | -1.2187656 | 4.010403 |
| KWH | 3 | 191 | 6502474.6 | 1447570.88674 | 6283324.0 | 6439474.9 | 1543073.7714 | 770523 | 10655730 | 9885207 | 0.1683404 | 0.4547269 | 104742.553300 |
The results above show one missing value, for September 2008.
The boxplot shows one outlier for July 2010 which had a very low power usage of 770523 KWH. Let’s go ahead and impute the one missing value with the median KWH value, and create a tsibble object.
Let’s plot the time series.
The data shows seasonality and big spike downwards in July 2010 which could be related to an outage.
The ACF plot confirms that the data is seasonal, showing several significant autocorrelations that are 6 months apart from peak to trough, suggesting high and low usage during winter and summer months. There seems to also be a mild upward trend.
Let’s run a STL decomposition on the power time series.
The STL decomposition shows an upward trend that seems to have accelerated from 2012 onwards, as well as seasonality. In order to address this, we transform the data using a Box Cox transformation.
## [1] 0.1041666
The Box-Cox transformation does not seem to have made much difference to the time
Before fitting any models, we will split the data into training and test data sets. We can use the last 1 year as the test data set. Then we can try fitting different models including the seasonal naive model, exponential smoothing and ARIMA.
Let’s forecast for 12 months using the fitted models.
The forecasts look reasonable. Visually the SNAIVE and ETS forecasts look like a close fit. Let’s check the accuracy metrics.
## # A tibble: 3 x 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA Test 764755. 1516998. 1014416. 8.64 12.2 1.44 1.27 0.0659
## 2 ETS Test 405572. 1050779. 682558. 4.24 8.19 0.967 0.882 0.0846
## 3 SNAIVE Test 405195. 1035538. 618606. 4.55 7.06 0.877 0.869 -0.0313
Based on the above metrics, the SNAIVE model looks better. Let’s check the residuals from the fitted models.
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE 70.8 2.30e-10
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ETS 22.6 0.0317
## # A tibble: 1 x 3
## .model bp_stat bp_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA 3.95 0.984
Based on the above, it seems that the residuals from the ARIMA model are not dependent and closest to white noise. So we go with the ARIMA model instead.
## Series: KWH
## Model: ARIMA(0,0,3)(2,1,1)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 sma1 constant
## 0.2198 0.1094 0.1301 -0.6385 -0.5175 -0.2918 174498.9
## s.e. 0.0789 0.0804 0.0717 0.1630 0.1324 0.1813 70961.6
##
## sigma^2 estimated as 682673355129: log likelihood=-2531.73
## AIC=5079.46 AICc=5080.36 BIC=5104.45
The ARIMA model has parameters: (0,0,3)(2,1,1)[12] w/ drift.
The residuals look uncorrelated.
To come up with the 12-monthly forecast for 2014 from this model, we need to reverse the Box Cox transformation.
Write the forecast for 2014 to a file.