title: “DS624_Project1_JagdishChhabria” author: “Jagdish Chhabria” date: “10/29/2021” output: html_document: default

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Exploratory Data Analysis

The data set contains cash withdrawals from four different ATMs during the period from May 1, 2009 to April 30, 2010. We start by separating the single column of data into 4 separate time series so they can be analyzed separately.

Let’s inspect the data.

DATE ATM1 ATM2 ATM3 ATM4
2009-05-01 96 107 0 777
2009-05-02 82 89 0 524
2009-05-03 85 90 0 793
2009-05-04 90 55 0 908
2009-05-05 99 79 0 53
2009-05-06 88 19 0 52
2009-05-07 8 2 0 55
2009-05-08 104 103 0 559
2009-05-09 87 107 0 904
2009-05-10 93 118 0 879
2009-05-11 86 75 0 274
2009-05-12 111 111 0 396
2009-05-13 75 25 0 275
2009-05-14 6 16 0 16
2009-05-15 102 137 0 852
2009-05-16 73 95 0 380
2009-05-17 92 103 0 31
2009-05-18 82 80 0 492
2009-05-19 86 118 0 84
2009-05-20 73 30 0 129
2009-05-21 20 7 0 14
2009-05-22 100 118 0 815
2009-05-23 93 104 0 758
2009-05-24 90 59 0 601
2009-05-25 94 40 0 907
2009-05-26 98 106 0 503
2009-05-27 73 18 0 88
2009-05-28 10 9 0 35
2009-05-29 97 136 0 338
2009-05-30 102 118 0 5

Let’s explore the data further.

##       ATM1             ATM2             ATM3              ATM4      
##  Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000   Min.   :    2  
##  1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000   1st Qu.:  124  
##  Median : 91.00   Median : 67.00   Median : 0.0000   Median :  404  
##  Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206   Mean   :  474  
##  3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000   3rd Qu.:  705  
##  Max.   :180.00   Max.   :147.00   Max.   :96.0000   Max.   :10920  
##  NA's   :3        NA's   :2
##      vars   n   mean     sd median trimmed    mad min   max range  skew
## DATE    1 365    NaN     NA     NA     NaN     NA Inf  -Inf  -Inf    NA
## ATM1    2 362  83.89  36.66     91   86.65  25.20   1   180   179 -0.71
## ATM2    3 363  62.58  38.90     67   62.24  48.93   0   147   147 -0.03
## ATM3    4 365   0.72   7.94      0    0.00   0.00   0    96    96 10.93
## ATM4    5 365 474.01 650.95    404  416.65 425.51   2 10920 10918 11.39
##      kurtosis    se
## DATE       NA    NA
## ATM1     0.19  1.93
## ATM2    -1.09  2.04
## ATM3   118.38  0.42
## ATM4   178.92 34.07

From the above, we notice the following unusual things: 1) ATM1 has 3 missing values and ATM2 has 2 missing values. 2) ATM2 has 2 days with withdrawals = 0, which indicates that this data represents missing values separately from no withdrawals (i.e. withdrawal amount = 0) 3) ATM3 seems to have been operational for only the last 3 days of the entire 1 year period. This could potentially indicate that it was newly installed in the last week of April 2010 or was inoperational uptil that date for some other reason. 4) ATM has an outlier i.e. a daily withdrawal of $1,092,000 which is far above the mean withdrawal amount. The date is February 9 which is not a public holiday or some other major event date. 5) The mean for ATM4 is much higher than the mean for ATM1 and ATM2 indicating that it is potentially in a very busy, commercial neighborhood or an affluent neighborhood.

Let’s check the 4 ATM time series for outliers.

From the above, we can see that ATM has almost 15% of the data points as outliers, ATM2 does not have any outliers, ATM3 data should not be considered representative, and ATM4 has 2 outliers of which one is quite significant.

From the above plots, we can see that ATM1 and ATM2 data shows some seasonality.ATM3 plot doesn’t provide much information as expected. ATM4 plot primarily shows the big spike for the outlier.

##       ATM1             ATM2             ATM3              ATM4       
##  Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000   Min.   :   2.0  
##  1st Qu.: 73.00   1st Qu.: 26.00   1st Qu.: 0.0000   1st Qu.: 124.0  
##  Median : 91.00   Median : 67.00   Median : 0.0000   Median : 403.0  
##  Mean   : 84.15   Mean   : 62.59   Mean   : 0.7206   Mean   : 444.7  
##  3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000   3rd Qu.: 704.0  
##  Max.   :180.00   Max.   :147.00   Max.   :96.0000   Max.   :1712.0

From the above, we can see that the missing data points have now been imputed.

The above shows that there is weekly seasonality in the ATM1 data, and the PACF plot also confirms that the autocorrelation is restricted to weekly lags.

Based on the results of the ndiffs and nsdiffs function, it seems that no differencing is required, which seems contrary to what the plot was showing.

We now try to decompose the ATM1 time series.

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.465      0.0495

Based on the above, we can say that the time series is not stationary. Let’s apply the Box Cox transformation.

## [1] 0.346298
## [1] 0.6833615
## [1] 0.4025564

The optimal lambda values for the Box Cox tranformations of ATM1, ATM2 and ATM4 are 0.3549, 0.6796 and 0.3995 respectively.

Let’s transform the data using these lambda values.

The classical decomposition shows a strong seasonal component and a spike in variance around March 2010.

We re-run the KPSS test for stationarity and can see that the data is now stationary.

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1   0.00871         0.1

Based on this test, we can conclude that the differenced and Box Cox transformed time series is stationary. Let’s try fitting different models (seasonally naive, ETS and ARIMA) on the ATM1 data, and evaluate against a training period.

Let us compare the accuracy across the models.

## # A tibble: 3 x 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  Test  -6.81  12.3  9.68 -82.9  86.0 0.523 0.429 -0.282 
## 2 ETS    Test  -5.80  11.6  9.13 -61.5  65.1 0.493 0.404 -0.255 
## 3 SNAIVE Test  -7.07  16.0 13.5  -66.2  73.4 0.731 0.557  0.0437

The ETS model seems to the best choice here.

Let’s check the residuals.

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     7.91     0.245
## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       19.1   0.00399
## # A tibble: 3 x 11
##   .model sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE   MAE ar_roots  ma_roots  
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>    <list>    
## 1 SNAIVE   831.     NA    NA    NA    NA    NA    NA   NA   <NULL>    <NULL>    
## 2 ETS      622.  -2047. 4114. 4114. 4152.  605.  608.  15.6 <NULL>    <NULL>    
## 3 ARIMA    602.  -1516. 3039. 3039. 3054.   NA    NA   NA   <cpl [0]> <cpl [15]>

The residuals are not correlated and look like white noise.

To come up with the forecast for May 2010 from this model, we need to reverse the Box Cox transformation.

The next month’s forecast for ATM1 cash withdrawals along with the prediction bands is shown in the plot above.

###ATM2

The plot above shows autocorrelation at lags = 7,14,21 indicating weekly seasonality.

ATM2 shows significant seasonality. Let’s transform the data.

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1   0.00852         0.1

## # A tibble: 3 x 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA  Test  -22.1  32.2  25.4 -312.  315.  1.19  1.05 0.186
## 2 ETS    Test  -21.1  31.5  24.3 -256.  260.  1.14  1.03 0.159
## 3 SNAIVE Test  -22.3  32.0  26.9 -265.  270.  1.26  1.04 0.160

Let’s check the residuals.

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     7.91     0.245
## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       19.1   0.00399
## # A tibble: 3 x 11
##   .model sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE   MAE ar_roots  ma_roots  
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>    <list>    
## 1 SNAIVE   831.     NA    NA    NA    NA    NA    NA   NA   <NULL>    <NULL>    
## 2 ETS      622.  -2047. 4114. 4114. 4152.  605.  608.  15.6 <NULL>    <NULL>    
## 3 ARIMA    602.  -1516. 3039. 3039. 3054.   NA    NA   NA   <cpl [0]> <cpl [15]>

To come up with the forecast for May 2010 from this model, we need to reverse the Box Cox transformation.

The next month’s forecast for ATM1 cash withdrawals along with the prediction bands is shown in the plot above.

ATM4

The plot above shows autocorrelation at lags = 7,14,21 indicating weekly seasonality.

ATM2 shows significant seasonality. Let’s transform the data.

## # A tibble: 3 x 10
##   .model .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  Test   -56.6  285.  243.  -994. 1019. 0.688 0.618 -0.0264
## 2 ETS    Test   -50.3  308.  261. -1141. 1174. 0.740 0.669  0.124 
## 3 SNAIVE Test  -130.   301.  227.  -374.  389. 0.644 0.653  0.0305

Let’s check the residuals.

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA    0.786     0.992
## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       9.21     0.162
## # A tibble: 3 x 11
##   .model  sigma2 log_lik   AIC  AICc   BIC     MSE    AMSE   MAE ar_roots  
##   <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl> <list>    
## 1 SNAIVE 213314.     NA    NA    NA    NA      NA      NA    NA  <NULL>    
## 2 ETS    112320.  -2917. 5854. 5855. 5893. 109302. 109540.  264. <NULL>    
## 3 ARIMA  121960.  -2434. 4883. 4884. 4914.     NA      NA    NA  <cpl [10]>
## # ... with 1 more variable: ma_roots <list>

To come up with the forecast for May 2010 from this model, we need to reverse the Box Cox transformation.

The next month’s forecast for ATM1 cash withdrawals along with the prediction bands is shown in the plot above.

B. Power Forecasting

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward.

CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147
739 1998-Jul 8914755
740 1998-Aug 8607428
741 1998-Sep 6989888
742 1998-Oct 6345620
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
vars n mean sd median trimmed mad min max range skew kurtosis se
CaseSequence 1 192 828.5 55.56978 828.5 828.5 71.1648 733 924 191 0.0000000 -1.2187656 4.010403
YYYY-MMM* 2 192 96.5 55.56978 96.5 96.5 71.1648 1 192 191 0.0000000 -1.2187656 4.010403
KWH 3 191 6502474.6 1447570.88674 6283324.0 6439474.9 1543073.7714 770523 10655730 9885207 0.1683404 0.4547269 104742.553300

The results above show one missing value, for September 2008.

The boxplot shows one outlier for July 2010 which had a very low power usage of 770523 KWH. Let’s go ahead and impute the one missing value with the median KWH value, and create a tsibble object.

Let’s plot the time series.

The data shows seasonality and big spike downwards in July 2010 which could be related to an outage.

The ACF plot confirms that the data is seasonal, showing several significant autocorrelations that are 6 months apart from peak to trough, suggesting high and low usage during winter and summer months. There seems to also be a mild upward trend.

Let’s run a STL decomposition on the power time series.

The STL decomposition shows an upward trend that seems to have accelerated from 2012 onwards, as well as seasonality. In order to address this, we transform the data using a Box Cox transformation.

## [1] 0.1041666

The Box-Cox transformation does not seem to have made much difference to the time

Before fitting any models, we will split the data into training and test data sets. We can use the last 1 year as the test data set. Then we can try fitting different models including the seasonal naive model, exponential smoothing and ARIMA.

Let’s forecast for 12 months using the fitted models.

The forecasts look reasonable. Visually the SNAIVE and ETS forecasts look like a close fit. Let’s check the accuracy metrics.

## # A tibble: 3 x 10
##   .model .type      ME     RMSE      MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>   <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ARIMA  Test  764755. 1516998. 1014416.  8.64 12.2  1.44  1.27   0.0659
## 2 ETS    Test  405572. 1050779.  682558.  4.24  8.19 0.967 0.882  0.0846
## 3 SNAIVE Test  405195. 1035538.  618606.  4.55  7.06 0.877 0.869 -0.0313

Based on the above metrics, the SNAIVE model looks better. Let’s check the residuals from the fitted models.

## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 SNAIVE    70.8  2.30e-10
## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       22.6    0.0317
## # A tibble: 1 x 3
##   .model bp_stat bp_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ARIMA     3.95     0.984

Based on the above, it seems that the residuals from the ARIMA model are not dependent and closest to white noise. So we go with the ARIMA model instead.

## Series: KWH 
## Model: ARIMA(0,0,3)(2,1,1)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2     sma1  constant
##       0.2198  0.1094  0.1301  -0.6385  -0.5175  -0.2918  174498.9
## s.e.  0.0789  0.0804  0.0717   0.1630   0.1324   0.1813   70961.6
## 
## sigma^2 estimated as 682673355129:  log likelihood=-2531.73
## AIC=5079.46   AICc=5080.36   BIC=5104.45

The ARIMA model has parameters: (0,0,3)(2,1,1)[12] w/ drift.

The residuals look uncorrelated.

To come up with the 12-monthly forecast for 2014 from this model, we need to reverse the Box Cox transformation.

Write the forecast for 2014 to a file.