Project 1
In the following project all missing values (N.A.'s) will replaced with linearly interpolated values using the na.approx() from the zoo library.
The Hampel filter will be used to identify as outliers the values outside the interval (I) formed by the median, plus or minus 3 median absolute deviations. Outliers will replaced with linearly interpolated values as well.
Part A – ATM Forecast, ATM624Data.xlsx
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
After loading and joining the four ATM datasets, the head and tail of the data set looks like:
## Date ATM1_Cash ATM2_Cash ATM3_Cash ATM4_Cash
## 1 2009-05-03 96 107 0 776.99342
## 2 2009-05-04 82 89 0 524.41796
## 3 2009-05-05 85 90 0 792.81136
## 4 2009-05-06 90 55 0 908.23846
## 5 2009-05-07 99 79 0 52.83210
## 6 2009-05-08 88 19 0 52.20845
## 7 2009-05-09 8 2 0 55.47361
## 8 2009-05-10 104 103 0 558.50325
## 9 2009-05-11 87 107 0 904.34136
## 10 2009-05-12 93 118 0 879.49359
## Date ATM1_Cash ATM2_Cash ATM3_Cash ATM4_Cash
## 356 2010-04-23 100 100 0 556.792330
## 357 2010-04-24 69 103 0 386.175335
## 358 2010-04-25 85 44 0 165.294181
## 359 2010-04-26 85 61 0 5.451815
## 360 2010-04-27 109 89 0 542.280602
## 361 2010-04-28 74 11 0 403.839336
## 362 2010-04-29 4 2 0 13.697331
## 363 2010-04-30 96 107 96 348.201061
## 364 2010-05-01 82 89 82 44.245345
## 365 2010-05-02 85 90 85 482.287107
From the first 5 rows of observations we notice that we have a time stamp indicating year, month and day of usage. ATM1 and ATM2 have been continously used at least during the first 10 days. ATM3 has not been used at all during this time period. The data from ATM4 shows is reporting cash withdrawals 10 times larger than typical withdrawals from ATM1 and ATM2 which is possible if it is located in a high usage area (e.g. stadium, train station, etc). It is also reporting cash withdrawals with fractions of a cent which is not possible.
It is not possible to get a clarification on the validity of the observatiosn from ATM4. Probably there is a failure in the machine that is affecting the floating point calculations of the register. If this is the case, the range of cash withdrawals could be similar as that of ATM1 and ATM4.
## Date ATM1_Cash ATM2_Cash ATM3_Cash ATM4_Cash
## 1 2009-05-03 96 107 0 78
## 2 2009-05-04 82 89 0 52
## 3 2009-05-05 85 90 0 79
## 4 2009-05-06 90 55 0 91
## 5 2009-05-07 99 79 0 5
## 6 2009-05-08 88 19 0 5
## 7 2009-05-09 8 2 0 6
## 8 2009-05-10 104 103 0 56
## 9 2009-05-11 87 107 0 90
## 10 2009-05-12 93 118 0 88
Data Imputation
Data from ATM1 has 3 values missing from the series while ATM2 has 2 missing values:
##
## Summary of ATM1
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 73.00 91.00 83.89 108.00 180.00 3
##
## Summary of ATM2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 25.50 67.00 62.58 93.00 147.00 2
The missing values will be replaced with interpolated values. For ATM1, we can see below how the Median has not changed while the Mean has increased by 0.31%. For ATM2, we see no change on the Median while the mean has increased by 0.02%.
##
## Summary of Imputed ATM1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 73.00 91.00 84.15 108.00 180.00
##
## Summary of Imputed ATM2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 26.00 67.00 62.59 93.00 147.00
Exploratory Data Analysis
First, let's look at some statistics of the cash withdrawals:
## Summary of values from ATM1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 73.00 91.00 84.15 108.00 180.00
##
## Summary of values from ATM2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 26.00 67.00 62.59 93.00 147.00
##
## Summary of values from ATM3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7206 0.0000 96.0000
##
## Summary of values from ATM4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 12.0 40.0 47.4 70.0 1092.0
After imputation, the data in ATM1 and ATM2 doesn't raise any red flags. ATM3 is filled with mostly zeroes and ATM4 has at least one outlier and the values don't make sense if the cash withdrawals are in hundreds of dollars.
ATM1
Cash withdrawals from ATM1 are bimodal with two peaks at around 20 and 90 hundreds of dollars. People tend to withdraw money near these two values. We can see these in the Scatter and Histogram plots below.
From the Autocorrelation Fuction (ACF) plot below we can see how the withdrawals are seasonal in the sense that they are correlated to those 7 days in the future and past. There is a weekly pattern.
ATM2
Cash withdrawals from ATM2 are also slighly bimodal with peaks near 20 and 70 hundreds of dollars. The bimodal distribution is not as defined as in the case of ATM1.
From the Autocorrelation Fuction (ACF) plot below we can see how the withdrawals from ATM2 are also seasonal. There is a weekly pattern just as we saw for ATM1.
ATM3
Cash withdrawals from ATM3 are mostly zero with a few withdrawals near a 100 hundred dollards.
The frequency distribution for the whole data set is the following:
##
## 0 82 85 96
## 362 1 1 1
There were only 3 non-zero withdrawals at the end of the series or only 0.82% of the observations. The last 10 ATM4 observations are:
## [1] 0 0 0 0 0 0 0 96 82 85
This may have been because the machine had no money on all but the last 3 days of the series. This will severely limit the ability to forecast future cash withdrawals from this machine.
ATM4
Cash withdrawals from ATM4 do not follow a normal distribution and are very right modal or positive skewed. There is also an extreme withdrawal value at 1092 (hundres of dollars) that we may have to consider an outlier. In the plot below, we see how taking the natural log of the values, we have brough back the skewed distribution to one resembling a normal distribution.
From the Autocorrelation Fuction (ACF) plot below we can see how the withdrawals from ATM4 appear to have seasonal autocorrelation. However, the spikes in the lags fall below the 95% treshold confidence intervals determined by the variance in the data set.
Forecasting ATM1
First, let's generate some simple non-seasonal models:
The models above don't capture at all the seasonality of the cash withdrawals.
Models such as ETS, Seasonal Naive and ARIMA do capture the seasonality of the training data and match the valley's when withdrawals drop to close to zero. From the three models, Seasonal Naive matches the phase of the valley while the other two models are out of phase with the actual data.
Accuracy of ATM1 Models
## # A tibble: 4 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Holt's method Training -1.01e+ 0 36.7 27.1 -177. 199. 1.53 0.0739
## 2 Damped Holt's method Training -4.90e- 1 36.6 27.1 -180. 203. 1.53 0.0657
## 3 Mean Training -6.97e-15 36.6 27.3 -175. 199. 1.54 0.0706
## 4 Naïve Training -3.02e- 2 49.9 37.2 -132. 167. 2.10 -0.356
All non-seasonal methods perform similary on the training and test data. The Naive method having the worst accuracy.
## # A tibble: 4 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Damped Holt's method Test -3.53 43.3 33.3 -495. 521. 1.88 0.00766
## 2 Holt's method Test -4.45 43.9 33.6 -492. 518. 1.89 0.0211
## 3 Mean Test -5.32 42.5 31.4 -469. 491. 1.77 0.0305
## 4 Naïve Test -4.85 48.6 34.4 -226. 252. 1.93 0.0451
Models that take into account seasonality (ETS, Seasonal Naive and ARIMA) perform much better compared to simpler models as we can see based on the accuracy metrics above and below. The ARIMA model performs slightly better than the rest from the perspective of the RMSE parameter.
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Training -0.0482 23.8 15.1 -107. 122. 0.852 0.144
## 2 arima Training -0.0742 23.3 14.5 -102. 117. 0.822 -0.00872
## 3 snaive Training -0.0363 27.7 17.7 -96.5 116. 1 0.188
On both training and test data, ARIMA had a better RMSE performance than ETS or Seasonal Naive.
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test -0.741 34.0 20.1 -440. 457. 1.13 0.0336
## 2 ets Test -3.62 40.0 24.9 -522. 541. 1.40 -0.0579
## 3 snaive Test -2.64 40.5 26.4 -534. 562. 1.49 0.0328
The less parsimonious model is the ARIMA model that when auto-generated created a model with non-seasonal and seasonal components. The parameters shown below indicate that the model differentiates only (seasonal differencing) and bases the seasonality on the length of the week in days. The ACF of the residual lags are below the treshold of statistical significance and the residual follow a normal distribution.
## Series: ATM1_Cash
## Model: ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1950 -0.5806 -0.1037
## s.e. 0.0546 0.0506 0.0494
##
## sigma^2 estimated as 556.4: log likelihood=-1640.01
## AIC=3288.03 AICc=3288.14 BIC=3303.55
The parameters of the ETS model below indicate no trend was found and the errors and seasonal componets were found to be additive. The model season component shows a section when the variance changes signficantly compared to the rest of the series around March 2010.
## Series: ATM1_Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.02126635
## gamma = 0.3298864
##
## Initial states:
## l s1 s2 s3 s4 s5 s6 s7
## 77.02956 -55.27235 2.531658 16.15413 4.529656 11.40245 12.92571 7.72875
##
## sigma^2: 580.2586
##
## AIC AICc BIC
## 4487.018 4487.639 4526.017
We can select the seasonal ARIMA(0,0,1)(0,1,2)[7] model as a good candidate to model the ATM1 cash withdrawals.
## .model Date ATM1_Cash .mean
## 1 arima 2010-05-03 N(87.16789, 556.3821) 87.167893
## 2 arima 2010-05-04 N(104.0559, 577.5304) 104.055911
## 3 arima 2010-05-05 N(73.14947, 577.5304) 73.149472
## 4 arima 2010-05-06 N(8.032036, 577.5304) 8.032036
## 5 arima 2010-05-07 N(100.0066, 577.5304) 100.006619
## 6 arima 2010-05-08 N(80.85748, 577.5304) 80.857483
Forecasting ATM2
The ETS, Seasonal Naive and ARIMA models have again captured the seasonality of the data and match the timing of the valley's when withdrawals drop to close to zero. From the three models, Seasonal Naive matches the phase of the valleys as well as the variance in the observed data better than the other two models.
Accuracy of ATM2 Models
## # A tibble: 4 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Holt's method Training 7.24e- 1 38.1 32.6 -Inf Inf 1.57 -0.0131
## 2 Damped Holt's method Training -2.19e+ 0 38.3 32.7 -Inf Inf 1.57 -0.00811
## 3 Mean Training -2.43e-16 38.8 33.2 -Inf Inf 1.59 0.0228
## 4 Naïve Training -4.67e- 2 54.2 42.5 -Inf Inf 2.05 -0.307
The metrics of Mean Absolute Error (MPE) and Mean Absolute Percentage Error (MAPE) above and below are showing the effects of division by zero (true obs is zero) giving infinite values.
The ARIMA model again performs slightly better than the rest from the perspective of the RMSE parameter. However, as we saw from the plot of the forecasts, Seasonal Naive better matched the phase and the range of variance of the historical data.
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Training -0.689 25.1 17.9 -Inf Inf 0.859 0.0199
## 2 arima Training -0.891 24.1 17.0 -Inf Inf 0.820 -0.00428
## 3 snaive Training 0.0223 30.2 20.8 -Inf Inf 1.00 0.0482
The auto-generated ARIMA model chose parameters for a model with non-seasonal and seasonal components. The weekly seasonality appears to be the strongest in determining the AR and MA components of the ARIMA model. The residual plots again indicate the goodness of fit for the ARIMA model.
## Series: ATM2_Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4320 -0.9130 0.4773 0.8048 -0.7547
## s.e. 0.0553 0.0407 0.0861 0.0556 0.0381
##
## sigma^2 estimated as 602.5: log likelihood=-1653.67
## AIC=3319.33 AICc=3319.57 BIC=3342.61
The parameters of the ETS model below indicate no trend was found and the erros and seasonal componets were found to be additive. As in the case for the ATM1 series, there is a section in the series when the variance changes signficantly around March 2010.
## Series: ATM2_Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000049
## gamma = 0.3594252
##
## Initial states:
## l s1 s2 s3 s4 s5 s6 s7
## 72.71479 -33.56563 -27.28811 12.45705 0.8474403 4.742114 17.29087 25.51626
##
## sigma^2: 648.1015
##
## AIC AICc BIC
## 4527.377 4527.998 4566.376
Seasonal Naive is a good choice to model cash withdrawals from ATM2 based on the matching of variance and phase of the historical data.
## .model Date ATM1_Cash .mean .model Date
## 1 arima 2010-05-03 N(87.16789, 556.3821) 87.167893 snaive 2010-05-03
## 2 arima 2010-05-04 N(104.0559, 577.5304) 104.055911 snaive 2010-05-04
## 3 arima 2010-05-05 N(73.14947, 577.5304) 73.149472 snaive 2010-05-05
## 4 arima 2010-05-06 N(8.032036, 577.5304) 8.032036 snaive 2010-05-06
## 5 arima 2010-05-07 N(100.0066, 577.5304) 100.006619 snaive 2010-05-07
## 6 arima 2010-05-08 N(80.85748, 577.5304) 80.857483 snaive 2010-05-08
## ATM2_Cash .mean
## 1 N(61, 911.8715) 61
## 2 N(89, 911.8715) 89
## 3 N(11, 911.8715) 11
## 4 N(2, 911.8715) 2
## 5 N(107, 911.8715) 107
## 6 N(89, 911.8715) 89
Forecasting ATM3
The data from ATM3 is filled almost completely with zeros as seen below.
Only the last 3 daily observations of the year long data set contain non-zero observations. 3 observations are not enough for any reasonable forecast that takes into account trend and/or seasonality.
## # A tsibble: 6 x 2 [1D]
## Date ATM3_Cash
## <date> <dbl>
## 1 2009-05-03 0
## 2 2009-05-04 0
## 3 2009-05-05 0
## 4 2009-05-06 0
## 5 2009-05-07 0
## 6 2009-05-08 0
## # A tsibble: 6 x 2 [1D]
## Date ATM3_Cash
## <date> <dbl>
## 1 2010-04-27 0
## 2 2010-04-28 0
## 3 2010-04-29 0
## 4 2010-04-30 96
## 5 2010-05-01 82
## 6 2010-05-02 85
A Naive forecast considering that the next withdrawal would be equal to the last one is viable considering the extremely limited number of observations. As we can see below, the confidence intervals quickly balloon out as well as the uncertainty for any future forecast.
## .model Date ATM1_Cash .mean .model Date
## 1 arima 2010-05-03 N(87.16789, 556.3821) 87.167893 snaive 2010-05-03
## 2 arima 2010-05-04 N(104.0559, 577.5304) 104.055911 snaive 2010-05-04
## 3 arima 2010-05-05 N(73.14947, 577.5304) 73.149472 snaive 2010-05-05
## 4 arima 2010-05-06 N(8.032036, 577.5304) 8.032036 snaive 2010-05-06
## 5 arima 2010-05-07 N(100.0066, 577.5304) 100.006619 snaive 2010-05-07
## 6 arima 2010-05-08 N(80.85748, 577.5304) 80.857483 snaive 2010-05-08
## ATM2_Cash .mean .model Date ATM3_Cash .mean
## 1 N(61, 911.8715) 61 Naïve 2010-05-03 N(85, 25.88187) 85
## 2 N(89, 911.8715) 89 Naïve 2010-05-04 N(85, 51.76374) 85
## 3 N(11, 911.8715) 11 Naïve 2010-05-05 N(85, 77.6456) 85
## 4 N(2, 911.8715) 2 Naïve 2010-05-06 N(85, 103.5275) 85
## 5 N(107, 911.8715) 107 Naïve 2010-05-07 N(85, 129.4093) 85
## 6 N(89, 911.8715) 89 Naïve 2010-05-08 N(85, 155.2912) 85
Forecasting ATM4
We can find outlier using a method knows as Hampel filter, considers as outliers the values outside the interval (I) formed by the median, plus or minus 3 median absolute deviations (MAD).
\[I=[\text {median}-3 \cdot M A D ; \text { median }+3 \cdot M A D]\]
where MAD is defined as:
\[M A D=\operatorname{median}\left(\left|X_{i}-\tilde{X}\right|\right)\]
The data from ATM4 contains an outlier easily seen in the plot below.
We'll detect the outlier, replace with an NA and finally use the na.approx() to linearly interpolate a new value.
The fit of the seasonal models to the historical data appears to be poor. Neither of the models capture the periodicity or the variance of the data. This is due to the non-normal distribution of the data its white-noise like properties. The ARIMA model fits the worst as the model eventuallly dampens to a constant value.
Accuracy of ATM4 Models
## # A tibble: 4 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Holt's method Training -6.08e- 1 35.5 29.6 -Inf Inf 0.853 0.0794
## 2 Damped Holt's method Training -1.96e- 1 35.1 29.2 -Inf Inf 0.843 0.0783
## 3 Mean Training -2.49e-15 35.1 29.2 -Inf Inf 0.844 0.0768
## 4 Naïve Training -8.24e- 2 47.7 38.7 -Inf Inf 1.12 -0.464
The accuracy metrics of all the models tested are similar from simple non-seasonal models such as the Mean of the series to more complex models such as seasonal ARIMA.
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Training -0.707 32.9 26.7 -Inf Inf 0.769 0.102
## 2 arima Training 0.00214 34.0 28.0 -Inf Inf 0.809 0.00478
## 3 snaive Training -0.372 45.3 34.7 -Inf Inf 1 0.0630
It is interesting the auto-selected ARIMA model is that of a seasonal ARIMA with a season length of 7 days with zero order differencing of both the non and seasonal components. The model tried to incorporate a seasonal component with limited influence on the non-seasonal component.
## Series: ATM4_Cash
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 constant
## -0.8763 -0.6769 0.1701 0.9620 0.8072 0.2481 79.4710
## s.e. 0.1000 0.1194 0.0565 0.0901 0.1088 0.0556 4.8915
##
## sigma^2 estimated as 1177: log likelihood=-1805.01
## AIC=3626.02 AICc=3626.42 BIC=3657.22
Taking into account the following regarding the registered cash withdrawals from ATM4:
- Values show discrepancies on the withdrawal amounts compared to values from ATM1, ATM2 and the few non-zero observations from ATM3.
- Values indicate franctional dollar withdrawal.
- Values are highly skewed.
- Large standard deviation of the data induces very broad confidence intervals.
Model complexity has not increased accuracy so a simple model such as a the Mean of observations is a good candidate to forecasting future cash withdrawals for ATM4.
## .model Date ATM1_Cash .mean .model Date
## 1 arima 2010-05-03 N(87.16789, 556.3821) 87.167893 snaive 2010-05-03
## 2 arima 2010-05-04 N(104.0559, 577.5304) 104.055911 snaive 2010-05-04
## 3 arima 2010-05-05 N(73.14947, 577.5304) 73.149472 snaive 2010-05-05
## 4 arima 2010-05-06 N(8.032036, 577.5304) 8.032036 snaive 2010-05-06
## 5 arima 2010-05-07 N(100.0066, 577.5304) 100.006619 snaive 2010-05-07
## 6 arima 2010-05-08 N(80.85748, 577.5304) 80.857483 snaive 2010-05-08
## ATM2_Cash .mean .model Date ATM3_Cash .mean .model
## 1 N(61, 911.8715) 61 Naïve 2010-05-03 N(85, 25.88187) 85 Mean
## 2 N(89, 911.8715) 89 Naïve 2010-05-04 N(85, 51.76374) 85 Mean
## 3 N(11, 911.8715) 11 Naïve 2010-05-05 N(85, 77.6456) 85 Mean
## 4 N(2, 911.8715) 2 Naïve 2010-05-06 N(85, 103.5275) 85 Mean
## 5 N(107, 911.8715) 107 Naïve 2010-05-07 N(85, 129.4093) 85 Mean
## 6 N(89, 911.8715) 89 Naïve 2010-05-08 N(85, 155.2912) 85 Mean
## Date ATM4_Cash .mean
## 1 2010-05-03 N(47.40274, 4250.42) 47.40274
## 2 2010-05-04 N(47.40274, 4250.42) 47.40274
## 3 2010-05-05 N(47.40274, 4250.42) 47.40274
## 4 2010-05-06 N(47.40274, 4250.42) 47.40274
## 5 2010-05-07 N(47.40274, 4250.42) 47.40274
## 6 2010-05-08 N(47.40274, 4250.42) 47.40274
Outliers detection in R https://www.statsandr.com/blog/outliers-detection-in-r/
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Exploratory Data Analysis
Let's examine the features of the residential power usage using exploratory data analysis (EDA).
From these exploratory plots we can notice the outlier occuring after 2010. The data has a normal distribution and shows uniform variance over the range of observations.
Identify and Deal N.A.'s
We have a missing observation in this data series of 192 observations. The missing observation is located at row number 705. These are the elements before and after the missing observation:
## [1] 7643987 8037137 NA 5101803 4555602
We will use linear interpolation to substitute the missing value.
## [1] 7643987 8037137 6569470 5101803 4555602
Identify and Remove Outlier
The outlier in this series is the minimum value of the series.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 770523 5434539 6314472 6502824 7608792 10655730
The outlier will be identified using
## [1] 770523
## [1] 151
Below we can the time series without the outlier or the missing values.
Transformations
There appears to be a trends in the data series but no large variability in the variance of the data. A BoxCox transformation did not improve the quality of the data.
Forecasting of Residential Power Consumption
The ETS, Seasonal Naive and ARIMA models have been successful in capturing the seasonality and variance of the historical data set. We see below that the ARIMA model has a slightly higher accuracy based on the RMSE parameter as compared to the other two models.
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Training 52868. 631229. 482033. 0.0715 7.27 0.772 0.220
## 2 arima Training -8679. 590208. 430362. -0.798 6.53 0.690 0.000127
## 3 snaive Training 94245. 825374. 624053. 0.681 9.28 1.00 0.298
The auto-selected ARIMA model performed best in the training and test data sets with a lower RMSE while Seasonal Naive has the highest RMSE value.
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test 190750. 672763. 536673. 1.61 8.23 0.858 0.136
## 2 ets Test 116692. 784947. 612111. 0.295 9.45 0.978 0.0438
## 3 snaive Test 215759. 906344. 663847. 1.67 10.1 1.06 -0.162
## 50.857 sec elapsed
ARIMA auto-selected a seasonal model with a season length of 12 (12 months in a year). Only the seasonal component was differenced with a Auto-Regressive coefficient of 2. The only coefficient in the non-seasonal component was the Moving Average compoment with a coefficient of 4.
## Series: value
## Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift
##
## Coefficients:
## ma1 ma2 ma3 ma4 sar1 sar2 constant
## 0.3459 0.0665 0.2109 -0.0772 -0.7368 -0.4342 233796.06
## s.e. 0.0772 0.0838 0.0737 0.0819 0.0763 0.0784 73859.23
##
## sigma^2 estimated as 3.866e+11: log likelihood=-2657.65
## AIC=5331.31 AICc=5332.15 BIC=5356.85
Below we can see the 6 month forecast using the ARIMA(0,0,4)(2,1,0)[12] model. The periodicity and variance matches the historical data of the set.
## # A fable: 1 x 4 [1M]
## # Key: .model [1]
## .model index value .mean
## <chr> <mth> <dist> <dbl>
## 1 arima 2014 Jan N(1e+07, 3.9e+11) 10248709.
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
Waterflow_Pipe1.xlsx
Exploratory Data Analysis
## Date Time WaterFlow
## Min. :42300 Min. : 1.067
## 1st Qu.:42302 1st Qu.:13.683
## Median :42305 Median :19.880
## Mean :42305 Mean :19.897
## 3rd Qu.:42307 3rd Qu.:26.159
## Max. :42310 Max. :38.913
The "Date Time" column shows times of observation in days and fractions fractions of a day. We need to convert them into number of seconds and fractions of second to use functions such as as_datetime or as.POSIXct to covert the observation time stamp into a date that we can use for our work.
We can extract from "year", "month", "day", "hour", "minutes", "seconds" parameters from the "Date Time" field. We can use these parameters for aggregating data by hour.
Data Aggregation at Different Time Scales
The lowest order seasonality is that at the hourly level. We see peak usage in the late-evening/early-morning period (people taking showers before bed) and in the afternoon at 3pm when people start to come back from work. The water flow measurments appear to follow a normal distribution and there are no outliers at the hourly aggregation level. The aggregated lowest water flow observation occurs at 10 AM as seen above in the plot at the bottom right corner.
We can also aggregate by day of the week (WeekDay) and hour as shown above. The aggregated data also shows a normal distribution. From the time series plot above we can see a dip in the aggregated data at the start/end of a day. This daily seasonality can also be seen in the series decomposition below. We an see that the seasonal component shows 7 distinctive cycles for the 7 days of the week.
Forecast
Three different models (ARIMA, ETS, Seasonal Naive) were chosen to forecast the seasonally varying data. Only the Seasonal Naive model was able to produce a forecast that represents the historical data.
The ETS model was only able to forecast a Mean value of the observations.
## ETS(A,N,N)
##
## Call:
## ets(y = WFP1_Data_agg2)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 19.8521
##
## sigma: 4.0452
##
## AIC AICc BIC
## 1334.384 1334.530 1343.756
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002538585 4.021046 3.231715 -5.28105 18.49092 0.6690035
## ACF1
## Training set 0.0286744
ARIMA also completely failed to generate a model that fits the data.
What should be our forecast for this series. We can either choose the Seasonal Naive model that tries to fit the aggregated data or we can use the aggregated data per WeekDay and Hour that also shows the seasonality per day and per hour.
Reorder factor levels by day of the week in R
https://stackoverflow.com/questions/10309564/reorder-factor-levels-by-day-of-the-week-in-r
Second Data Set
The data set from Waterflow_Pipe2.xlsx showed the same characteristics as the previous set. The time stamp had to be converted form numerical time to DataTime that we can use to aggregate data per hour.
## year month day hour minutes seconds WaterFlow DateTime WeekDay
## 1 2015 10 25 01 00 00 18.81079 2015-10-25 01:00:00 Sunday
## 2 2015 10 25 02 00 00 43.08703 2015-10-25 02:00:00 Sunday
## 3 2015 10 25 03 00 00 37.98770 2015-10-25 03:00:00 Sunday
## 4 2015 10 25 04 00 00 36.12038 2015-10-25 04:00:00 Sunday
## 5 2015 10 25 05 00 00 31.85126 2015-10-25 05:00:00 Sunday
## 6 2015 10 25 06 00 00 28.23809 2015-10-25 06:00:00 Sunday
Data Aggregation at Different Time Scales
Both data sets show the lowest Water Flow during the 10 AM period of time. The distribution of aggregated flows appears to be normal and shows seasonality in the scale of hours and days of the week.
Forecast
The only forecast model that produced reasonable results matching periodicity of the historical data was Seasonal Naive.
The multiple potential seasonalities (hourly, WeekDay, Monthly) and noise in the data appeared to preclude more complex model such as ARIMA to find a set of parameters that could forecast the historical data.
Techniques such as data aggregation or models as Seasonal Naive are powerful tools when dealing with high frequency data.
## $method
## [1] "Seasonal naive method"
##
## $model
## Call: snaive(y = WFP2_Data_agg2, h = (24 * 8))
##
## Residual sd: 8.2805
##
## $lambda
## NULL
##
## $x
## Time Series:
## Start = c(0, 1)
## End = c(6, 24)
## Frequency = 24
## [1] 47.58080 45.06177 35.27366 47.68656 35.98560 39.73735 50.42331 39.45068
## [9] 36.59180 28.33674 46.52859 35.41536 41.40227 42.34403 50.91121 46.53228
## [17] 34.85761 47.18069 48.42610 39.99935 44.41102 33.12071 47.58077 47.38481
## [25] 38.82707 34.42901 41.07133 42.16208 34.66865 45.26408 44.70038 41.40555
## [33] 41.30034 38.84086 35.90252 40.05034 40.34892 35.74855 35.91561 34.72807
## [41] 38.05734 49.56629 44.90299 40.54453 42.30054 36.35438 48.42404 41.03465
## [49] 33.70027 44.38092 43.62371 38.46691 44.21530 47.08228 40.55499 35.92058
## [57] 38.10303 40.17602 34.26778 41.97120 41.81164 32.77803 43.39558 39.64293
## [65] 27.03206 27.06252 31.48747 39.29561 38.52652 31.63105 25.36962 33.12386
## [73] 38.23113 28.43512 53.42453 36.02557 43.94098 39.59530 44.92140 45.77981
## [81] 27.71829 40.32067 33.14742 42.64416 40.28602 41.36456 48.49877 36.52893
## [89] 26.39926 29.56739 38.25075 34.16781 44.82123 33.17212 49.11825 43.21395
## [97] 33.23872 36.43188 49.06736 40.23871 24.76925 29.91784 43.55657 26.53657
## [105] 34.98630 39.61490 20.59022 41.30538 38.99578 50.35201 37.05621 47.77958
## [113] 43.33413 31.90753 39.33891 42.07221 43.73603 53.01235 38.33874 39.02422
## [121] 40.78944 48.13487 47.82460 42.87733 37.13866 34.05907 31.95106 37.36704
## [129] 33.75836 33.16775 35.01480 39.24048 30.33179 46.63145 34.43842 43.80997
## [137] 43.91438 33.06041 43.80049 34.66572 38.25556 40.02876 41.96201 39.17842
## [145] 51.54758 30.17947 47.07923 38.53015 42.03357 35.17858 40.00331 28.21321
## [153] 40.27458 51.23303 31.77747 48.05283 38.42947 40.93348 29.56784 51.06131
## [161] 37.17026 51.55776 38.91904 34.39085 42.90945 41.22726 46.41242 41.08541
##
## $fitted
## Time Series:
## Start = c(0, 1)
## End = c(6, 24)
## Frequency = 24
## [1] NA NA NA NA NA NA NA NA
## [9] NA NA NA NA NA NA NA NA
## [17] NA NA NA NA NA NA NA NA
## [25] 47.58080 45.06177 35.27366 47.68656 35.98560 39.73735 50.42331 39.45068
## [33] 36.59180 28.33674 46.52859 35.41536 41.40227 42.34403 50.91121 46.53228
## [41] 34.85761 47.18069 48.42610 39.99935 44.41102 33.12071 47.58077 47.38481
## [49] 38.82707 34.42901 41.07133 42.16208 34.66865 45.26408 44.70038 41.40555
## [57] 41.30034 38.84086 35.90252 40.05034 40.34892 35.74855 35.91561 34.72807
## [65] 38.05734 49.56629 44.90299 40.54453 42.30054 36.35438 48.42404 41.03465
## [73] 33.70027 44.38092 43.62371 38.46691 44.21530 47.08228 40.55499 35.92058
## [81] 38.10303 40.17602 34.26778 41.97120 41.81164 32.77803 43.39558 39.64293
## [89] 27.03206 27.06252 31.48747 39.29561 38.52652 31.63105 25.36962 33.12386
## [97] 38.23113 28.43512 53.42453 36.02557 43.94098 39.59530 44.92140 45.77981
## [105] 27.71829 40.32067 33.14742 42.64416 40.28602 41.36456 48.49877 36.52893
## [113] 26.39926 29.56739 38.25075 34.16781 44.82123 33.17212 49.11825 43.21395
## [121] 33.23872 36.43188 49.06736 40.23871 24.76925 29.91784 43.55657 26.53657
## [129] 34.98630 39.61490 20.59022 41.30538 38.99578 50.35201 37.05621 47.77958
## [137] 43.33413 31.90753 39.33891 42.07221 43.73603 53.01235 38.33874 39.02422
## [145] 40.78944 48.13487 47.82460 42.87733 37.13866 34.05907 31.95106 37.36704
## [153] 33.75836 33.16775 35.01480 39.24048 30.33179 46.63145 34.43842 43.80997
## [161] 43.91438 33.06041 43.80049 34.66572 38.25556 40.02876 41.96201 39.17842
##
## $residuals
## Time Series:
## Start = c(0, 1)
## End = c(6, 24)
## Frequency = 24
## [1] NA NA NA NA NA NA
## [7] NA NA NA NA NA NA
## [13] NA NA NA NA NA NA
## [19] NA NA NA NA NA NA
## [25] -8.7537312 -10.6327572 5.7976661 -5.5244745 -1.3169465 5.5267331
## [31] -5.7229346 1.9548757 4.7085441 10.5041228 -10.6260623 4.6349795
## [37] -1.0533459 -6.5954804 -14.9956026 -11.8042160 3.1997236 2.3855947
## [43] -3.5231113 0.5451860 -2.1104809 3.2336736 0.8432673 -6.3501599
## [49] -5.1268049 9.9519102 2.5523820 -3.6951733 9.5466509 1.8181974
## [55] -4.1453827 -5.4849747 -3.1973139 1.3351631 -1.6347455 1.9208570
## [61] 1.4627106 -2.9705241 7.4799670 4.9148608 -11.0252776 -22.5037644
## [67] -13.4155172 -1.2489281 -3.7740142 -4.7233387 -23.0544188 -7.9107963
## [73] 4.5308575 -15.9458030 9.8008181 -2.4413405 -0.2743196 -7.4869806
## [79] 4.3664096 9.8592304 -10.3847352 0.1446432 -1.1203572 0.6729670
## [85] -1.5256111 8.5865321 5.1031927 -3.1139982 -0.6327959 2.5048673
## [91] 6.7632867 -5.1277974 6.2947117 1.5410725 23.7486277 10.0900963
## [97] -4.9924017 7.9967546 -4.3571742 4.2131427 -19.1717277 -9.6774550
## [103] -1.3648347 -19.2432354 7.2680053 -0.7057697 -12.5571960 -1.3387805
## [109] -1.2902464 8.9874478 -11.4425632 11.2506466 16.9348631 2.3401364
## [115] 1.0881541 7.9044053 -1.0852036 19.8402308 -10.7795073 -4.1897307
## [121] 7.5507162 11.7029931 -1.2427576 2.6386199 12.3694044 4.1412252
## [127] -11.6055117 10.8304731 -1.2279340 -6.4471460 14.4245766 -2.0649043
## [133] -8.6639830 -3.7205571 -2.6177858 -3.9696022 0.5802509 1.1528859
## [139] 4.4615788 -7.4064993 -5.4804707 -12.9835897 3.6232733 0.1541990
## [145] 10.7581444 -17.9554010 -0.7453725 -4.3471815 4.8949123 1.1195126
## [151] 8.0522536 -9.1538366 6.5162218 18.0652849 -3.2373259 8.8123517
## [157] 8.0976705 -5.6979668 -4.8705858 7.2513371 -6.7441184 18.4973517
## [163] -4.8814432 -0.2748684 4.6538894 1.1985058 4.4504110 1.9069859
## WFP1_Data_agg2 WFP2_Data_agg2
## 1 16.059214 47.58080
## 2 15.863804 45.06177
## 3 25.650593 35.27366
## 4 23.635124 47.68656
## 5 23.714303 35.98560
## 6 23.314339 39.73735
## 7 24.105044 50.42331
## 8 22.856268 39.45068
## 9 25.131327 36.59180
## 10 18.833089 28.33674
## 11 17.173694 46.52859
## 12 18.739112 35.41536
## 13 14.472489 41.40227
## 14 19.563607 42.34403
## 15 17.349672 50.91121
## 16 15.474739 46.53228
## 17 17.832126 34.85761
## 18 20.546048 47.18069
## 19 20.356969 48.42610
## 20 17.239913 39.99935
## 21 19.356167 44.41102
## 22 19.250355 33.12071
## 23 15.949138 47.58077
## 24 20.111429 47.38481
## 25 18.402730 38.82707
## 26 24.807050 34.42901
## 27 21.092540 41.07133
## 28 20.779714 42.16208
## 29 17.662120 34.66865
## 30 19.012419 45.26408
## 31 19.634759 44.70038
## 32 20.704561 41.40555
## 33 17.892379 41.30034
## 34 27.031242 38.84086
## 35 14.168328 35.90252
## 36 19.681238 40.05034
## 37 20.270096 40.34892
## 38 22.358185 35.74855
## 39 19.322248 35.91561
## 40 20.675480 34.72807
## 41 17.657527 38.05734
## 42 22.467979 49.56629
## 43 22.522688 44.90299
## 44 19.501629 40.54453
## 45 14.330054 42.30054
## 46 18.630484 36.35438
## 47 19.296505 48.42404
## 48 26.604516 41.03465
## 49 21.478724 33.70027
## 50 24.169753 44.38092
## 51 17.681047 43.62371
## 52 17.228764 38.46691
## 53 15.557495 44.21530
## 54 13.755156 47.08228
## 55 18.378993 40.55499
## 56 17.348098 35.92058
## 57 22.767248 38.10303
## 58 22.208604 40.17602
## 59 16.768433 34.26778
## 60 27.174478 41.97120
## 61 20.868027 41.81164
## 62 16.441789 32.77803
## 63 18.509076 43.39558
## 64 14.787636 39.64293
## 65 26.339307 27.03206
## 66 19.628814 27.06252
## 67 18.593013 31.48747
## 68 24.965091 39.29561
## 69 22.618109 38.52652
## 70 12.099414 31.63105
## 71 22.612621 25.36962
## 72 21.166484 33.12386
## 73 18.389080 38.23113
## 74 23.560504 28.43512
## 75 23.996702 53.42453
## 76 18.370869 36.02557
## 77 20.618550 43.94098
## 78 25.888403 39.59530
## 79 13.276622 44.92140
## 80 19.954973 45.77981
## 81 17.896071 27.71829
## 82 10.719391 40.32067
## 83 14.956229 33.14742
## 84 14.538538 42.64416
## 85 23.706073 40.28602
## 86 20.588211 41.36456
## 87 21.670380 48.49877
## 88 28.049444 36.52893
## 89 19.306950 26.39926
## 90 25.930345 29.56739
## 91 12.977001 38.25075
## 92 25.161390 34.16781
## 93 24.392668 44.82123
## 94 18.382795 33.17212
## 95 18.111355 49.11825
## 96 6.037540 43.21395
## 97 17.341505 33.23872
## 98 26.199455 36.43188
## 99 26.775857 49.06736
## 100 15.122721 40.23871
## 101 26.450472 24.76925
## 102 19.379246 29.91784
## 103 20.309258 43.55657
## 104 14.976784 26.53657
## 105 17.948196 34.98630
## 106 8.530415 39.61490
## 107 21.318989 20.59022
## 108 25.120167 41.30538
## 109 15.530935 38.99578
## 110 14.509383 50.35201
## 111 22.184786 37.05621
## 112 22.363508 47.77958
## 113 26.027013 43.33413
## 114 18.337608 31.90753
## 115 22.467321 39.33891
## 116 18.007411 42.07221
## 117 16.440147 43.73603
## 118 23.807593 53.01235
## 119 22.097793 38.33874
## 120 19.319161 39.02422
## 121 21.477792 40.78944
## 122 24.942278 48.13487
## 123 13.196225 47.82460
## 124 14.627345 42.87733
## 125 12.728187 37.13866
## 126 24.441168 34.05907
## 127 16.295212 31.95106
## 128 21.175908 37.36704
## 129 18.380724 33.75836
## 130 15.500801 33.16775
## 131 11.375779 35.01480
## 132 23.474667 39.24048
## 133 19.452844 30.33179
## 134 23.102799 46.63145
## 135 23.957343 34.43842
## 136 19.087196 43.80997
## 137 15.035653 43.91438
## 138 25.729928 33.06041
## 139 22.436649 43.80049
## 140 17.092190 34.66572
## 141 29.020963 38.25556
## 142 21.669312 40.02876
## 143 15.521515 41.96201
## 144 21.943044 39.17842
## 145 22.363424 51.54758
## 146 22.364263 30.17947
## 147 19.923735 47.07923
## 148 17.947958 38.53015
## 149 19.235762 42.03357
## 150 20.128081 35.17858
## 151 22.947803 40.00331
## 152 21.559265 28.21321
## 153 19.063454 40.27458
## 154 18.523604 51.23303
## 155 18.260121 31.77747
## 156 16.761564 48.05283
## 157 25.653080 38.42947
## 158 17.947947 40.93348
## 159 24.204276 29.56784
## 160 22.846984 51.06131
## 161 26.802653 37.17026
## 162 21.609274 51.55776
## 163 16.696297 38.91904
## 164 18.298961 34.39085
## 165 22.100456 42.90945
## 166 19.719495 41.22726
## 167 15.010076 46.41242
## 168 16.249880 41.08541
## [1] "/ATM" "/Power" "/Water_Flow"
## [[1]]
## [1] 0
##
## [[2]]
## [1] 0
##
## [[3]]
## [1] 0