Project 1

In the following project all missing values (N.A.'s) will replaced with linearly interpolated values using the na.approx() from the zoo library.
The Hampel filter will be used to identify as outliers the values outside the interval (I) formed by the median, plus or minus 3 median absolute deviations. Outliers will replaced with linearly interpolated values as well.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

After loading and joining the four ATM datasets, the head and tail of the data set looks like:

##          Date ATM1_Cash ATM2_Cash ATM3_Cash ATM4_Cash
## 1  2009-05-03        96       107         0 776.99342
## 2  2009-05-04        82        89         0 524.41796
## 3  2009-05-05        85        90         0 792.81136
## 4  2009-05-06        90        55         0 908.23846
## 5  2009-05-07        99        79         0  52.83210
## 6  2009-05-08        88        19         0  52.20845
## 7  2009-05-09         8         2         0  55.47361
## 8  2009-05-10       104       103         0 558.50325
## 9  2009-05-11        87       107         0 904.34136
## 10 2009-05-12        93       118         0 879.49359

##           Date ATM1_Cash ATM2_Cash ATM3_Cash  ATM4_Cash
## 356 2010-04-23       100       100         0 556.792330
## 357 2010-04-24        69       103         0 386.175335
## 358 2010-04-25        85        44         0 165.294181
## 359 2010-04-26        85        61         0   5.451815
## 360 2010-04-27       109        89         0 542.280602
## 361 2010-04-28        74        11         0 403.839336
## 362 2010-04-29         4         2         0  13.697331
## 363 2010-04-30        96       107        96 348.201061
## 364 2010-05-01        82        89        82  44.245345
## 365 2010-05-02        85        90        85 482.287107

From the first 5 rows of observations we notice that we have a time stamp indicating year, month and day of usage. ATM1 and ATM2 have been continously used at least during the first 10 days. ATM3 has not been used at all during this time period. The data from ATM4 shows is reporting cash withdrawals 10 times larger than typical withdrawals from ATM1 and ATM2 which is possible if it is located in a high usage area (e.g. stadium, train station, etc). It is also reporting cash withdrawals with fractions of a cent which is not possible.

It is not possible to get a clarification on the validity of the observatiosn from ATM4. Probably there is a failure in the machine that is affecting the floating point calculations of the register. If this is the case, the range of cash withdrawals could be similar as that of ATM1 and ATM4.

##          Date ATM1_Cash ATM2_Cash ATM3_Cash ATM4_Cash
## 1  2009-05-03        96       107         0        78
## 2  2009-05-04        82        89         0        52
## 3  2009-05-05        85        90         0        79
## 4  2009-05-06        90        55         0        91
## 5  2009-05-07        99        79         0         5
## 6  2009-05-08        88        19         0         5
## 7  2009-05-09         8         2         0         6
## 8  2009-05-10       104       103         0        56
## 9  2009-05-11        87       107         0        90
## 10 2009-05-12        93       118         0        88

Data Imputation

Data from ATM1 has 3 values missing from the series while ATM2 has 2 missing values:

## 
## Summary of ATM1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   73.00   91.00   83.89  108.00  180.00       3

## 
## Summary of ATM2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   25.50   67.00   62.58   93.00  147.00       2

The missing values will be replaced with interpolated values. For ATM1, we can see below how the Median has not changed while the Mean has increased by 0.31%. For ATM2, we see no change on the Median while the mean has increased by 0.02%.

## 
## Summary of Imputed ATM1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   73.00   91.00   84.15  108.00  180.00

## 
## Summary of Imputed ATM2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   26.00   67.00   62.59   93.00  147.00

Exploratory Data Analysis

First, let's look at some statistics of the cash withdrawals:

## Summary of values from ATM1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   73.00   91.00   84.15  108.00  180.00

## 
## Summary of values from ATM2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   26.00   67.00   62.59   93.00  147.00

## 
## Summary of values from ATM3

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7206  0.0000 96.0000

## 
## Summary of values from ATM4

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    12.0    40.0    47.4    70.0  1092.0

After imputation, the data in ATM1 and ATM2 doesn't raise any red flags. ATM3 is filled with mostly zeroes and ATM4 has at least one outlier and the values don't make sense if the cash withdrawals are in hundreds of dollars.

ATM1

Cash withdrawals from ATM1 are bimodal with two peaks at around 20 and 90 hundreds of dollars. People tend to withdraw money near these two values. We can see these in the Scatter and Histogram plots below.

From the Autocorrelation Fuction (ACF) plot below we can see how the withdrawals are seasonal in the sense that they are correlated to those 7 days in the future and past. There is a weekly pattern.

ATM2

Cash withdrawals from ATM2 are also slighly bimodal with peaks near 20 and 70 hundreds of dollars. The bimodal distribution is not as defined as in the case of ATM1.

From the Autocorrelation Fuction (ACF) plot below we can see how the withdrawals from ATM2 are also seasonal. There is a weekly pattern just as we saw for ATM1.

ATM3

Cash withdrawals from ATM3 are mostly zero with a few withdrawals near a 100 hundred dollards.

The frequency distribution for the whole data set is the following:

## 
##   0  82  85  96 
## 362   1   1   1

There were only 3 non-zero withdrawals at the end of the series or only 0.82% of the observations. The last 10 ATM4 observations are:

##  [1]  0  0  0  0  0  0  0 96 82 85

This may have been because the machine had no money on all but the last 3 days of the series. This will severely limit the ability to forecast future cash withdrawals from this machine.

ATM4

Cash withdrawals from ATM4 do not follow a normal distribution and are very right modal or positive skewed. There is also an extreme withdrawal value at 1092 (hundres of dollars) that we may have to consider an outlier. In the plot below, we see how taking the natural log of the values, we have brough back the skewed distribution to one resembling a normal distribution.

From the Autocorrelation Fuction (ACF) plot below we can see how the withdrawals from ATM4 appear to have seasonal autocorrelation. However, the spikes in the lags fall below the 95% treshold confidence intervals determined by the variance in the data set.

Forecasting ATM1

First, let's generate some simple non-seasonal models:

The models above don't capture at all the seasonality of the cash withdrawals.

Models such as ETS, Seasonal Naive and ARIMA do capture the seasonality of the training data and match the valley's when withdrawals drop to close to zero. From the three models, Seasonal Naive matches the phase of the valley while the other two models are out of phase with the actual data.

Accuracy of ATM1 Models

## # A tibble: 4 x 9
##   .model               .type           ME  RMSE   MAE   MPE  MAPE  MASE    ACF1
##   <chr>                <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Holt's method        Training -1.01e+ 0  36.7  27.1 -177.  199.  1.53  0.0739
## 2 Damped Holt's method Training -4.90e- 1  36.6  27.1 -180.  203.  1.53  0.0657
## 3 Mean                 Training -6.97e-15  36.6  27.3 -175.  199.  1.54  0.0706
## 4 Naïve                Training -3.02e- 2  49.9  37.2 -132.  167.  2.10 -0.356

All non-seasonal methods perform similary on the training and test data. The Naive method having the worst accuracy.

## # A tibble: 4 x 9
##   .model               .type    ME  RMSE   MAE   MPE  MAPE  MASE    ACF1
##   <chr>                <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Damped Holt's method Test  -3.53  43.3  33.3 -495.  521.  1.88 0.00766
## 2 Holt's method        Test  -4.45  43.9  33.6 -492.  518.  1.89 0.0211 
## 3 Mean                 Test  -5.32  42.5  31.4 -469.  491.  1.77 0.0305 
## 4 Naïve                Test  -4.85  48.6  34.4 -226.  252.  1.93 0.0451

Models that take into account seasonality (ETS, Seasonal Naive and ARIMA) perform much better compared to simpler models as we can see based on the accuracy metrics above and below. The ARIMA model performs slightly better than the rest from the perspective of the RMSE parameter.

## # A tibble: 3 x 9
##   .model .type         ME  RMSE   MAE    MPE  MAPE  MASE     ACF1
##   <chr>  <chr>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>
## 1 ets    Training -0.0482  23.8  15.1 -107.   122. 0.852  0.144  
## 2 arima  Training -0.0742  23.3  14.5 -102.   117. 0.822 -0.00872
## 3 snaive Training -0.0363  27.7  17.7  -96.5  116. 1      0.188

On both training and test data, ARIMA had a better RMSE performance than ETS or Seasonal Naive.

## # A tibble: 3 x 9
##   .model .type     ME  RMSE   MAE   MPE  MAPE  MASE    ACF1
##   <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 arima  Test  -0.741  34.0  20.1 -440.  457.  1.13  0.0336
## 2 ets    Test  -3.62   40.0  24.9 -522.  541.  1.40 -0.0579
## 3 snaive Test  -2.64   40.5  26.4 -534.  562.  1.49  0.0328

The less parsimonious model is the ARIMA model that when auto-generated created a model with non-seasonal and seasonal components. The parameters shown below indicate that the model differentiates only (seasonal differencing) and bases the seasonality on the length of the week in days. The ACF of the residual lags are below the treshold of statistical significance and the residual follow a normal distribution.

## Series: ATM1_Cash 
## Model: ARIMA(0,0,1)(0,1,2)[7] 
## 
## Coefficients:
##          ma1     sma1     sma2
##       0.1950  -0.5806  -0.1037
## s.e.  0.0546   0.0506   0.0494
## 
## sigma^2 estimated as 556.4:  log likelihood=-1640.01
## AIC=3288.03   AICc=3288.14   BIC=3303.55

The parameters of the ETS model below indicate no trend was found and the errors and seasonal componets were found to be additive. The model season component shows a section when the variance changes signficantly compared to the rest of the series around March 2010.

## Series: ATM1_Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.02126635 
##     gamma = 0.3298864 
## 
##   Initial states:
##         l        s1       s2       s3       s4       s5       s6      s7
##  77.02956 -55.27235 2.531658 16.15413 4.529656 11.40245 12.92571 7.72875
## 
##   sigma^2:  580.2586
## 
##      AIC     AICc      BIC 
## 4487.018 4487.639 4526.017

We can select the seasonal ARIMA(0,0,1)(0,1,2)[7] model as a good candidate to model the ATM1 cash withdrawals.

##   .model       Date             ATM1_Cash      .mean
## 1  arima 2010-05-03 N(87.16789, 556.3821)  87.167893
## 2  arima 2010-05-04 N(104.0559, 577.5304) 104.055911
## 3  arima 2010-05-05 N(73.14947, 577.5304)  73.149472
## 4  arima 2010-05-06 N(8.032036, 577.5304)   8.032036
## 5  arima 2010-05-07 N(100.0066, 577.5304) 100.006619
## 6  arima 2010-05-08 N(80.85748, 577.5304)  80.857483

Forecasting ATM2

The ETS, Seasonal Naive and ARIMA models have again captured the seasonality of the data and match the timing of the valley's when withdrawals drop to close to zero. From the three models, Seasonal Naive matches the phase of the valleys as well as the variance in the observed data better than the other two models.

Accuracy of ATM2 Models

## # A tibble: 4 x 9
##   .model               .type           ME  RMSE   MAE   MPE  MAPE  MASE     ACF1
##   <chr>                <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 Holt's method        Training  7.24e- 1  38.1  32.6  -Inf   Inf  1.57 -0.0131 
## 2 Damped Holt's method Training -2.19e+ 0  38.3  32.7  -Inf   Inf  1.57 -0.00811
## 3 Mean                 Training -2.43e-16  38.8  33.2  -Inf   Inf  1.59  0.0228 
## 4 Naïve                Training -4.67e- 2  54.2  42.5  -Inf   Inf  2.05 -0.307

The metrics of Mean Absolute Error (MPE) and Mean Absolute Percentage Error (MAPE) above and below are showing the effects of division by zero (true obs is zero) giving infinite values.

The ARIMA model again performs slightly better than the rest from the perspective of the RMSE parameter. However, as we saw from the plot of the forecasts, Seasonal Naive better matched the phase and the range of variance of the historical data.

## # A tibble: 3 x 9
##   .model .type         ME  RMSE   MAE   MPE  MAPE  MASE     ACF1
##   <chr>  <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1 ets    Training -0.689   25.1  17.9  -Inf   Inf 0.859  0.0199 
## 2 arima  Training -0.891   24.1  17.0  -Inf   Inf 0.820 -0.00428
## 3 snaive Training  0.0223  30.2  20.8  -Inf   Inf 1.00   0.0482

The auto-generated ARIMA model chose parameters for a model with non-seasonal and seasonal components. The weekly seasonality appears to be the strongest in determining the AR and MA components of the ARIMA model. The residual plots again indicate the goodness of fit for the ARIMA model.

## Series: ATM2_Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4320  -0.9130  0.4773  0.8048  -0.7547
## s.e.   0.0553   0.0407  0.0861  0.0556   0.0381
## 
## sigma^2 estimated as 602.5:  log likelihood=-1653.67
## AIC=3319.33   AICc=3319.57   BIC=3342.61

The parameters of the ETS model below indicate no trend was found and the erros and seasonal componets were found to be additive. As in the case for the ATM1 series, there is a section in the series when the variance changes signficantly around March 2010.

## Series: ATM2_Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000049 
##     gamma = 0.3594252 
## 
##   Initial states:
##         l        s1        s2       s3        s4       s5       s6       s7
##  72.71479 -33.56563 -27.28811 12.45705 0.8474403 4.742114 17.29087 25.51626
## 
##   sigma^2:  648.1015
## 
##      AIC     AICc      BIC 
## 4527.377 4527.998 4566.376

Seasonal Naive is a good choice to model cash withdrawals from ATM2 based on the matching of variance and phase of the historical data.

##   .model       Date             ATM1_Cash      .mean .model       Date
## 1  arima 2010-05-03 N(87.16789, 556.3821)  87.167893 snaive 2010-05-03
## 2  arima 2010-05-04 N(104.0559, 577.5304) 104.055911 snaive 2010-05-04
## 3  arima 2010-05-05 N(73.14947, 577.5304)  73.149472 snaive 2010-05-05
## 4  arima 2010-05-06 N(8.032036, 577.5304)   8.032036 snaive 2010-05-06
## 5  arima 2010-05-07 N(100.0066, 577.5304) 100.006619 snaive 2010-05-07
## 6  arima 2010-05-08 N(80.85748, 577.5304)  80.857483 snaive 2010-05-08
##          ATM2_Cash .mean
## 1  N(61, 911.8715)    61
## 2  N(89, 911.8715)    89
## 3  N(11, 911.8715)    11
## 4   N(2, 911.8715)     2
## 5 N(107, 911.8715)   107
## 6  N(89, 911.8715)    89

Forecasting ATM3

The data from ATM3 is filled almost completely with zeros as seen below.

Only the last 3 daily observations of the year long data set contain non-zero observations. 3 observations are not enough for any reasonable forecast that takes into account trend and/or seasonality.

## # A tsibble: 6 x 2 [1D]
##   Date       ATM3_Cash
##   <date>         <dbl>
## 1 2009-05-03         0
## 2 2009-05-04         0
## 3 2009-05-05         0
## 4 2009-05-06         0
## 5 2009-05-07         0
## 6 2009-05-08         0

## # A tsibble: 6 x 2 [1D]
##   Date       ATM3_Cash
##   <date>         <dbl>
## 1 2010-04-27         0
## 2 2010-04-28         0
## 3 2010-04-29         0
## 4 2010-04-30        96
## 5 2010-05-01        82
## 6 2010-05-02        85

A Naive forecast considering that the next withdrawal would be equal to the last one is viable considering the extremely limited number of observations. As we can see below, the confidence intervals quickly balloon out as well as the uncertainty for any future forecast.

##   .model       Date             ATM1_Cash      .mean .model       Date
## 1  arima 2010-05-03 N(87.16789, 556.3821)  87.167893 snaive 2010-05-03
## 2  arima 2010-05-04 N(104.0559, 577.5304) 104.055911 snaive 2010-05-04
## 3  arima 2010-05-05 N(73.14947, 577.5304)  73.149472 snaive 2010-05-05
## 4  arima 2010-05-06 N(8.032036, 577.5304)   8.032036 snaive 2010-05-06
## 5  arima 2010-05-07 N(100.0066, 577.5304) 100.006619 snaive 2010-05-07
## 6  arima 2010-05-08 N(80.85748, 577.5304)  80.857483 snaive 2010-05-08
##          ATM2_Cash .mean .model       Date       ATM3_Cash .mean
## 1  N(61, 911.8715)    61  Naïve 2010-05-03 N(85, 25.88187)    85
## 2  N(89, 911.8715)    89  Naïve 2010-05-04 N(85, 51.76374)    85
## 3  N(11, 911.8715)    11  Naïve 2010-05-05  N(85, 77.6456)    85
## 4   N(2, 911.8715)     2  Naïve 2010-05-06 N(85, 103.5275)    85
## 5 N(107, 911.8715)   107  Naïve 2010-05-07 N(85, 129.4093)    85
## 6  N(89, 911.8715)    89  Naïve 2010-05-08 N(85, 155.2912)    85

Forecasting ATM4

We can find outlier using a method knows as Hampel filter, considers as outliers the values outside the interval (I) formed by the median, plus or minus 3 median absolute deviations (MAD).

\[I=[\text {median}-3 \cdot M A D ; \text { median }+3 \cdot M A D]\]

where MAD is defined as:

\[M A D=\operatorname{median}\left(\left|X_{i}-\tilde{X}\right|\right)\]

The data from ATM4 contains an outlier easily seen in the plot below.

We'll detect the outlier, replace with an NA and finally use the na.approx() to linearly interpolate a new value.

The fit of the seasonal models to the historical data appears to be poor. Neither of the models capture the periodicity or the variance of the data. This is due to the non-normal distribution of the data its white-noise like properties. The ARIMA model fits the worst as the model eventuallly dampens to a constant value.

Accuracy of ATM4 Models

## # A tibble: 4 x 9
##   .model               .type           ME  RMSE   MAE   MPE  MAPE  MASE    ACF1
##   <chr>                <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Holt's method        Training -6.08e- 1  35.5  29.6  -Inf   Inf 0.853  0.0794
## 2 Damped Holt's method Training -1.96e- 1  35.1  29.2  -Inf   Inf 0.843  0.0783
## 3 Mean                 Training -2.49e-15  35.1  29.2  -Inf   Inf 0.844  0.0768
## 4 Naïve                Training -8.24e- 2  47.7  38.7  -Inf   Inf 1.12  -0.464

The accuracy metrics of all the models tested are similar from simple non-seasonal models such as the Mean of the series to more complex models such as seasonal ARIMA.

## # A tibble: 3 x 9
##   .model .type          ME  RMSE   MAE   MPE  MAPE  MASE    ACF1
##   <chr>  <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ets    Training -0.707    32.9  26.7  -Inf   Inf 0.769 0.102  
## 2 arima  Training  0.00214  34.0  28.0  -Inf   Inf 0.809 0.00478
## 3 snaive Training -0.372    45.3  34.7  -Inf   Inf 1     0.0630

It is interesting the auto-selected ARIMA model is that of a seasonal ARIMA with a season length of 7 days with zero order differencing of both the non and seasonal components. The model tried to incorporate a seasonal component with limited influence on the non-seasonal component.

## Series: ATM4_Cash 
## Model: ARIMA(3,0,2)(1,0,0)[7] w/ mean 
## 
## Coefficients:
##           ar1      ar2     ar3     ma1     ma2    sar1  constant
##       -0.8763  -0.6769  0.1701  0.9620  0.8072  0.2481   79.4710
## s.e.   0.1000   0.1194  0.0565  0.0901  0.1088  0.0556    4.8915
## 
## sigma^2 estimated as 1177:  log likelihood=-1805.01
## AIC=3626.02   AICc=3626.42   BIC=3657.22

Taking into account the following regarding the registered cash withdrawals from ATM4:

Values show discrepancies on the withdrawal amounts compared to values from ATM1, ATM2 and the few non-zero observations from ATM3.
Values indicate franctional dollar withdrawal.
Values are highly skewed.
Large standard deviation of the data induces very broad confidence intervals.

Model complexity has not increased accuracy so a simple model such as a the Mean of observations is a good candidate to forecasting future cash withdrawals for ATM4.

##   .model       Date             ATM1_Cash      .mean .model       Date
## 1  arima 2010-05-03 N(87.16789, 556.3821)  87.167893 snaive 2010-05-03
## 2  arima 2010-05-04 N(104.0559, 577.5304) 104.055911 snaive 2010-05-04
## 3  arima 2010-05-05 N(73.14947, 577.5304)  73.149472 snaive 2010-05-05
## 4  arima 2010-05-06 N(8.032036, 577.5304)   8.032036 snaive 2010-05-06
## 5  arima 2010-05-07 N(100.0066, 577.5304) 100.006619 snaive 2010-05-07
## 6  arima 2010-05-08 N(80.85748, 577.5304)  80.857483 snaive 2010-05-08
##          ATM2_Cash .mean .model       Date       ATM3_Cash .mean .model
## 1  N(61, 911.8715)    61  Naïve 2010-05-03 N(85, 25.88187)    85   Mean
## 2  N(89, 911.8715)    89  Naïve 2010-05-04 N(85, 51.76374)    85   Mean
## 3  N(11, 911.8715)    11  Naïve 2010-05-05  N(85, 77.6456)    85   Mean
## 4   N(2, 911.8715)     2  Naïve 2010-05-06 N(85, 103.5275)    85   Mean
## 5 N(107, 911.8715)   107  Naïve 2010-05-07 N(85, 129.4093)    85   Mean
## 6  N(89, 911.8715)    89  Naïve 2010-05-08 N(85, 155.2912)    85   Mean
##         Date            ATM4_Cash    .mean
## 1 2010-05-03 N(47.40274, 4250.42) 47.40274
## 2 2010-05-04 N(47.40274, 4250.42) 47.40274
## 3 2010-05-05 N(47.40274, 4250.42) 47.40274
## 4 2010-05-06 N(47.40274, 4250.42) 47.40274
## 5 2010-05-07 N(47.40274, 4250.42) 47.40274
## 6 2010-05-08 N(47.40274, 4250.42) 47.40274

Outliers detection in R https://www.statsandr.com/blog/outliers-detection-in-r/

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Exploratory Data Analysis

Let's examine the features of the residential power usage using exploratory data analysis (EDA).

From these exploratory plots we can notice the outlier occuring after 2010. The data has a normal distribution and shows uniform variance over the range of observations.

Identify and Deal N.A.'s

We have a missing observation in this data series of 192 observations. The missing observation is located at row number 705. These are the elements before and after the missing observation:

## [1] 7643987 8037137      NA 5101803 4555602

We will use linear interpolation to substitute the missing value.

## [1] 7643987 8037137 6569470 5101803 4555602

Identify and Remove Outlier

The outlier in this series is the minimum value of the series.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   770523  5434539  6314472  6502824  7608792 10655730

The outlier will be identified using

## [1] 770523

## [1] 151

Below we can the time series without the outlier or the missing values.

Transformations

There appears to be a trends in the data series but no large variability in the variance of the data. A BoxCox transformation did not improve the quality of the data.

Forecasting of Residential Power Consumption

The ETS, Seasonal Naive and ARIMA models have been successful in capturing the seasonality and variance of the historical data set. We see below that the ARIMA model has a slightly higher accuracy based on the RMSE parameter as compared to the other two models.

## # A tibble: 3 x 9
##   .model .type        ME    RMSE     MAE     MPE  MAPE  MASE     ACF1
##   <chr>  <chr>     <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>    <dbl>
## 1 ets    Training 52868. 631229. 482033.  0.0715  7.27 0.772 0.220   
## 2 arima  Training -8679. 590208. 430362. -0.798   6.53 0.690 0.000127
## 3 snaive Training 94245. 825374. 624053.  0.681   9.28 1.00  0.298

The auto-selected ARIMA model performed best in the training and test data sets with a lower RMSE while Seasonal Naive has the highest RMSE value.

## # A tibble: 3 x 9
##   .model .type      ME    RMSE     MAE   MPE  MAPE  MASE    ACF1
##   <chr>  <chr>   <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 arima  Test  190750. 672763. 536673. 1.61   8.23 0.858  0.136 
## 2 ets    Test  116692. 784947. 612111. 0.295  9.45 0.978  0.0438
## 3 snaive Test  215759. 906344. 663847. 1.67  10.1  1.06  -0.162

## 50.857 sec elapsed

ARIMA auto-selected a seasonal model with a season length of 12 (12 months in a year). Only the seasonal component was differenced with a Auto-Regressive coefficient of 2. The only coefficient in the non-seasonal component was the Moving Average compoment with a coefficient of 4.

## Series: value 
## Model: ARIMA(0,0,4)(2,1,0)[12] w/ drift 
## 
## Coefficients:
##          ma1     ma2     ma3      ma4     sar1     sar2   constant
##       0.3459  0.0665  0.2109  -0.0772  -0.7368  -0.4342  233796.06
## s.e.  0.0772  0.0838  0.0737   0.0819   0.0763   0.0784   73859.23
## 
## sigma^2 estimated as 3.866e+11:  log likelihood=-2657.65
## AIC=5331.31   AICc=5332.15   BIC=5356.85

Below we can see the 6 month forecast using the ARIMA(0,0,4)(2,1,0)[12] model. The periodicity and variance matches the historical data of the set.

## # A fable: 1 x 4 [1M]
## # Key:     .model [1]
##   .model    index             value     .mean
##   <chr>     <mth>            <dist>     <dbl>
## 1 arima  2014 Jan N(1e+07, 3.9e+11) 10248709.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Waterflow_Pipe1.xlsx

Exploratory Data Analysis

##    Date Time       WaterFlow     
##  Min.   :42300   Min.   : 1.067  
##  1st Qu.:42302   1st Qu.:13.683  
##  Median :42305   Median :19.880  
##  Mean   :42305   Mean   :19.897  
##  3rd Qu.:42307   3rd Qu.:26.159  
##  Max.   :42310   Max.   :38.913

The "Date Time" column shows times of observation in days and fractions fractions of a day. We need to convert them into number of seconds and fractions of second to use functions such as as_datetime or as.POSIXct to covert the observation time stamp into a date that we can use for our work.

We can extract from "year", "month", "day", "hour", "minutes", "seconds" parameters from the "Date Time" field. We can use these parameters for aggregating data by hour.

Data Aggregation at Different Time Scales

The lowest order seasonality is that at the hourly level. We see peak usage in the late-evening/early-morning period (people taking showers before bed) and in the afternoon at 3pm when people start to come back from work. The water flow measurments appear to follow a normal distribution and there are no outliers at the hourly aggregation level. The aggregated lowest water flow observation occurs at 10 AM as seen above in the plot at the bottom right corner.

We can also aggregate by day of the week (WeekDay) and hour as shown above. The aggregated data also shows a normal distribution. From the time series plot above we can see a dip in the aggregated data at the start/end of a day. This daily seasonality can also be seen in the series decomposition below. We an see that the seasonal component shows 7 distinctive cycles for the 7 days of the week.

Forecast

Three different models (ARIMA, ETS, Seasonal Naive) were chosen to forecast the seasonally varying data. Only the Seasonal Naive model was able to produce a forecast that represents the historical data.

The ETS model was only able to forecast a Mean value of the observations.

## ETS(A,N,N) 
## 
## Call:
##  ets(y = WFP1_Data_agg2) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 19.8521 
## 
##   sigma:  4.0452
## 
##      AIC     AICc      BIC 
## 1334.384 1334.530 1343.756 
## 
## Training set error measures:
##                        ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 0.0002538585 4.021046 3.231715 -5.28105 18.49092 0.6690035
##                   ACF1
## Training set 0.0286744

ARIMA also completely failed to generate a model that fits the data.

What should be our forecast for this series. We can either choose the Seasonal Naive model that tries to fit the aggregated data or we can use the aggregated data per WeekDay and Hour that also shows the seasonality per day and per hour.

Reorder factor levels by day of the week in R

https://stackoverflow.com/questions/10309564/reorder-factor-levels-by-day-of-the-week-in-r

Second Data Set

The data set from Waterflow_Pipe2.xlsx showed the same characteristics as the previous set. The time stamp had to be converted form numerical time to DataTime that we can use to aggregate data per hour.

##   year month day hour minutes seconds WaterFlow            DateTime WeekDay
## 1 2015    10  25   01      00      00  18.81079 2015-10-25 01:00:00  Sunday
## 2 2015    10  25   02      00      00  43.08703 2015-10-25 02:00:00  Sunday
## 3 2015    10  25   03      00      00  37.98770 2015-10-25 03:00:00  Sunday
## 4 2015    10  25   04      00      00  36.12038 2015-10-25 04:00:00  Sunday
## 5 2015    10  25   05      00      00  31.85126 2015-10-25 05:00:00  Sunday
## 6 2015    10  25   06      00      00  28.23809 2015-10-25 06:00:00  Sunday

Data Aggregation at Different Time Scales

Both data sets show the lowest Water Flow during the 10 AM period of time. The distribution of aggregated flows appears to be normal and shows seasonality in the scale of hours and days of the week.

Forecast

The only forecast model that produced reasonable results matching periodicity of the historical data was Seasonal Naive.

The multiple potential seasonalities (hourly, WeekDay, Monthly) and noise in the data appeared to preclude more complex model such as ARIMA to find a set of parameters that could forecast the historical data.

Techniques such as data aggregation or models as Seasonal Naive are powerful tools when dealing with high frequency data.

## $method
## [1] "Seasonal naive method"
## 
## $model
## Call: snaive(y = WFP2_Data_agg2, h = (24 * 8)) 
## 
## Residual sd: 8.2805 
## 
## $lambda
## NULL
## 
## $x
## Time Series:
## Start = c(0, 1) 
## End = c(6, 24) 
## Frequency = 24 
##   [1] 47.58080 45.06177 35.27366 47.68656 35.98560 39.73735 50.42331 39.45068
##   [9] 36.59180 28.33674 46.52859 35.41536 41.40227 42.34403 50.91121 46.53228
##  [17] 34.85761 47.18069 48.42610 39.99935 44.41102 33.12071 47.58077 47.38481
##  [25] 38.82707 34.42901 41.07133 42.16208 34.66865 45.26408 44.70038 41.40555
##  [33] 41.30034 38.84086 35.90252 40.05034 40.34892 35.74855 35.91561 34.72807
##  [41] 38.05734 49.56629 44.90299 40.54453 42.30054 36.35438 48.42404 41.03465
##  [49] 33.70027 44.38092 43.62371 38.46691 44.21530 47.08228 40.55499 35.92058
##  [57] 38.10303 40.17602 34.26778 41.97120 41.81164 32.77803 43.39558 39.64293
##  [65] 27.03206 27.06252 31.48747 39.29561 38.52652 31.63105 25.36962 33.12386
##  [73] 38.23113 28.43512 53.42453 36.02557 43.94098 39.59530 44.92140 45.77981
##  [81] 27.71829 40.32067 33.14742 42.64416 40.28602 41.36456 48.49877 36.52893
##  [89] 26.39926 29.56739 38.25075 34.16781 44.82123 33.17212 49.11825 43.21395
##  [97] 33.23872 36.43188 49.06736 40.23871 24.76925 29.91784 43.55657 26.53657
## [105] 34.98630 39.61490 20.59022 41.30538 38.99578 50.35201 37.05621 47.77958
## [113] 43.33413 31.90753 39.33891 42.07221 43.73603 53.01235 38.33874 39.02422
## [121] 40.78944 48.13487 47.82460 42.87733 37.13866 34.05907 31.95106 37.36704
## [129] 33.75836 33.16775 35.01480 39.24048 30.33179 46.63145 34.43842 43.80997
## [137] 43.91438 33.06041 43.80049 34.66572 38.25556 40.02876 41.96201 39.17842
## [145] 51.54758 30.17947 47.07923 38.53015 42.03357 35.17858 40.00331 28.21321
## [153] 40.27458 51.23303 31.77747 48.05283 38.42947 40.93348 29.56784 51.06131
## [161] 37.17026 51.55776 38.91904 34.39085 42.90945 41.22726 46.41242 41.08541
## 
## $fitted
## Time Series:
## Start = c(0, 1) 
## End = c(6, 24) 
## Frequency = 24 
##   [1]       NA       NA       NA       NA       NA       NA       NA       NA
##   [9]       NA       NA       NA       NA       NA       NA       NA       NA
##  [17]       NA       NA       NA       NA       NA       NA       NA       NA
##  [25] 47.58080 45.06177 35.27366 47.68656 35.98560 39.73735 50.42331 39.45068
##  [33] 36.59180 28.33674 46.52859 35.41536 41.40227 42.34403 50.91121 46.53228
##  [41] 34.85761 47.18069 48.42610 39.99935 44.41102 33.12071 47.58077 47.38481
##  [49] 38.82707 34.42901 41.07133 42.16208 34.66865 45.26408 44.70038 41.40555
##  [57] 41.30034 38.84086 35.90252 40.05034 40.34892 35.74855 35.91561 34.72807
##  [65] 38.05734 49.56629 44.90299 40.54453 42.30054 36.35438 48.42404 41.03465
##  [73] 33.70027 44.38092 43.62371 38.46691 44.21530 47.08228 40.55499 35.92058
##  [81] 38.10303 40.17602 34.26778 41.97120 41.81164 32.77803 43.39558 39.64293
##  [89] 27.03206 27.06252 31.48747 39.29561 38.52652 31.63105 25.36962 33.12386
##  [97] 38.23113 28.43512 53.42453 36.02557 43.94098 39.59530 44.92140 45.77981
## [105] 27.71829 40.32067 33.14742 42.64416 40.28602 41.36456 48.49877 36.52893
## [113] 26.39926 29.56739 38.25075 34.16781 44.82123 33.17212 49.11825 43.21395
## [121] 33.23872 36.43188 49.06736 40.23871 24.76925 29.91784 43.55657 26.53657
## [129] 34.98630 39.61490 20.59022 41.30538 38.99578 50.35201 37.05621 47.77958
## [137] 43.33413 31.90753 39.33891 42.07221 43.73603 53.01235 38.33874 39.02422
## [145] 40.78944 48.13487 47.82460 42.87733 37.13866 34.05907 31.95106 37.36704
## [153] 33.75836 33.16775 35.01480 39.24048 30.33179 46.63145 34.43842 43.80997
## [161] 43.91438 33.06041 43.80049 34.66572 38.25556 40.02876 41.96201 39.17842
## 
## $residuals
## Time Series:
## Start = c(0, 1) 
## End = c(6, 24) 
## Frequency = 24 
##   [1]          NA          NA          NA          NA          NA          NA
##   [7]          NA          NA          NA          NA          NA          NA
##  [13]          NA          NA          NA          NA          NA          NA
##  [19]          NA          NA          NA          NA          NA          NA
##  [25]  -8.7537312 -10.6327572   5.7976661  -5.5244745  -1.3169465   5.5267331
##  [31]  -5.7229346   1.9548757   4.7085441  10.5041228 -10.6260623   4.6349795
##  [37]  -1.0533459  -6.5954804 -14.9956026 -11.8042160   3.1997236   2.3855947
##  [43]  -3.5231113   0.5451860  -2.1104809   3.2336736   0.8432673  -6.3501599
##  [49]  -5.1268049   9.9519102   2.5523820  -3.6951733   9.5466509   1.8181974
##  [55]  -4.1453827  -5.4849747  -3.1973139   1.3351631  -1.6347455   1.9208570
##  [61]   1.4627106  -2.9705241   7.4799670   4.9148608 -11.0252776 -22.5037644
##  [67] -13.4155172  -1.2489281  -3.7740142  -4.7233387 -23.0544188  -7.9107963
##  [73]   4.5308575 -15.9458030   9.8008181  -2.4413405  -0.2743196  -7.4869806
##  [79]   4.3664096   9.8592304 -10.3847352   0.1446432  -1.1203572   0.6729670
##  [85]  -1.5256111   8.5865321   5.1031927  -3.1139982  -0.6327959   2.5048673
##  [91]   6.7632867  -5.1277974   6.2947117   1.5410725  23.7486277  10.0900963
##  [97]  -4.9924017   7.9967546  -4.3571742   4.2131427 -19.1717277  -9.6774550
## [103]  -1.3648347 -19.2432354   7.2680053  -0.7057697 -12.5571960  -1.3387805
## [109]  -1.2902464   8.9874478 -11.4425632  11.2506466  16.9348631   2.3401364
## [115]   1.0881541   7.9044053  -1.0852036  19.8402308 -10.7795073  -4.1897307
## [121]   7.5507162  11.7029931  -1.2427576   2.6386199  12.3694044   4.1412252
## [127] -11.6055117  10.8304731  -1.2279340  -6.4471460  14.4245766  -2.0649043
## [133]  -8.6639830  -3.7205571  -2.6177858  -3.9696022   0.5802509   1.1528859
## [139]   4.4615788  -7.4064993  -5.4804707 -12.9835897   3.6232733   0.1541990
## [145]  10.7581444 -17.9554010  -0.7453725  -4.3471815   4.8949123   1.1195126
## [151]   8.0522536  -9.1538366   6.5162218  18.0652849  -3.2373259   8.8123517
## [157]   8.0976705  -5.6979668  -4.8705858   7.2513371  -6.7441184  18.4973517
## [163]  -4.8814432  -0.2748684   4.6538894   1.1985058   4.4504110   1.9069859

##     WFP1_Data_agg2 WFP2_Data_agg2
## 1        16.059214       47.58080
## 2        15.863804       45.06177
## 3        25.650593       35.27366
## 4        23.635124       47.68656
## 5        23.714303       35.98560
## 6        23.314339       39.73735
## 7        24.105044       50.42331
## 8        22.856268       39.45068
## 9        25.131327       36.59180
## 10       18.833089       28.33674
## 11       17.173694       46.52859
## 12       18.739112       35.41536
## 13       14.472489       41.40227
## 14       19.563607       42.34403
## 15       17.349672       50.91121
## 16       15.474739       46.53228
## 17       17.832126       34.85761
## 18       20.546048       47.18069
## 19       20.356969       48.42610
## 20       17.239913       39.99935
## 21       19.356167       44.41102
## 22       19.250355       33.12071
## 23       15.949138       47.58077
## 24       20.111429       47.38481
## 25       18.402730       38.82707
## 26       24.807050       34.42901
## 27       21.092540       41.07133
## 28       20.779714       42.16208
## 29       17.662120       34.66865
## 30       19.012419       45.26408
## 31       19.634759       44.70038
## 32       20.704561       41.40555
## 33       17.892379       41.30034
## 34       27.031242       38.84086
## 35       14.168328       35.90252
## 36       19.681238       40.05034
## 37       20.270096       40.34892
## 38       22.358185       35.74855
## 39       19.322248       35.91561
## 40       20.675480       34.72807
## 41       17.657527       38.05734
## 42       22.467979       49.56629
## 43       22.522688       44.90299
## 44       19.501629       40.54453
## 45       14.330054       42.30054
## 46       18.630484       36.35438
## 47       19.296505       48.42404
## 48       26.604516       41.03465
## 49       21.478724       33.70027
## 50       24.169753       44.38092
## 51       17.681047       43.62371
## 52       17.228764       38.46691
## 53       15.557495       44.21530
## 54       13.755156       47.08228
## 55       18.378993       40.55499
## 56       17.348098       35.92058
## 57       22.767248       38.10303
## 58       22.208604       40.17602
## 59       16.768433       34.26778
## 60       27.174478       41.97120
## 61       20.868027       41.81164
## 62       16.441789       32.77803
## 63       18.509076       43.39558
## 64       14.787636       39.64293
## 65       26.339307       27.03206
## 66       19.628814       27.06252
## 67       18.593013       31.48747
## 68       24.965091       39.29561
## 69       22.618109       38.52652
## 70       12.099414       31.63105
## 71       22.612621       25.36962
## 72       21.166484       33.12386
## 73       18.389080       38.23113
## 74       23.560504       28.43512
## 75       23.996702       53.42453
## 76       18.370869       36.02557
## 77       20.618550       43.94098
## 78       25.888403       39.59530
## 79       13.276622       44.92140
## 80       19.954973       45.77981
## 81       17.896071       27.71829
## 82       10.719391       40.32067
## 83       14.956229       33.14742
## 84       14.538538       42.64416
## 85       23.706073       40.28602
## 86       20.588211       41.36456
## 87       21.670380       48.49877
## 88       28.049444       36.52893
## 89       19.306950       26.39926
## 90       25.930345       29.56739
## 91       12.977001       38.25075
## 92       25.161390       34.16781
## 93       24.392668       44.82123
## 94       18.382795       33.17212
## 95       18.111355       49.11825
## 96        6.037540       43.21395
## 97       17.341505       33.23872
## 98       26.199455       36.43188
## 99       26.775857       49.06736
## 100      15.122721       40.23871
## 101      26.450472       24.76925
## 102      19.379246       29.91784
## 103      20.309258       43.55657
## 104      14.976784       26.53657
## 105      17.948196       34.98630
## 106       8.530415       39.61490
## 107      21.318989       20.59022
## 108      25.120167       41.30538
## 109      15.530935       38.99578
## 110      14.509383       50.35201
## 111      22.184786       37.05621
## 112      22.363508       47.77958
## 113      26.027013       43.33413
## 114      18.337608       31.90753
## 115      22.467321       39.33891
## 116      18.007411       42.07221
## 117      16.440147       43.73603
## 118      23.807593       53.01235
## 119      22.097793       38.33874
## 120      19.319161       39.02422
## 121      21.477792       40.78944
## 122      24.942278       48.13487
## 123      13.196225       47.82460
## 124      14.627345       42.87733
## 125      12.728187       37.13866
## 126      24.441168       34.05907
## 127      16.295212       31.95106
## 128      21.175908       37.36704
## 129      18.380724       33.75836
## 130      15.500801       33.16775
## 131      11.375779       35.01480
## 132      23.474667       39.24048
## 133      19.452844       30.33179
## 134      23.102799       46.63145
## 135      23.957343       34.43842
## 136      19.087196       43.80997
## 137      15.035653       43.91438
## 138      25.729928       33.06041
## 139      22.436649       43.80049
## 140      17.092190       34.66572
## 141      29.020963       38.25556
## 142      21.669312       40.02876
## 143      15.521515       41.96201
## 144      21.943044       39.17842
## 145      22.363424       51.54758
## 146      22.364263       30.17947
## 147      19.923735       47.07923
## 148      17.947958       38.53015
## 149      19.235762       42.03357
## 150      20.128081       35.17858
## 151      22.947803       40.00331
## 152      21.559265       28.21321
## 153      19.063454       40.27458
## 154      18.523604       51.23303
## 155      18.260121       31.77747
## 156      16.761564       48.05283
## 157      25.653080       38.42947
## 158      17.947947       40.93348
## 159      24.204276       29.56784
## 160      22.846984       51.06131
## 161      26.802653       37.17026
## 162      21.609274       51.55776
## 163      16.696297       38.91904
## 164      18.298961       34.39085
## 165      22.100456       42.90945
## 166      19.719495       41.22726
## 167      15.010076       46.41242
## 168      16.249880       41.08541

## [1] "/ATM"        "/Power"      "/Water_Flow"

## [[1]]
## [1] 0
## 
## [[2]]
## [1] 0
## 
## [[3]]
## [1] 0

DATA 624 - Project 1

Jose Mawyin

10/20/2020

Project 1

Part A – ATM Forecast, ATM624Data.xlsx

Data Imputation

Exploratory Data Analysis

ATM1

ATM2

ATM3

ATM4

Forecasting ATM1

Accuracy of ATM1 Models

Forecasting ATM2

Accuracy of ATM2 Models

Forecasting ATM3

Forecasting ATM4

Accuracy of ATM4 Models

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Exploratory Data Analysis

Identify and Deal N.A.'s

Identify and Remove Outlier

Transformations

Forecasting of Residential Power Consumption

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Waterflow_Pipe1.xlsx

Exploratory Data Analysis

Data Aggregation at Different Time Scales

Forecast

Second Data Set

Data Aggregation at Different Time Scales

Forecast