Libraries

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part a. ATM Forecast

Data collection

atm <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/atm_data.csv")

Data Exploration

table(atm$ATM)
## 
##      ATM1 ATM2 ATM3 ATM4 
##   14  365  365  365  365
## 14 null values, not assigned to any atm
summary(atm$Cash)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.5    73.0   155.6   114.0 10920.0      19
## zero's indicate no money withdrawn. There might be various reasons behind this, transaction cancelled, machine error. We will retain this in our records for prediction. Let's check which atm has most zero's. Is there a trend?
table(atm$Cash==0, atm$ATM)
##        
##             ATM1 ATM2 ATM3 ATM4
##   FALSE   0  362  361    3  365
##   TRUE    0    0    2  362    0
##ATM3 is the culprit, only three values of cash dispensing
max(atm$DATE)
## [1] "9/9/2009 12:00:00 AM"
min(atm$DATE)
## [1] "1/1/2010 12:00:00 AM"
atm %>% filter(!complete.cases(atm))
##                     DATE  ATM Cash
## 1  6/13/2009 12:00:00 AM ATM1   NA
## 2  6/16/2009 12:00:00 AM ATM1   NA
## 3  6/18/2009 12:00:00 AM ATM2   NA
## 4  6/22/2009 12:00:00 AM ATM1   NA
## 5  6/24/2009 12:00:00 AM ATM2   NA
## 6   5/1/2010 12:00:00 AM        NA
## 7   5/2/2010 12:00:00 AM        NA
## 8   5/3/2010 12:00:00 AM        NA
## 9   5/4/2010 12:00:00 AM        NA
## 10  5/5/2010 12:00:00 AM        NA
## 11  5/6/2010 12:00:00 AM        NA
## 12  5/7/2010 12:00:00 AM        NA
## 13  5/8/2010 12:00:00 AM        NA
## 14  5/9/2010 12:00:00 AM        NA
## 15 5/10/2010 12:00:00 AM        NA
## 16 5/11/2010 12:00:00 AM        NA
## 17 5/12/2010 12:00:00 AM        NA
## 18 5/13/2010 12:00:00 AM        NA
## 19 5/14/2010 12:00:00 AM        NA
# 19 records with either missing values from column ATM or Cash
aggr_plot <- aggr(atm, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##      Cash 0.01289009
##      DATE 0.00000000
##       ATM 0.00000000

Data Preparation

atm <- atm %>% mutate(DATE=mdy_hms(DATE)) %>% filter(complete.cases(atm))
## missing values dropped and data converted to Date instead of date time
## convert dataframe into ts for all 4 atms
## As pr Hyndman "If the frequency of observations is greater than once per week, then there is usually more than one way of handling the frequency. For example, data with daily observations might have a weekly seasonality (frequency=7) or an annual seasonality (frequency=365.25)".
# also if series is not too long we might want to set frequesny to 7.
#https://robjhyndman.com/hyndsight/dailydata/ 
for (i in unique(atm$ATM)){
  temp <- atm %>%filter(ATM==i)%>% select(-ATM, -DATE)%>%ts(frequency=7)
  assign(paste0("ts_", tolower(i)),temp)
  rm(temp)
}

ATM1

boxplot(ts_atm1)

## only a few outlier at the upper end, we will include in the analysis.
autoplot(ts_atm1)

ggtsdisplay(ts_atm1)

ggseasonplot(ts_atm1, year.labels=TRUE, year.labels.left=TRUE) +
  ylab("Cash") +
  ggtitle("ATM1")

Variance seems constant throughout the time series, no transformation is required. In the seasonal, besides some different spikes, we can see the seasonal pattern. In addition, the significant lags at 7, 14, 21 in ACF and PACF plots eveals the presence of seasonality.

Build Models

Seasonality is reflected in the ts atm1. We wll use seasonal differencing and explore ACF and PACF plots to create seasonal arima model.

ts_atm1 %>% diff(lag=7) %>% ggtsdisplay()

Significant spike in PACF and ACF at lag 1 suggest AR(1) or MA(1) non-seasonal model. Seasonal - PCF Significant lag at 7,14,21 spikes at 14 and 21 are decreasing, the lag 7 spike is more significant in PACF suggest AR(1) model. (using same analogy we can try two seasonal arima models) as below:

ARIMA(1,0,0)(1,1,0) ARIMA(0,0,1)(1,1,0)

a_ar1 <- ts_atm1%>% Arima(order=c(1,0,0), seasonal=c(1,1,0)) 
a_ar2 <- ts_atm1%>% Arima(order=c(0,0,1), seasonal=c(0,1,1))
a_ar1$aic
## [1] 3354.58
a_ar2$aic
## [1] 3322.631

We can use seasonal decomposition and forecasting our ts data.

## additive decomp assumed by stl
decomp <- stl(ts_atm1[, "Cash"], t.window=13,s.window="periodic")
deseasonal <- seasadj(decomp)
autoplot(deseasonal)

We will apply naive model on the seasonal adjusted data

stl_fc <- stlf(deseasonal,method='naive', h=31) 

Model Selection

accuracy(a_ar1)
##                       ME     RMSE      MAE     MPE     MAPE      MASE
## Training set -0.06716054 26.73159 17.05704 -100.19 119.2043 0.9063389
##                   ACF1
## Training set 0.0106164
accuracy(a_ar2)
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.09700089 25.49555 16.17552 -106.2792 122.4165 0.8594986
##                     ACF1
## Training set -0.01116544
accuracy(stl_fc)
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.02165497 26.71567 17.82508 -1038.474 1214.219 0.9471491
##                    ACF1
## Training set -0.3853073
checkresiduals(a_ar1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(1,1,0)[7]
## Q* = 38.845, df = 12, p-value = 0.0001116
## 
## Model df: 2.   Total lags used: 14
checkresiduals(a_ar2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
## 
## Model df: 2.   Total lags used: 14
checkresiduals(stl_fc)
## Warning in checkresiduals(stl_fc): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 86.331, df = 14, p-value = 1.863e-12
## 
## Model df: 0.   Total lags used: 14

Model 2 ARIMA(0,0,1)(1,1,0) have better AIC and RMSE values, residuals seems to be normally distributed also.

unloadNamespace("aTSA")
atm1_fc <- forecast(a_ar1, 31)
head(atm1_fc)
## $method
## [1] "ARIMA(1,0,0)(1,1,0)[7]"
## 
## $model
## Series: . 
## ARIMA(1,0,0)(1,1,0)[7] 
## 
## Coefficients:
##          ar1     sar1
##       0.1360  -0.4018
## s.e.  0.0525   0.0480
## 
## sigma^2 estimated as 732.8:  log likelihood=-1674.29
## AIC=3354.58   AICc=3354.65   BIC=3366.2
## 
## $level
## [1] 80 95
## 
## $mean
## Time Series:
## Start = c(52, 6) 
## End = c(57, 1) 
## Frequency = 7 
##  [1]  84.871352  97.787871  79.629679   3.598927  97.607131  76.777142
##  [7]  85.000002  84.923038 102.292448  77.367903   3.760062  96.961451
## [13]  78.875474  85.000001  84.902273 100.482692  78.276592   3.695324
## [19]  97.220859  78.032449  85.000001  84.910615 101.209779  77.911518
## [25]   3.721333  97.116639  78.371142  85.000001  84.907263 100.917665
## [31]  78.058190
## 
## $lower
## Time Series:
## Start = c(52, 6) 
## End = c(57, 1) 
## Frequency = 7 
##                80%        95%
## 52.71429  50.17947  31.814696
## 52.85714  62.77683  44.243097
## 53.00000  44.61276  26.075924
## 53.14286 -31.41810 -49.954993
## 53.28571  62.59010  44.053207
## 53.42857  41.76011  23.223218
## 53.57143  49.98297  31.446078
## 53.71429  44.21768  22.669568
## 53.85714  61.48941  39.889587
## 54.00000  36.56306  14.962283
## 54.14286 -37.04481 -58.645609
## 54.28571  56.15658  34.555780
## 54.42857  38.07060  16.469802
## 54.57143  44.19513  22.594330
## 54.71429  36.32699  10.612791
## 54.85714  51.77545  25.991388
## 55.00000  29.56691   3.781563
## 55.14286 -45.01440 -70.799774
## 55.28571  48.51113  22.725759
## 55.42857  29.32272   3.537349
## 55.57143  36.29027  10.504901
## 55.71429  30.56319   1.793400
## 55.85714  46.76364  17.941591
## 56.00000  23.46355  -5.359458
## 56.14286 -50.72666 -79.549694
## 56.28571  42.66864  13.845611
## 56.42857  23.92314  -4.899886
## 56.57143  30.55200   1.728973
## 56.71429  24.99083  -6.727013
## 56.85714  40.90484   9.135971
## 57.00000  18.04359 -13.726227
## 
## $upper
## Time Series:
## Start = c(52, 6) 
## End = c(57, 1) 
## Frequency = 7 
##                80%       95%
## 52.71429 119.56323 137.92801
## 52.85714 132.79892 151.33264
## 53.00000 114.64660 133.18343
## 53.14286  38.61595  57.15285
## 53.28571 132.62416 151.16105
## 53.42857 111.79417 130.33107
## 53.57143 120.01703 138.55393
## 53.71429 125.62839 147.17651
## 53.85714 143.09549 164.69531
## 54.00000 118.17274 139.77352
## 54.14286  44.56494  66.16573
## 54.28571 137.76633 159.36712
## 54.42857 119.68035 141.28115
## 54.57143 125.80488 147.40567
## 54.71429 133.47755 159.19175
## 54.85714 149.18994 174.97400
## 55.00000 126.98627 152.77162
## 55.14286  52.40505  78.19042
## 55.28571 145.93059 171.71596
## 55.42857 126.74218 152.52755
## 55.57143 133.70973 159.49510
## 55.71429 139.25804 168.02783
## 55.85714 155.65592 184.47797
## 56.00000 132.35948 161.18249
## 56.14286  58.16933  86.99236
## 56.28571 151.56464 180.38767
## 56.42857 132.81914 161.64217
## 56.57143 139.44800 168.27103
## 56.71429 144.82370 176.54154
## 56.85714 160.93049 192.69936
## 57.00000 138.07279 169.84261
#write.csv(atm1_fc, "ATM1_forecast.csv")

ATM2

We will perform similar checks on time series for ATM2

boxplot(ts_atm2)

autoplot(ts_atm2)

ggtsdisplay(ts_atm2)

ggseasonplot(ts_atm2, year.labels=TRUE, year.labels.left=TRUE) +
  ylab("Cash") +
  ggtitle("ATM2")

No outliers present. Variance is constant throughout the series, no transformation required. Significant lags at 7,14 and 21 in PACF reflects seasonality. ACF plot reveals non-stationary data.

ndiffs(ts_atm2)
## [1] 1
nsdiffs(ts_atm2)
## [1] 1
## We will apply the seasonal differencing first to see if the ordinary differencing is still required
ts_atm2%>%diff(lag=7)%>%ggtsdisplay()

ts_atm2%>%diff(lag=7)%>%ndiffs()
## [1] 0
##ordinary differencing is not required
##We will select our seasonal arima model from here

Build Model

We see significant spikes at lag 5 in ACF and PACF plots. For non-seasonal part we can select AR(5) or MA(5). For seasonal we select AR(1) due to significant spike at 7 and decreasing significant spikes at 14, 21.

AR(5,0,0)(1,1,0)

AR(0,0,5)(0,1,1)

b_ar1 <- ts_atm2%>% Arima(order=c(5,0,0), seasonal=c(1,1,0)) 
b_ar2 <- ts_atm2%>% Arima(order=c(0,0,5), seasonal=c(0,1,1))
b_ar1$aic
## [1] 3378.869
b_ar2$aic
## [1] 3353.789

Seasonal Decomposition and forecasting with seasonal decomposition

## additive decomp assumed by stl
decomp2 <- stl(ts_atm2[, "Cash"], t.window=13,s.window="periodic")
deseasonal2 <- seasadj(decomp2)
autoplot(decomp2)

autoplot(deseasonal2)

We will apply naive model on the seasonal adjusted data

stl_fc2 <- stlf(deseasonal2,method='naive', h=31) 

Model Selection

accuracy(b_ar1)
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.1352166 26.96241 19.07778 -Inf  Inf 0.8878028 0.004138287
accuracy(b_ar2)
##                      ME     RMSE      MAE  MPE MAPE      MASE         ACF1
## Training set -0.6299813 25.97515 18.59492 -Inf  Inf 0.8653322 0.0004582112
accuracy(stl_fc2)
##                      ME     RMSE     MAE      MPE     MAPE      MASE
## Training set 0.03366952 29.93007 21.3727 1583.463 1756.777 0.9945987
##                    ACF1
## Training set -0.4833479
checkresiduals(b_ar1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,0,0)(1,1,0)[7]
## Q* = 25.775, df = 8, p-value = 0.001147
## 
## Model df: 6.   Total lags used: 14
checkresiduals(b_ar2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,5)(0,1,1)[7]
## Q* = 6.4129, df = 8, p-value = 0.6011
## 
## Model df: 6.   Total lags used: 14
checkresiduals(stl_fc2)
## Warning in checkresiduals(stl_fc2): The fitted degrees of freedom is based
## on the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  Random walk
## Q* = 125.09, df = 14, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 14

Based on accuracy, aic and residuals, box-ljung test, model2 AR(0,0,5)(0,1,1) is the best model and we will forecast the time series based on this model.

Forecast

atm2_fc <- forecast(b_ar2, 31)
write.csv(atm2_fc, "ATM2_forecast.csv")

ATM3

Due to degeneracy, it might not be a good approach to create a forecasting model on ATM3, however we have the possibility of mean or median imputation. We can use the average method on the atm3 time series.

atm3_fc<-meanf(ts_atm3, 31)
checkresiduals(atm3_fc)

## 
##  Ljung-Box test
## 
## data:  Residuals from Mean
## Q* = 196.67, df = 13, p-value < 2.2e-16
## 
## Model df: 1.   Total lags used: 14

The model is not a better approach for forecasting. We can assume the forecast as the mean of three values for ATM3.

ATM4

boxplot(ts_atm4)

ts_atm4 <- ts_atm4[-285]
ts_atm4 <- ts(ts_atm4)
autoplot(ts_atm4)

ggtsdisplay(ts_atm4)

Stationary time series, no ordinary differencing required. Variance is constant, no transformation required. No seasonality present. Significant spike at lag 8. AR(8) or MA(8) non-seasonal model can be used. we will use auto arima to select the model

auto.arima(ts_atm4)
## Series: ts_atm4 
## ARIMA(0,0,1) with non-zero mean 
## 
## Coefficients:
##          ma1      mean
##       0.0746  445.3755
## s.e.  0.0522   19.7044
## 
## sigma^2 estimated as 123119:  log likelihood=-2648.7
## AIC=5303.4   AICc=5303.46   BIC=5315.09
accuracy(auto.arima(ts_atm4))
##                       ME    RMSE      MAE       MPE     MAPE      MASE
## Training set -0.04498806 349.918 291.6344 -543.2407 576.6388 0.7530519
##                      ACF1
## Training set 0.0003956452

Since there was no seasonality present and there seems to be no specific trend, we can use ses model

fc_ses <- ses(ts_atm4, h=31)
accuracy(fc_ses)
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1742947 350.9191 292.5632 -550.0211 583.4599 0.7554501
##                   ACF1
## Training set 0.0751692
checkresiduals(fc_ses)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 24.468, df = 8, p-value = 0.001912
## 
## Model df: 2.   Total lags used: 10
checkresiduals(auto.arima(ts_atm4))

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1) with non-zero mean
## Q* = 21.413, df = 8, p-value = 0.006129
## 
## Model df: 2.   Total lags used: 10

Model one ARIMA(8,0,0) is our best model.

Forecast

atm4_fc <- forecast(fc_ses, 31)
head(fc_ses)
## $model
## Simple exponential smoothing 
## 
## Call:
##  ses(y = ts_atm4, h = 31) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 445.3416 
## 
##   sigma:  351.8871
## 
##      AIC     AICc      BIC 
## 6419.048 6419.115 6430.740 
## 
## $mean
## Time Series:
## Start = 365 
## End = 395 
## Frequency = 1 
##  [1] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
##  [8] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
## [15] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
## [22] 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352 445.3352
## [29] 445.3352 445.3352 445.3352
## 
## $level
## [1] 80 95
## 
## $x
## Time Series:
## Start = 1 
## End = 364 
## Frequency = 1 
##   [1]  777  524  793  908   53   52   55  559  904  879  274  396  275   16
##  [15]  852  380   31  492   84  129   14  815  758  601  907  503   88   35
##  [29]  338    5  123  150  721  443   17   15  741 1058  576 1484  194   27
##  [43] 1191  746 1221 1022  373  321   92  117  202  524   81   64   91   48
##  [57] 1026  424   61  540  174  393   42  310  110  682   55  214  738   16
##  [71]   16 1050  438  547  858  447   95  644  569  705  572  480  419   28
##  [85]  835  911  468  768 1089  267    7  704  495  143  429  895  610   71
##  [99]  594  342  735  463 1156  454  283  572  772  260  358   16  334   26
## [113]  255  357 1246  917  592  412   83  996  104 1117  817  914  648  141
## [127] 1495 1301  780  744  200  854   51    7 1061  715   35  492  343   21
## [141]  506   97  474  900 1712  281   26  329  761  629  236 1195  782  108
## [155]  847  576  442  319  154  543  124  449  615  219  946   10  696    8
## [169]  845  213    9   47   30  400   61  428    2    9   50  313  627    6
## [183]   78  338  212  690  596   65   77   44  964  835  637  927   76   44
## [197]  621  313  826  414  194  346   32  655  638   15  300  627  601   64
## [211]  563  317 1167   47  994  687   71 1047 1009  289  592  231  578   70
## [225]  581  149  404  125  230   19  129  328  532  877  662  301  668   15
## [239]  660  511  164  748  174   20  101  986  597  468  857  685  382  152
## [253]  272  135 1105  292 1141  141    9   85   67  710  568  487   17   49
## [267]  357  180  729  261  629  277   41 1575  670  980  426  153  275  136
## [281]  454  458  112  418   42  280  412  853  180  226  989  825  967  734
## [295]  121  288  503   10  258 1170  193  403 1276  820   26  894  361  860
## [309]  381   10  601   32  553  572  219  828  631  339   32  487  335  340
## [323]  291   46  201   10  878  778  708  351  711  503   23  493  405  818
## [337]  152  281  470    3  415  719  812  890  616   61   27  768  326  825
## [351]  384  195  711   30  557  386  165    5  542  404   14  348   44  482
## 
## $upper
## Time Series:
## Start = 365 
## End = 395 
## Frequency = 1 
##          80%      95%
## 365 896.2967 1135.021
## 366 896.2967 1135.021
## 367 896.2967 1135.021
## 368 896.2967 1135.021
## 369 896.2967 1135.021
## 370 896.2967 1135.021
## 371 896.2967 1135.021
## 372 896.2967 1135.021
## 373 896.2967 1135.021
## 374 896.2967 1135.021
## 375 896.2967 1135.021
## 376 896.2967 1135.021
## 377 896.2967 1135.021
## 378 896.2967 1135.021
## 379 896.2967 1135.021
## 380 896.2967 1135.021
## 381 896.2967 1135.021
## 382 896.2967 1135.021
## 383 896.2967 1135.021
## 384 896.2967 1135.021
## 385 896.2967 1135.021
## 386 896.2967 1135.021
## 387 896.2967 1135.021
## 388 896.2967 1135.021
## 389 896.2968 1135.021
## 390 896.2968 1135.021
## 391 896.2968 1135.021
## 392 896.2968 1135.021
## 393 896.2968 1135.021
## 394 896.2968 1135.021
## 395 896.2968 1135.021
## 
## $lower
## Time Series:
## Start = 365 
## End = 395 
## Frequency = 1 
##           80%       95%
## 365 -5.626271 -244.3509
## 366 -5.626273 -244.3509
## 367 -5.626275 -244.3509
## 368 -5.626277 -244.3509
## 369 -5.626280 -244.3509
## 370 -5.626282 -244.3509
## 371 -5.626284 -244.3509
## 372 -5.626286 -244.3509
## 373 -5.626289 -244.3509
## 374 -5.626291 -244.3509
## 375 -5.626293 -244.3509
## 376 -5.626295 -244.3509
## 377 -5.626298 -244.3509
## 378 -5.626300 -244.3509
## 379 -5.626302 -244.3509
## 380 -5.626304 -244.3509
## 381 -5.626307 -244.3509
## 382 -5.626309 -244.3509
## 383 -5.626311 -244.3509
## 384 -5.626313 -244.3509
## 385 -5.626316 -244.3509
## 386 -5.626318 -244.3509
## 387 -5.626320 -244.3509
## 388 -5.626322 -244.3509
## 389 -5.626325 -244.3509
## 390 -5.626327 -244.3509
## 391 -5.626329 -244.3509
## 392 -5.626332 -244.3510
## 393 -5.626334 -244.3510
## 394 -5.626336 -244.3510
## 395 -5.626338 -244.3510
write.csv(atm4_fc, "ATM4_forecast.csv")

Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Data collection

pwr <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/power_data.csv")

Data Exploration

dim(pwr)
## [1] 192   3
str(pwr)
## Classes 'data.table' and 'data.frame':   192 obs. of  3 variables:
##  $ CaseSequence: int  733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY-MMM    : chr  "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : int  6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...
##  - attr(*, ".internal.selfref")=<externalptr>
min(pwr$`YYYY-MMM`)
## [1] "1998-Apr"
summary(pwr)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1
pwr%>%filter(!complete.cases(pwr))
##   CaseSequence YYYY-MMM KWH
## 1          861 2008-Sep  NA
pwr <- pwr %>%filter(complete.cases(pwr))

Data Manipulation

pwr_ts <- ts(pwr %>% select(-`YYYY-MMM`, -CaseSequence), start = c(1998, 1), frequency = 12)
boxplot(pwr_ts)

autoplot(pwr_ts)

ggseasonplot(pwr_ts, year.labels=TRUE, year.labels.left=TRUE) +
  ylab("KwH") +
  ggtitle("Residential Power")

We can detect seasonality from the plot. In addition the variance is increasing with increase in time. Transformation is good idea to stabilise the variance of the time series. Since data are seasonal, we will first take a seasonal difference

lambda1=BoxCox.lambda(pwr_ts)
lambda1
## [1] 0.06361748
pwr_bc <- BoxCox(pwr_ts, lambda=lambda1)
autoplot(pwr_bc)

pwr_bc %>% diff(lag=12) %>% ggtsdisplay()

#autoplot(pwr_bc_sd)

The data is stationary, evident from the time plot. No differencing required.

The significant spike at lag 1 in the ACF and PACF suggests a non-seasonal MA(1) or AR(1) component, while the significant spike at lag 1 in the ACF suggests a seasonal MA(1) component or lag 4 in ACF can suggest seasonal MA(4). Another case scenario, if we look at PACF and emphasize lag 12 and 24 (neglecting lag 4), AR(2) is another option. We can test this model in case our selected model is not a good approach.

ARIMA(0,0,1)(0,1,1) ARIMA(1,0,0)(1,1,0)

ar1 <- pwr_bc%>% Arima(order=c(0,0,1), seasonal=c(0,1,1))  
ar2 <- pwr_bc%>% Arima(order=c(1,0,0), seasonal=c(1,1,0)) 
ar3 <- pwr_bc%>% Arima(order=c(1,0,0), seasonal=c(2,1,0)) 
ar1$aic
## [1] 327.7296
ar2$aic
## [1] 354.5315
ar3$aic
## [1] 331.7403
accuracy(ar1)
##                      ME      RMSE     MAE       MPE    MAPE      MASE
## Training set 0.07237788 0.5618735 0.28481 0.2287857 1.07925 0.8390499
##                   ACF1
## Training set -0.010759
accuracy(ar2)
##                      ME     RMSE       MAE       MPE     MAPE     MASE
## Training set 0.04776387 0.614326 0.2867503 0.1372028 1.087465 0.844766
##                      ACF1
## Training set -0.006151287
accuracy(ar3)
##                      ME      RMSE       MAE       MPE     MAPE      MASE
## Training set 0.06559122 0.5679673 0.2774767 0.2047543 1.051591 0.8174461
##                     ACF1
## Training set -0.01227044
checkresiduals(ar1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[12]
## Q* = 8.7718, df = 22, p-value = 0.9944
## 
## Model df: 2.   Total lags used: 24
checkresiduals(ar2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(1,1,0)[12]
## Q* = 32.896, df = 22, p-value = 0.06335
## 
## Model df: 2.   Total lags used: 24
checkresiduals(ar3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,1,0)[12]
## Q* = 9.7066, df = 21, p-value = 0.9825
## 
## Model df: 3.   Total lags used: 24

Form Box-ljung test, we fail to reject null hypothesis. The residuals are uncorrelated and test validates our model.

Residuals have consant variance and are normally distributed.

pwr_fc <- forecast(ar1,12)
head(pwr_fc)
## $method
## [1] "ARIMA(0,0,1)(0,1,1)[12]"
## 
## $model
## Series: . 
## ARIMA(0,0,1)(0,1,1)[12] 
## 
## Coefficients:
##          ma1     sma1
##       0.1337  -0.7130
## s.e.  0.0735   0.0568
## 
## sigma^2 estimated as 0.3407:  log likelihood=-160.86
## AIC=327.73   AICc=327.87   BIC=337.29
## 
## $level
## [1] 80 95
## 
## $mean
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2013                                                               
## 2014 27.47164 26.94520 26.63214 26.48856 27.10573 26.92037 27.76407
##           Aug      Sep      Oct      Nov      Dec
## 2013                                     27.84462
## 2014 27.47573 26.64076 26.51737 27.22215         
## 
## $lower
##               80%      95%
## Dec 2013 27.09660 26.70063
## Jan 2014 26.71698 26.31748
## Feb 2014 26.19054 25.79104
## Mar 2014 25.87748 25.47799
## Apr 2014 25.73390 25.33440
## May 2014 26.35107 25.95158
## Jun 2014 26.16570 25.76621
## Jul 2014 27.00940 26.60991
## Aug 2014 26.72107 26.32157
## Sep 2014 25.88610 25.48660
## Oct 2014 25.76271 25.36321
## Nov 2014 26.46748 26.06799
## 
## $upper
##               80%      95%
## Dec 2013 28.59263 28.98860
## Jan 2014 28.22630 28.62579
## Feb 2014 27.69986 28.09936
## Mar 2014 27.38681 27.78630
## Apr 2014 27.24322 27.64272
## May 2014 27.86039 28.25989
## Jun 2014 27.67503 28.07452
## Jul 2014 28.51873 28.91822
## Aug 2014 28.23039 28.62988
## Sep 2014 27.39542 27.79492
## Oct 2014 27.27203 27.67153
## Nov 2014 27.97681 28.37630
write.csv(pwr_fc, "forecast_power.csv")

Part C

Water Flow Forecast

The data in waterflow 1 has data at different time stamps within the hour, however in waterflow 2 has hourly timestamps. We will bind two data frames together and update the column value of the water flow within an hour to mean of water flow within that hour.

w1 <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/waterflow_pipe1.csv")
w2 <- fread("https://raw.githubusercontent.com/gpsingh12/Data624/master/Project1/waterflow_pipe2.csv")




wp1 <- w1 %>% setNames(gsub("[ ]", "_", tolower(colnames(w1)))) %>% mutate(date_time=mdy_hm(date_time))

wp2 <- w2 %>% setNames(gsub("[ ]", "_", tolower(colnames(w2)))) %>% mutate(date_time=mdy_hm(date_time)) 

wp_merge <- rbind(wp1,wp2) %>% mutate(hr = hour(date_time), dt= date(date_time), key=ymd_h(paste0(dt,hr))) %>% group_by(key) %>% 
                              mutate(wf=mean(waterflow)) %>% select(dateime=key, wf) %>% distinct(., .keep_all=T)

There were no outliers present indicating that waterflow was consistent. We applied seasonal differencing and ordinary differencing and reviewed the ACF and PACF plots. The results were not promising, the waterflow was going to be negative. We will let auto.arima to decide the model.

ts_wp<-ts(wp_merge$wf, frequency = 24)
boxplot(ts_wp)

autoplot(ts_wp)

ggseasonplot(ts_wp, year.labels=TRUE, year.labels.left=TRUE) +
  ylab("waterflow") +
  ggtitle("Waterpipeline")

ggtsdisplay(ts_wp)

ts_wp %>% diff(lag=24) %>% ggtsdisplay()

a1 <-auto.arima(ts_wp)
accuracy(a1)
##                     ME    RMSE      MAE       MPE    MAPE      MASE
## Training set 0.4974017 14.4913 11.08593 -26.26872 47.7588 0.7333519
##                    ACF1
## Training set -0.0160318
checkresiduals(a1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,0,1)[24]
## Q* = 66.378, df = 46, p-value = 0.02616
## 
## Model df: 2.   Total lags used: 48

Although we were expecting weak forecasting models for the water flow problem, we were not able to build the forecasting model as our waterflow was going negative in an order to make the series stationary.

Refrences:

https://robjhyndman.com/hyndsight/dailydata/ https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/ https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials https://github.com/cran/aTSA/blob/master/R/forecast.R https://towardsdatascience.com/trend-seasonality-moving-average-auto-regressive-model-my-journey-to-time-series-data-with-edc4c0c8284b