Objective

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

library(pacman)
pacman::p_load(tidyverse, fpp2, plotly, readxl, knitr, kableExtra)

Part A - ATM Forecast

FIrst of all, let’s start reading ATM dataset and split it into four datasets based on ATM #

Data Preparation

atm_full <- readxl::read_excel('ATM624Data.xlsx')

# Checking for missing values
colSums(is.na(atm_full))

## DATE  ATM Cash 
##    0   14   19

# Data Preparation
atm_full$DATE <- as.Date(atm_full$DATE, origin= "1899-12-30") # Converting into date 
atm_full <- atm_full[!is.na(atm_full$ATM),]
atm_full <- atm_full[!is.na(atm_full$Cash),]


# missing values after removing
colSums(is.na(atm_full))

## DATE  ATM Cash 
##    0    0    0

# Splitting the data
atm1 <- atm_full %>% filter(ATM == 'ATM1')
atm2 <- atm_full %>% filter(ATM == 'ATM2')
atm3 <- atm_full %>% filter(ATM == 'ATM3')
atm4 <- atm_full %>% filter(ATM == 'ATM4')

All the missing values were removed from the model because I did not want to replace them by mean or median. Instead, later I will see if I can aggregate them weekly or monthly then I won’t need these values and removing these values won’t create much difference.

Models - ATM 1

atm1_ts <- ts(atm1$Cash, frequency = 7)
ggseasonplot(atm1_ts)

ggtsdisplay(atm1_ts, main="ATM 1 Transactions - Weekly")

I converted the series into weekly data to see better picture of atm transactions. Seems like there is some seasonality in the data with dip on Mondays and Wednesdays mostly. The data is stationary as per the ACF model and autoplot shows white series so there is no need for transformation.

Simple Exponential Smoothing

Although simple exponential smoothing is used mostly where there is no seasonality but in our dataset seasonality is not very strong that’s why I will start by ses and see how it goes. I’m not expecting much from this method although.

forecast_atm1_ses <- ses(atm1_ts)
autoplot(forecast_atm1_ses)+
  autolayer(fitted(forecast_atm1_ses), series='Fitted')

summary(forecast_atm1_ses)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = atm1_ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 83.7797 
## 
##   sigma:  36.7094
## 
##      AIC     AICc      BIC 
## 4745.365 4745.432 4757.040 
## 
## Error measures:
##                      ME     RMSE      MAE     MPE     MAPE     MASE       ACF1
## Training set 0.07802025 36.60783 27.31876 -175.78 199.2779 1.451603 0.06365945
## 
## Forecasts:
##          Point Forecast   Lo 80    Hi 80    Lo 95    Hi 95
## 52.71429       83.78256 36.7376 130.8275 11.83350 155.7316
## 52.85714       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.00000       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.14286       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.28571       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.42857       83.78256 36.7376 130.8275 11.83349 155.7316
## 53.57143       83.78256 36.7376 130.8275 11.83349 155.7316
## 53.71429       83.78256 36.7376 130.8275 11.83349 155.7316
## 53.85714       83.78256 36.7376 130.8275 11.83349 155.7316
## 54.00000       83.78256 36.7376 130.8275 11.83349 155.7316

ETS

forecast_atm1_ets <- ets(atm1_ts)
autoplot(forecast_atm1_ets)+
  autolayer(forecast(forecast_atm1_ets), series="Forecasted")

summary(forecast_atm1_ets)

## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm1_ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.415 
## 
##   Initial states:
##     l = 87.7888 
##     s = -54.0598 16.2223 21.048 -26.068 8.3991 35.2058
##            -0.7474
## 
##   sigma:  26.384
## 
##      AIC     AICc      BIC 
## 4513.138 4513.765 4552.055 
## 
## Training set error measures:
##                      ME     RMSE     MAE       MPE     MAPE      MASE      ACF1
## Training set -0.5473075 26.05397 16.9637 -110.6377 126.3744 0.9013793 0.1374786

The model accuracy got slightly better if compare the values of RMSE, AIC and BIC as compared with simple exponential smoothing.

ARIMA

forecast_atm1_arima <- auto.arima(atm1_ts)
autoplot(forecast(forecast_atm1_arima))

checkresiduals(forecast_atm1_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
## 
## Model df: 2.   Total lags used: 14

summary(forecast_atm1_arima)

## Series: atm1_ts 
## ARIMA(0,0,1)(0,1,1)[7] 
## 
## Coefficients:
##          ma1     sma1
##       0.1674  -0.5813
## s.e.  0.0565   0.0472
## 
## sigma^2 estimated as 666.6:  log likelihood=-1658.32
## AIC=3322.63   AICc=3322.7   BIC=3334.25
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.09700089 25.49555 16.17552 -106.2792 122.4165 0.8594986
##                     ACF1
## Training set -0.01116544

Seems like ARIMA model gave the best result out of the rest with AIC and BIC dropped to around 3334 which around 4300 before. Even RMSE dropped with auto.arima(). Also, the data does not need transformation and is stationary as per the ACF plot.

Models - ATM 2

atm2_ts <- ts(atm2$Cash, frequency = 7)
ggseasonplot(atm2_ts)

ggtsdisplay(atm2_ts, main="ATM 2 Transactions - Weekly")

ATM2 data is also stationary and there is no transformation required as it is already white noise. ACF plot shows that the data is stationary.

Simple Exponential Smoothing

Again, I’ll start here again with simple exponential smoothing

forecast_atm2_ses <- ses(atm2_ts)
autoplot(forecast_atm2_ses)+
  autolayer(forecast(forecast_atm2_ses), series='Forecasted')

checkresiduals(forecast_atm2_ses)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 515.46, df = 12, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 14

summary(forecast_atm2_ses)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = atm2_ts) 
## 
##   Smoothing parameters:
##     alpha = 0.016 
## 
##   Initial states:
##     l = 72.14 
## 
##   sigma:  38.681
## 
##      AIC     AICc      BIC 
## 4797.445 4797.512 4809.128 
## 
## Error measures:
##                     ME     RMSE      MAE  MPE MAPE     MASE        ACF1
## Training set -2.563596 38.57427 33.09136 -Inf  Inf 1.539938 -0.02208177
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 52.85714       57.26713 7.695463 106.8388 -18.54619 133.0804
## 53.00000       57.26713 7.689132 106.8451 -18.55587 133.0901
## 53.14286       57.26713 7.682802 106.8514 -18.56555 133.0998
## 53.28571       57.26713 7.676473 106.8578 -18.57523 133.1095
## 53.42857       57.26713 7.670145 106.8641 -18.58491 133.1192
## 53.57143       57.26713 7.663817 106.8704 -18.59459 133.1288
## 53.71429       57.26713 7.657491 106.8768 -18.60426 133.1385
## 53.85714       57.26713 7.651165 106.8831 -18.61394 133.1482
## 54.00000       57.26713 7.644840 106.8894 -18.62361 133.1579
## 54.14286       57.26713 7.638515 106.8957 -18.63328 133.1675

The value of AIC and BIC are 4797 and 4809 respectively which I think is consistent with ATM1 data series. I was not expecting much from this method anyway. RMSE’s value is 38.5. Let’s go ahead and check these values with ETS and ARIMAs.

ETS

forecast_atm2_ets <- ets(atm2_ts)
autoplot(forecast(forecast_atm2_ets))

checkresiduals(forecast_atm2_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 38.485, df = 5, p-value = 3.015e-07
## 
## Model df: 9.   Total lags used: 14

summary(forecast_atm2_ets)

## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm2_ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.4647 
## 
##   Initial states:
##     l = 76.1082 
##     s = 11.3136 -47.8461 -13.3718 -1.8486 18.6895 4.6056
##            28.4579
## 
##   sigma:  27.8902
## 
##      AIC     AICc      BIC 
## 4566.882 4567.507 4605.826 
## 
## Training set error measures:
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.6349959 27.54227 19.09383 -Inf  Inf 0.8885496 -0.02291393

The data is already stationary as we checked before so I don’t have to worry about transformation. Here, values of AIC and BIC have dropped to 4566 and 4605 respectively which shows slight improvement.

ARIMA Model

forecast_atm2_arima <- auto.arima(atm2_ts)
autoplot(forecast(forecast_atm2_arima))

checkresiduals(forecast_atm2_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,2)[7]
## Q* = 12.12, df = 8, p-value = 0.1459
## 
## Model df: 6.   Total lags used: 14

summary(forecast_atm2_arima)

## Series: atm2_ts 
## ARIMA(2,0,2)(0,1,2)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1    sma2
##       -0.4148  -0.8922  0.4008  0.7534  -0.7220  0.0801
## s.e.   0.0674   0.0510  0.1027  0.0688   0.0567  0.0550
## 
## sigma^2 estimated as 708.1:  log likelihood=-1671.98
## AIC=3357.96   AICc=3358.28   BIC=3385.08
## 
## Training set error measures:
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.4583759 26.12942 18.09675 -Inf  Inf 0.8421495 0.009277035

Data is already normal so there is no need of transformation. The data is also somehow stationary as ACF’s value dropped very suddenly. Even the first two lines in ACF plot are very low. According to the summary results, AIC, BIC and RMSE have dropped tremendously as compared with ETS and SES models and showed that arima model performs better than the others.

Models - ATM 3

atm3_ts <- ts(atm3$Cash, frequency = 7)
ggseasonplot(atm3_ts)

ggtsdisplay(atm3_ts, main="ATM 3 Transactions - Weekly")

Although for the sake of this assignment I can go ahead and test Simple Exponential Smoothing, ETS and ARIMA but I think it’s useless to use these techniques as we don’t have enough data. A lot of the values in ATM3 are 0.

plot_ly(y = atm3$Cash, type='box', name="Cash Transactions - ATM3")

summary(atm3$Cash)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7206  0.0000 96.0000

Models - ATM 4

atm4_ts <- ts(atm4$Cash, frequency = 7)
ggseasonplot(atm4_ts)

ggtsdisplay(atm4_ts, main="ATM 4 Transactions - Weekly")

Simple Exponential Smoothing

Let’s start building first model on atm4 dataseries by using simple exponential smoothing.

forecast_atm4_ses <- ses(atm4_ts, lambda='auto')
autoplot(forecast_atm4_ses)+
  autolayer(forecast(forecast_atm4_ses), series='Forecasted')

checkresiduals(forecast_atm4_ses)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 68.994, df = 12, p-value = 4.939e-10
## 
## Model df: 2.   Total lags used: 14

summary(forecast_atm4_ses)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = atm4_ts, lambda = "auto") 
## 
##   Box-Cox transformation: lambda= -0.0737 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 4.4955 
## 
##   sigma:  0.9807
## 
##      AIC     AICc      BIC 
## 2143.230 2143.296 2154.930 
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 238.6781 692.4834 345.1213 -255.8507 328.5013 0.8587366
##                      ACF1
## Training set -0.009363965
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 53.14286       235.3574 40.47122 1780.582 17.34986 5955.898
## 53.28571       235.3574 40.47122 1780.582 17.34985 5955.898
## 53.42857       235.3574 40.47121 1780.582 17.34985 5955.898
## 53.57143       235.3574 40.47121 1780.582 17.34985 5955.898
## 53.71429       235.3574 40.47121 1780.582 17.34985 5955.898
## 53.85714       235.3574 40.47121 1780.582 17.34985 5955.898
## 54.00000       235.3574 40.47121 1780.582 17.34985 5955.899
## 54.14286       235.3574 40.47121 1780.582 17.34985 5955.899
## 54.28571       235.3574 40.47121 1780.582 17.34985 5955.899
## 54.42857       235.3574 40.47121 1780.582 17.34985 5955.899

ATM4 data was slighly abnormal that’s why I used boxcox transformation to normalize the data and seems like it’s better as compared with before. Data is stationary as we can see in ACF plot above and autoplot which shows data does not have autocorrelation. Before transformation AIC and BIC values were around 6500 which is dropped significantly down to around 2143 after boxcox transformation. RMSE has also improved. It shows that boxcox transformation did help to improve the results but we will select the model after using ETS and ARIMA to compare their measures.

ETS

forecast_atm4_ets <- ets(atm4_ts, lambda = 'auto')
autoplot(forecast(forecast_atm4_ets))

checkresiduals(forecast_atm4_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 18.809, df = 5, p-value = 0.002086
## 
## Model df: 9.   Total lags used: 14

summary(forecast_atm4_ets)

## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm4_ts, lambda = "auto") 
## 
##   Box-Cox transformation: lambda= -0.0737 
## 
##   Smoothing parameters:
##     alpha = 0.0138 
##     gamma = 0.1424 
## 
##   Initial states:
##     l = 4.4335 
##     s = -1.3504 -0.0467 0.0246 0.3089 0.3808 0.2859
##            0.397
## 
##   sigma:  0.8982
## 
##      AIC     AICc      BIC 
## 2085.955 2086.577 2124.954 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 177.0986 664.9255 308.4432 -271.5265 333.966 0.7674736
##                      ACF1
## Training set -0.004544255

Model’s accuracy got slightly improved with AIC and BIC’s values came down to around 2085 and RMSE from 694 to 664. I cannot say for sure if this model is better yet as ARIMA seems to give better results so far. Let’s check it out with ARIMA now. Also, as data is not normal that’s why I had to apply boxcox transformation for better results.

ARIMA

forecast_atm4_arima <- auto.arima(atm4_ts, lambda = 'auto')
autoplot(forecast(forecast_atm4_arima))

checkresiduals(forecast_atm4_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 17.625, df = 11, p-value = 0.09069
## 
## Model df: 3.   Total lags used: 14

summary(forecast_atm4_arima)

## Series: atm4_ts 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= -0.07372558 
## 
## Coefficients:
##         sar1    sar2    mean
##       0.2487  0.1947  4.4864
## s.e.  0.0521  0.0525  0.0841
## 
## sigma^2 estimated as 0.8418:  log likelihood=-485.59
## AIC=979.18   AICc=979.29   BIC=994.78
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 209.7822 678.9958 317.1943 -239.7643 304.041 0.7892482
##                      ACF1
## Training set -0.005100994

This model seems to provide better results in terms of accuracy as AIC and BIC have dropped from around 2085 down to around 979 but RMSE did not improve. I would still go with ARIMA model as AIC and BIC are stronger criterion for model selection

Prediction

# ATM 1
test <- forecast(forecast_atm1_arima, h=5)
test1 <- data.frame(test)
test1

##          Point.Forecast     Lo.80     Hi.80     Lo.95     Hi.95
## 52.71429      86.805607  53.71784 119.89337  36.20224 137.40898
## 52.85714     100.640560  67.09247 134.18865  49.33319 151.94793
## 53.00000      74.714560  41.16647 108.26265  23.40719 126.02193
## 53.14286       4.762635 -28.78545  38.31072 -46.54474  56.07001
## 53.28571     100.063192  66.51510 133.61128  48.75582 151.37057

#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm1_forecast.xlsx', row.names=FALSE, col.names=TRUE)

# ATM 2
test <- forecast(forecast_atm2_arima, h=5)
test1 <- data.frame(test)
test1

##          Point.Forecast     Lo.80     Hi.80     Lo.95     Hi.95
## 52.85714      69.432448  35.33003 103.53487  17.27730 121.58759
## 53.00000      69.881366  35.77560 103.98713  17.72111 122.04163
## 53.14286      12.308924 -22.09671  46.71456 -40.30996  64.92780
## 53.28571       2.758906 -31.72395  37.24176 -49.97807  55.49588
## 53.42857      99.519624  64.89886 134.14038  46.57174 152.46751

#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm2_forecast.xlsx', row.names=FALSE, col.names=TRUE)

# ATM 4

test <- forecast(forecast_atm4_arima, h=5)
test1 <- data.frame(test)
test1

##          Point.Forecast    Lo.80     Hi.80     Lo.95    Hi.95
## 53.14286       90.23016 19.20879  517.3859  9.049304 1441.117
## 53.28571      275.70706 51.80183 1856.9650 23.087712 5748.059
## 53.42857      328.95842 60.54703 2275.9560 26.741692 7171.671
## 53.57143       72.60330 15.82002  404.1713  7.530810 1104.380
## 53.71429      303.15704 56.33390 2071.4399 24.985769 6473.389

#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm4_forecast.xlsx', row.names=FALSE, col.names=TRUE)

Summary of ATM dataset

Data had some missing values which were removed as I could not change them with mean or median. I splitted the data into four datasets as required ATM1, ATM2, ATM3 AND ATM4. I aggregated the data into ts object on weekly basis and test Simple exponential, ETS and ARIMA. Arima performed best in all ATM1, ATM2 and ATM4. ATM3 did not have enough data to do analysis. Criteria for best model was based on AIC, BIC and RMSE. I forecasted the values for 5 weeks with exact value along with lower and upper values for 80% and 95% confidence interval and exported them in excel. As required, forecasted values are for May and first month of June 2010.

Part B - Power

Let’s start by reading the dataset first.

power <- readxl::read_excel("C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/Power624Data.xlsx")
power %>% kable() %>% kable_styling(full_width = FALSE)

CaseSequence	YYYY-MMM	KWH
733	1998-Jan	6862583
734	1998-Feb	5838198
735	1998-Mar	5420658
736	1998-Apr	5010364
737	1998-May	4665377
738	1998-Jun	6467147
739	1998-Jul	8914755
740	1998-Aug	8607428
741	1998-Sep	6989888
742	1998-Oct	6345620
743	1998-Nov	4640410
744	1998-Dec	4693479
745	1999-Jan	7183759
746	1999-Feb	5759262
747	1999-Mar	4847656
748	1999-Apr	5306592
749	1999-May	4426794
750	1999-Jun	5500901
751	1999-Jul	7444416
752	1999-Aug	7564391
753	1999-Sep	7899368
754	1999-Oct	5358314
755	1999-Nov	4436269
756	1999-Dec	4419229
757	2000-Jan	7068296
758	2000-Feb	5876083
759	2000-Mar	4807961
760	2000-Apr	4873080
761	2000-May	5050891
762	2000-Jun	7092865
763	2000-Jul	6862662
764	2000-Aug	7517830
765	2000-Sep	8912169
766	2000-Oct	5844352
767	2000-Nov	5041769
768	2000-Dec	6220334
769	2001-Jan	7538529
770	2001-Feb	6602448
771	2001-Mar	5779180
772	2001-Apr	4835210
773	2001-May	4787904
774	2001-Jun	6283324
775	2001-Jul	7855129
776	2001-Aug	8450717
777	2001-Sep	7112069
778	2001-Oct	5242535
779	2001-Nov	4461979
780	2001-Dec	5240995
781	2002-Jan	7099063
782	2002-Feb	6413429
783	2002-Mar	5839514
784	2002-Apr	5371604
785	2002-May	5439166
786	2002-Jun	5850383
787	2002-Jul	7039702
788	2002-Aug	8058748
789	2002-Sep	8245227
790	2002-Oct	5865014
791	2002-Nov	4908979
792	2002-Dec	5779958
793	2003-Jan	7256079
794	2003-Feb	6190517
795	2003-Mar	6120626
796	2003-Apr	4885643
797	2003-May	5296096
798	2003-Jun	6051571
799	2003-Jul	6900676
800	2003-Aug	8476499
801	2003-Sep	7791791
802	2003-Oct	5344613
803	2003-Nov	4913707
804	2003-Dec	5756193
805	2004-Jan	7584596
806	2004-Feb	6560742
807	2004-Mar	6526586
808	2004-Apr	4831688
809	2004-May	4878262
810	2004-Jun	6421614
811	2004-Jul	7307931
812	2004-Aug	7309774
813	2004-Sep	6690366
814	2004-Oct	5444948
815	2004-Nov	4824940
816	2004-Dec	5791208
817	2005-Jan	8225477
818	2005-Feb	6564338
819	2005-Mar	5581725
820	2005-Apr	5563071
821	2005-May	4453983
822	2005-Jun	5900212
823	2005-Jul	8337998
824	2005-Aug	7786659
825	2005-Sep	7057213
826	2005-Oct	6694523
827	2005-Nov	4313019
828	2005-Dec	6181548
829	2006-Jan	7793358
830	2006-Feb	5914945
831	2006-Mar	5819734
832	2006-Apr	5255988
833	2006-May	4740588
834	2006-Jun	7052275
835	2006-Jul	7945564
836	2006-Aug	8241110
837	2006-Sep	7296355
838	2006-Oct	5104799
839	2006-Nov	4458429
840	2006-Dec	6226214
841	2007-Jan	8031295
842	2007-Feb	7928337
843	2007-Mar	6443170
844	2007-Apr	4841979
845	2007-May	4862847
846	2007-Jun	5022647
847	2007-Jul	6426220
848	2007-Aug	7447146
849	2007-Sep	7666970
850	2007-Oct	5785964
851	2007-Nov	4907057
852	2007-Dec	6047292
853	2008-Jan	7964293
854	2008-Feb	7597060
855	2008-Mar	6085644
856	2008-Apr	5352359
857	2008-May	4608528
858	2008-Jun	6548439
859	2008-Jul	7643987
860	2008-Aug	8037137
861	2008-Sep	NA
862	2008-Oct	5101803
863	2008-Nov	4555602
864	2008-Dec	6442746
865	2009-Jan	8072330
866	2009-Feb	6976800
867	2009-Mar	5691452
868	2009-Apr	5531616
869	2009-May	5264439
870	2009-Jun	5804433
871	2009-Jul	7713260
872	2009-Aug	8350517
873	2009-Sep	7583146
874	2009-Oct	5566075
875	2009-Nov	5339890
876	2009-Dec	7089880
877	2010-Jan	9397357
878	2010-Feb	8390677
879	2010-Mar	7347915
880	2010-Apr	5776131
881	2010-May	4919289
882	2010-Jun	6696292
883	2010-Jul	770523
884	2010-Aug	7922701
885	2010-Sep	7819472
886	2010-Oct	5875917
887	2010-Nov	4800733
888	2010-Dec	6152583
889	2011-Jan	8394747
890	2011-Feb	8898062
891	2011-Mar	6356903
892	2011-Apr	5685227
893	2011-May	5506308
894	2011-Jun	8037779
895	2011-Jul	10093343
896	2011-Aug	10308076
897	2011-Sep	8943599
898	2011-Oct	5603920
899	2011-Nov	6154138
900	2011-Dec	8273142
901	2012-Jan	8991267
902	2012-Feb	7952204
903	2012-Mar	6356961
904	2012-Apr	5569828
905	2012-May	5783598
906	2012-Jun	7926956
907	2012-Jul	8886851
908	2012-Aug	9612423
909	2012-Sep	7559148
910	2012-Oct	5576852
911	2012-Nov	5731899
912	2012-Dec	6609694
913	2013-Jan	10655730
914	2013-Feb	7681798
915	2013-Mar	6517514
916	2013-Apr	6105359
917	2013-May	5940475
918	2013-Jun	7920627
919	2013-Jul	8415321
920	2013-Aug	9080226
921	2013-Sep	7968220
922	2013-Oct	5759367
923	2013-Nov	5769083
924	2013-Dec	9606304

summary(power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

Data Preparation

# Checking for missing values
colSums(is.na(power))

## CaseSequence     YYYY-MMM          KWH 
##            0            0            1

# Replacing missing value with median
power$KWH[is.na(power$KWH)] = median(power$KWH, na.rm=TRUE)

# Converting the dataframe into timeseries object
power_ts <- ts(power$KWH, start=c(1998,1), frequency = 12)
power_ts

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755  8607428
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416  7564391
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662  7517830
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129  8450717
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702  8058748
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676  8476499
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931  7309774
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998  7786659
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564  8241110
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220  7447146
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987  8037137
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260  8350517
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523  7922701
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343 10308076
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851  9612423
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321  9080226
##           Sep      Oct      Nov      Dec
## 1998  6989888  6345620  4640410  4693479
## 1999  7899368  5358314  4436269  4419229
## 2000  8912169  5844352  5041769  6220334
## 2001  7112069  5242535  4461979  5240995
## 2002  8245227  5865014  4908979  5779958
## 2003  7791791  5344613  4913707  5756193
## 2004  6690366  5444948  4824940  5791208
## 2005  7057213  6694523  4313019  6181548
## 2006  7296355  5104799  4458429  6226214
## 2007  7666970  5785964  4907057  6047292
## 2008  6283324  5101803  4555602  6442746
## 2009  7583146  5566075  5339890  7089880
## 2010  7819472  5875917  4800733  6152583
## 2011  8943599  5603920  6154138  8273142
## 2012  7559148  5576852  5731899  6609694
## 2013  7968220  5759367  5769083  9606304

The data was given on monthly basis from 1998 till 2013 and we are asked to forecast for every month in 2014. First, I had to check for missing value and I saw that there is only one missing value which I replaced with median of that column and then converted it into ts object before moving ahead. Now the data is ready to go for next steps.

# Before Transformation
ggtsdisplay(power_ts, main="Before Transformation, Monthly Power Consumption")

# Box Cox Transformation
power_boxcox <- power_ts %>% BoxCox(lambda= 'auto')
ggtsdisplay(power_boxcox, main='BoxCox Transformation, Monthly Power Consumption')

# Seasonality
ggseasonplot(power_boxcox)

BoxCox Transformation did not make much improvement especially the data is still non-stationary therefore, I have to make it stationary before creating any model or predictions. There is also seasonality in the data which shows dip in April and May and then goes up again in June through beginning of September. Let’s see how many differencing is required to make the data stationary.

print(paste0("Suggested number of differencing is ", ndiffs(power_boxcox)))

## [1] "Suggested number of differencing is 1"

print(paste0("Seasonal differencing: How many more differencing required ? ", ndiffs(diff(power_boxcox, lag=12))))

## [1] "Seasonal differencing: How many more differencing required ? 0"

Seems like no seasonal differencing is required but overall 1 differencing may make the data stationary. Let’s make at first difference.

power_diff <- power_boxcox %>% diff(lag = 12)
ggtsdisplay(power_diff, main= "BoxCox Transformation and seasonal differencing - Monthly power consumption")

The data has become stationary now as we can see in ACF plot at seasonal differencing. There was a seasonality in the data which required us seasonal differencing in order to make the data stationary. We can double check by using ndiffs again

print(paste0("Suggested differencing : ", ndiffs(power_diff)))

## [1] "Suggested differencing : 0"

Now that the data is stationary, we can move forward and use different forecasting techniques to see which one performs well and then predict for 2014 for each month. We will use moving average as there is seaonality in the data and then others to see which one performs better.

Moving Average

autoplot(power_diff, series="Data")+
  autolayer(ma(power_diff, 12), series = "12 Months Moving Average")+ ggtitle("12 Months Moving Average for 2014")

Simple Exponential Smoothing

forecast_power_ses <- ses(power_diff, h=12)
autoplot(forecast_power_ses)+
  autolayer(fitted(forecast_power_ses), series="Fitted")

summary(forecast_power_ses)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = power_diff, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0714 
## 
##   sigma:  1.3548
## 
##      AIC     AICc      BIC 
## 1048.048 1048.184 1057.627 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005184573 1.347295 0.6003728 -761.7743 987.6887 0.5469261
##                   ACF1
## Training set 0.1202922
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Feb 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Mar 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Apr 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## May 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Jun 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Jul 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Aug 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Sep 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Oct 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Nov 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Dec 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815

Moving average still gives some value but simple exponential smoothing does not seem like it’s working as it’s constant line of 0.0713 (Transformed). Here, AIC is 1048 and BIC is 1057. RMSE is 1.347. Let’s use ARIMA and then compare both.

ARIMA

I’m going to check it out with ARIMA which seemed to work fantastic in Part A and see how it goes here.

forecast_atm4_arima <- auto.arima(atm4_ts, lambda = ‘auto’) autoplot(forecast(forecast_atm4_arima)) checkresiduals(forecast_atm4_arima) summary(forecast_atm4_arima)

forecast_power_arima <- auto.arima(power_diff)
autoplot(forecast(forecast_power_arima,12), main="Prediction for 2014 with ARIMA")

summary(forecast_power_arima)

## Series: power_diff 
## ARIMA(0,0,1)(0,0,2)[12] with non-zero mean 
## 
## Coefficients:
##          ma1     sma1    sma2    mean
##       0.1195  -0.9264  0.0890  0.0617
## s.e.  0.0712   0.0810  0.0811  0.0198
## 
## sigma^2 estimated as 0.9561:  log likelihood=-257.16
## AIC=524.31   AICc=524.66   BIC=540.28
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02555369 0.9668839 0.4542946 -1920.731 2160.928 0.4138522
##                     ACF1
## Training set 0.006469908

Seems like ARIMA provided better results as AIC dropped to 524 and BIC dropped to 540. RMSE has also dropped from 1.347 to 0.966. I’ll select ARIMA model here and predict the values in excel as I am comfortable with the results of ARIMA.

test <- forecast(forecast_atm4_arima, h=5) test1 <- data.frame(test) test1 xlsx::write.xlsx(test1, ‘C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm1_forecast.xlsx’, row.names=FALSE, col.names=TRUE)

test <- forecast(forecast_power_arima, h=12)
test1 <- data.frame(test)
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/POwer 12 months prediction.xlsx')

Summary for Power Dataset

Power datase was given in monthly basis from 1998 to 2013 and was asked to provide 2014’s prediction for each month. Data had one missing value which was replaced with the median of the column. Data was converted into ts object and I used different techniques to see which one performs better. ARIMA seemed to provide better results based on the values of AIC, BIC and RMSE that’s why I chose ARIMA model to predict the power consumption of each month for 2014.

Data 624 - Project 1

Habib Khan

10/21/2020

Objective

Part A - ATM Forecast

Data Preparation

Models - ATM 1

Simple Exponential Smoothing

ETS

ARIMA

Models - ATM 2

Simple Exponential Smoothing

ETS

ARIMA Model

Models - ATM 3

Models - ATM 4

Simple Exponential Smoothing

ETS

ARIMA

Prediction

Summary of ATM dataset

Part B - Power

Data Preparation

Moving Average

Simple Exponential Smoothing

ARIMA

Summary for Power Dataset