Objective

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

library(pacman)
pacman::p_load(tidyverse, fpp2, plotly, readxl, knitr, kableExtra)

Part A - ATM Forecast

FIrst of all, let’s start reading ATM dataset and split it into four datasets based on ATM #

Data Preparation

atm_full <- readxl::read_excel('ATM624Data.xlsx')

# Checking for missing values
colSums(is.na(atm_full))
## DATE  ATM Cash 
##    0   14   19
# Data Preparation
atm_full$DATE <- as.Date(atm_full$DATE, origin= "1899-12-30") # Converting into date 
atm_full <- atm_full[!is.na(atm_full$ATM),]
atm_full <- atm_full[!is.na(atm_full$Cash),]


# missing values after removing
colSums(is.na(atm_full))
## DATE  ATM Cash 
##    0    0    0
# Splitting the data
atm1 <- atm_full %>% filter(ATM == 'ATM1')
atm2 <- atm_full %>% filter(ATM == 'ATM2')
atm3 <- atm_full %>% filter(ATM == 'ATM3')
atm4 <- atm_full %>% filter(ATM == 'ATM4')

All the missing values were removed from the model because I did not want to replace them by mean or median. Instead, later I will see if I can aggregate them weekly or monthly then I won’t need these values and removing these values won’t create much difference.

Models - ATM 1

atm1_ts <- ts(atm1$Cash, frequency = 7)
ggseasonplot(atm1_ts)

ggtsdisplay(atm1_ts, main="ATM 1 Transactions - Weekly")

I converted the series into weekly data to see better picture of atm transactions. Seems like there is some seasonality in the data with dip on Mondays and Wednesdays mostly. The data is stationary as per the ACF model and autoplot shows white series so there is no need for transformation.

Simple Exponential Smoothing

Although simple exponential smoothing is used mostly where there is no seasonality but in our dataset seasonality is not very strong that’s why I will start by ses and see how it goes. I’m not expecting much from this method although.

forecast_atm1_ses <- ses(atm1_ts)
autoplot(forecast_atm1_ses)+
  autolayer(fitted(forecast_atm1_ses), series='Fitted')

summary(forecast_atm1_ses)
## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = atm1_ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 83.7797 
## 
##   sigma:  36.7094
## 
##      AIC     AICc      BIC 
## 4745.365 4745.432 4757.040 
## 
## Error measures:
##                      ME     RMSE      MAE     MPE     MAPE     MASE       ACF1
## Training set 0.07802025 36.60783 27.31876 -175.78 199.2779 1.451603 0.06365945
## 
## Forecasts:
##          Point Forecast   Lo 80    Hi 80    Lo 95    Hi 95
## 52.71429       83.78256 36.7376 130.8275 11.83350 155.7316
## 52.85714       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.00000       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.14286       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.28571       83.78256 36.7376 130.8275 11.83350 155.7316
## 53.42857       83.78256 36.7376 130.8275 11.83349 155.7316
## 53.57143       83.78256 36.7376 130.8275 11.83349 155.7316
## 53.71429       83.78256 36.7376 130.8275 11.83349 155.7316
## 53.85714       83.78256 36.7376 130.8275 11.83349 155.7316
## 54.00000       83.78256 36.7376 130.8275 11.83349 155.7316

ETS

forecast_atm1_ets <- ets(atm1_ts)
autoplot(forecast_atm1_ets)+
  autolayer(forecast(forecast_atm1_ets), series="Forecasted")

summary(forecast_atm1_ets)
## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm1_ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.415 
## 
##   Initial states:
##     l = 87.7888 
##     s = -54.0598 16.2223 21.048 -26.068 8.3991 35.2058
##            -0.7474
## 
##   sigma:  26.384
## 
##      AIC     AICc      BIC 
## 4513.138 4513.765 4552.055 
## 
## Training set error measures:
##                      ME     RMSE     MAE       MPE     MAPE      MASE      ACF1
## Training set -0.5473075 26.05397 16.9637 -110.6377 126.3744 0.9013793 0.1374786

The model accuracy got slightly better if compare the values of RMSE, AIC and BIC as compared with simple exponential smoothing.

ARIMA

forecast_atm1_arima <- auto.arima(atm1_ts)
autoplot(forecast(forecast_atm1_arima))

checkresiduals(forecast_atm1_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
## 
## Model df: 2.   Total lags used: 14
summary(forecast_atm1_arima)
## Series: atm1_ts 
## ARIMA(0,0,1)(0,1,1)[7] 
## 
## Coefficients:
##          ma1     sma1
##       0.1674  -0.5813
## s.e.  0.0565   0.0472
## 
## sigma^2 estimated as 666.6:  log likelihood=-1658.32
## AIC=3322.63   AICc=3322.7   BIC=3334.25
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.09700089 25.49555 16.17552 -106.2792 122.4165 0.8594986
##                     ACF1
## Training set -0.01116544

Seems like ARIMA model gave the best result out of the rest with AIC and BIC dropped to around 3334 which around 4300 before. Even RMSE dropped with auto.arima(). Also, the data does not need transformation and is stationary as per the ACF plot.

Models - ATM 2

atm2_ts <- ts(atm2$Cash, frequency = 7)
ggseasonplot(atm2_ts)

ggtsdisplay(atm2_ts, main="ATM 2 Transactions - Weekly")

ATM2 data is also stationary and there is no transformation required as it is already white noise. ACF plot shows that the data is stationary.

Simple Exponential Smoothing

Again, I’ll start here again with simple exponential smoothing

forecast_atm2_ses <- ses(atm2_ts)
autoplot(forecast_atm2_ses)+
  autolayer(forecast(forecast_atm2_ses), series='Forecasted')

checkresiduals(forecast_atm2_ses)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 515.46, df = 12, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 14
summary(forecast_atm2_ses)
## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = atm2_ts) 
## 
##   Smoothing parameters:
##     alpha = 0.016 
## 
##   Initial states:
##     l = 72.14 
## 
##   sigma:  38.681
## 
##      AIC     AICc      BIC 
## 4797.445 4797.512 4809.128 
## 
## Error measures:
##                     ME     RMSE      MAE  MPE MAPE     MASE        ACF1
## Training set -2.563596 38.57427 33.09136 -Inf  Inf 1.539938 -0.02208177
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 52.85714       57.26713 7.695463 106.8388 -18.54619 133.0804
## 53.00000       57.26713 7.689132 106.8451 -18.55587 133.0901
## 53.14286       57.26713 7.682802 106.8514 -18.56555 133.0998
## 53.28571       57.26713 7.676473 106.8578 -18.57523 133.1095
## 53.42857       57.26713 7.670145 106.8641 -18.58491 133.1192
## 53.57143       57.26713 7.663817 106.8704 -18.59459 133.1288
## 53.71429       57.26713 7.657491 106.8768 -18.60426 133.1385
## 53.85714       57.26713 7.651165 106.8831 -18.61394 133.1482
## 54.00000       57.26713 7.644840 106.8894 -18.62361 133.1579
## 54.14286       57.26713 7.638515 106.8957 -18.63328 133.1675

The value of AIC and BIC are 4797 and 4809 respectively which I think is consistent with ATM1 data series. I was not expecting much from this method anyway. RMSE’s value is 38.5. Let’s go ahead and check these values with ETS and ARIMAs.

ETS

forecast_atm2_ets <- ets(atm2_ts)
autoplot(forecast(forecast_atm2_ets))

checkresiduals(forecast_atm2_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 38.485, df = 5, p-value = 3.015e-07
## 
## Model df: 9.   Total lags used: 14
summary(forecast_atm2_ets)
## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm2_ts) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
##     gamma = 0.4647 
## 
##   Initial states:
##     l = 76.1082 
##     s = 11.3136 -47.8461 -13.3718 -1.8486 18.6895 4.6056
##            28.4579
## 
##   sigma:  27.8902
## 
##      AIC     AICc      BIC 
## 4566.882 4567.507 4605.826 
## 
## Training set error measures:
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.6349959 27.54227 19.09383 -Inf  Inf 0.8885496 -0.02291393

The data is already stationary as we checked before so I don’t have to worry about transformation. Here, values of AIC and BIC have dropped to 4566 and 4605 respectively which shows slight improvement.

ARIMA Model

forecast_atm2_arima <- auto.arima(atm2_ts)
autoplot(forecast(forecast_atm2_arima))

checkresiduals(forecast_atm2_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,2)(0,1,2)[7]
## Q* = 12.12, df = 8, p-value = 0.1459
## 
## Model df: 6.   Total lags used: 14
summary(forecast_atm2_arima)
## Series: atm2_ts 
## ARIMA(2,0,2)(0,1,2)[7] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1    sma2
##       -0.4148  -0.8922  0.4008  0.7534  -0.7220  0.0801
## s.e.   0.0674   0.0510  0.1027  0.0688   0.0567  0.0550
## 
## sigma^2 estimated as 708.1:  log likelihood=-1671.98
## AIC=3357.96   AICc=3358.28   BIC=3385.08
## 
## Training set error measures:
##                      ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set -0.4583759 26.12942 18.09675 -Inf  Inf 0.8421495 0.009277035

Data is already normal so there is no need of transformation. The data is also somehow stationary as ACF’s value dropped very suddenly. Even the first two lines in ACF plot are very low. According to the summary results, AIC, BIC and RMSE have dropped tremendously as compared with ETS and SES models and showed that arima model performs better than the others.

Models - ATM 3

atm3_ts <- ts(atm3$Cash, frequency = 7)
ggseasonplot(atm3_ts)

ggtsdisplay(atm3_ts, main="ATM 3 Transactions - Weekly")

Although for the sake of this assignment I can go ahead and test Simple Exponential Smoothing, ETS and ARIMA but I think it’s useless to use these techniques as we don’t have enough data. A lot of the values in ATM3 are 0.

plot_ly(y = atm3$Cash, type='box', name="Cash Transactions - ATM3")
summary(atm3$Cash)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7206  0.0000 96.0000

Models - ATM 4

atm4_ts <- ts(atm4$Cash, frequency = 7)
ggseasonplot(atm4_ts)

ggtsdisplay(atm4_ts, main="ATM 4 Transactions - Weekly")

Simple Exponential Smoothing

Let’s start building first model on atm4 dataseries by using simple exponential smoothing.

forecast_atm4_ses <- ses(atm4_ts, lambda='auto')
autoplot(forecast_atm4_ses)+
  autolayer(forecast(forecast_atm4_ses), series='Forecasted')

checkresiduals(forecast_atm4_ses)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 68.994, df = 12, p-value = 4.939e-10
## 
## Model df: 2.   Total lags used: 14
summary(forecast_atm4_ses)
## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = atm4_ts, lambda = "auto") 
## 
##   Box-Cox transformation: lambda= -0.0737 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 4.4955 
## 
##   sigma:  0.9807
## 
##      AIC     AICc      BIC 
## 2143.230 2143.296 2154.930 
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 238.6781 692.4834 345.1213 -255.8507 328.5013 0.8587366
##                      ACF1
## Training set -0.009363965
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 53.14286       235.3574 40.47122 1780.582 17.34986 5955.898
## 53.28571       235.3574 40.47122 1780.582 17.34985 5955.898
## 53.42857       235.3574 40.47121 1780.582 17.34985 5955.898
## 53.57143       235.3574 40.47121 1780.582 17.34985 5955.898
## 53.71429       235.3574 40.47121 1780.582 17.34985 5955.898
## 53.85714       235.3574 40.47121 1780.582 17.34985 5955.898
## 54.00000       235.3574 40.47121 1780.582 17.34985 5955.899
## 54.14286       235.3574 40.47121 1780.582 17.34985 5955.899
## 54.28571       235.3574 40.47121 1780.582 17.34985 5955.899
## 54.42857       235.3574 40.47121 1780.582 17.34985 5955.899

ATM4 data was slighly abnormal that’s why I used boxcox transformation to normalize the data and seems like it’s better as compared with before. Data is stationary as we can see in ACF plot above and autoplot which shows data does not have autocorrelation. Before transformation AIC and BIC values were around 6500 which is dropped significantly down to around 2143 after boxcox transformation. RMSE has also improved. It shows that boxcox transformation did help to improve the results but we will select the model after using ETS and ARIMA to compare their measures.

ETS

forecast_atm4_ets <- ets(atm4_ts, lambda = 'auto')
autoplot(forecast(forecast_atm4_ets))

checkresiduals(forecast_atm4_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 18.809, df = 5, p-value = 0.002086
## 
## Model df: 9.   Total lags used: 14
summary(forecast_atm4_ets)
## ETS(A,N,A) 
## 
## Call:
##  ets(y = atm4_ts, lambda = "auto") 
## 
##   Box-Cox transformation: lambda= -0.0737 
## 
##   Smoothing parameters:
##     alpha = 0.0138 
##     gamma = 0.1424 
## 
##   Initial states:
##     l = 4.4335 
##     s = -1.3504 -0.0467 0.0246 0.3089 0.3808 0.2859
##            0.397
## 
##   sigma:  0.8982
## 
##      AIC     AICc      BIC 
## 2085.955 2086.577 2124.954 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 177.0986 664.9255 308.4432 -271.5265 333.966 0.7674736
##                      ACF1
## Training set -0.004544255

Model’s accuracy got slightly improved with AIC and BIC’s values came down to around 2085 and RMSE from 694 to 664. I cannot say for sure if this model is better yet as ARIMA seems to give better results so far. Let’s check it out with ARIMA now. Also, as data is not normal that’s why I had to apply boxcox transformation for better results.

ARIMA

forecast_atm4_arima <- auto.arima(atm4_ts, lambda = 'auto')
autoplot(forecast(forecast_atm4_arima))

checkresiduals(forecast_atm4_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 17.625, df = 11, p-value = 0.09069
## 
## Model df: 3.   Total lags used: 14
summary(forecast_atm4_arima)
## Series: atm4_ts 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= -0.07372558 
## 
## Coefficients:
##         sar1    sar2    mean
##       0.2487  0.1947  4.4864
## s.e.  0.0521  0.0525  0.0841
## 
## sigma^2 estimated as 0.8418:  log likelihood=-485.59
## AIC=979.18   AICc=979.29   BIC=994.78
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 209.7822 678.9958 317.1943 -239.7643 304.041 0.7892482
##                      ACF1
## Training set -0.005100994

This model seems to provide better results in terms of accuracy as AIC and BIC have dropped from around 2085 down to around 979 but RMSE did not improve. I would still go with ARIMA model as AIC and BIC are stronger criterion for model selection

Prediction

# ATM 1
test <- forecast(forecast_atm1_arima, h=5)
test1 <- data.frame(test)
test1
##          Point.Forecast     Lo.80     Hi.80     Lo.95     Hi.95
## 52.71429      86.805607  53.71784 119.89337  36.20224 137.40898
## 52.85714     100.640560  67.09247 134.18865  49.33319 151.94793
## 53.00000      74.714560  41.16647 108.26265  23.40719 126.02193
## 53.14286       4.762635 -28.78545  38.31072 -46.54474  56.07001
## 53.28571     100.063192  66.51510 133.61128  48.75582 151.37057
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm1_forecast.xlsx', row.names=FALSE, col.names=TRUE)
# ATM 2
test <- forecast(forecast_atm2_arima, h=5)
test1 <- data.frame(test)
test1
##          Point.Forecast     Lo.80     Hi.80     Lo.95     Hi.95
## 52.85714      69.432448  35.33003 103.53487  17.27730 121.58759
## 53.00000      69.881366  35.77560 103.98713  17.72111 122.04163
## 53.14286      12.308924 -22.09671  46.71456 -40.30996  64.92780
## 53.28571       2.758906 -31.72395  37.24176 -49.97807  55.49588
## 53.42857      99.519624  64.89886 134.14038  46.57174 152.46751
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm2_forecast.xlsx', row.names=FALSE, col.names=TRUE)
# ATM 4

test <- forecast(forecast_atm4_arima, h=5)
test1 <- data.frame(test)
test1
##          Point.Forecast    Lo.80     Hi.80     Lo.95    Hi.95
## 53.14286       90.23016 19.20879  517.3859  9.049304 1441.117
## 53.28571      275.70706 51.80183 1856.9650 23.087712 5748.059
## 53.42857      328.95842 60.54703 2275.9560 26.741692 7171.671
## 53.57143       72.60330 15.82002  404.1713  7.530810 1104.380
## 53.71429      303.15704 56.33390 2071.4399 24.985769 6473.389
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm4_forecast.xlsx', row.names=FALSE, col.names=TRUE)

Summary of ATM dataset

Data had some missing values which were removed as I could not change them with mean or median. I splitted the data into four datasets as required ATM1, ATM2, ATM3 AND ATM4. I aggregated the data into ts object on weekly basis and test Simple exponential, ETS and ARIMA. Arima performed best in all ATM1, ATM2 and ATM4. ATM3 did not have enough data to do analysis. Criteria for best model was based on AIC, BIC and RMSE. I forecasted the values for 5 weeks with exact value along with lower and upper values for 80% and 95% confidence interval and exported them in excel. As required, forecasted values are for May and first month of June 2010.

Part B - Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Let’s start by reading the dataset first.

power <- readxl::read_excel("C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/Power624Data.xlsx")
power %>% kable() %>% kable_styling(full_width = FALSE)
CaseSequence YYYY-MMM KWH
733 1998-Jan 6862583
734 1998-Feb 5838198
735 1998-Mar 5420658
736 1998-Apr 5010364
737 1998-May 4665377
738 1998-Jun 6467147
739 1998-Jul 8914755
740 1998-Aug 8607428
741 1998-Sep 6989888
742 1998-Oct 6345620
743 1998-Nov 4640410
744 1998-Dec 4693479
745 1999-Jan 7183759
746 1999-Feb 5759262
747 1999-Mar 4847656
748 1999-Apr 5306592
749 1999-May 4426794
750 1999-Jun 5500901
751 1999-Jul 7444416
752 1999-Aug 7564391
753 1999-Sep 7899368
754 1999-Oct 5358314
755 1999-Nov 4436269
756 1999-Dec 4419229
757 2000-Jan 7068296
758 2000-Feb 5876083
759 2000-Mar 4807961
760 2000-Apr 4873080
761 2000-May 5050891
762 2000-Jun 7092865
763 2000-Jul 6862662
764 2000-Aug 7517830
765 2000-Sep 8912169
766 2000-Oct 5844352
767 2000-Nov 5041769
768 2000-Dec 6220334
769 2001-Jan 7538529
770 2001-Feb 6602448
771 2001-Mar 5779180
772 2001-Apr 4835210
773 2001-May 4787904
774 2001-Jun 6283324
775 2001-Jul 7855129
776 2001-Aug 8450717
777 2001-Sep 7112069
778 2001-Oct 5242535
779 2001-Nov 4461979
780 2001-Dec 5240995
781 2002-Jan 7099063
782 2002-Feb 6413429
783 2002-Mar 5839514
784 2002-Apr 5371604
785 2002-May 5439166
786 2002-Jun 5850383
787 2002-Jul 7039702
788 2002-Aug 8058748
789 2002-Sep 8245227
790 2002-Oct 5865014
791 2002-Nov 4908979
792 2002-Dec 5779958
793 2003-Jan 7256079
794 2003-Feb 6190517
795 2003-Mar 6120626
796 2003-Apr 4885643
797 2003-May 5296096
798 2003-Jun 6051571
799 2003-Jul 6900676
800 2003-Aug 8476499
801 2003-Sep 7791791
802 2003-Oct 5344613
803 2003-Nov 4913707
804 2003-Dec 5756193
805 2004-Jan 7584596
806 2004-Feb 6560742
807 2004-Mar 6526586
808 2004-Apr 4831688
809 2004-May 4878262
810 2004-Jun 6421614
811 2004-Jul 7307931
812 2004-Aug 7309774
813 2004-Sep 6690366
814 2004-Oct 5444948
815 2004-Nov 4824940
816 2004-Dec 5791208
817 2005-Jan 8225477
818 2005-Feb 6564338
819 2005-Mar 5581725
820 2005-Apr 5563071
821 2005-May 4453983
822 2005-Jun 5900212
823 2005-Jul 8337998
824 2005-Aug 7786659
825 2005-Sep 7057213
826 2005-Oct 6694523
827 2005-Nov 4313019
828 2005-Dec 6181548
829 2006-Jan 7793358
830 2006-Feb 5914945
831 2006-Mar 5819734
832 2006-Apr 5255988
833 2006-May 4740588
834 2006-Jun 7052275
835 2006-Jul 7945564
836 2006-Aug 8241110
837 2006-Sep 7296355
838 2006-Oct 5104799
839 2006-Nov 4458429
840 2006-Dec 6226214
841 2007-Jan 8031295
842 2007-Feb 7928337
843 2007-Mar 6443170
844 2007-Apr 4841979
845 2007-May 4862847
846 2007-Jun 5022647
847 2007-Jul 6426220
848 2007-Aug 7447146
849 2007-Sep 7666970
850 2007-Oct 5785964
851 2007-Nov 4907057
852 2007-Dec 6047292
853 2008-Jan 7964293
854 2008-Feb 7597060
855 2008-Mar 6085644
856 2008-Apr 5352359
857 2008-May 4608528
858 2008-Jun 6548439
859 2008-Jul 7643987
860 2008-Aug 8037137
861 2008-Sep NA
862 2008-Oct 5101803
863 2008-Nov 4555602
864 2008-Dec 6442746
865 2009-Jan 8072330
866 2009-Feb 6976800
867 2009-Mar 5691452
868 2009-Apr 5531616
869 2009-May 5264439
870 2009-Jun 5804433
871 2009-Jul 7713260
872 2009-Aug 8350517
873 2009-Sep 7583146
874 2009-Oct 5566075
875 2009-Nov 5339890
876 2009-Dec 7089880
877 2010-Jan 9397357
878 2010-Feb 8390677
879 2010-Mar 7347915
880 2010-Apr 5776131
881 2010-May 4919289
882 2010-Jun 6696292
883 2010-Jul 770523
884 2010-Aug 7922701
885 2010-Sep 7819472
886 2010-Oct 5875917
887 2010-Nov 4800733
888 2010-Dec 6152583
889 2011-Jan 8394747
890 2011-Feb 8898062
891 2011-Mar 6356903
892 2011-Apr 5685227
893 2011-May 5506308
894 2011-Jun 8037779
895 2011-Jul 10093343
896 2011-Aug 10308076
897 2011-Sep 8943599
898 2011-Oct 5603920
899 2011-Nov 6154138
900 2011-Dec 8273142
901 2012-Jan 8991267
902 2012-Feb 7952204
903 2012-Mar 6356961
904 2012-Apr 5569828
905 2012-May 5783598
906 2012-Jun 7926956
907 2012-Jul 8886851
908 2012-Aug 9612423
909 2012-Sep 7559148
910 2012-Oct 5576852
911 2012-Nov 5731899
912 2012-Dec 6609694
913 2013-Jan 10655730
914 2013-Feb 7681798
915 2013-Mar 6517514
916 2013-Apr 6105359
917 2013-May 5940475
918 2013-Jun 7920627
919 2013-Jul 8415321
920 2013-Aug 9080226
921 2013-Sep 7968220
922 2013-Oct 5759367
923 2013-Nov 5769083
924 2013-Dec 9606304
summary(power)
##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

Data Preparation

# Checking for missing values
colSums(is.na(power))
## CaseSequence     YYYY-MMM          KWH 
##            0            0            1
# Replacing missing value with median
power$KWH[is.na(power$KWH)] = median(power$KWH, na.rm=TRUE)

# Converting the dataframe into timeseries object
power_ts <- ts(power$KWH, start=c(1998,1), frequency = 12)
power_ts
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1998  6862583  5838198  5420658  5010364  4665377  6467147  8914755  8607428
## 1999  7183759  5759262  4847656  5306592  4426794  5500901  7444416  7564391
## 2000  7068296  5876083  4807961  4873080  5050891  7092865  6862662  7517830
## 2001  7538529  6602448  5779180  4835210  4787904  6283324  7855129  8450717
## 2002  7099063  6413429  5839514  5371604  5439166  5850383  7039702  8058748
## 2003  7256079  6190517  6120626  4885643  5296096  6051571  6900676  8476499
## 2004  7584596  6560742  6526586  4831688  4878262  6421614  7307931  7309774
## 2005  8225477  6564338  5581725  5563071  4453983  5900212  8337998  7786659
## 2006  7793358  5914945  5819734  5255988  4740588  7052275  7945564  8241110
## 2007  8031295  7928337  6443170  4841979  4862847  5022647  6426220  7447146
## 2008  7964293  7597060  6085644  5352359  4608528  6548439  7643987  8037137
## 2009  8072330  6976800  5691452  5531616  5264439  5804433  7713260  8350517
## 2010  9397357  8390677  7347915  5776131  4919289  6696292   770523  7922701
## 2011  8394747  8898062  6356903  5685227  5506308  8037779 10093343 10308076
## 2012  8991267  7952204  6356961  5569828  5783598  7926956  8886851  9612423
## 2013 10655730  7681798  6517514  6105359  5940475  7920627  8415321  9080226
##           Sep      Oct      Nov      Dec
## 1998  6989888  6345620  4640410  4693479
## 1999  7899368  5358314  4436269  4419229
## 2000  8912169  5844352  5041769  6220334
## 2001  7112069  5242535  4461979  5240995
## 2002  8245227  5865014  4908979  5779958
## 2003  7791791  5344613  4913707  5756193
## 2004  6690366  5444948  4824940  5791208
## 2005  7057213  6694523  4313019  6181548
## 2006  7296355  5104799  4458429  6226214
## 2007  7666970  5785964  4907057  6047292
## 2008  6283324  5101803  4555602  6442746
## 2009  7583146  5566075  5339890  7089880
## 2010  7819472  5875917  4800733  6152583
## 2011  8943599  5603920  6154138  8273142
## 2012  7559148  5576852  5731899  6609694
## 2013  7968220  5759367  5769083  9606304

The data was given on monthly basis from 1998 till 2013 and we are asked to forecast for every month in 2014. First, I had to check for missing value and I saw that there is only one missing value which I replaced with median of that column and then converted it into ts object before moving ahead. Now the data is ready to go for next steps.

# Before Transformation
ggtsdisplay(power_ts, main="Before Transformation, Monthly Power Consumption")

# Box Cox Transformation
power_boxcox <- power_ts %>% BoxCox(lambda= 'auto')
ggtsdisplay(power_boxcox, main='BoxCox Transformation, Monthly Power Consumption')

# Seasonality
ggseasonplot(power_boxcox)

BoxCox Transformation did not make much improvement especially the data is still non-stationary therefore, I have to make it stationary before creating any model or predictions. There is also seasonality in the data which shows dip in April and May and then goes up again in June through beginning of September. Let’s see how many differencing is required to make the data stationary.

print(paste0("Suggested number of differencing is ", ndiffs(power_boxcox)))
## [1] "Suggested number of differencing is 1"
print(paste0("Seasonal differencing: How many more differencing required ? ", ndiffs(diff(power_boxcox, lag=12))))
## [1] "Seasonal differencing: How many more differencing required ? 0"

Seems like no seasonal differencing is required but overall 1 differencing may make the data stationary. Let’s make at first difference.

power_diff <- power_boxcox %>% diff(lag = 12)
ggtsdisplay(power_diff, main= "BoxCox Transformation and seasonal differencing - Monthly power consumption")

The data has become stationary now as we can see in ACF plot at seasonal differencing. There was a seasonality in the data which required us seasonal differencing in order to make the data stationary. We can double check by using ndiffs again

print(paste0("Suggested differencing : ", ndiffs(power_diff)))
## [1] "Suggested differencing : 0"

Now that the data is stationary, we can move forward and use different forecasting techniques to see which one performs well and then predict for 2014 for each month. We will use moving average as there is seaonality in the data and then others to see which one performs better.

Moving Average

autoplot(power_diff, series="Data")+
  autolayer(ma(power_diff, 12), series = "12 Months Moving Average")+ ggtitle("12 Months Moving Average for 2014")

Simple Exponential Smoothing

forecast_power_ses <- ses(power_diff, h=12)
autoplot(forecast_power_ses)+
  autolayer(fitted(forecast_power_ses), series="Fitted")

summary(forecast_power_ses)
## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = power_diff, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 0.0714 
## 
##   sigma:  1.3548
## 
##      AIC     AICc      BIC 
## 1048.048 1048.184 1057.627 
## 
## Error measures:
##                        ME     RMSE       MAE       MPE     MAPE      MASE
## Training set 0.0005184573 1.347295 0.6003728 -761.7743 987.6887 0.5469261
##                   ACF1
## Training set 0.1202922
## 
## Forecasts:
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## Jan 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Feb 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Mar 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Apr 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## May 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Jun 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Jul 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Aug 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Sep 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Oct 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Nov 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Dec 2014     0.07137174 -1.664929 1.807673 -2.584072 2.726815

Moving average still gives some value but simple exponential smoothing does not seem like it’s working as it’s constant line of 0.0713 (Transformed). Here, AIC is 1048 and BIC is 1057. RMSE is 1.347. Let’s use ARIMA and then compare both.

ARIMA

I’m going to check it out with ARIMA which seemed to work fantastic in Part A and see how it goes here.

forecast_atm4_arima <- auto.arima(atm4_ts, lambda = ‘auto’) autoplot(forecast(forecast_atm4_arima)) checkresiduals(forecast_atm4_arima) summary(forecast_atm4_arima)

forecast_power_arima <- auto.arima(power_diff)
autoplot(forecast(forecast_power_arima,12), main="Prediction for 2014 with ARIMA")

summary(forecast_power_arima)
## Series: power_diff 
## ARIMA(0,0,1)(0,0,2)[12] with non-zero mean 
## 
## Coefficients:
##          ma1     sma1    sma2    mean
##       0.1195  -0.9264  0.0890  0.0617
## s.e.  0.0712   0.0810  0.0811  0.0198
## 
## sigma^2 estimated as 0.9561:  log likelihood=-257.16
## AIC=524.31   AICc=524.66   BIC=540.28
## 
## Training set error measures:
##                       ME      RMSE       MAE       MPE     MAPE      MASE
## Training set -0.02555369 0.9668839 0.4542946 -1920.731 2160.928 0.4138522
##                     ACF1
## Training set 0.006469908

Seems like ARIMA provided better results as AIC dropped to 524 and BIC dropped to 540. RMSE has also dropped from 1.347 to 0.966. I’ll select ARIMA model here and predict the values in excel as I am comfortable with the results of ARIMA.

test <- forecast(forecast_atm4_arima, h=5) test1 <- data.frame(test) test1 xlsx::write.xlsx(test1, ‘C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm1_forecast.xlsx’, row.names=FALSE, col.names=TRUE)

test <- forecast(forecast_power_arima, h=12)
test1 <- data.frame(test)
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/POwer 12 months prediction.xlsx')

Summary for Power Dataset

Power datase was given in monthly basis from 1998 to 2013 and was asked to provide 2014’s prediction for each month. Data had one missing value which was replaced with the median of the column. Data was converted into ts object and I used different techniques to see which one performs better. ARIMA seemed to provide better results based on the values of AIC, BIC and RMSE that’s why I chose ARIMA model to predict the power consumption of each month for 2014.