In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
library(pacman)
pacman::p_load(tidyverse, fpp2, plotly, readxl, knitr, kableExtra)
FIrst of all, let’s start reading ATM dataset and split it into four datasets based on ATM #
atm_full <- readxl::read_excel('ATM624Data.xlsx')
# Checking for missing values
colSums(is.na(atm_full))
## DATE ATM Cash
## 0 14 19
# Data Preparation
atm_full$DATE <- as.Date(atm_full$DATE, origin= "1899-12-30") # Converting into date
atm_full <- atm_full[!is.na(atm_full$ATM),]
atm_full <- atm_full[!is.na(atm_full$Cash),]
# missing values after removing
colSums(is.na(atm_full))
## DATE ATM Cash
## 0 0 0
# Splitting the data
atm1 <- atm_full %>% filter(ATM == 'ATM1')
atm2 <- atm_full %>% filter(ATM == 'ATM2')
atm3 <- atm_full %>% filter(ATM == 'ATM3')
atm4 <- atm_full %>% filter(ATM == 'ATM4')
All the missing values were removed from the model because I did not want to replace them by mean or median. Instead, later I will see if I can aggregate them weekly or monthly then I won’t need these values and removing these values won’t create much difference.
atm1_ts <- ts(atm1$Cash, frequency = 7)
ggseasonplot(atm1_ts)
ggtsdisplay(atm1_ts, main="ATM 1 Transactions - Weekly")
I converted the series into weekly data to see better picture of atm transactions. Seems like there is some seasonality in the data with dip on Mondays and Wednesdays mostly. The data is stationary as per the ACF model and autoplot shows white series so there is no need for transformation.
Although simple exponential smoothing is used mostly where there is no seasonality but in our dataset seasonality is not very strong that’s why I will start by ses and see how it goes. I’m not expecting much from this method although.
forecast_atm1_ses <- ses(atm1_ts)
autoplot(forecast_atm1_ses)+
autolayer(fitted(forecast_atm1_ses), series='Fitted')
summary(forecast_atm1_ses)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = atm1_ts)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 83.7797
##
## sigma: 36.7094
##
## AIC AICc BIC
## 4745.365 4745.432 4757.040
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.07802025 36.60783 27.31876 -175.78 199.2779 1.451603 0.06365945
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 52.71429 83.78256 36.7376 130.8275 11.83350 155.7316
## 52.85714 83.78256 36.7376 130.8275 11.83350 155.7316
## 53.00000 83.78256 36.7376 130.8275 11.83350 155.7316
## 53.14286 83.78256 36.7376 130.8275 11.83350 155.7316
## 53.28571 83.78256 36.7376 130.8275 11.83350 155.7316
## 53.42857 83.78256 36.7376 130.8275 11.83349 155.7316
## 53.57143 83.78256 36.7376 130.8275 11.83349 155.7316
## 53.71429 83.78256 36.7376 130.8275 11.83349 155.7316
## 53.85714 83.78256 36.7376 130.8275 11.83349 155.7316
## 54.00000 83.78256 36.7376 130.8275 11.83349 155.7316
forecast_atm1_ets <- ets(atm1_ts)
autoplot(forecast_atm1_ets)+
autolayer(forecast(forecast_atm1_ets), series="Forecasted")
summary(forecast_atm1_ets)
## ETS(A,N,A)
##
## Call:
## ets(y = atm1_ts)
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.415
##
## Initial states:
## l = 87.7888
## s = -54.0598 16.2223 21.048 -26.068 8.3991 35.2058
## -0.7474
##
## sigma: 26.384
##
## AIC AICc BIC
## 4513.138 4513.765 4552.055
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.5473075 26.05397 16.9637 -110.6377 126.3744 0.9013793 0.1374786
The model accuracy got slightly better if compare the values of RMSE, AIC and BIC as compared with simple exponential smoothing.
forecast_atm1_arima <- auto.arima(atm1_ts)
autoplot(forecast(forecast_atm1_arima))
checkresiduals(forecast_atm1_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 11.267, df = 12, p-value = 0.5062
##
## Model df: 2. Total lags used: 14
summary(forecast_atm1_arima)
## Series: atm1_ts
## ARIMA(0,0,1)(0,1,1)[7]
##
## Coefficients:
## ma1 sma1
## 0.1674 -0.5813
## s.e. 0.0565 0.0472
##
## sigma^2 estimated as 666.6: log likelihood=-1658.32
## AIC=3322.63 AICc=3322.7 BIC=3334.25
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.09700089 25.49555 16.17552 -106.2792 122.4165 0.8594986
## ACF1
## Training set -0.01116544
Seems like ARIMA model gave the best result out of the rest with AIC and BIC dropped to around 3334 which around 4300 before. Even RMSE dropped with auto.arima(). Also, the data does not need transformation and is stationary as per the ACF plot.
atm2_ts <- ts(atm2$Cash, frequency = 7)
ggseasonplot(atm2_ts)
ggtsdisplay(atm2_ts, main="ATM 2 Transactions - Weekly")
ATM2 data is also stationary and there is no transformation required as it is already white noise. ACF plot shows that the data is stationary.
Again, I’ll start here again with simple exponential smoothing
forecast_atm2_ses <- ses(atm2_ts)
autoplot(forecast_atm2_ses)+
autolayer(forecast(forecast_atm2_ses), series='Forecasted')
checkresiduals(forecast_atm2_ses)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 515.46, df = 12, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 14
summary(forecast_atm2_ses)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = atm2_ts)
##
## Smoothing parameters:
## alpha = 0.016
##
## Initial states:
## l = 72.14
##
## sigma: 38.681
##
## AIC AICc BIC
## 4797.445 4797.512 4809.128
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -2.563596 38.57427 33.09136 -Inf Inf 1.539938 -0.02208177
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 52.85714 57.26713 7.695463 106.8388 -18.54619 133.0804
## 53.00000 57.26713 7.689132 106.8451 -18.55587 133.0901
## 53.14286 57.26713 7.682802 106.8514 -18.56555 133.0998
## 53.28571 57.26713 7.676473 106.8578 -18.57523 133.1095
## 53.42857 57.26713 7.670145 106.8641 -18.58491 133.1192
## 53.57143 57.26713 7.663817 106.8704 -18.59459 133.1288
## 53.71429 57.26713 7.657491 106.8768 -18.60426 133.1385
## 53.85714 57.26713 7.651165 106.8831 -18.61394 133.1482
## 54.00000 57.26713 7.644840 106.8894 -18.62361 133.1579
## 54.14286 57.26713 7.638515 106.8957 -18.63328 133.1675
The value of AIC and BIC are 4797 and 4809 respectively which I think is consistent with ATM1 data series. I was not expecting much from this method anyway. RMSE’s value is 38.5. Let’s go ahead and check these values with ETS and ARIMAs.
forecast_atm2_ets <- ets(atm2_ts)
autoplot(forecast(forecast_atm2_ets))
checkresiduals(forecast_atm2_ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 38.485, df = 5, p-value = 3.015e-07
##
## Model df: 9. Total lags used: 14
summary(forecast_atm2_ets)
## ETS(A,N,A)
##
## Call:
## ets(y = atm2_ts)
##
## Smoothing parameters:
## alpha = 1e-04
## gamma = 0.4647
##
## Initial states:
## l = 76.1082
## s = 11.3136 -47.8461 -13.3718 -1.8486 18.6895 4.6056
## 28.4579
##
## sigma: 27.8902
##
## AIC AICc BIC
## 4566.882 4567.507 4605.826
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.6349959 27.54227 19.09383 -Inf Inf 0.8885496 -0.02291393
The data is already stationary as we checked before so I don’t have to worry about transformation. Here, values of AIC and BIC have dropped to 4566 and 4605 respectively which shows slight improvement.
forecast_atm2_arima <- auto.arima(atm2_ts)
autoplot(forecast(forecast_atm2_arima))
checkresiduals(forecast_atm2_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,2)[7]
## Q* = 12.12, df = 8, p-value = 0.1459
##
## Model df: 6. Total lags used: 14
summary(forecast_atm2_arima)
## Series: atm2_ts
## ARIMA(2,0,2)(0,1,2)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1 sma2
## -0.4148 -0.8922 0.4008 0.7534 -0.7220 0.0801
## s.e. 0.0674 0.0510 0.1027 0.0688 0.0567 0.0550
##
## sigma^2 estimated as 708.1: log likelihood=-1671.98
## AIC=3357.96 AICc=3358.28 BIC=3385.08
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.4583759 26.12942 18.09675 -Inf Inf 0.8421495 0.009277035
Data is already normal so there is no need of transformation. The data is also somehow stationary as ACF’s value dropped very suddenly. Even the first two lines in ACF plot are very low. According to the summary results, AIC, BIC and RMSE have dropped tremendously as compared with ETS and SES models and showed that arima model performs better than the others.
atm3_ts <- ts(atm3$Cash, frequency = 7)
ggseasonplot(atm3_ts)
ggtsdisplay(atm3_ts, main="ATM 3 Transactions - Weekly")
Although for the sake of this assignment I can go ahead and test Simple Exponential Smoothing, ETS and ARIMA but I think it’s useless to use these techniques as we don’t have enough data. A lot of the values in ATM3 are 0.
plot_ly(y = atm3$Cash, type='box', name="Cash Transactions - ATM3")
summary(atm3$Cash)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7206 0.0000 96.0000
atm4_ts <- ts(atm4$Cash, frequency = 7)
ggseasonplot(atm4_ts)
ggtsdisplay(atm4_ts, main="ATM 4 Transactions - Weekly")
Let’s start building first model on atm4 dataseries by using simple exponential smoothing.
forecast_atm4_ses <- ses(atm4_ts, lambda='auto')
autoplot(forecast_atm4_ses)+
autolayer(forecast(forecast_atm4_ses), series='Forecasted')
checkresiduals(forecast_atm4_ses)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 68.994, df = 12, p-value = 4.939e-10
##
## Model df: 2. Total lags used: 14
summary(forecast_atm4_ses)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = atm4_ts, lambda = "auto")
##
## Box-Cox transformation: lambda= -0.0737
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 4.4955
##
## sigma: 0.9807
##
## AIC AICc BIC
## 2143.230 2143.296 2154.930
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 238.6781 692.4834 345.1213 -255.8507 328.5013 0.8587366
## ACF1
## Training set -0.009363965
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 53.14286 235.3574 40.47122 1780.582 17.34986 5955.898
## 53.28571 235.3574 40.47122 1780.582 17.34985 5955.898
## 53.42857 235.3574 40.47121 1780.582 17.34985 5955.898
## 53.57143 235.3574 40.47121 1780.582 17.34985 5955.898
## 53.71429 235.3574 40.47121 1780.582 17.34985 5955.898
## 53.85714 235.3574 40.47121 1780.582 17.34985 5955.898
## 54.00000 235.3574 40.47121 1780.582 17.34985 5955.899
## 54.14286 235.3574 40.47121 1780.582 17.34985 5955.899
## 54.28571 235.3574 40.47121 1780.582 17.34985 5955.899
## 54.42857 235.3574 40.47121 1780.582 17.34985 5955.899
ATM4 data was slighly abnormal that’s why I used boxcox transformation to normalize the data and seems like it’s better as compared with before. Data is stationary as we can see in ACF plot above and autoplot which shows data does not have autocorrelation. Before transformation AIC and BIC values were around 6500 which is dropped significantly down to around 2143 after boxcox transformation. RMSE has also improved. It shows that boxcox transformation did help to improve the results but we will select the model after using ETS and ARIMA to compare their measures.
forecast_atm4_ets <- ets(atm4_ts, lambda = 'auto')
autoplot(forecast(forecast_atm4_ets))
checkresiduals(forecast_atm4_ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 18.809, df = 5, p-value = 0.002086
##
## Model df: 9. Total lags used: 14
summary(forecast_atm4_ets)
## ETS(A,N,A)
##
## Call:
## ets(y = atm4_ts, lambda = "auto")
##
## Box-Cox transformation: lambda= -0.0737
##
## Smoothing parameters:
## alpha = 0.0138
## gamma = 0.1424
##
## Initial states:
## l = 4.4335
## s = -1.3504 -0.0467 0.0246 0.3089 0.3808 0.2859
## 0.397
##
## sigma: 0.8982
##
## AIC AICc BIC
## 2085.955 2086.577 2124.954
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 177.0986 664.9255 308.4432 -271.5265 333.966 0.7674736
## ACF1
## Training set -0.004544255
Model’s accuracy got slightly improved with AIC and BIC’s values came down to around 2085 and RMSE from 694 to 664. I cannot say for sure if this model is better yet as ARIMA seems to give better results so far. Let’s check it out with ARIMA now. Also, as data is not normal that’s why I had to apply boxcox transformation for better results.
forecast_atm4_arima <- auto.arima(atm4_ts, lambda = 'auto')
autoplot(forecast(forecast_atm4_arima))
checkresiduals(forecast_atm4_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 17.625, df = 11, p-value = 0.09069
##
## Model df: 3. Total lags used: 14
summary(forecast_atm4_arima)
## Series: atm4_ts
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Box Cox transformation: lambda= -0.07372558
##
## Coefficients:
## sar1 sar2 mean
## 0.2487 0.1947 4.4864
## s.e. 0.0521 0.0525 0.0841
##
## sigma^2 estimated as 0.8418: log likelihood=-485.59
## AIC=979.18 AICc=979.29 BIC=994.78
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 209.7822 678.9958 317.1943 -239.7643 304.041 0.7892482
## ACF1
## Training set -0.005100994
This model seems to provide better results in terms of accuracy as AIC and BIC have dropped from around 2085 down to around 979 but RMSE did not improve. I would still go with ARIMA model as AIC and BIC are stronger criterion for model selection
# ATM 1
test <- forecast(forecast_atm1_arima, h=5)
test1 <- data.frame(test)
test1
## Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
## 52.71429 86.805607 53.71784 119.89337 36.20224 137.40898
## 52.85714 100.640560 67.09247 134.18865 49.33319 151.94793
## 53.00000 74.714560 41.16647 108.26265 23.40719 126.02193
## 53.14286 4.762635 -28.78545 38.31072 -46.54474 56.07001
## 53.28571 100.063192 66.51510 133.61128 48.75582 151.37057
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm1_forecast.xlsx', row.names=FALSE, col.names=TRUE)
# ATM 2
test <- forecast(forecast_atm2_arima, h=5)
test1 <- data.frame(test)
test1
## Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
## 52.85714 69.432448 35.33003 103.53487 17.27730 121.58759
## 53.00000 69.881366 35.77560 103.98713 17.72111 122.04163
## 53.14286 12.308924 -22.09671 46.71456 -40.30996 64.92780
## 53.28571 2.758906 -31.72395 37.24176 -49.97807 55.49588
## 53.42857 99.519624 64.89886 134.14038 46.57174 152.46751
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm2_forecast.xlsx', row.names=FALSE, col.names=TRUE)
# ATM 4
test <- forecast(forecast_atm4_arima, h=5)
test1 <- data.frame(test)
test1
## Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
## 53.14286 90.23016 19.20879 517.3859 9.049304 1441.117
## 53.28571 275.70706 51.80183 1856.9650 23.087712 5748.059
## 53.42857 328.95842 60.54703 2275.9560 26.741692 7171.671
## 53.57143 72.60330 15.82002 404.1713 7.530810 1104.380
## 53.71429 303.15704 56.33390 2071.4399 24.985769 6473.389
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm4_forecast.xlsx', row.names=FALSE, col.names=TRUE)
Data had some missing values which were removed as I could not change them with mean or median. I splitted the data into four datasets as required ATM1, ATM2, ATM3 AND ATM4. I aggregated the data into ts object on weekly basis and test Simple exponential, ETS and ARIMA. Arima performed best in all ATM1, ATM2 and ATM4. ATM3 did not have enough data to do analysis. Criteria for best model was based on AIC, BIC and RMSE. I forecasted the values for 5 weeks with exact value along with lower and upper values for 80% and 95% confidence interval and exported them in excel. As required, forecasted values are for May and first month of June 2010.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Let’s start by reading the dataset first.
power <- readxl::read_excel("C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/Power624Data.xlsx")
power %>% kable() %>% kable_styling(full_width = FALSE)
CaseSequence | YYYY-MMM | KWH |
---|---|---|
733 | 1998-Jan | 6862583 |
734 | 1998-Feb | 5838198 |
735 | 1998-Mar | 5420658 |
736 | 1998-Apr | 5010364 |
737 | 1998-May | 4665377 |
738 | 1998-Jun | 6467147 |
739 | 1998-Jul | 8914755 |
740 | 1998-Aug | 8607428 |
741 | 1998-Sep | 6989888 |
742 | 1998-Oct | 6345620 |
743 | 1998-Nov | 4640410 |
744 | 1998-Dec | 4693479 |
745 | 1999-Jan | 7183759 |
746 | 1999-Feb | 5759262 |
747 | 1999-Mar | 4847656 |
748 | 1999-Apr | 5306592 |
749 | 1999-May | 4426794 |
750 | 1999-Jun | 5500901 |
751 | 1999-Jul | 7444416 |
752 | 1999-Aug | 7564391 |
753 | 1999-Sep | 7899368 |
754 | 1999-Oct | 5358314 |
755 | 1999-Nov | 4436269 |
756 | 1999-Dec | 4419229 |
757 | 2000-Jan | 7068296 |
758 | 2000-Feb | 5876083 |
759 | 2000-Mar | 4807961 |
760 | 2000-Apr | 4873080 |
761 | 2000-May | 5050891 |
762 | 2000-Jun | 7092865 |
763 | 2000-Jul | 6862662 |
764 | 2000-Aug | 7517830 |
765 | 2000-Sep | 8912169 |
766 | 2000-Oct | 5844352 |
767 | 2000-Nov | 5041769 |
768 | 2000-Dec | 6220334 |
769 | 2001-Jan | 7538529 |
770 | 2001-Feb | 6602448 |
771 | 2001-Mar | 5779180 |
772 | 2001-Apr | 4835210 |
773 | 2001-May | 4787904 |
774 | 2001-Jun | 6283324 |
775 | 2001-Jul | 7855129 |
776 | 2001-Aug | 8450717 |
777 | 2001-Sep | 7112069 |
778 | 2001-Oct | 5242535 |
779 | 2001-Nov | 4461979 |
780 | 2001-Dec | 5240995 |
781 | 2002-Jan | 7099063 |
782 | 2002-Feb | 6413429 |
783 | 2002-Mar | 5839514 |
784 | 2002-Apr | 5371604 |
785 | 2002-May | 5439166 |
786 | 2002-Jun | 5850383 |
787 | 2002-Jul | 7039702 |
788 | 2002-Aug | 8058748 |
789 | 2002-Sep | 8245227 |
790 | 2002-Oct | 5865014 |
791 | 2002-Nov | 4908979 |
792 | 2002-Dec | 5779958 |
793 | 2003-Jan | 7256079 |
794 | 2003-Feb | 6190517 |
795 | 2003-Mar | 6120626 |
796 | 2003-Apr | 4885643 |
797 | 2003-May | 5296096 |
798 | 2003-Jun | 6051571 |
799 | 2003-Jul | 6900676 |
800 | 2003-Aug | 8476499 |
801 | 2003-Sep | 7791791 |
802 | 2003-Oct | 5344613 |
803 | 2003-Nov | 4913707 |
804 | 2003-Dec | 5756193 |
805 | 2004-Jan | 7584596 |
806 | 2004-Feb | 6560742 |
807 | 2004-Mar | 6526586 |
808 | 2004-Apr | 4831688 |
809 | 2004-May | 4878262 |
810 | 2004-Jun | 6421614 |
811 | 2004-Jul | 7307931 |
812 | 2004-Aug | 7309774 |
813 | 2004-Sep | 6690366 |
814 | 2004-Oct | 5444948 |
815 | 2004-Nov | 4824940 |
816 | 2004-Dec | 5791208 |
817 | 2005-Jan | 8225477 |
818 | 2005-Feb | 6564338 |
819 | 2005-Mar | 5581725 |
820 | 2005-Apr | 5563071 |
821 | 2005-May | 4453983 |
822 | 2005-Jun | 5900212 |
823 | 2005-Jul | 8337998 |
824 | 2005-Aug | 7786659 |
825 | 2005-Sep | 7057213 |
826 | 2005-Oct | 6694523 |
827 | 2005-Nov | 4313019 |
828 | 2005-Dec | 6181548 |
829 | 2006-Jan | 7793358 |
830 | 2006-Feb | 5914945 |
831 | 2006-Mar | 5819734 |
832 | 2006-Apr | 5255988 |
833 | 2006-May | 4740588 |
834 | 2006-Jun | 7052275 |
835 | 2006-Jul | 7945564 |
836 | 2006-Aug | 8241110 |
837 | 2006-Sep | 7296355 |
838 | 2006-Oct | 5104799 |
839 | 2006-Nov | 4458429 |
840 | 2006-Dec | 6226214 |
841 | 2007-Jan | 8031295 |
842 | 2007-Feb | 7928337 |
843 | 2007-Mar | 6443170 |
844 | 2007-Apr | 4841979 |
845 | 2007-May | 4862847 |
846 | 2007-Jun | 5022647 |
847 | 2007-Jul | 6426220 |
848 | 2007-Aug | 7447146 |
849 | 2007-Sep | 7666970 |
850 | 2007-Oct | 5785964 |
851 | 2007-Nov | 4907057 |
852 | 2007-Dec | 6047292 |
853 | 2008-Jan | 7964293 |
854 | 2008-Feb | 7597060 |
855 | 2008-Mar | 6085644 |
856 | 2008-Apr | 5352359 |
857 | 2008-May | 4608528 |
858 | 2008-Jun | 6548439 |
859 | 2008-Jul | 7643987 |
860 | 2008-Aug | 8037137 |
861 | 2008-Sep | NA |
862 | 2008-Oct | 5101803 |
863 | 2008-Nov | 4555602 |
864 | 2008-Dec | 6442746 |
865 | 2009-Jan | 8072330 |
866 | 2009-Feb | 6976800 |
867 | 2009-Mar | 5691452 |
868 | 2009-Apr | 5531616 |
869 | 2009-May | 5264439 |
870 | 2009-Jun | 5804433 |
871 | 2009-Jul | 7713260 |
872 | 2009-Aug | 8350517 |
873 | 2009-Sep | 7583146 |
874 | 2009-Oct | 5566075 |
875 | 2009-Nov | 5339890 |
876 | 2009-Dec | 7089880 |
877 | 2010-Jan | 9397357 |
878 | 2010-Feb | 8390677 |
879 | 2010-Mar | 7347915 |
880 | 2010-Apr | 5776131 |
881 | 2010-May | 4919289 |
882 | 2010-Jun | 6696292 |
883 | 2010-Jul | 770523 |
884 | 2010-Aug | 7922701 |
885 | 2010-Sep | 7819472 |
886 | 2010-Oct | 5875917 |
887 | 2010-Nov | 4800733 |
888 | 2010-Dec | 6152583 |
889 | 2011-Jan | 8394747 |
890 | 2011-Feb | 8898062 |
891 | 2011-Mar | 6356903 |
892 | 2011-Apr | 5685227 |
893 | 2011-May | 5506308 |
894 | 2011-Jun | 8037779 |
895 | 2011-Jul | 10093343 |
896 | 2011-Aug | 10308076 |
897 | 2011-Sep | 8943599 |
898 | 2011-Oct | 5603920 |
899 | 2011-Nov | 6154138 |
900 | 2011-Dec | 8273142 |
901 | 2012-Jan | 8991267 |
902 | 2012-Feb | 7952204 |
903 | 2012-Mar | 6356961 |
904 | 2012-Apr | 5569828 |
905 | 2012-May | 5783598 |
906 | 2012-Jun | 7926956 |
907 | 2012-Jul | 8886851 |
908 | 2012-Aug | 9612423 |
909 | 2012-Sep | 7559148 |
910 | 2012-Oct | 5576852 |
911 | 2012-Nov | 5731899 |
912 | 2012-Dec | 6609694 |
913 | 2013-Jan | 10655730 |
914 | 2013-Feb | 7681798 |
915 | 2013-Mar | 6517514 |
916 | 2013-Apr | 6105359 |
917 | 2013-May | 5940475 |
918 | 2013-Jun | 7920627 |
919 | 2013-Jul | 8415321 |
920 | 2013-Aug | 9080226 |
921 | 2013-Sep | 7968220 |
922 | 2013-Oct | 5759367 |
923 | 2013-Nov | 5769083 |
924 | 2013-Dec | 9606304 |
summary(power)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
# Checking for missing values
colSums(is.na(power))
## CaseSequence YYYY-MMM KWH
## 0 0 1
# Replacing missing value with median
power$KWH[is.na(power$KWH)] = median(power$KWH, na.rm=TRUE)
# Converting the dataframe into timeseries object
power_ts <- ts(power$KWH, start=c(1998,1), frequency = 12)
power_ts
## Jan Feb Mar Apr May Jun Jul Aug
## 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428
## 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416 7564391
## 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662 7517830
## 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129 8450717
## 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702 8058748
## 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676 8476499
## 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931 7309774
## 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998 7786659
## 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564 8241110
## 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220 7447146
## 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987 8037137
## 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260 8350517
## 2010 9397357 8390677 7347915 5776131 4919289 6696292 770523 7922701
## 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343 10308076
## 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851 9612423
## 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321 9080226
## Sep Oct Nov Dec
## 1998 6989888 6345620 4640410 4693479
## 1999 7899368 5358314 4436269 4419229
## 2000 8912169 5844352 5041769 6220334
## 2001 7112069 5242535 4461979 5240995
## 2002 8245227 5865014 4908979 5779958
## 2003 7791791 5344613 4913707 5756193
## 2004 6690366 5444948 4824940 5791208
## 2005 7057213 6694523 4313019 6181548
## 2006 7296355 5104799 4458429 6226214
## 2007 7666970 5785964 4907057 6047292
## 2008 6283324 5101803 4555602 6442746
## 2009 7583146 5566075 5339890 7089880
## 2010 7819472 5875917 4800733 6152583
## 2011 8943599 5603920 6154138 8273142
## 2012 7559148 5576852 5731899 6609694
## 2013 7968220 5759367 5769083 9606304
The data was given on monthly basis from 1998 till 2013 and we are asked to forecast for every month in 2014. First, I had to check for missing value and I saw that there is only one missing value which I replaced with median of that column and then converted it into ts object before moving ahead. Now the data is ready to go for next steps.
# Before Transformation
ggtsdisplay(power_ts, main="Before Transformation, Monthly Power Consumption")
# Box Cox Transformation
power_boxcox <- power_ts %>% BoxCox(lambda= 'auto')
ggtsdisplay(power_boxcox, main='BoxCox Transformation, Monthly Power Consumption')
# Seasonality
ggseasonplot(power_boxcox)
BoxCox Transformation did not make much improvement especially the data is still non-stationary therefore, I have to make it stationary before creating any model or predictions. There is also seasonality in the data which shows dip in April and May and then goes up again in June through beginning of September. Let’s see how many differencing is required to make the data stationary.
print(paste0("Suggested number of differencing is ", ndiffs(power_boxcox)))
## [1] "Suggested number of differencing is 1"
print(paste0("Seasonal differencing: How many more differencing required ? ", ndiffs(diff(power_boxcox, lag=12))))
## [1] "Seasonal differencing: How many more differencing required ? 0"
Seems like no seasonal differencing is required but overall 1 differencing may make the data stationary. Let’s make at first difference.
power_diff <- power_boxcox %>% diff(lag = 12)
ggtsdisplay(power_diff, main= "BoxCox Transformation and seasonal differencing - Monthly power consumption")
The data has become stationary now as we can see in ACF plot at seasonal differencing. There was a seasonality in the data which required us seasonal differencing in order to make the data stationary. We can double check by using ndiffs again
print(paste0("Suggested differencing : ", ndiffs(power_diff)))
## [1] "Suggested differencing : 0"
Now that the data is stationary, we can move forward and use different forecasting techniques to see which one performs well and then predict for 2014 for each month. We will use moving average as there is seaonality in the data and then others to see which one performs better.
autoplot(power_diff, series="Data")+
autolayer(ma(power_diff, 12), series = "12 Months Moving Average")+ ggtitle("12 Months Moving Average for 2014")
forecast_power_ses <- ses(power_diff, h=12)
autoplot(forecast_power_ses)+
autolayer(fitted(forecast_power_ses), series="Fitted")
summary(forecast_power_ses)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = power_diff, h = 12)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 0.0714
##
## sigma: 1.3548
##
## AIC AICc BIC
## 1048.048 1048.184 1057.627
##
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0005184573 1.347295 0.6003728 -761.7743 987.6887 0.5469261
## ACF1
## Training set 0.1202922
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Feb 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Mar 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Apr 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## May 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Jun 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Jul 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Aug 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Sep 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Oct 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Nov 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
## Dec 2014 0.07137174 -1.664929 1.807673 -2.584072 2.726815
Moving average still gives some value but simple exponential smoothing does not seem like it’s working as it’s constant line of 0.0713 (Transformed). Here, AIC is 1048 and BIC is 1057. RMSE is 1.347. Let’s use ARIMA and then compare both.
I’m going to check it out with ARIMA which seemed to work fantastic in Part A and see how it goes here.
forecast_atm4_arima <- auto.arima(atm4_ts, lambda = ‘auto’) autoplot(forecast(forecast_atm4_arima)) checkresiduals(forecast_atm4_arima) summary(forecast_atm4_arima)
forecast_power_arima <- auto.arima(power_diff)
autoplot(forecast(forecast_power_arima,12), main="Prediction for 2014 with ARIMA")
summary(forecast_power_arima)
## Series: power_diff
## ARIMA(0,0,1)(0,0,2)[12] with non-zero mean
##
## Coefficients:
## ma1 sma1 sma2 mean
## 0.1195 -0.9264 0.0890 0.0617
## s.e. 0.0712 0.0810 0.0811 0.0198
##
## sigma^2 estimated as 0.9561: log likelihood=-257.16
## AIC=524.31 AICc=524.66 BIC=540.28
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02555369 0.9668839 0.4542946 -1920.731 2160.928 0.4138522
## ACF1
## Training set 0.006469908
Seems like ARIMA provided better results as AIC dropped to 524 and BIC dropped to 540. RMSE has also dropped from 1.347 to 0.966. I’ll select ARIMA model here and predict the values in excel as I am comfortable with the results of ARIMA.
test <- forecast(forecast_atm4_arima, h=5) test1 <- data.frame(test) test1 xlsx::write.xlsx(test1, ‘C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/atm1_forecast.xlsx’, row.names=FALSE, col.names=TRUE)
test <- forecast(forecast_power_arima, h=12)
test1 <- data.frame(test)
#xlsx::write.xlsx(test1, 'C:/Users/hukha/Desktop/MS - Data Science/Data 624 - Predictive Analysis/POwer 12 months prediction.xlsx')
Power datase was given in monthly basis from 1998 to 2013 and was asked to provide 2014’s prediction for each month. Data had one missing value which was replaced with the median of the column. Data was converted into ts object and I used different techniques to see which one performs better. ARIMA seemed to provide better results based on the values of AIC, BIC and RMSE that’s why I chose ARIMA model to predict the power consumption of each month for 2014.