In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling.
#Libraries required
library("openxlsx")
library('dplyr')
library('tidyverse')
library("lubridate")
library('forecast')
library("tsibble")
library(feasts)
library(rtweet)
library(TSstudio)
library(data.table)
library(Hmisc)
#Import the read the data
ATM_Data <- read.xlsx("ATM624Data.xlsx")
head(ATM_Data)
## DATE ATM Cash
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
The data consists of 3 columns (Date, ATM, and Cash). We notice that we need to change the date format. First, I would like to check if there are missing values.
Check if there are NA values
sum(is.na(ATM_Data))
## [1] 33
There are 33 missing values in the data.
Change the format of the date and drop the NA values
ATM_Data <- ATM_Data %>% drop_na() %>%
mutate(DATE = as.Date(DATE, origin = "1899-12-30"))
Check how many unique values of ATM variable.
unique(ATM_Data$ATM)
## [1] "ATM1" "ATM2" "ATM3" "ATM4"
There are four ATMs.
summary(ATM_Data)
## DATE ATM Cash
## Min. :2009-05-01 Length:1455 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-10-31 Mode :character Median : 73.0
## Mean :2009-10-30 Mean : 155.6
## 3rd Qu.:2010-01-29 3rd Qu.: 114.0
## Max. :2010-04-30 Max. :10919.8
The summary statistics indicate the following:
The date range from 2009-05-01 to 2010-04-30.
Looking at the mean, median and 3rd Qu in the Cash column, there is an indication of outliers because there is a huge difference in the Max value.
Check the total amount of cash withdrawn from each ATM
Cash_Withdrawn <- aggregate(ATM_Data$Cash, by=list(Group=ATM_Data$ATM), FUN=sum)
Cash_Withdrawn
## Group x
## 1 ATM1 30367.0
## 2 ATM2 22716.0
## 3 ATM3 263.0
## 4 ATM4 173025.8
ATM4 has the highest sum of cash withdrawn.
ATM3 has the lowest sum of cash withdrawn.
In order to plot based on each ATM, we need to spread the column ATM using spread function.
spread_data <- ATM_Data %>% spread(ATM, Cash)
head(spread_data)
## DATE ATM1 ATM2 ATM3 ATM4
## 1 2009-05-01 96 107 0 776.99342
## 2 2009-05-02 82 89 0 524.41796
## 3 2009-05-03 85 90 0 792.81136
## 4 2009-05-04 90 55 0 908.23846
## 5 2009-05-05 99 79 0 52.83210
## 6 2009-05-06 88 19 0 52.20845
hist.data.frame(spread_data)
From the histogram above, we can observe:
ATM1 displays normal distribution in addition to another peak towards zero.
ATM2 displays approximately normal distribution in addition to another peak towards zero.
ATM3 which has the lowest sum of cash withdrawn, displays all zeros entries.
ATM4 displays a unimodal distribution.
#Time Series plot
ATM_Data %>% ggplot(aes(x = DATE, y = Cash, col = ATM)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
labs(title = "Cash Withdrawn per ATM per Day", x = "Date") +
scale_y_continuous("Cash Withdrawn ($100's)")
From the time series plots above, we can observe:
ATM1 and ATM2 have fair range of cash withdrawal.
ATM3 again has 0 values and has the cash withdrawals in the last month of data.
ATM4 shows extreme outlier but it shows normal distribution similar to ATM1 and ATM2.
#boxplot
par(mfrow = c(1,4))
for(i in 2:5){ #iterate over atm columns
boxplot(spread_data[i],
main = sprintf("%s", names(spread_data)[i]),
col = "lightblue")
}
From the above boxplot, we see observe:
ATM1 has many outliers,
ATM2 does not have any outliers,
ATM3 has 3 outliers,
ATM4 has 3 outliers with one of them is VERY extreme.
It is important to handle outliers for better forecasting.
First, we are excluding ATM 3 from the following modeling section since we will use a simple mean forecast.
spread_data2 <- dplyr::select(spread_data, -c(ATM3))
#spread_data2
Second, we will handle outliers before building the model.
#Replacing outliers in ATM1 and ATM4.
spread_data2$ATM1 <- tsclean(spread_data2$ATM1, replace.missing = TRUE)
spread_data2$ATM2 <- tsclean(spread_data2$ATM2, replace.missing = TRUE)
spread_data2$ATM4 <- tsclean(spread_data2$ATM4, replace.missing = TRUE)
summary(spread_data2)
## DATE ATM1 ATM2 ATM4
## Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 1.563
## 1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 26.00 1st Qu.: 124.334
## Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 402.770
## Mean :2009-10-30 Mean : 84.15 Mean : 62.59 Mean : 444.757
## 3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 704.192
## Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :1712.075
We can see that the maximum value for ATM4 is reduced.
#convert to tsibble object
TS_SS <- spread_data2 %>%
as_tsibble(index = DATE)
We are going to build a couple of models, and select the strongest performing between models based on AIC.
Decomposition the ATM1 series:
#decompose series
TS_SS %>% model(classical_decomposition(ATM1, type = "add")) %>%
components() %>%
autoplot() +
labs(title = "Classical additive decomposition for ATM1 cash withdrawal")
## Warning: Removed 3 row(s) containing missing values (geom_path).
There is not a clear trend in ATM1. We explore the corresponding time series, ACF, and PACF:
#Plot time series, ACF & PCF
TS_SS %>%
gg_tsdisplay(ATM1, plot_type = 'partial') +
labs(title = "Untransformed time series, ACF, and PACF for ATM1 cash withdrawal")
From the above plots, it appears that we observe seasonality every 7, 14, 21 days, confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days). A stationary series has no predictable patterns in the long term. No trend and no seasonality.
ndiffs(TS_SS$ATM1)
## [1] 0
ATM1 DOES NOT require box cox transformation or non-seasonal differencing. An ARIMA(x,0,x)(x,1,x) model where p,q, P, and Q are to be determined.
TS_SS_Model = ets(TS_SS$ATM1, model = 'ZZZ')
summary(TS_SS_Model)
## ETS(M,N,N)
##
## Call:
## ets(y = TS_SS$ATM1, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.0326
##
## Initial states:
## l = 79.7942
##
## sigma: 0.4352
##
## AIC AICc BIC
## 4785.562 4785.628 4797.262
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1295602 36.75059 27.45335 -179.0177 202.7145 0.7378734
## ACF1
## Training set 0.04828029
We can see that auto selection method of ETS has chosen ETS (M,N,N) model indicating multiplicative error terms and No trend and seasonality
#Analyze residuals
checkresiduals(TS_SS_Model)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,N)
## Q* = 240.53, df = 8, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 10
ARIMA
Arima.Fit <- function(ts){
bc.lambda <- BoxCox.lambda(ts)
ts <- ts %>% as.vector() %>% ts(frequency = 7)
aic <- Inf
for (s in c(FALSE, TRUE)){
for (bc in c(FALSE, TRUE)){
if (bc == TRUE){
fit <- auto.arima(ts, seasonal=s, lambda=bc.lambda, biasadj=T, stepwise=F, approximation=F)
} else {
fit <- auto.arima(ts, seasonal=s, stepwise=F, approximation=F)
}
if (fit$aic < aic){
aic <- fit$aic
best.fit <- fit
}
}
}
return(best.fit)
}
(Arima.ATM1 <- Arima.Fit(TS_SS$ATM1))
## Series: ts
## ARIMA(0,0,1)(1,1,1)[7]
##
## Coefficients:
## ma1 sar1 sma1
## 0.1968 0.1848 -0.7537
## s.e. 0.0546 0.0793 0.0552
##
## sigma^2 = 555.1: log likelihood = -1639.63
## AIC=3287.26 AICc=3287.37 BIC=3302.78
checkresiduals(Arima.ATM1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(1,1,1)[7]
## Q* = 15.095, df = 11, p-value = 0.1782
##
## Model df: 3. Total lags used: 14
Based on the output AICs from above, we observe that auto ARIMA led to an ARIMA(0,0,1)(1,1,1) model with an AIC of 3287.26 which is less than that of the ETS model.
Lower AIC value is indicative of a stronger model, I choose the ARIMA model to forecast for ATM1.
#decompose series
TS_SS %>% model(classical_decomposition(ATM2, type = "add")) %>%
components() %>%
autoplot() +
labs(title = "Classical additive decomposition for ATM2 cash withdrawal")
## Warning: Removed 3 row(s) containing missing values (geom_path).
There is a slight downard trend in ATM2, a random component that also looks like white noise, and what appears to be a weekly seasonality. To confirm or deny, we consult the time series, ACF, and PACF:
#plot time series / ACF / PCF
TS_SS %>%
gg_tsdisplay(ATM2, plot_type = 'partial')
From the above plots, it appears that we observe seasonality (every 7, 14, 21 days), confirmed with strong autocorrelation and exponential decay in the partial correlation values (every 7 days). In addition, we observe spikes at days 2 and 5 that appear to repeat in the ACF and PACF plots.
While the case for a box cox transformation, to normalize and stabilize variance, is weak, seasonal differencing (D=1) certainly is. We explore whether non-seasonal differencing is:
ndiffs(TS_SS$ATM2)
## [1] 1
ATM2 does not require box cox transformation but requires non-seasonal differencing (d=1).
ETS Model
Fit2 = ets(TS_SS$ATM2, model = 'ZZZ')
summary(Fit2)
## ETS(A,A,N)
##
## Call:
## ets(y = TS_SS$ATM2, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 7e-04
## beta = 7e-04
##
## Initial states:
## l = 78.8479
## b = -0.0342
##
## sigma: 38.286
##
## AIC AICc BIC
## 4820.352 4820.519 4839.851
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.7238217 38.07562 32.60903 -Inf Inf 0.7664292 -0.01310284
The auto selection method of ETS has chosen ETS (A,A,N)
#Analyze residuals
checkresiduals(Fit2)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,A,N)
## Q* = 360.74, df = 6, p-value < 2.2e-16
##
## Model df: 4. Total lags used: 10
ARIMA
(Arima.ATM2 <- Arima.Fit(TS_SS$ATM2))
## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.
## Series: ts
## ARIMA(2,0,2)(0,1,1)[7]
## Box Cox transformation: lambda= 0.1870864
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4000 -0.9081 0.4140 0.8224 -0.6914
## s.e. 0.0709 0.0584 0.1018 0.0742 0.0449
##
## sigma^2 = 2.032: log likelihood = -634.27
## AIC=1280.55 AICc=1280.79 BIC=1303.83
checkresiduals(Arima.ATM2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 12.47, df = 9, p-value = 0.1881
##
## Model df: 5. Total lags used: 14
Auto ARIMA led to an ARIMA(2,0,2)(0,1,1) model with an AIC of 1280.55 which is less than that of the ETS model. I will select the ARIMA model to forecast for ATM2 because lower AIC value is indicative of a stronger model.
ATM3 had only data points to work with, I elected to take the mean of these 3 values to provide a forecast:
#select ATM3 column
spread_ATM3 <- dplyr::select(spread_data, c(ATM3))
#select only non-zero rows
simple_atm3 <- tail(spread_ATM3, n=3)
#take the mean
ATM3_mean <- mean(simple_atm3$ATM3)
ATM3_mean
## [1] 87.66667
#forecast
ATM3_forecast <- ts(rep(ATM3_mean), start = 1, end = 31)
The average value for ATM3 was 87.66667 and so we forecast this value for the entire month of May.
#decompose series
TS_SS %>% model(classical_decomposition(ATM4, type = "add")) %>%
components() %>%
autoplot() +
labs(title = "Classical additive decomposition for ATM4 cash withdrawal")
## Warning: Removed 3 row(s) containing missing values (geom_path).
There is no trend in ATM4, the random component looks like white noise, and there is no weekly seasonality. To confirm or deny, we consult the time series, ACF, and PACF:
#plot time series / ACF / PCF
TS_SS %>%
gg_tsdisplay(ATM4, plot_type = 'partial')
ATM 4 looks different from ATM 1 and 2. It appears there is a case for seasonal differencing based on the ACF and PACF plots. Also, the spikes are greater as is the mean.
ndiffs(TS_SS$ATM4)
## [1] 0
There is no need for non-seasonal differencing
fit4 = ets(TS_SS$ATM4, model = 'ZZZ')
summary(fit4)
## ETS(A,N,N)
##
## Call:
## ets(y = TS_SS$ATM4, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 444.8009
##
## sigma: 351.5848
##
## AIC AICc BIC
## 6437.046 6437.113 6448.746
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1992344 350.6202 292.3004 -573.1566 606.5751 0.7568655
## ACF1
## Training set 0.07683212
The auto selection method of ETS has chosen ETS (A,N,N)
#Analyze residuals
checkresiduals(fit4)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 24.975, df = 8, p-value = 0.00157
##
## Model df: 2. Total lags used: 10
ARIMA
(Arima.ATM4 <- Arima.Fit(TS_SS$ATM4))
## Series: ts
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Box Cox transformation: lambda= 0.238482
##
## Coefficients:
## sar1 sar2 mean
## 0.2328 0.2112 12.1713
## s.e. 0.0516 0.0520 0.3862
##
## sigma^2 = 17.72: log likelihood = -1041.71
## AIC=2091.42 AICc=2091.53 BIC=2107.02
For ATM 4, the model selected is ARIMA(0,0,0)(2,0,0)[7] with non-zero mean and Box-Cox transformation.
checkresiduals(Arima.ATM4)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
## Q* = 20.675, df = 11, p-value = 0.03689
##
## Model df: 3. Total lags used: 14
Auto ARIMA led to an ARIMA(0,0,0)(2,0,0) model with an AIC of 2091.42 which is less than that of the ETS model. I will select the ARIMA model to forecast for ATM2 because lower AIC value is indicative of a stronger model.
ATM1_Fcst = forecast(Arima.ATM1, h=30)
print(ATM1_Fcst)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 53.14286 87.040565 56.84665 117.23448 40.86296 133.21817
## 53.28571 104.202236 73.42908 134.97540 57.13875 151.26572
## 53.42857 73.512143 42.73898 104.28530 26.44866 120.57563
## 53.57143 9.181767 -21.59139 39.95493 -37.88172 56.24525
## 53.71429 99.565861 68.79270 130.33902 52.50237 146.62935
## 53.85714 80.483913 49.71075 111.25707 33.42043 127.54740
## 54.00000 87.019776 56.24662 117.79294 39.95629 134.08326
## 54.14286 87.993527 54.58104 121.40601 36.89354 139.09352
## 54.28571 103.315610 69.80507 136.82615 52.06566 154.56556
## 54.42857 73.421988 39.91144 106.93253 22.17203 124.67194
## 54.57143 10.139357 -23.37119 43.64990 -41.11060 61.38931
## 54.71429 100.224832 66.71429 133.73538 48.97488 151.47479
## 54.85714 80.203740 46.69320 113.71428 28.95379 131.45369
## 55.00000 87.393031 53.88249 120.90357 36.14308 138.64298
## 55.14286 88.169634 53.24395 123.09532 34.75540 141.58387
## 55.28571 103.151761 68.17241 138.13112 49.65545 156.64807
## 55.42857 73.405327 38.42597 108.38468 19.90902 126.90163
## 55.57143 10.316320 -24.66304 45.29567 -43.17999 63.81263
## 55.71429 100.346610 65.36725 135.32596 46.85030 153.84292
## 55.85714 80.151964 45.17261 115.13132 26.65566 133.64827
## 56.00000 87.462008 52.48265 122.44136 33.96570 140.95831
## 56.14286 88.202178 52.01931 124.38505 32.86526 143.53910
## 56.28571 103.121482 66.89280 139.35017 47.71449 158.52847
## 56.42857 73.402248 37.17356 109.63093 17.99526 128.80924
## 56.57143 10.349022 -25.87966 46.57771 -45.05797 65.75601
## 56.71429 100.369114 64.14043 136.59780 44.96212 155.77610
## 56.85714 80.142396 43.91371 116.37108 24.73541 135.54939
## 57.00000 87.474755 51.24607 123.70344 32.06776 142.88175
## 57.14286 88.208193 50.84287 125.57352 31.06286 145.35352
## 57.28571 103.115887 65.70723 140.52454 45.90428 160.32749
fwrite(data.frame(ATM1_Fcst), "ATM1Forecast.csv", sep=",", col.names = TRUE, row.names = FALSE)
autoplot(ATM1_Fcst, pi=TRUE) +
ggtitle("Monthly withdrawal forecast from ATM1")
ATM2_Fcst = forecast(Arima.ATM2, h=30)
print(ATM1_Fcst)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 53.14286 87.040565 56.84665 117.23448 40.86296 133.21817
## 53.28571 104.202236 73.42908 134.97540 57.13875 151.26572
## 53.42857 73.512143 42.73898 104.28530 26.44866 120.57563
## 53.57143 9.181767 -21.59139 39.95493 -37.88172 56.24525
## 53.71429 99.565861 68.79270 130.33902 52.50237 146.62935
## 53.85714 80.483913 49.71075 111.25707 33.42043 127.54740
## 54.00000 87.019776 56.24662 117.79294 39.95629 134.08326
## 54.14286 87.993527 54.58104 121.40601 36.89354 139.09352
## 54.28571 103.315610 69.80507 136.82615 52.06566 154.56556
## 54.42857 73.421988 39.91144 106.93253 22.17203 124.67194
## 54.57143 10.139357 -23.37119 43.64990 -41.11060 61.38931
## 54.71429 100.224832 66.71429 133.73538 48.97488 151.47479
## 54.85714 80.203740 46.69320 113.71428 28.95379 131.45369
## 55.00000 87.393031 53.88249 120.90357 36.14308 138.64298
## 55.14286 88.169634 53.24395 123.09532 34.75540 141.58387
## 55.28571 103.151761 68.17241 138.13112 49.65545 156.64807
## 55.42857 73.405327 38.42597 108.38468 19.90902 126.90163
## 55.57143 10.316320 -24.66304 45.29567 -43.17999 63.81263
## 55.71429 100.346610 65.36725 135.32596 46.85030 153.84292
## 55.85714 80.151964 45.17261 115.13132 26.65566 133.64827
## 56.00000 87.462008 52.48265 122.44136 33.96570 140.95831
## 56.14286 88.202178 52.01931 124.38505 32.86526 143.53910
## 56.28571 103.121482 66.89280 139.35017 47.71449 158.52847
## 56.42857 73.402248 37.17356 109.63093 17.99526 128.80924
## 56.57143 10.349022 -25.87966 46.57771 -45.05797 65.75601
## 56.71429 100.369114 64.14043 136.59780 44.96212 155.77610
## 56.85714 80.142396 43.91371 116.37108 24.73541 135.54939
## 57.00000 87.474755 51.24607 123.70344 32.06776 142.88175
## 57.14286 88.208193 50.84287 125.57352 31.06286 145.35352
## 57.28571 103.115887 65.70723 140.52454 45.90428 160.32749
fwrite(data.frame(ATM2_Fcst), "ATM2Forecast.csv", sep=",", col.names = TRUE, row.names = FALSE)
autoplot(ATM2_Fcst, pi=TRUE) +
ggtitle("Monthly withdrawal forecast from ATM2")
ATM4_Fcst = forecast(Arima.ATM4, h=30)
print(ATM4_Fcst)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 53.14286 265.8087 22.66715 645.3423 4.400024 1141.0761
## 53.28571 457.9805 61.55498 1041.9974 17.915335 1738.7159
## 53.42857 545.8291 82.96460 1216.3073 26.618470 1994.3830
## 53.57143 191.2427 11.55878 483.0208 1.557718 887.0983
## 53.71429 503.6930 72.45797 1133.1387 22.267342 1872.8199
## 53.85714 333.1698 34.83824 787.3177 8.185889 1358.0745
## 54.00000 434.3943 56.14645 994.5645 15.827425 1668.5228
## 54.14286 240.8718 16.62462 599.6557 2.597004 1085.8838
## 54.28571 500.5525 66.21057 1141.8068 19.021487 1909.4756
## 54.42857 492.4035 64.34569 1125.4116 18.302839 1885.2064
## 54.57143 242.1330 16.80509 602.4175 2.641001 1090.2237
## 54.71429 470.0453 59.31522 1080.2652 16.392350 1818.2149
## 54.85714 312.3895 27.94814 753.8311 5.683043 1325.3819
## 55.00000 483.1868 62.25651 1106.8304 17.504325 1857.6635
## 55.14286 357.3754 31.94206 862.4940 6.489185 1516.6060
## 55.28571 470.1714 53.34867 1098.1157 13.471469 1874.4898
## 55.42857 486.1112 56.65791 1130.8391 14.642173 1923.5948
## 55.57143 335.9058 28.31227 816.7173 5.433225 1446.0661
## 55.71429 472.7015 53.86962 1103.3186 13.654386 1882.3061
## 55.85714 395.5617 38.76907 943.1258 8.588871 1639.9994
## 56.00000 461.1931 51.51355 1079.6272 12.831355 1846.6870
## 56.14286 380.5345 35.28929 914.0025 7.425002 1600.0985
## 56.28571 469.9577 52.32086 1100.6690 12.995224 1883.4746
## 56.42857 471.9786 52.73109 1104.8360 13.137424 1889.7469
## 56.57143 375.5623 34.41112 903.4810 7.158194 1583.9728
## 56.71429 464.3149 51.18097 1089.0220 12.601860 1865.9322
## 56.85714 409.7601 40.60390 975.5257 9.086580 1694.0460
## 57.00000 464.3566 51.18937 1089.1081 12.604747 1866.0620
## 57.14286 415.8014 41.28640 989.6346 9.256551 1718.1152
## 57.28571 462.2099 50.24121 1086.2696 12.219139 1864.5560
fwrite(data.frame(ATM4_Fcst), "ATM4Forecast.csv", sep=",", col.names = TRUE, row.names = FALSE)
autoplot(ATM4_Fcst, pi=TRUE) +
ggtitle("Monthly withdrawal forecast from ATM4")
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
#Read the data
Residential_Data <- read.xlsx("ResidentialCustomerForecastLoad-624.xlsx")
head(Residential_Data)
## CaseSequence YYYY-MMM KWH
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
summary(Residential_Data)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
There is ONE NA value from KWH. I will fill that value with median value of KWH.
#Replace NA value with the mean.
Residential_Data$KWH[is.na(Residential_Data$KWH)] = median(Residential_Data$KWH, na.rm = TRUE)
summary(Residential_Data)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5434539
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6501333
## 3rd Qu.:876.2 3rd Qu.: 7608792
## Max. :924.0 Max. :10655730
#Copy the data
Data <- Residential_Data
#convert to time series
Data <- ts(Data$KWH, start = c(1998,1), frequency = 12)
head(Data)
## Jan Feb Mar Apr May Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147
#tail(Data)
summary(Data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 770523 5434539 6283324 6501333 7608792 10655730
The series runs from January of 1998 through December of 2013.
autoplot(Data,
title = "Residential Power Consumption",
Xtitle = "Date",
Ytitle = "KWHs")
## Warning: Ignoring unknown parameters: title, Xtitle, Ytitle
It looks like there is an extreme outlier in 2010 and our data is seasonal and with trend.
Outlier-handling will be important. I will use tsclean() function again to replace this outlier with a more reasonable number.
#replace outlier using tsclean()
Data_ts <- tsclean(Data, replace.missing = TRUE)
Autoplot after removing the outlier
autoplot(Data_ts) +
xlab('Month') +
ylab('KWH') +
ggtitle('Monthly power consumption in KWH')
STL decomposition
res = stl(Data_ts, s.window = 12)
plot(res)
The STL plot above shows strong seasonal pattern. There is no underlying trend however there is a surge in power consumption post 2010. The unusually low power consumption in the year 2010 is appearent in the error plot above. Above is also a non-stationary time series.
ggtsdisplay(Data_ts, main='Power Usage in KWH')
Above, is another time series plot after the missing values imputed. The gap and the outlier are now gone and the ACF plot remains unchanged. The seasonal plot and seasonal sub-series plot are also created.
We perform a box cox transformation and consult the resulting plot:
Data_ptc <- BoxCox(Data_ts, lambda=BoxCox.lambda(Data_ts))
autoplot(Data_ptc)
Let’s see what order differencing is required to make time series stationary
nsdiffs(Data_ts)
## [1] 1
ndiffs(Data_ts)
## [1] 1
It appears that non-seasonal differencing is required.
To observe the difference of an ETS (exponential smoothing) v. auto-ARIMA v. seasonal-ARIMA model, we train test split our data and use the MAPE (mean absolute percent error) of the test set for model selection.
We train test split our data, build an ETS model, and output the model’s training statistics and test set MAPE:
#train test split time series - using 2013
Data_split <- ts_split(Data_ts, sample.out = 12)
Train <- Data_split$train
Test <- Data_split$test
#ets model
ets_model <- ets(Train, lambda=BoxCox.lambda(Data_ts))
summary(ets_model)
## ETS(A,A,A)
##
## Call:
## ets(y = Train, lambda = BoxCox.lambda(Data_ts))
##
## Box-Cox transformation: lambda= -0.1484
##
## Smoothing parameters:
## alpha = 0.2155
## beta = 1e-04
## gamma = 1e-04
##
## Initial states:
## l = 6.0741
## b = 1e-04
## s = -0.0058 -0.026 -0.0112 0.0166 0.025 0.0197
## 0.0025 -0.0233 -0.0189 -0.0062 0.0083 0.0192
##
## sigma: 0.009
##
## AIC AICc BIC
## -743.6583 -739.8805 -689.3780
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 26852.27 581419.8 446726.4 -0.3521302 6.919776 0.7167728 0.2353739
checkresiduals(ets_model)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,A,A)
## Q* = 46.642, df = 8, p-value = 1.794e-07
##
## Model df: 16. Total lags used: 24
ARIMA
arima.fit <- function(ts, bc.lambda=-0.2163697){
aic <- Inf
for (s in c(FALSE, TRUE)){
for (bc in c(FALSE, TRUE)){
if (bc == TRUE){
fit <- auto.arima(ts, seasonal=s, lambda=bc.lambda, biasadj=T, stepwise=F, approximation=F)
} else {
fit <- auto.arima(ts, seasonal=s, stepwise=F, approximation=F)}
if (fit$aic < aic){
aic <- fit$aic
best.fit <- fit}}}
return(best.fit)
}
(arima.power <- arima.fit(Train))
## Series: ts
## ARIMA(0,0,3)(2,1,0)[12] with drift
## Box Cox transformation: lambda= -0.2163697
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 drift
## 0.2824 0.0855 0.2364 -0.7569 -0.4200 0e+00
## s.e. 0.0756 0.0815 0.0690 0.0753 0.0801 1e-04
##
## sigma^2 = 1.037e-05: log likelihood = 737.16
## AIC=-1460.32 AICc=-1459.62 BIC=-1438.46
checkresiduals(arima.power)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 26.058, df = 18, p-value = 0.09844
##
## Model df: 6. Total lags used: 24
The ARIMA model is ARIMA(0,0,3)(2,1,0)[12] with drift and Box-Cox transformation with lambda of -0.2164.
Train our model on all the data, forecast for 2014, output the corresponding plot, and send the forecast to csv:
#retrain the model
ets_fit <- ets(Data_ts, lambda=BoxCox.lambda(Data_ts))
#forecast
ets_fit %>% forecast(h=12) -> power_forecast
#output forecast plot
autoplot(power_forecast)
power_forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 2014 9186406 8129409 10404311 7626789 11123582
## Feb 2014 7824540 6938720 8842718 6516725 9442890
## Mar 2014 6865850 6097969 7746874 5731663 8265469
## Apr 2014 5951986 5295523 6703645 4981888 7145390
## May 2014 5823836 5179710 6561666 4872063 6995422
## Jun 2014 7463610 6604350 8453763 6195783 9038568
## Jul 2014 8954614 7891429 10185622 7387706 10915396
## Aug 2014 9378191 8251306 10685441 7718161 11461563
## Sep 2014 8867967 7804701 10100985 7301526 10832839
## Oct 2014 6305042 5579875 7140551 5235030 7633962
## Nov 2014 5565576 4932973 6293136 4631743 6722201
## Dec 2014 6988636 6164456 7941901 5773659 8506554
The plot above highlights an forecast. It appears that an exponential smoothing model’s greater weight on recent data works in this instance of power consumption. While there is greater variability in the data prior to our forecast, our forecast captures the seasonality of what appear to be higher demand in the summer and winter while sitting between the peaks and troughs of more recent years.
fwrite(data.frame(power_forecast), "PowerForecast.csv", sep=",", col.names = TRUE, row.names = FALSE)