The solo project consists of two parts. Part A tasks us with forecasting the amount of cash withdrawn from 4 ATM machines. Part B is forecasting residential power usage. The data provided for this project is in two excel files. The major outline of my approach is to examine the data, clean and prepare the data for analysis, and compare a range of forecasting models to the data. The model that performs the best based on RMSE will be the one chosen to forecast the next time period.
III. Summary of Findings and Forecast for May 2010
Part BIII. Summary of Findings and Monthly Forecast for 2014
Key Assumption
We are given no details about the ATM data. We don’t know if they’re from the same bank or in the same building or in the same city. Thus, I made the assumption that these are 4 distinct, unrelated ATMs. They are managed by an independent, non-bank, solo entrepeneur. These are the kinds of ATMs that one would find in a convenience store or at a gas station. This entrepeneur services these ATMs by keeping them stocked with cash. Revenue comes from the user fees for each withdrawal. So, if the ATMs are out of cash, the entrepeneur loses money because no withdrawals can be made. If they have too much cash, the money sits idle and the company loses money as well. So, the entrepeneur needs accurate withdrawal forecasts to maximize profits.
I begin by taking a glimpse at the raw data. There are 1,474 observations and three variables. We see that the Date variable is in integer format, not date. Data for each of the 4 ATMs are mingled in one dataset. There are 14 missing values in the ATM variables and 19 missing values in the Cash variable. I handle Missing Data in the next section.
## Observations: 1,474
## Variables: 3
## $ DATE <dbl> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938,...
## $ ATM <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2"...
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 10...
| DATE | ATM | Cash |
|---|---|---|
| 39934 | ATM1 | 96 |
| 39934 | ATM2 | 107 |
| 39935 | ATM1 | 82 |
| 39935 | ATM2 | 89 |
| 39936 | ATM1 | 85 |
| 39936 | ATM2 | 90 |
Using the excel_numeric_to_date function from the janitor library package, the dates were converted to the ‘YYYY-MM-DD’ format.
| DATE | ATM | Cash |
|---|---|---|
| 2009-05-01 | ATM1 | 96 |
| 2009-05-01 | ATM2 | 107 |
| 2009-05-02 | ATM1 | 82 |
| 2009-05-02 | ATM2 | 89 |
| 2009-05-03 | ATM1 | 85 |
Looking at the summary of the data shows that there are 19 observations with missing data. Further inspection shows that there are 14 observations without any ATM or Cash information, and five observations with only the ATM information. Since we have no details on how this data was collected and as such, there is no logical answer as to why 14 values are missing. For example, these missing values are on consecutive days so attributing the missingness to holidays or some other explanation like an equipment replacement.
Key Assumption
My next key assumption is regarding handling missing data.Both the amounts and the dates are missing. I am going to assume that the 14 missing values are Missing Completely at Random (MCAR) which means that the missing data has no relationship to the other data. I cannot use data from any of the other ATMs because as I stated earlier, they’re all independent. I also cannot assume that it came from one particular ATM given the uniqueness of each.
The link below goes into details about MCAR.
https://www.theanalysisfactor.com/mar-and-mcar-missing-data/
The first step in cleaning up this data will be to remove these 14 rows with no information.
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.6
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.8
## NA's :19
| DATE | ATM | Cash |
|---|---|---|
| 2009-06-13 | ATM1 | NA |
| 2009-06-16 | ATM1 | NA |
| 2009-06-18 | ATM2 | NA |
| 2009-06-22 | ATM1 | NA |
| 2009-06-24 | ATM2 | NA |
| 2010-05-01 | NA | NA |
The 14 rows with Missing Data
| DATE | ATM | Cash |
|---|---|---|
| 2010-05-01 | NA | NA |
| 2010-05-02 | NA | NA |
| 2010-05-03 | NA | NA |
| 2010-05-04 | NA | NA |
| 2010-05-05 | NA | NA |
| 2010-05-06 | NA | NA |
| 2010-05-07 | NA | NA |
| 2010-05-08 | NA | NA |
| 2010-05-09 | NA | NA |
| 2010-05-10 | NA | NA |
| 2010-05-11 | NA | NA |
| 2010-05-12 | NA | NA |
| 2010-05-13 | NA | NA |
| 2010-05-14 | NA | NA |
I created a new data frame, ATM624Data_Raw1, with the 14 removed.
Next, I created four new dataframes for each ATM. Again because of the independence of each machine.
ATM1
A summary of ATM1 shows that there are three NA values.
## DATE ATM Cash
## Min. :2009-05-01 Length:365 Min. : 1.00
## 1st Qu.:2009-07-31 Class :character 1st Qu.: 73.00
## Median :2009-10-30 Mode :character Median : 91.00
## Mean :2009-10-30 Mean : 83.89
## 3rd Qu.:2010-01-29 3rd Qu.:108.00
## Max. :2010-04-30 Max. :180.00
## NA's :3
When I isolated three of the missing cash values for ATM1, I saw that it’s all within the month of June 2009.
| DATE | ATM | Cash |
|---|---|---|
| 2009-06-13 | ATM1 | NA |
| 2009-06-16 | ATM1 | NA |
| 2009-06-22 | ATM1 | NA |
k-Nearest Neighbour Imputation
I used a kNN imputation strategy to fill in for the missing Cash values for ATM1. I also used the default value of k=5.
“One popular technique for imputation is a K-nearest neighbor model. A new sample is imputed by finding the samples in the training set”closest" to it and averages these nearby points to fill in the value."
Kuhn, Max and Johnson, Kjell, Applied Predictive Modeling, pg. 42
| DATE | ATM | Cash | Cash_imp | |
|---|---|---|---|---|
| 44 | 2009-06-13 | ATM1 | 90 | TRUE |
| 47 | 2009-06-16 | ATM1 | 90 | TRUE |
| 53 | 2009-06-22 | ATM1 | 90 | TRUE |
ATM2
For ATM2, there are 2 NA values, and I used the same imputation strategy
| DATE | ATM | Cash | Cash_imp | |
|---|---|---|---|---|
| 49 | 2009-06-18 | ATM2 | 89 | TRUE |
| 55 | 2009-06-24 | ATM2 | 89 | TRUE |
The box and whisker plots below show that ATM1 and ATM4 have outliers while ATM3 contains no data at all except for three entries.
par(mfrow=c(2,2))
boxplot(ATM1_Imputed$Cash , col="red" , xlab="ATM1 Cash")
boxplot(ATM2_Imputed$Cash , col="blue" , xlab="ATM2 Cash")
boxplot(ATM3$Cash, col="yellow" , xlab="ATM3 Cash")
boxplot(ATM4$Cash , col="orange" , xlab="ATM4 Cash")The Tukey Method
Invented by mathematician, John Tukey. “Tukey Fences uses the interquartile range (”IQR“) to flag observations (financial transactions in this case) that are considered outliers. Tukey Fences defines an outlier based on the below formula: Outlier = Transaction > Q3 + 1.5 x IQR OR Transaction < Q1-1.5 x IQR”
In this section, I created two functions that can identify outliers. The function, Identify_Outlier, uses the Tukey method, where outliers are identified by being below Q1-1.5IQR and above Q3+1.5IQR. The second function, tag_outlier, returns a binary list of values, “Acceptable” or “Outlier” that will be added to the dataframe.
Identify_Outlier <- function(value){
interquartile_range = IQR(sort(value),na.rm = TRUE)
q1 = matrix(c(quantile(sort(value),na.rm = TRUE)))[2]
q3 = matrix(c(quantile(sort(value),na.rm = TRUE)))[4]
lower = q1-(1.5*interquartile_range)
upper = q3+(1.5*interquartile_range)
bound <- c(lower, upper)
return (bound)
}tag_outlier <- function(value) {
boundaries <- Identify_Outlier(value)
tags <- c()
counter = 1
for (i in as.numeric(value))
{
if (i >= boundaries[1] & i <= boundaries[2]){
tags[counter] <- "Acceptable"
} else{
tags[counter] <- "Outlier"
}
counter = counter +1
}
return (tags)
}tags1<- tag_outlier(ATM1_Imputed$Cash)
ATM1_Imputed$Cash_Outlier <- tags1
tags2<- tag_outlier(ATM2_Imputed$Cash)
ATM2_Imputed$Cash_Outlier <- tags2
tags3<- tag_outlier(ATM3$Cash)
ATM3$Cash_Outlier <- tags3
tags4<- tag_outlier(ATM4$Cash)
ATM4$Cash_Outlier <- tags4ATM1
After applying the functions and tagging the outliers, we see that there are 51 outliers in ATM1. Most of the outliers are below the first quartile. To handle the outliers, I will Winsorize the data. Outliers outside of 1.5 * the IQR will be replaced with either the lower or upper boundary.
“Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. It is named after the engineer-turned-biostatistician Charles P. Winsor (1895-1951). The effect is the same as clipping in signal processing.”
https://en.wikipedia.org/wiki/Winsorizing
| DATE | ATM | Cash | Cash_imp | Cash_Outlier | |
|---|---|---|---|---|---|
| 7 | 2009-05-07 | ATM1 | 8 | FALSE | Outlier |
| 14 | 2009-05-14 | ATM1 | 6 | FALSE | Outlier |
| 21 | 2009-05-21 | ATM1 | 20 | FALSE | Outlier |
| 28 | 2009-05-28 | ATM1 | 10 | FALSE | Outlier |
| 35 | 2009-06-04 | ATM1 | 14 | FALSE | Outlier |
| 36 | 2009-06-05 | ATM1 | 3 | FALSE | Outlier |
boundaries <- Identify_Outlier(ATM1_Imputed$Cash)
for(i in ATM1_Imputed$Cash){
if (i < boundaries[1]){
ATM1_Imputed$Cash[ATM1_Imputed$Cash == i] <- boundaries[1]
}
else if (i > boundaries[2]){
ATM1_Imputed$Cash[ATM1_Imputed$Cash==i] <- boundaries[2]
}
}The box plot shows that the outliers in ATM1 have been replaced and ATM1 is cleaned:
## DATE ATM Cash Cash_imp Cash_Outlier
## 1 2009-05-01 ATM1 96 FALSE Acceptable
## 2 2009-05-02 ATM1 82 FALSE Acceptable
## 3 2009-05-03 ATM1 85 FALSE Acceptable
## 4 2009-05-04 ATM1 90 FALSE Acceptable
## 5 2009-05-05 ATM1 99 FALSE Acceptable
## 6 2009-05-06 ATM1 88 FALSE Acceptable
ATM2
We see that the two functions did not identify any outliers in ATM2. The box and whiskers plot confirms this, and even though the histogram for ATM2 is not perfectly normally distributed, the skew is small enough.
## [1] DATE ATM Cash Cash_imp Cash_Outlier
## <0 rows> (or 0-length row.names)
## [1] -0.04273822
ATN2 has been cleaned:
ATM3
Key Assumption
99% of the values for ATM3 are zeroes even though there are date entries for the ATM. I am making the assumption that ATM3 was malfunctioning or blocked in some way until April 2010. Thus, I cannot remove the Outliers nor can I make any reasonable forecast. If I take an average of the last three days ($87.66) ATM3 would not have been able to fulfill the 04/28/2010 withdrawal. I am going to need more data to make any assumption for this machine. Thus, I am going to use a default value of 100.
## [1] 0.9917808
| DATE | ATM | Cash | Cash_Outlier |
|---|---|---|---|
| 2010-04-26 | ATM3 | 0 | Acceptable |
| 2010-04-27 | ATM3 | 0 | Acceptable |
| 2010-04-28 | ATM3 | 96 | Outlier |
| 2010-04-29 | ATM3 | 82 | Outlier |
| 2010-04-30 | ATM3 | 85 | Outlier |
## [1] 10.92911
ATM4
I see that there are two very large outliers for ATM4 which skews the data to the right. I used the same Winsor approach for handling outliers as I did with ATM1.
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 4
## DATE ATM Cash Cash_Outlier
## <date> <chr> <dbl> <chr>
## 1 2010-01-29 ATM4 1575. Outlier
## 2 2009-09-22 ATM4 1712. Outlier
## 3 2010-02-09 ATM4 10920. Outlier
boundaries <- Identify_Outlier(ATM4$Cash)
for(i in ATM4$Cash){
if (i < boundaries[1]){
ATM4$Cash[ATM4$Cash == i] <- boundaries[1]
}
else if (i > boundaries[2]){
ATM4$Cash[ATM4$Cash == i] <- boundaries[2]
}
}## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 4
## DATE ATM Cash Cash_Outlier
## <date> <chr> <dbl> <chr>
## 1 2009-09-22 ATM4 1575. Outlier
## 2 2010-01-29 ATM4 1575. Outlier
## 3 2010-02-09 ATM4 1575. Outlier
My custom function below converts the three datasets to time series.
ATM1_ts
The autoplot below shows that the ATM1 cash withdrawal data shows an uptick in the first 10 weeks, then a small drop from weeks 20 to 30, and finally a leveling off from weeks 38 to 53.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The seasonal, polar, subseries plots of ATM1 below show that for most of the days of the week withdrawals have the same or nearly the same activity except for Saturdays when there’s a huge drop off. This may mean that the scheduling for re-filling ATM1 needs to be looked at.
The Seasonal and Trend decomposition using Loess (STL) shows that there is not a discernable trend over the length of the time series with the exceptions at the low points at the 25th and 43rd weeks. We also see in the seasonal plots less of a variation between the 22nd and the 44th weeks.
These plots show that the data is not stationary. Also, there are strong positive correlations with lags at days 7, 14, and 21.
ATM2_ts
The autoplot for ATM2 show a slight downward trend over the span of the time series along with a descrease in the seasonal variation.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The seasonal plot of ATM2 show that withdrawls change over time. In the later weeks, Wednesdays and Thursdays have see less withdrawals while Fridays and Saturdays have seen peaks. This is a reversal of the pattern seen in earlier weeks.
The polar and subseries plots show a big drop off in Fridays and Saturday withdrawals.
The seasonal plot for ATM2 does not show an increase or decrease in the trend over time.
These plots show that the data is not stationary. Also, there are strong positive correlations with lags at days 7, 14, and 21.
The lag plots show the strong correlation at lags 7, 14, and 21.
ATM4_ts
The autoplot for ATM4 show a very slight downward beginning at the 20th week. The trend then evens out after week 35.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The seasonal, polar, and subseries plots show the most activity on Sundays and Thurdays with a huge drop off on Saturdays.
The is no discernable trend for ATM4 although the seasonal variation around lag 40 seems to tighten.
These plots show that the data is not stationary. Also, there are strong positive correlations with lags at days 7 and 14.
The lag plots show the strong correlation at lags 7 and 14.
In this section, I split the data into train and test sets. I then evaluated a range of time series models:
Key Assumption
I used different classes of models from the basic to the more advanced to gauge which one would have the best RMSE score.
Below I split each time series into a train/test split, and plotted each one.
Plot the Split Time Series
Each time series has been split into 43, 7-day periods for the Training set and 10, 7-day periods for the Test set. I set the forecasr horizon to h=71 during training of each model, and then evaluate each model’s 71 day forecast against the test set.
ATM1
SEASONAL NAIVE FORECAST
Beginning with the Seasonal Naive forecast where each forecast to be equal to the last observed value from the same season. This is considered to be one of the more basic time series model. It provides a ballpark estimate for the anticipated amount of withdrawals.
I trained the model to forecast the next 71 days (the length of the test set). Then, I evaluated the accuracy of the model. The seasonal naive method yields an RMSE of 54.37. A check of the residuals shows that there is still some information leakage in the seasonal naive model. Also that there is correlation in the lags.The very small p-value from the Ljung-Box test shows that we can reject the null hypothesis that the residuals are independent.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 0.4320557 | 26.24980 | 16.87456 | -7.745595 | 23.67519 | 1.000000 | 0.1366565 | NA |
| Test set | -15.5070423 | 54.37002 | 44.94366 | -64.370018 | 92.52089 | 2.663397 | 0.2410099 | 0.7745795 |
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 57.005, df = 14, p-value = 3.904e-07
##
## Model df: 0. Total lags used: 14
SIMPLE EXPONENTIAL SMOOTHING (SES)
The next model, Simple Exponential Smoothing (SES), assumes no trend nor seasonality in the data. SES assigns larger weights to more recent observations. The weights decrease exponentially the further back the observation is. We see that the SES’s RMSE of 29.99595 is already lower than the Seasonal Naive model which shows that more advanced models may get us closer to a better forecast.
A check of the residuals shows that there is still some information leakage in the SES method. There are strong correlations in the lags at time period 7 and 14 which means that the residuals are not White Noise. The very small p-value from the Ljung-Box test shows that we can reject the null hypothesis that the residuals are independent.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 1.493721 | 34.26072 | 26.56414 | -39.15612 | 63.15481 | 1.574212 | 0.0800819 | NA |
| Test set | -1.900756 | 29.99595 | 21.42742 | -39.03785 | 56.65852 | 1.269805 | -0.0532251 | 0.3660872 |
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 316.42, df = 12, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 14
HOLT WINTERS
The next model, the Holt Winters model, is used to capture trend and seasonality. This is a departure from the earlier models which were based on past observations. I used the additive method because as we saw earlier, there is no increase in trend or seasonality for this data over time.The HOLT WINTERS approach shows that the RMSE is for the test set is 48.66 which is higher than the two previous models.
Given the small p-value again, we have to reject the null hypothesis and say that the model is showing a lack of fit.
Forecast_ATM1_Withdrawal_Additive <- hw(ATM1_train, h = h, seasonal = "additive")
autoplot(Forecast_ATM1_Withdrawal_Additive)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.3136776 | 20.62943 | 13.31811 | -9.611706 | 23.31315 | 0.789242 | 0.1421814 | NA |
| Test set | -16.4063009 | 48.66144 | 37.35665 | -75.744872 | 97.35845 | 2.213785 | 0.0001390 | 0.462075 |
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 14.96, df = 3, p-value = 0.001851
##
## Model df: 11. Total lags used: 14
HOLT WINTERS with DAMPED METHOD
The HOLT WINTERS DAMPED creates a trend line that “dampens” or flattens the trend over time. This approach shows that the RMSE is for the test set is 45.76854. The residuals are still not showing independence.
The Holt Winters approaches may not performing as well as the other models primarily due to the lack of trend and seasonality in the data.
Forecast_ATM1_Withdrawal_ADDITIVE_Damped <- hw(ATM1_train, h = h, seasonal = "additive", damped=TRUE)
autoplot(Forecast_ATM1_Withdrawal_ADDITIVE_Damped)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.0633654 | 20.59383 | 13.28339 | -10.08907 | 23.10647 | 0.7871842 | 0.1480174 | NA |
| Test set | -11.1128971 | 45.76854 | 34.61795 | -65.54882 | 89.83930 | 2.0514871 | 0.0094684 | 0.4562158 |
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' additive method
## Q* = 14.987, df = 3, p-value = 0.001827
##
## Model df: 12. Total lags used: 15
ARIMA Models
The next model to try is the ARIMA models which stands for Auto Regressive Integrated Moving Average models. To evaluate the model, I tested whether ATM1_Train was stationary, and it was given it p-value of 0.4081. No differencing was needed. I then tested different Arima models, and the model with the lowest AICc was ARIMA(1,0,1). I confirmed these results using auto arima which added a seasonal component to the model. The residuals from this model also show independnce. The best model, as confirmed by the auto.arima function, was ARIMA(1,0,1)(0,1,1)[7]
The RMSE for the auto.arima model, 43.66641, is much higher than it is for the SES model. However, the SES model is less of a good fit because of the non-independence of the residuals.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 0.4081
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
## [1] 0.09231135
## [1] 0
## Series: ATM1_train
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## -0.7253 0.8820 85.7531
## s.e. 0.0696 0.0434 2.1222
##
## sigma^2 estimated as 1125: log likelihood=-1448.48
## AIC=2904.96 AICc=2905.09 BIC=2919.69
## Series: ATM1_train
## ARIMA(0,0,1)(1,0,1)[7] with non-zero mean
##
## Coefficients:
## ma1 sar1 sma1 mean
## 0.2090 0.9950 -0.8762 85.2799
## s.e. 0.0548 0.0061 0.0659 11.6250
##
## sigma^2 estimated as 458: log likelihood=-1322.66
## AIC=2655.31 AICc=2655.52 BIC=2673.73
## Series: ATM1_train
## ARIMA(0,0,1)(0,0,1)[7] with non-zero mean
##
## Coefficients:
## ma1 sma1 mean
## 0.1107 0.4921 85.9188
## s.e. 0.0633 0.0411 2.7123
##
## sigma^2 estimated as 808.6: log likelihood=-1400.84
## AIC=2809.68 AICc=2809.81 BIC=2824.41
## Series: ATM1_train
## ARIMA(1,0,1)(0,1,1)[7]
##
## Coefficients:
## ar1 ma1 sma1
## 0.5466 -0.2967 -0.9140
## s.e. 0.2016 0.2260 0.0441
##
## sigma^2 estimated as 445.9: log likelihood=-1287.43
## AIC=2582.85 AICc=2582.99 BIC=2597.49
## Series: ATM1_train
## ARIMA(1,0,1)(0,1,1)[7]
##
## Coefficients:
## ar1 ma1 sma1
## 0.5466 -0.2967 -0.9140
## s.e. 0.2016 0.2260 0.0441
##
## sigma^2 estimated as 445.9: log likelihood=-1287.43
## AIC=2582.85 AICc=2582.99 BIC=2597.49
## RMSE MAE MAPE MASE
## Training set 20.75324 13.54075 21.46173 0.8024353
## Test set 43.66641 34.03410 83.13523 2.0168875
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 8.5273, df = 11, p-value = 0.6654
##
## Model df: 3. Total lags used: 14
ATM2
I followed the same approach for ATMs 2 and 4.
SEASONAL NAIVE FORECAST
The seasonal naive method yields an RMSE of 40.13551. A check of the residuals shows that there is still some information leakage in the seasonal naive model for ATM2. Also that there is strong negative correlation at time 7 in the lags.The very small p-value from the Ljung-Box test shows that we can reject the null hypothesis that the residuals are independent.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.0836237 | 31.74254 | 21.90941 | -Inf | Inf | 1.000000 | 0.0807212 | NA |
| Test set | -3.2253521 | 40.13551 | 32.12676 | -Inf | Inf | 1.466346 | 0.0907574 | 0 |
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 88.664, df = 14, p-value = 6.781e-13
##
## Model df: 0. Total lags used: 14
SIMPLE EXPONENTIAL SMOOTHING (SES)
The next model is SES which has a lower RMSE than the seasonal naive method, 39.26611. A check of the residuals shows that there are strong correlations in the lags which means that the residuals are not independent. The very small p-value from the Ljung-Box test shows that we can reject the null hypothesis that the residuals are independent.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -3.1817681 | 38.23963 | 32.16199 | -Inf | Inf | 1.467953 | -0.0253341 | NA |
| Test set | 0.5218097 | 39.19669 | 35.99792 | -Inf | Inf | 1.643035 | 0.0262781 | 0 |
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 415.25, df = 12, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 14
HOLT WINTERS
HOLT WINTERS model has an RMSE of 61.04838, but once again the residuals do not show independence of each other.
Forecast_ATM2_Withdrawal_Additive <- hw(ATM2_train, h=h, seasonal = "additive")
autoplot(Forecast_ATM2_Withdrawal_Additive)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 0.306212 | 23.43502 | 16.03738 | -Inf | Inf | 0.731986 | 0.0833972 | NA |
| Test set | 7.155758 | 61.04838 | 55.86088 | -Inf | Inf | 2.549630 | -0.0349627 | 0 |
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 12.188, df = 3, p-value = 0.006766
##
## Model df: 11. Total lags used: 14
HOLT WINTERS with Damped
The RMSE for Holt Winters with the Damped method is 60.67069.
Forecast_ATM2_Withdrawal_Multi_Damped <- hw(ATM2_train, h = h, seasonal = "additive", damped=TRUE)
autoplot(Forecast_ATM2_Withdrawal_Multi_Damped )| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -1.748178 | 23.51733 | 16.40822 | -Inf | Inf | 0.7489122 | 0.0845853 | NA |
| Test set | 1.591616 | 60.67069 | 55.37023 | -Inf | Inf | 2.5272352 | -0.0357414 | 0 |
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' additive method
## Q* = 12.117, df = 3, p-value = 0.006993
##
## Model df: 12. Total lags used: 15
ARIMA Models
Running through the same procedure that I did for ATM1, the best ARIMA model is ARIMA(0,0,1)(0,1,1)[7] with a seasonal component to the model. The residuals show independence as well.
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 5 lags.
##
## Value of test-statistic is: 1.9423
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
## [1] 0.6073268
## [1] 1
## Series: ATM2_train
## ARIMA(1,1,0)(1,1,1)[7]
##
## Coefficients:
## ar1 sar1 sma1
## -0.4670 -0.0212 -0.8798
## s.e. 0.0521 0.0673 0.0321
##
## sigma^2 estimated as 785.7: log likelihood=-1363.1
## AIC=2734.2 AICc=2734.34 BIC=2748.82
## Series: ATM2_train
## ARIMA(2,1,2)(1,0,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1
## -0.6721 0.1267 -0.2272 -0.7644 0.9966 -0.8803
## s.e. 0.2037 0.0610 0.1963 0.1951 0.0027 0.0324
##
## sigma^2 estimated as 564.8: log likelihood=-1349.76
## AIC=2713.53 AICc=2713.92 BIC=2739.29
## Series: ATM2_train
## ARIMA(1,1,1)(0,1,1)[7]
##
## Coefficients:
## ar1 ma1 sma1
## 0.0901 -0.9999 -0.8877
## s.e. 0.0607 0.0141 0.0300
##
## sigma^2 estimated as 561.5: log likelihood=-1319.7
## AIC=2647.4 AICc=2647.54 BIC=2662.03
## Series: ATM2_train
## ARIMA(1,1,1)(1,1,1)[7]
##
## Coefficients:
## ar1 ma1 sar1 sma1
## 0.0880 -1.000 -0.0624 -0.8765
## s.e. 0.0608 0.014 0.0680 0.0340
##
## sigma^2 estimated as 561.6: log likelihood=-1319.28
## AIC=2648.57 AICc=2648.78 BIC=2666.85
## Series: ATM2_train
## ARIMA(0,0,1)(0,1,1)[7] with drift
##
## Coefficients:
## ma1 sma1 drift
## 0.0828 -0.8934 -0.0829
## s.e. 0.0578 0.0289 0.0296
##
## sigma^2 estimated as 558.9: log likelihood=-1319.1
## AIC=2646.2 AICc=2646.34 BIC=2660.84
## Series: ATM2_train
## ARIMA(0,0,1)(0,1,1)[7] with drift
##
## Coefficients:
## ma1 sma1 drift
## 0.0828 -0.8934 -0.0829
## s.e. 0.0578 0.0289 0.0296
##
## sigma^2 estimated as 558.9: log likelihood=-1319.1
## AIC=2646.2 AICc=2646.34 BIC=2660.84
## RMSE MAE MAPE MASE
## Training set 23.23479 15.45587 Inf 0.7054445
## Test set 60.02590 55.58951 Inf 2.5372440
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1)(0,1,1)[7]
## Q* = 8.5273, df = 11, p-value = 0.6654
##
## Model df: 3. Total lags used: 14
ATM4
SEASONAL NAIVE FORECAST
The seasonal naive method yields an RMSE of 537.9173. A check of the residuals shows that there is still some information leakage in the seasonal naive model for ATM4. Also that there is strong negative correlation at time 7 in the lags the same as ATM2.The very small p-value from the Ljung-Box test shows that we can reject the null hypothesis that the residuals are independent
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 4.490081 | 462.9165 | 351.3659 | -364.6133 | 420.9499 | 1.000000 | 0.0507322 | NA |
| Test set | -208.441862 | 537.9173 | 443.9707 | -1546.3391 | 1580.9963 | 1.263556 | -0.2048002 | 0.8476732 |
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 80.29, df = 14, p-value = 2.5e-11
##
## Model df: 0. Total lags used: 14
SIMPLE EXPONENTIAL SMOOTHING (SES)
The next model is SES which has a lower RMSE than the seasonal naive method at 309.5954. A check of the residuals shows that there is some correlations in the lags which means that the residuals are not independent. The very small p-value from the Ljung-Box test shows that we can reject the null hypothesis that the residuals are independent.
| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | -0.00023 | 364.0978 | 304.0387 | -538.9422 | 572.6629 | 0.8653051 | 0.0767115 | NA |
| Test set | -29.41341 | 309.5954 | 258.8207 | -781.1552 | 809.4647 | 0.7366129 | -0.0538860 | 0.2651169 |
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 23.797, df = 12, p-value = 0.02167
##
## Model df: 2. Total lags used: 14
HOLT WINTERS
HOLT WINTERS model has an RMSE of 499.6356. The small p-value means that we cannot reject the null hypothesis.
Forecast_ATM4_Withdrawal_Additive <- hw(ATM4_train, h=h, seasonal = "additive")
autoplot(Forecast_ATM4_Withdrawal_Additive)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 23.75904 | 330.9405 | 260.7417 | -347.7947 | 412.2755 | 0.7420802 | 0.0543639 | NA |
| Test set | -147.00463 | 405.0530 | 330.1076 | -1178.4626 | 1203.6448 | 0.9394981 | 0.0121899 | 0.2934021 |
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 17.254, df = 3, p-value = 0.0006265
##
## Model df: 11. Total lags used: 14
HOLT WINTERS with Damped
The RMSE for Holt Winters with the Damped method is 485.5957, and there is still information leakage in the residuals.
Forecast_ATM4_Withdrawal_Additive_Damped <- hw(ATM4_train, h = h, seasonal = "additive", damped=TRUE)
autoplot(Forecast_ATM4_Withdrawal_Additive_Damped)| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
|---|---|---|---|---|---|---|---|---|
| Training set | 1.12895 | 328.4283 | 258.1154 | -409.0397 | 442.8975 | 0.7346056 | 0.0845203 | NA |
| Test set | -74.74384 | 384.4139 | 317.7409 | -1036.1018 | 1070.9761 | 0.9043019 | 0.0117797 | 0.2592924 |
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' additive method
## Q* = 20.41, df = 3, p-value = 0.0001395
##
## Model df: 12. Total lags used: 15
ARIMA Models
Following the same procedure, the best ARIMA model was ARIMA(0,0,0)(1,0,0)[7] with an RMSE of 311.767. A check of the residuals show their independence.
## [1] 0.423857
## [1] 0
#Transform the series with BoxCox function and the best lambda
ATM4_train_trans <- BoxCox(ATM4_train, lambda)## Series: ATM2_train
## ARIMA(1,1,0)(1,1,1)[7]
##
## Coefficients:
## ar1 sar1 sma1
## -0.4670 -0.0212 -0.8798
## s.e. 0.0521 0.0673 0.0321
##
## sigma^2 estimated as 785.7: log likelihood=-1363.1
## AIC=2734.2 AICc=2734.34 BIC=2748.82
## Series: ATM4_train
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## 0.1942 -0.1178 454.0816
## s.e. 0.7614 0.7711 23.1723
##
## sigma^2 estimated as 133118: log likelihood=-2150.11
## AIC=4308.23 AICc=4308.37 BIC=4322.96
## Series: ATM4_train
## ARIMA(0,0,1)(0,0,1)[7] with non-zero mean
##
## Coefficients:
## ma1 sma1 mean
## 0.0671 0.1630 454.6767
## s.e. 0.0579 0.0542 25.7854
##
## sigma^2 estimated as 129107: log likelihood=-2145.71
## AIC=4299.42 AICc=4299.56 BIC=4314.16
## Series: ATM4_train
## ARIMA(0,0,0)(0,1,0)[7]
##
## sigma^2 estimated as 214292: log likelihood=-2168.71
## AIC=4339.42 AICc=4339.44 BIC=4343.08
## Series: ATM4_train
## ARIMA(0,0,0)(1,0,0)[7] with non-zero mean
##
## Coefficients:
## sar1 mean
## 0.1915 454.8773
## s.e. 0.0579 25.6351
##
## sigma^2 estimated as 128574: log likelihood=-2145.64
## AIC=4297.28 AICc=4297.37 BIC=4308.33
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(1,0,0)[7] with non-zero mean
## Q* = 12.298, df = 12, p-value = 0.422
##
## Model df: 2. Total lags used: 14
## RMSE MAE MAPE MASE
## Training set 357.351 299.1113 539.7646 0.8512815
## Test set 311.767 261.8851 828.3793 0.7453343
In all three experiments across different classes of models, the auto.arima model showed the most accurate forecast for each ATM with the exception of ATM3 which did not have enough data to make an accurate forecast. For that ATM, I would use a default value of 100 until more data is collected.
The final step is to build auto.arima models across the entire time series for ATM1, ATM2, and ATM4, and then forecast for the next 31 days.
May2010 <- data.frame(seq(as.Date("2010-05-01"), by = "day", length.out = 31))
colnames(May2010) <- c("Date")ATM1
## Series: ATM1_ts
## ARIMA(0,0,1)(0,1,2)[7]
##
## Coefficients:
## ma1 sma1 sma2
## 0.1519 -0.5777 -0.1191
## s.e. 0.0544 0.0508 0.0501
##
## sigma^2 estimated as 476.7: log likelihood=-1612.43
## AIC=3232.85 AICc=3232.97 BIC=3248.38
ATM1_MAY2010 <- as.data.frame(ATM1.final.fit.arima %>% forecast(h = 31))
ATM1_MAY2010 <- cbind(May2010, ATM1_MAY2010)
ATM1_MAY2010_head <- head(ATM1_MAY2010, 10)##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(0,1,2)[7]
## Q* = 18.031, df = 11, p-value = 0.08087
##
## Model df: 3. Total lags used: 14
| Date | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|---|
| 53.14286 | 2010-05-01 | 87.40152 | 59.420743 | 115.38230 | 44.60861 | 130.19444 |
| 53.28571 | 2010-05-02 | 104.47097 | 76.169174 | 132.77276 | 61.18711 | 147.75482 |
| 53.42857 | 2010-05-03 | 72.99315 | 44.691363 | 101.29494 | 29.70930 | 116.27701 |
| 53.57143 | 2010-05-04 | 23.96954 | -4.332252 | 52.27133 | -19.31432 | 67.25339 |
| 53.71429 | 2010-05-05 | 99.70527 | 71.403482 | 128.00706 | 56.42142 | 142.98913 |
| 53.85714 | 2010-05-06 | 80.97688 | 52.675089 | 109.27867 | 37.69303 | 124.26073 |
| 54.00000 | 2010-05-07 | 86.88567 | 58.583884 | 115.18747 | 43.60182 | 130.16953 |
| 54.14286 | 2010-05-08 | 88.16500 | 57.495003 | 118.83499 | 41.25929 | 135.07070 |
| 54.28571 | 2010-05-09 | 102.97742 | 72.254932 | 133.69991 | 55.99143 | 149.96341 |
| 54.42857 | 2010-05-10 | 73.30308 | 42.580593 | 104.02557 | 26.31709 | 120.28907 |
The forecast for ATM1 for May 2010 is exported here:
ATM2
## Series: ATM2_ts
## ARIMA(2,0,2)(0,1,1)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4322 -0.9172 0.4729 0.8135 -0.7571
## s.e. 0.0550 0.0398 0.0860 0.0546 0.0382
##
## sigma^2 estimated as 620.1: log likelihood=-1658.84
## AIC=3329.68 AICc=3329.92 BIC=3352.96
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,2)(0,1,1)[7]
## Q* = 11.228, df = 9, p-value = 0.2604
##
## Model df: 5. Total lags used: 14
ATM2_MAY2010 <- as.data.frame(ATM2.final.fit.arima %>% forecast(h = 31))
ATM2_MAY2010 <- cbind(May2010, ATM2_MAY2010)
ATM2_MAY2010_head <- head(ATM2_MAY2010, 10)| Date | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|---|
| 53.14286 | 2010-05-01 | 66.372742 | 34.46012 | 98.28536 | 17.56661 | 115.17888 |
| 53.28571 | 2010-05-02 | 72.889482 | 40.95043 | 104.82854 | 24.04291 | 121.73605 |
| 53.42857 | 2010-05-03 | 13.711143 | -18.46153 | 45.88381 | -35.49271 | 62.91500 |
| 53.57143 | 2010-05-04 | 5.262535 | -26.91373 | 37.43880 | -43.94682 | 54.47189 |
| 53.71429 | 2010-05-05 | 97.633753 | 65.28442 | 129.98309 | 48.15971 | 147.10780 |
| 53.85714 | 2010-05-06 | 88.225388 | 55.82115 | 120.62963 | 38.66738 | 137.78340 |
| 54.00000 | 2010-05-07 | 64.753541 | 32.27126 | 97.23582 | 15.07619 | 114.43089 |
| 54.14286 | 2010-05-08 | 66.381329 | 32.25868 | 100.50398 | 14.19524 | 118.56742 |
| 54.28571 | 2010-05-09 | 72.892832 | 38.74868 | 107.03698 | 20.67386 | 125.11181 |
| 54.42857 | 2010-05-10 | 13.701819 | -20.65375 | 48.05739 | -38.84049 | 66.24413 |
The forecast for ATM2 for May 2010 is exported here:
ATM3
As stated previously, there is not nearly enough data to make any realistic forecast for ATM3, and I did not want to use an average of the three values as that would cause the ATM to have insufficient funds to complete the transaction. Instead, I will use a default value of 100 until I can collect more data.
default <-rep(c(100),times=c(31))
ATM3_MAY2010 <- as.data.frame(cbind(May2010, default))
ATM3_MAY2010 ## Date default
## 1 2010-05-01 100
## 2 2010-05-02 100
## 3 2010-05-03 100
## 4 2010-05-04 100
## 5 2010-05-05 100
## 6 2010-05-06 100
## 7 2010-05-07 100
## 8 2010-05-08 100
## 9 2010-05-09 100
## 10 2010-05-10 100
## 11 2010-05-11 100
## 12 2010-05-12 100
## 13 2010-05-13 100
## 14 2010-05-14 100
## 15 2010-05-15 100
## 16 2010-05-16 100
## 17 2010-05-17 100
## 18 2010-05-18 100
## 19 2010-05-19 100
## 20 2010-05-20 100
## 21 2010-05-21 100
## 22 2010-05-22 100
## 23 2010-05-23 100
## 24 2010-05-24 100
## 25 2010-05-25 100
## 26 2010-05-26 100
## 27 2010-05-27 100
## 28 2010-05-28 100
## 29 2010-05-29 100
## 30 2010-05-30 100
## 31 2010-05-31 100
The forecast for ATM3 for May 2010 is exported here:
ATM4
## Series: ATM4_ts
## ARIMA(3,0,2)(1,0,0)[7] with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 mean
## -0.9610 -0.6672 0.1415 1.0318 0.7564 0.2402 446.9595
## s.e. 0.1409 0.1316 0.0576 0.1355 0.1217 0.0550 26.4231
##
## sigma^2 estimated as 120826: log likelihood=-2650.22
## AIC=5316.44 AICc=5316.84 BIC=5347.64
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,2)(1,0,0)[7] with non-zero mean
## Q* = 5.2881, df = 7, p-value = 0.6249
##
## Model df: 7. Total lags used: 14
ATM4_MAY2010 <- as.data.frame(ATM4.final.fit.arima %>% forecast(h = 31))
ATM4_MAY2010 <- cbind(May2010, ATM4_MAY2010)
ATM4_MAY2010_head <- head(ATM4_MAY2010, 10)| Date | Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|---|
| 53.14286 | 2010-05-01 | 367.2026 | -78.26566 | 812.6708 | -314.0823 | 1048.487 |
| 53.28571 | 2010-05-02 | 395.7143 | -50.86834 | 842.2969 | -287.2749 | 1078.703 |
| 53.42857 | 2010-05-03 | 504.8763 | 58.19368 | 951.5589 | -178.2658 | 1188.018 |
| 53.57143 | 2010-05-04 | 330.4565 | -117.43716 | 778.3501 | -354.5377 | 1015.451 |
| 53.71429 | 2010-05-05 | 379.1307 | -70.01214 | 828.2735 | -307.7740 | 1066.035 |
| 53.85714 | 2010-05-06 | 410.5516 | -38.73962 | 859.8428 | -276.5800 | 1097.683 |
| 54.00000 | 2010-05-07 | 425.1336 | -24.43869 | 874.7059 | -262.4279 | 1112.695 |
| 54.14286 | 2010-05-08 | 410.4359 | -46.07617 | 866.9480 | -287.7391 | 1108.611 |
| 54.28571 | 2010-05-09 | 480.0971 | 22.88792 | 937.3062 | -219.1440 | 1179.338 |
| 54.42857 | 2010-05-10 | 424.4931 | -32.75584 | 881.7421 | -274.8088 | 1123.795 |
The forecast for ATM4 for May 2010 is exported here:
For Part B, we are given a simple dataset of residential power usage for January 1998 until December 2013. We are to model this data and provide a monthly forecast for the year 2014. From the import, there are 192 observations and 3 variables. The date column needs to be updated and renamed, and there is a missing value.
ResidentialCustomerForecastLoad624 <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")
print("ResidentialCustomerForecastLoad624 has been loaded.")## [1] "ResidentialCustomerForecastLoad624 has been loaded."
## Classes 'tbl_df', 'tbl' and 'data.frame': 192 obs. of 3 variables:
## $ CaseSequence: num 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : num 6862583 5838198 5420658 5010364 4665377 ...
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Length:192 Min. : 770523
## 1st Qu.:780.8 Class :character 1st Qu.: 5429912
## Median :828.5 Mode :character Median : 6283324
## Mean :828.5 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.: 7620524
## Max. :924.0 Max. :10655730
## NA's :1
Rename the date column to “Date”
Add an “-01” to the date values
| CaseSequence | Date | KWH |
|---|---|---|
| 733 | 1998-01-01 | 6862583 |
| 734 | 1998-02-01 | 5838198 |
| 735 | 1998-03-01 | 5420658 |
| 736 | 1998-04-01 | 5010364 |
| 737 | 1998-05-01 | 4665377 |
| 738 | 1998-06-01 | 6467147 |
| 739 | 1998-07-01 | 8914755 |
| 740 | 1998-08-01 | 8607428 |
| 741 | 1998-09-01 | 6989888 |
| 742 | 1998-10-01 | 6345620 |
As was done in Part A, I will use kNN to impute the missing item.
| CaseSequence | Date | KWH | KWH_imp | |
|---|---|---|---|---|
| 129 | 861 | 2008-09-01 | 6548439 | TRUE |
We see the box plot that there is a single outlier. As was done in Part A, I used the “Winsorized” of handling outliers. However, since there is only one outlier, I will use the difference between the upper and lower boundaries of the IQR.
tags5 <- tag_outlier(ResidentialCustomerForecastLoad624$KWH)
ResidentialCustomerForecastLoad624$Outlier <- tags5## CaseSequence Date KWH KWH_imp Outlier
## 151 883 2010-07-01 770523 FALSE Outlier
boundaries <- Identify_Outlier(ResidentialCustomerForecastLoad624$KWH)
for(i in ResidentialCustomerForecastLoad624$KWH){
if (i < boundaries[1]){
ResidentialCustomerForecastLoad624$KWH[ResidentialCustomerForecastLoad624$KWH == i ] <- boundaries[2] - boundaries[1]
}
else if (i > boundaries[2]){
ResidentialCustomerForecastLoad624$KWH[ResidentialCustomerForecastLoad624$KWH == i ] <- boundaries[2] - boundaries[1]
}
}The box plot after outlier has been adjusted.
We see the strong seasonal and a distinct upward trend of energy usage from 1998 to 2013.
tsKWH <- ResidentialCustomerForecastLoad624 %>%
select(Date, KWH)
tsKWH$Date <- as.Date(tsKWH$Date)
KWH_ts <- ts(tsKWH$KWH, start=c(1998, 1), frequency=12)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Decomposition shows the upward trend in usage over the time series as well as a strongly defined seasonality.
KWH_ts %>% decompose(type="multiplicative") %>%
autoplot() + xlab("Year")+
ggtitle("Classical multiplicative decomposition of KWH Usage")The polar, seasonal, and series plots show that the periods of June to Aug and Nov to Feb are the heaviest usage months.
The seasonal and subseries plots confirm that the peak usage for the months
The gglagplot shows the highest lag correlation at month 12.
The data will be split into training and test sets. The training set will be comprised of the first 12 years and the test set will be the last three years (h=36 months).
I followed the same experimentation procedure as I did for Part A.
SEASONAL NAIVE FORECAST
The Seasonal Naive model has a RMSE of 1145248.6, and the Ljung-Box test has an extremely small p-value which means that the residuals are not independent.
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 64862.3 731206.9 580023.4 0.3531193 9.135197 1.000000 0.2742628
## Test set 523464.3 1145248.6 906533.9 6.1475757 11.672843 1.562926 0.1745470
## Theil's U
## Training set NA
## Test set 0.6984242
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 73.802, df = 24, p-value = 5.723e-07
##
## Model df: 0. Total lags used: 24
SIMPLE EXPONENTIAL SMOOTHING
The Simple Exponential Smoothing RMSE is 1964548 with non-independent residuals.
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 601.9718 1241146 1067381 -3.903186 17.49222 1.840238 0.4620137
## Test set 1184397.2306 1964548 1555443 12.042544 18.56283 2.681691 0.3618090
## Theil's U
## Training set NA
## Test set 1.04319
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 708.51, df = 22, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 24
HOLT WINTERS
For Holt Winters, I set the seasonal parameter to “multiplicative” because of the increasing variance of the trend and seasonality. The Holt Winters model RMSE is 1136637.5. So, far this model performs the best, RMSE of 1136637.5, because it captures the trend and seasonality. However, the residuals are not independent.
Forecast_KWH_Withdrawal_Multi <- hw(KWH_train, h = h, seasonal = "multiplicative")
autoplot(Forecast_KWH_Withdrawal_Multi)## ME RMSE MAE MPE MAPE MASE
## Training set -39611.18 546618.7 422413.1 -1.313133 6.723668 0.728269
## Test set 761002.17 1136637.5 863245.9 9.081672 10.752058 1.488295
## ACF1 Theil's U
## Training set 0.3480073 NA
## Test set 0.1204219 0.7026044
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' multiplicative method
## Q* = 55.456, df = 8, p-value = 3.6e-09
##
## Model df: 16. Total lags used: 24
HOLT WINTERS with Damped
The Holt Winters model using the Damped approach’s RMSE is 1206318.5 with non-independent residuals.
Forecast_KWH_Multi_Damped <- hw(KWH_train, h = h, seasonal = "multiplicative", damped=TRUE)
autoplot(Forecast_KWH_Multi_Damped)## ME RMSE MAE MPE MAPE MASE
## Training set 3682.852 539719.6 426777.0 -0.5395034 6.814871 0.7357927
## Test set 857755.126 1206318.5 922850.2 10.3659723 11.439927 1.5910568
## ACF1 Theil's U
## Training set 0.1318510 NA
## Test set 0.1169638 0.745225
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' multiplicative method
## Q* = 53.789, df = 7, p-value = 2.589e-09
##
## Model df: 17. Total lags used: 24
ARIMA Models
Using the same procedure as before, I found that the best ARIMA model is ARIMA(0,0,1)(2,1,0)[12] with drift. The RMSE of the ARIMA model is the lowest for all of the above models at 1047126.6.
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.3845
## [1] 0.3474603
## [1] 0
#Transform the series with BoxCox function and the best lambda
KWH_train_trans <- BoxCox(KWH_train, lambda)## Series: KWH_train
## ARIMA(0,0,1)(2,1,1)[12] with drift
##
## Coefficients:
## ma1 sar1 sar2 sma1 drift
## 0.3109 -0.2827 -0.2105 -0.5307 4421.573
## s.e. 0.1009 0.1899 0.1390 0.1851 1827.769
##
## sigma^2 estimated as 3.067e+11: log likelihood=-2111.32
## AIC=4234.63 AICc=4235.25 BIC=4252.45
## Series: KWH_train
## ARIMA(1,0,1) with non-zero mean
##
## Coefficients:
## ar1 ma1 mean
## 0.1545 0.6327 6327446.1
## s.e. 0.0981 0.0660 149251.4
##
## sigma^2 estimated as 9.57e+11: log likelihood=-2371.99
## AIC=4751.98 AICc=4752.24 BIC=4764.18
## Series: KWH_train
## ARIMA(1,0,1)(0,0,1)[12] with non-zero mean
##
## Coefficients:
## ar1 ma1 sma1 mean
## 0.0586 0.6102 0.5511 6351418.0
## s.e. 0.1154 0.0854 0.0648 164964.4
##
## sigma^2 estimated as 6.558e+11: log likelihood=-2344.08
## AIC=4698.15 AICc=4698.55 BIC=4713.4
## Series: KWH_train
## ARIMA(0,0,1)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 sar1 sar2 drift
## 0.2821 -0.7137 -0.4286 4979.243
## s.e. 0.1057 0.0853 0.0835 2466.045
##
## sigma^2 estimated as 3.185e+11: log likelihood=-2113.55
## AIC=4237.11 AICc=4237.54 BIC=4251.96
## Series: KWH_train
## ARIMA(0,0,1)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 sar1 sar2 drift
## 0.2821 -0.7137 -0.4286 4979.243
## s.e. 0.1057 0.0853 0.0835 2466.045
##
## sigma^2 estimated as 3.185e+11: log likelihood=-2113.55
## AIC=4237.11 AICc=4237.54 BIC=4251.96
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,1)(2,1,0)[12] with drift
## Q* = 36.622, df = 20, p-value = 0.01298
##
## Model df: 4. Total lags used: 24
## RMSE MAE MAPE MASE
## Training set 534637.1 398144.7 6.358980 0.6864288
## Test set 1047126.6 765875.5 9.428987 1.3204217
Applying the auto.arima model across the entire time series to project the next 12 months shows that the best model is ARIMA(0,0,3)(2,1,0)[12]
## Series: KWH_ts
## ARIMA(0,0,3)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 drift
## 0.3308 0.0452 0.2391 -0.6834 -0.4000 9003.910
## s.e. 0.0789 0.0871 0.0757 0.0772 0.0793 3121.791
##
## sigma^2 estimated as 3.96e+11: log likelihood=-2659.68
## AIC=5333.35 AICc=5334 BIC=5355.7
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,3)(2,1,0)[12] with drift
## Q* = 15.236, df = 18, p-value = 0.6457
##
## Model df: 6. Total lags used: 24
| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| Jan 2014 | 10369540 | 9563069 | 11176012 | 9136149 | 11602932 |
| Feb 2014 | 8598539 | 7749091 | 9447987 | 7299420 | 9897658 |
| Mar 2014 | 7232369 | 6382139 | 8082599 | 5932054 | 8532684 |
| Apr 2014 | 6010640 | 5138824 | 6882456 | 4677313 | 7343967 |
| May 2014 | 5947452 | 5075636 | 6819268 | 4614125 | 7280779 |
| Jun 2014 | 8194390 | 7322574 | 9066206 | 6861063 | 9527717 |
| Jul 2014 | 9445287 | 8573472 | 10317103 | 8111960 | 10778615 |
| Aug 2014 | 9947311 | 9075496 | 10819127 | 8613984 | 11280638 |
| Sep 2014 | 8467560 | 7595744 | 9339376 | 7134233 | 9800887 |
| Oct 2014 | 5870569 | 4998753 | 6742385 | 4537242 | 7203896 |
| Nov 2014 | 6137679 | 5265863 | 7009495 | 4804352 | 7471006 |
| Dec 2014 | 8448896 | 7577080 | 9320712 | 7115569 | 9782223 |
The 2014 monthly forecasts are exported here: