This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Apr 11. I will accept late submissions with a penalty until the meetup after that when we review some projects. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.
library(fpp3)
library(forecast)
library(DataExplorer)
library(inspectdf)
library(imputeTS)
library(knitr)atm <- read.csv('./ATM624Data.csv')
power <- read.csv('./ResidentialCustomerForecastLoad-624.csv')
waterflow1 <- read.csv('./Waterflow_Pipe1.csv')
waterflow2 <- read.csv('./Waterflow_Pipe2.csv')In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast.
There is about 1.29% missing data. Most of the missing data is towards the beginning of the data set (Rows 0-125) and only a fraction of missing data is after the first interval, within rows 375-500. Lets explore the missing data a little further.
| ATM | NA Rows | NA Rate |
|---|---|---|
| 14 | 0.7368421 | |
| ATM1 | 3 | 0.1578947 |
| ATM2 | 2 | 0.1052632 |
The missing ATM values will be very hard to impute because there is no way to know which ATM this data could come from. The different ATMs have an even distribution of values throughout the data set, each ATM has 365 rows. Therefore, selecting an ATM based on the different rates for which each ATM shows up in the data set is not possible. For this reason the best option is to remove those data points.
For the remaining five data points there are several options we can choose. Given that there are 125 to 166 rows for each ATM before there are missing values, we could let R decide the values using an ARIMA model or a moving average model. On the other hand, we could just take the median ATM withdrawal for each ATM. This method will not take into account any seasonality or trend in the data. Therefore, before we decide to impute the data we need to understand the trend and seasonality of the data.
Based on the plots above using a moving average imputation method for ATM 1 and ATM 2 would be the best option. Additionally, it seems that ATM 3 has been put in circulation only recently as data before April 28th shows no cash withdrawal. These values should be removed from the data set as they will not be useful in forecasting future ATM 3 withdrawals. Finally, ATM 4 has an extremely large outlier value that lends itself to be replaced using the moving average imputation method.
The ATM data looks better and is now ready to be forecasted.
We are starting with ATM 3 given that the ATM only has three days of activity the best option to consider is the MEAN method.
We see that the model is 95% confident that ATM 3 will have a daily withdrawals between 70.98 and 104.35 for the month of May. And given that there are only three days prior to May 2010 with data for this ATM, there is no reason to not believe that this range is the most accurate.
The first step to forecasting ATM 1 is to understand each component of the ATM 1 Cash graph. What we are looking for is whether the ATM Cash withdrawal has any trend or seasonality to it. To accomplish this we will use STL decomposition.
Looking at the decomposition of ATM1 cash withdrawal we see a few interesting things. First, there seems to be no trend to the data, i.e. the trend component seems to be white noise. It neither increases nor decreases over time at a steady rate. Secondly, we see that there is some kind of weekly seasonality to the data. The cash withdrawal seems to drop dramatically every 7 days. Finally, we see that the remainder becomes more variant with the more recent withdrawals. This kind of data makes me think that a simple SNAIVE model would work. Therefore, we will try a SNAIVE and an ETS model and see which model makes the most sense (we also will add an ARIMA model to the accuracy table to show that the extra complexity is not worth it).
Given that all the models both look reasonable when the forecasted values are plotted, it would seem that the best model would have to be the one with the lowest RMSE. Based on the table below the ETS(A,N,A) model seems to be the best fit. Not only does it have a smaller RMSE than the SNAIVE model, but the difference in RMSE from the ARIMA(0,0,1)(0,1,2) model is not enough to warrant a more complex model than the ETS(A,N,A) model.
| Model | RMSE | RMSSE | MAE | MAPE |
|---|---|---|---|---|
| snaive | 27.75 | 1.00 | 17.70 | 116.23 |
| etsANA | 23.79 | 0.86 | 15.09 | 122.03 |
| arima001012 | 23.26 | 0.84 | 14.54 | 117.15 |
Finally, we want to check the residuals of the ETS(A,N,A) model and make sure they look like white noise.
Based on the graphs above the model looks to have fairly normal residuals. The histogram shows a normal distribution with a few outliers, while the initial residual plot looks like white noise. Finally, the ACF does not show issues with the autocorrelation in the time series.
The ETS(A,N,A) model is the simplest and best fit model for ATM 1.
ATM 2 seems to have a similar pattern to ATM 1, but we want to check the components nonetheless to make sure that our intuition is correct. We do this again by using the STL Decomposition method.
Unlike ATM 1, there seems to be a very slight downward trend in cash withdrawals from ATM 2. There is some weird peak and drop around February 2010, but the data still seems to suggest a downward trend. We shouldn’t consider a SNAIVE or other simple methods only because they don’t do a great job at predicting trend. Therefore, we will compare an ETS and an ARIMA model for this ATM (We will add the SNAIVE model in the accuracy table as a proof of concept).
Once again, it is hard to select a model based on the graph alone as all three models seem to be reasonable. Therefore, we need to select the model with the best accuracy scores.
| Model | RMSE | RMSSE | MAE |
|---|---|---|---|
| snaive | 30.20 | 1.00 | 20.79 |
| etsANA | 25.14 | 0.83 | 17.87 |
| arima202011 | 24.14 | 0.80 | 17.04 |
Looking at the accuracy scores above we can see that the ARIMA(2,0,2)(0,1,1) model preforms slightly better than the ETS(A,N,A) model but significantly better than the SNAIVE model. For this reason, we will be using the ARIMA(2,0,2)(0,1,1) model to forecast ATM 2. The last step is to check residuals.
The residuals look like white noise. The histogram is normal, with a few outliers and a slight right skew. Additionally, the ACF plot has no values outside of the critical values. This supports our selection of the ARIMA(2,0,2)(0,1,1) model to forecast ATM 2.
The first step in the process is to decompose the data to get a better sense of any trend or seasonality to the data.
ATM 4 seems very similar to ATM 1, in that the trend component seems to be white noise. With a weekly seasonality it seems most appropriate to test SNAIVE and an ETS model.
This data also seems fairly stationary to begin with as the ACF and PACF seem to suggest that the data is white noise. It’s worth checking an ARIMA model as well to make sure it doesn’t outperform the other two simpler models.
Before we make our decision on the model let’s check the accuracy of each model.
| Model | RMSE | MAPE | MASE | RMSSE |
|---|---|---|---|---|
| snaive | 453.03 | 438.07 | 1.00 | 1.00 |
| etsMNM | 354.09 | 466.18 | 0.80 | 0.78 |
| arima302100 | 339.52 | 519.88 | 0.81 | 0.75 |
The SNAIVE model seems to be the weakest of the three models as the RMSE is much larger than either of the other two models. Let’s take a closer look at the ETS(M,N,M) and ARIMA(3,0,2)(1,0,0) models.
The more realistic model seems to be the ETS(M,N,M) model as it keeps a similar seasonality throughout the month of May. Additionally, the ETS(M,N,M) model doesn’t reduce the variation in the cash withdrawals over the course of the month. For this reason the best option is to go with the simpler ETS(M,N,M) model.
The residuals are not normally distributed. It looks like the data could benefit from some kind of box-cox or log transformation.
We transform the data using a Box-Cox transformation using the Guerrero method, which provides us with a \(\lambda\) of 0.4. We re-run all the models using the transformed data to make sure that the ETS(M,N,M) model is still the most accurate.
| Model | RMSE | MAPE | MASE | RMSSE |
|---|---|---|---|---|
| snaive_bx | 453.03 | 438.07 | 1.00 | 1.00 |
| snaive_log | 453.03 | 438.07 | 1.00 | 1.00 |
| etsMNM_bx | 345.22 | 384.21 | 0.76 | 0.76 |
| etsANA_log | 365.20 | 323.08 | 0.79 | 0.81 |
| arima001200_bx | 354.28 | 384.58 | 0.79 | 0.78 |
| arima001200_log | 379.24 | 297.68 | 0.82 | 0.84 |
As we check the accuracy scores for the models using the transformed data, we can eliminate the log transformation as it performs worse across the board except for the SNAIVE model. Looking at the accuracy score it seems that the ETS(M,N,M) using the Box-Cox transformed data is the most accurate. Let’s confirm this by looking at a plot of the forecasts.
Surprisingly, the SNAIVE model looks more in line with the rest of the data from previous months, while the ARIMA(0,0,1)(2,0,0) model seems to be the least accurate forecast. Let’s examine these three forecasts closer.
Based on the residual analysis it looks like the ARIMA(0,0,1)(2,0,0) and the SNAIVE models are the best fits. They both seem to show white noise residuals with a normal distribution. Therefore, we wont be considering the ETS(M,N,M) model anymore.
| Model | lb_stat | lb_pvalue |
|---|---|---|
| arima001200_bx | 5.836 | 0.442 |
| snaive_bx | 100.723 | 0.000 |
Finally, applying the Ljung-Box test we can see that the p-value for the ARIMA(0,0,1)(2,0,0) model is much larger than that of the SNAIVE model so we will consider the ARIMA(0,0,1)(2,0,0) as the final model to forecast ATM 4 cash withdrawals.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
We start by plotting the power data.
We immediately notice the dip around July 2010, which seems like an outlier that might need to be imputed. Additionally, taking a closer look we see a break in the data around September 2008, indicating one missing value that will be imputed.
We can now check a histogram of the data and see if it’s normally distributed or if it needs some kind of transformation. We will also check a boxplot to see if there are any other outliers we might have missed on the first KWH plot.
The boxplot seems to show that imputing the one outlier for July 2010 removed all outliers from the data. The histogram plot suggests that there might be room for a log or Box-Cox transformation.
Using the Guerrero method we get a \(\lambda\) of -0.21. This is not close enough to 0 where we could just use a log transformation, so we will stick with the lambda provided by the Guerrero method.
The histogram looks a little more normal that the original KWH histogram.
Before we begin forecasting we need to decompose our data and determining the trend and seasonality, which will help us come up with a plan.
We notice that the trend in power usage is increasing gradually, while there seems to be a seasonality, it’s hard to understand from just the decomposition plot. We can use the gg_season() and gg_subseries() plot to get a better sense of the seasons for this data set.
The plots above show a clear seasonality to the data. At the beginning of the year power usage is high then it dips down during the springs months and then it increases again during the summer months before taking another dip during the fall months.
We will forecast models using both the original and the transformed data taking into consideration the trend and the seasonality of the data.
We begin our forecast by considering the ACF and PACF plots.
We see that the data is clearly non-stationary with strong seasonality. Therefore, we need to seasonally difference the data once.
The significant spike at lag 1 in the ACF suggests a non-seasonal MA(1), while the significant spike at lag 12 of the ACF suggests a seasonal MA(1) component. Using the ACF approach we would start with an ARIMA(0,0,1)(0,1,1) model. We would also include a model by looking at the PACF plot. Two significant spikes at lag 1 and lag 12 in the ACF suggest a non-seasonal MA(2) component. Furthermore, the spike at lag 24 in the PACF suggests a seasonal AR(2) component. This would give us an initial ARIMA(0,0,2)(2,1,0) model. Let’s also add an automatic ARIMA model and see what the computer thinks is the best model.
| .model | RMSE |
|---|---|
| arima001011 | 612143 |
| arima002210 | 608519 |
| arima004210 | 590208 |
Initially we can remove the ARIMA(0,0,1)(0,1,1) model as the RMSE is much higher than the other models. Additionally, given that the auto ARIMA came out the an ARIMA(0,0,4)(2,1,0) model, we decided to compare an ARIMA(0,0,3)(2,1,0) model as well.
| .model | RMSE |
|---|---|
| arima002210 | 608519 |
| arima003210 | 591894 |
| arima004210 | 590208 |
We can remove the ARIMA(0,0,2)(2,1,0) model as the RMSE is still pretty high compared to the other two models. Given that the final 2 models ARIMA(0,0,3)(2,1,0) and ARIMA(0,0,4)(2,1,0) have very similar RMSE’s we need use additional metrics to decide whether to discard one versus the other. Comparing the AICc for each model we get that the AICc for the ARIMA(0,0,3)(2,1,0) model is 5330.84, while the AICc for ARIMA(0,0,4)(2,1,0) model is 5332.15. There is no reason to complicate the model further especially if the model isn’t improving the forecast. For this reason we will be selecting the model with the lowest AICc, ARIMA(0,0,3)(2,1,0).
Again, we begin by plotting the ACF and PACF plots.
We will again seasonally difference the data once.
We will use the same model types as in the Non-transformed data section, i.e. an ARIMA(0,0,1)(0,1,1),ARIMA(0,0,2)(2,1,0), ARIMA(0,0,3)(2,1,0), and an auto ARIMA.
| .model | RMSE |
|---|---|
| arima003210 | 621126 |
| arima100210 | 633777 |
The original ARIMA(0,0,3)(2,1,0) model has a better RMSE than the automatic ARIMA model even after transforming the data. On the other hand, the ARIMA(0,0,3)(2,1,0) model does have slightly higher AICc (-1504.52) than the new ARIMA(1,0,0)(2,1,0) model (-1500.26). Given that the new ARIMA(1,0,0)(2,1,0) model is simpler as it only requires 1 autoregressive term instead of 3 Moving average terms, ARIMA(1,0,0)(2,1,0) model seems to be the best model for our forecast.
Let’s finally check the residuals of our two finalist models.
The residuals for ARIMA(1,0,0)(2,1,0) look normally distributed based on the histogram from the plot above. However, they the ACF plot seems to have some kind of pattern in it with two spikes at lag 3 and 6, which are not consistent with white noise. Furthermore, a Ljung-Box Test p-value of 0.003 supports this conclusion.
Next we will consider the residuals of ARIMA(0,0,3)(2,1,0).
The residuals for ARIMA(0,0,3)(2,1,0) also look normally distributed based on the histogram from the plot above. ARIMA(0,0,3)(2,1,0) has an ACF plot that fits the pattern for white noise data. Furthermore, a Ljung-Box Test p-value of 0.003 supports this conclusion.
Even though the ARIMA(0,0,3)(2,1,0) model is more complex than the ARIMA(1,0,0)(2,1,0) model we will select it as our final model based on the residual analysis.
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
The first step is to understand what type of data we have.
| Date.Time | WaterFlow |
|---|---|
| 10/23/15 12:24 AM | 23.37 |
| 10/23/15 12:40 AM | 28.00 |
| 10/23/15 12:53 AM | 23.07 |
| 10/23/15 12:55 AM | 29.97 |
| 10/23/15 1:19 AM | 6.00 |
| 10/23/15 1:23 AM | 15.94 |
| Date.Time | WaterFlow |
|---|---|
| 10/23/15 1:00 AM | 18.81 |
| 10/23/15 2:00 AM | 43.09 |
| 10/23/15 3:00 AM | 37.99 |
| 10/23/15 4:00 AM | 36.12 |
| 10/23/15 5:00 AM | 31.85 |
| 10/23/15 6:00 AM | 28.24 |
Looking at the two tables above we see that Waterflow Pipe 1 data there are multiple entries per date and hour. Each data points is at a specific minute of the day. In Waterflow Pipe 2 data the data seems to be by hour. To join the two data tables together we first need to clean up Waterflow Pipe 1 to be by hour and average the WaterFlow column per hour.
| Date.Time | WaterFlow |
|---|---|
| 2015-10-23 00:00:00 | 26.10280 |
| 2015-10-23 01:00:00 | 18.85202 |
| 2015-10-23 02:00:00 | 15.15857 |
| 2015-10-23 03:00:00 | 23.07886 |
| 2015-10-23 04:00:00 | 15.48219 |
| 2015-10-23 05:00:00 | 22.72539 |
Now that we have the Waterflow Pipe 1 data by the hour we can add it together with the Waterflow Pipe 2 data and take the mean again for data that falls on the same hour.
| Date.Time | WaterFlow |
|---|---|
| 2015-10-23 00:00:00 | 26.103 |
| 2015-10-23 01:00:00 | 18.831 |
| 2015-10-23 02:00:00 | 29.123 |
| 2015-10-23 03:00:00 | 30.533 |
| 2015-10-23 04:00:00 | 25.801 |
| 2015-10-23 05:00:00 | 27.288 |
We can finally create a tsibble and figure out what kind of data we are dealing with and plot the data to see what it looks like.
The first thing we notice is the increase in variation of the WaterFlow data as the time increases. This suggests the data is non-stationary and will have to be differenced. Let’s check the distribution of the data as well and see if there are any outliers or missing data.
There is no missing data in the Pipe WaterFlow table, which is good. Next we will check the outliers in the data.
There doesn’t seem to be any outliers in the data. Using the unitroot tests we can see that we need 0 seasonal differencing and a first order difference of 0.
Looking at the differenced data we see that the data seems a lot more stationary with a exponentially decaying PACF and a significant spike at lag 1 in the ACF. This suggests and ARIMA(0,d,q) model. Given that we had to difference the data to get this plot and the ACF spike is at lag 1 then we should consider using an ARIMA(0,1,1) model.
As we noted in the previous section we want to try an ARIMA(0,1,1) model and we will compare it with the automatically selected model using the ARIMA() function.
| ACF/PACF ARIMA model | Automatic ARIMA model |
|---|---|
| <ARIMA(0,1,1)> | <ARIMA(0,1,1)(0,0,1)[24]> |
The ARIMA() function selected an ARIMA(0,1,1)(0,0,1)[24] model, noticing some daily seasonality to the data. Let’s see which model works best for the data.
| .model | RMSE | MAPE | RMSSE |
|---|---|---|---|
| arima011 | 14.756 | 49.729 | 0.741 |
| autoarima | 14.714 | 49.601 | 0.738 |
Comparing the RMSE for the ARIMA(0,1,1) model and the ARIMA(0,1,1)(0,0,1)[24] model there doesn’t seem to be that much difference. At least not enough to justify a more complex model. Comparing AICc we get the ARIMA(0,1,1) has an AICc of 8237.33 and the ARIMA(0,1,1)(0,0,1)[24] model has an AICc of 8233.73. These numbers are too close to justify a more complex model. Therefore, we will forecast the data with the ARIMA(0,1,1) model.
The best forecast is a moving average forecast with the value 43.2 for the week, with a 95% CI of between 15 and 70 for the week. Let’s take a look at the residuals and see if this model is valid.
The residuals fit the definition of white noise and have a normal distribution. Therefore, the ARIMA(0,1,1) model is the most effective model to predict this data.