project

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Introduction

For this part, we need to forecast how much cash will be taken out of 4 different ATM machines for May 2010.
Data was provided for this project. The cash is in hundreds of dollars.

Data Exploration

I will begin by plotting the cash withdrawn by each ATM.

ATM4 was used at a higher rate than every other ATM in the data series. There is an unusual spike in ATM4. There are some missing ATM labels in the data. Let’s remove ATM4 and the missing label from the dataset and replot the visualization:

It seems that ATM3 was not used until recently. ATM #1 and #2 seem similar to each other. It doesn’t seem like there is much of a trend in the data. Let’s explore the distributions further:

ATM	Minimum	1st Qu.	Mean	Median	3rd Qu.	Maximum	NA’s
ATM1	1.00000	73.0000	83.8867403	91.0000	108.000	180.00	3
ATM2	0.00000	25.5000	62.5785124	67.0000	93.000	147.00	2
ATM3	0.00000	0.0000	0.7205479	0.0000	0.000	96.00	0
ATM4	1.56326	124.3344	474.0433449	403.8393	704.507	10919.76	0

There is a pattern in the time series but fluctuations are very tight. I wonder if it is explained by the day of the week. For example, Friday and Saturday might have heavier usage and a Tuesday night might be calm. Let’s examine this hypothesis:

It looks like there is some varriation based on the day of the week. Thursday looks to be less busy than the other days of the week. There is so much variation between the 4 ATMs that each will be modeled separately.

We will clean up the data to produce the forecasts. Then explore some candidate modeling techniques. Once we have a handful of candidates, we will prefrom a cross-validation on the models and get the RMSE. The model that minimizes the RMSE during cross-validation will be selected as the model of choice.

ATM_1

Data Cleanup

There are 3 missing observations that need to be cleaned up.

Now that we have a complete data set I will create time series objects. Since there is a weekly effect I will be using a frequency of 7. This does mess up the dates in the plot however so please pay no reguard to them.

One readily observes the repeating peaks on every 7th lag. There also seems to be an interesting negative correlation between the 1st and both the 3rd and the 5th lag.

Model Creation

STL Decomposition Models

I will try a couple of seasonal decomposition models. I will set the seasonal window to 7 so it picks up the day of the week variation.

This seems promising to me. I will do STL decomposition forecasts using both the ETS and ARIMA models and check their residual plots

STL + ETS

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ETS(A,N,N)
#> Q* = 13.759, df = 12, p-value = 0.3164
#> 
#> Model df: 2.   Total lags used: 14

This model seems to hold merit and should be taken under consideration.

STL + ARIMA

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ARIMA(0,1,2)
#> Q* = 8.4479, df = 12, p-value = 0.7492
#> 
#> Model df: 2.   Total lags used: 14

This model also preformed well. It is definately a candidate for the cross-validation stage.

Holt-Winters

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 28.877, df = 3, p-value = 2.376e-06
#> 
#> Model df: 11.   Total lags used: 14

This method did a fairly good job. I will move it on to the next phase.

Holt-Winters with Box Cox Adjustment

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 18.824, df = 3, p-value = 0.0002973
#> 
#> Model df: 11.   Total lags used: 14

This seems to be a strong candidate. We will see how if fares in the cross-validation.

ARIMA

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ARIMA(0,0,1)(0,1,2)[7]
#> Q* = 15.247, df = 11, p-value = 0.1715
#> 
#> Model df: 3.   Total lags used: 14

This model looks like it is preforming well. Let’s see how all of them stack up.

Cross Validation

In order to understand how well a model is likely to preform at predicting out of sample data, I will use the tsCV function and evaluate the models. As prevously noted my goal is to minimize the RMSE. First I will get the errors from the cross validation process, then I will compute the RMSE.

Model	RMSE
ARIMA	29.67002
STL ARIMA	31.80636
STL ETS	31.81959
Adjusted Holt-Winters	34.30067
Holt-Winters	39.49958

It looks like the ARIMA model perform good job predicting on the out of sample data set.

Model Selection

The ARIMA model would be selected to produce the forecast for ATM #1. This because it is the model has more generalization and robustness across the cross validation.

ATM 2

We will repeat the same process for ATM 2, then select the most generalized robust model using the cross validation.

Data Cleanup

There are 2 missing observations. This will be cleaned up using the tsclean function again.

Once again the ACF plot has the regular spikes on evry multiple of 7.

Model Creation

STL Decomposition Models

Again I will try a couple of seasonal decomposition models. I will set the seasonal window to 7 so it picks up the day of the week variation.

This seems very similar to ATM #1.

STL + ETS

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ETS(A,N,N)
#> Q* = 9.2372, df = 12, p-value = 0.6825
#> 
#> Model df: 2.   Total lags used: 14

This model preformed much better than the it did for the ATM #1 time series. It will be interesting to see how it does in cross-validation

STL + ARIMA

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ARIMA(2,1,2)
#> Q* = 7.4932, df = 10, p-value = 0.6782
#> 
#> Model df: 4.   Total lags used: 14

Interesting. The residuals have a bit more spread but the left tail is shorter than the STL+ETS model.

Holt-Winters

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 37.435, df = 3, p-value = 3.722e-08
#> 
#> Model df: 11.   Total lags used: 14

This method did not preform as well as the others. I will, however keep it in so I can compare it to ATM #1’s statistics.

Holt-Winters with Box Cox Adjustment

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 34.776, df = 3, p-value = 1.358e-07
#> 
#> Model df: 11.   Total lags used: 14

ARIMA

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ARIMA(2,0,2)(0,1,1)[7]
#> Q* = 10.231, df = 9, p-value = 0.3321
#> 
#> Model df: 5.   Total lags used: 14

This model looks like it is preforming well. Let’s see how all of them stack up.

Cross Validation

In order to understand how well a model is likely to preform at predicting out of sample data, I will use the tsCV function and evaluate the models. As previously noted my goal is to minimize the RMSE. First I will get the errors from the cross validation process, then I will compute the RMSE.

Model	RMSE
ARIMA	33.52037
STL ARIMA	34.17286
STL ETS	34.96679
Adjusted Holt-Winters	41.53970
Holt-Winters	55.33134

Interesting. It looks like the ARIMA model was the top preformer again. I find it interesting that the RMSE is higher and more spread out for ATM #2. ATM #1’s RMSEs ranged from roughly ~ 30 to 39. The RMSEs for ATM #2 ranged from ~ 33 to 55. There is more variability in this data that isn’t captured by the model.

Model Selection

As stated in the previous section I will use the ARIMA model to produce the forecast for ATM #2.

ATM #3

This model is quite different from the two proceeding cases. You can see it in the plot:

Most of the values are zeros (points in red above), except for 3 points (shown in blue). The three non-zero points are the most current. This presents a serious challenge.

Model Creation/Selection

There is one fundamental question with this dataset. Are the three points an outlier, or is it the begining of the new normal? If they are outliers one would expect the cash value to return to zero. If the three points are an indication of change, then the historical data have little relevance.

I will be assuming the new data are the begining of the new normal. The challenge is we only have three data points. In the absense of more data I will calculate the average of these three points and use it for the forecast with the recommendation to revise it frequently. As the average is only based off of three data points it should not be considered stable.

ATM #4

This ATM is different too, but not as radically different as ATM #3. I will be using the same approach used for ATM #1 and #2 with this ATM.

Data Cleanup

There is one major outlier in the data set. We will clean it up by simply by using the tsclean function once again.

Once again there is the familiar patter of peaks on the multiples of seven.

Model Creation

STL Decomposition Models

Again I will try a couple of seasonal decomposition models. I will set the seasonal window to 7 so it picks up the day of the week variation.

This seems very similar to ATM #1.

STL + ETS

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ETS(A,N,N)
#> Q* = 30.61, df = 12, p-value = 0.002258
#> 
#> Model df: 2.   Total lags used: 14

This model did a fair job. There are a couple of ACF spikes that are outside the bands. Let’s see if the ARIMA model does better:

STL + ARIMA

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ARIMA(1,0,1) with non-zero mean
#> Q* = 21.78, df = 11, p-value = 0.02614
#> 
#> Model df: 3.   Total lags used: 14

This is an improvement over the STL+ETS model. There are still a couple of spikes on the ACF that falls outside the threshold.

Holt-Winters

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 18.279, df = 3, p-value = 0.0003852
#> 
#> Model df: 11.   Total lags used: 14

This method seems to have preformed better than any of the STL models.

Holt-Winters with Box Cox Adjustment

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 18.934, df = 3, p-value = 0.0002822
#> 
#> Model df: 11.   Total lags used: 14

This model preformed very well. It doesn’t look like any of the ACF spikes are outside the bands. This is a contender for sure.

ARIMA

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ARIMA(0,0,3)(1,0,0)[7] with non-zero mean
#> Q* = 16.96, df = 9, p-value = 0.04934
#> 
#> Model df: 5.   Total lags used: 14

This model did fairly well, but it seems like the Holt-Winters with Box-Cox adjustment did better. It’s time to cross-validate and see how all of them preform.

Cross Validation

Once again, I will use the tsCV function and evaluate the models. My goal to minimize the RMSE remains for this dataset.

Model	RMSE
ARIMA	346.7434
STL ARIMA	353.7169
STL ETS	366.0626
Adjusted Holt-Winters	388.4804
Holt-Winters	504.2440

That’s an unexpected result. The Adjusted Holt-Winters (the one I thought would preform the best), did not preform as well. The ARIMA model, once again, rose to the top.

Model Selection

Once again I will use the forecasts produced by the ARIMA model as the projections for ATM #4.

Summary

I set out to create predictions for 4 different ATMs. After testing multiple approaches using cross-validation, I selected ARIMA models for ATM #1, #2 and #4, as it was the modeling technique with the lowest RMSE. For ATM #3 we used the mean of all non-zero data (3 observations). This model needs to be updated once more data becomes available. I will finish this project by exporting my forcasts in the same file format as the original data.

—–

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

The generated forcates are on Github

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Introduction

I am to forecast how much power will be consumed by residential customers in kilowatt hours for the next 12 months. Data was provided for this project spaning January 1998 until December 2013.

Data Exploration

I will begin by reading in the data and ploting the series.

	KWH
	Min. : 770523
	1st Qu.: 5429912
	Median : 6283324
	Mean : 6502475
	3rd Qu.: 7620524
	Max. :10655730
	NA’s :1

We have outliers and missing data. We will clean them up with tsclean from the forcasts package, and replot the data with the ACF and PACF:

This looks much better. There is a seasonal component that looks to be additive. I will need to use a model that captures seasonality.

Candidate Models

Since the data is highly seasonal, I we will need to use an ARIMA or Holt-Winters model. Let’s fit an models using auto.arima() and hw() and visualize their projections. I will use a couple of different parameters with the Holt-Winters model.

Residual Analysis

All models do a good job picking up the seasonality. Let’s look at the residuals to see if any of the models should not be used.

Holt-Winters Model

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 61.741, df = 8, p-value = 2.121e-10
#> 
#> Model df: 16.   Total lags used: 24

There are some spikes in the ACF that indicate this is not the best model.

Adjusted Holt-Winters Model

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Holt-Winters' additive method
#> Q* = 48.973, df = 8, p-value = 6.433e-08
#> 
#> Model df: 16.   Total lags used: 24

This is a bit better than the previous model but there are still ACF spikes outside of the bands.

Damped & Adjusted Holt-Winters Model

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from Damped Holt-Winters' additive method
#> Q* = 47.391, df = 7, p-value = 4.682e-08
#> 
#> Model df: 17.   Total lags used: 24

The preformance of this model is about the same as above.

ARIMA Model

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ARIMA(3,0,1)(2,1,0)[12] with drift
#> Q* = 11.549, df = 17, p-value = 0.8266
#> 
#> Model df: 7.   Total lags used: 24

This model seems to be the best. There is only one ACF spike outside of the threshold and it only by a small amount. My working theory is this is the model I should use, but I will check this using cross validation.

Cross Validation

In order to understand how well a model is likely to preform at predicting out of sample data, I will use the tsCV function and evaluate the models. My goal is to minimize the RMSE. First I will get the errors from the cross validation process, then I will compute the RMSE.

Model	RMSE
ARIMA	659529.8
Holt-Winters	1010721.9
Adjusted Holt-Winters	1373617.3
Damped & Adjusted Holt-Winters	1547751.3

The ARIMA model had the best cross validation results.

Model Selection

Given that the ARIMA model minimized the RMSE in the cross validation, I will use it to forecast residential power consumption.

Summary

I set out to forecast residential power consumption. Four modeling techniques were tested for accuracy using cross-validation. In the end an ARIMA model was selected bercause it minimized the RMSE. Here’s my forcast for the next 12 months of power usage:

	Point Forecast	Lo 80	Hi 80	Lo 95	Hi 95
Jan 2014	9433336	8688058	10178615	8293531	10573141
Feb 2014	8570219	7785677	9354761	7370366	9770072
Mar 2014	6615312	5830157	7400467	5414521	7816102
Apr 2014	6005538	5196071	6815004	4767565	7243510
May 2014	5917569	5108101	6727036	4679595	7155543
Jun 2014	8187387	7377114	8997660	6948182	9426592
Jul 2014	9528779	8717809	10339750	8288507	10769052
Aug 2014	9991953	9180963	10802943	8751650	11232255
Sep 2014	8546843	7735715	9357971	7306330	9787356
Oct 2014	5856327	5045196	6667458	4615809	7096845
Nov 2014	6153245	5342114	6964376	4912727	7393763
Dec 2014	7732110	6920971	8543250	6491580	8972641

—–

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Introduction

For this project I have data on water flows from two pipes. The data are captured with timestamps. I am to forecast a week forward. In the instructions for this project I am given the following hint: For multiple recordings within an hour, take the mean. I will be on the lookout for data under the hour.

Data Exploration

I will begin by plotting the datasets:

The two pipes are quite different in terms of the data they offer. Pipe 1 has a median waterflow of about 20, while pipe 2 has a median of 40. Pipe 1 also has what looks like over a week of data, while pipe 2 has over a month. Both datasets have 1000 observations. This suggests that pipe one has more fine grain observations (i.e. multiple observations in a hour or minute) while pipe 2’s data captures a longer period of time. That would explain the difference in median and the length of time covered by the respective data sets. This also is hinted at in the instructions.

Closer Look at Pipe 1

I will look at the head and tail of pipe 1’s data to get a sense of what’s included:

Date Time	WaterFlow
2015-10-23 00:24:06	23.369599
2015-10-23 00:40:02	28.002881
2015-10-23 00:53:51	23.065895
2015-10-23 00:55:40	29.972809
2015-10-23 01:19:17	5.997953
2015-10-23 01:23:58	15.935223

There is multiple observations per minute which confirms my suspicions. I will need to create the averages per minute as directed.

Closer Look at Pipe 2

I will also take a look at pipe 2’s data to see if there is anything I need to be aware of:

Date Time	WaterFlow
2015-10-23 01:00:00	18.81079
2015-10-23 01:59:59	43.08703
2015-10-23 03:00:00	37.98770
2015-10-23 04:00:00	36.12038
2015-10-23 04:59:59	31.85126
2015-10-23 06:00:00	28.23809

Some of the timestamps are just a bit off from the hour mark. I will need to clean that up.

Data Cleanup

Pipe 1

I will clean up the pipe 1 data by creating the averages per minute.

After doing this I have 236 observations.

Pipe 2

I will clean up pipe 2’s Date Time field by rounding it to the nearest hour.

Model Creation

A Note on Process

I will develop a variety of models using different methods and will validate their performance using cross validation. The model that minimizes the error will be the model of choice.

Pipe 1 Candidates

STL + ETS

I will begin with a STL decomposition based model.

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ETS(M,N,N)
#> Q* = 37.291, df = 45, p-value = 0.7861
#> 
#> Model df: 2.   Total lags used: 47

This model appears to have preformed fairly well. The residuals are normal and there’s only one minor ACF spike. I’ll see how this preforms in the cross validation stage.

STL + ARIMA

Next I will create an ARIMA model on the STL decomposed time series.

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ARIMA(0,0,0) with non-zero mean
#> Q* = 37.292, df = 46, p-value = 0.8164
#> 
#> Model df: 1.   Total lags used: 47

That’s strange. It is a ARIMA(0,0,0) model. That is a white-noise model. Let’s see if this is an artifact of the STL decomposition or if we get a similar result training the ARIMA model.

ARIMA

Last of all I will try an ARIMA model using the auto.arima function to define the (p,d,q) parameters.

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ARIMA(0,0,0) with non-zero mean
#> Q* = 37.178, df = 46, p-value = 0.82
#> 
#> Model df: 1.   Total lags used: 47

Again the best model was an ARIMA(0,0,0) model! This suggests there is no reliable way to forecast the waterflow.

Pipe 2 Candidates

STL + ARIMA

I will first check with a STL decomposition and ARIMA to see if it results in a white noise model.

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from STL +  ARIMA(0,0,0) with non-zero mean
#> Q* = 58.308, df = 47, p-value = 0.1247
#> 
#> Model df: 1.   Total lags used: 48

It did. Let’s see what kind of model auto.arima comes up with.

ARIMA

#> 
#>  Ljung-Box test
#> 
#> data:  Residuals from ARIMA(0,0,0)(0,0,1)[24] with non-zero mean
#> Q* = 55.158, df = 46, p-value = 0.1669
#> 
#> Model df: 2.   Total lags used: 48

We have another white noise model. The water flow from pipe 2 cannot be forecasted reliably.

Summary

White noise models were recommended when modeling the water flow of both pipes using the auto.arima function. This suggests that there is not a reliable way to model the waterflow.

project_1

Salma Elshahawy

2021-04-11

Part A – ATM Forecast, ATM624Data.xlsx

Introduction

Data Exploration

ATM_1

Data Cleanup

Model Creation

STL Decomposition Models

STL + ETS

STL + ARIMA

Holt-Winters

Holt-Winters with Box Cox Adjustment

ARIMA

Cross Validation

Model Selection

ATM 2

Data Cleanup

Model Creation

STL Decomposition Models

STL + ETS

STL + ARIMA

Holt-Winters

Holt-Winters with Box Cox Adjustment

ARIMA

Cross Validation

Model Selection

ATM #3

Model Creation/Selection

ATM #4

Data Cleanup

Model Creation

STL Decomposition Models

STL + ETS

STL + ARIMA

Holt-Winters

Holt-Winters with Box Cox Adjustment

ARIMA

Cross Validation

Model Selection

Summary

—–

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Introduction

Data Exploration

Candidate Models

Residual Analysis

Holt-Winters Model

Adjusted Holt-Winters Model

Damped & Adjusted Holt-Winters Model

ARIMA Model

Cross Validation

Model Selection

Summary

—–

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Introduction

Data Exploration

Closer Look at Pipe 1

Closer Look at Pipe 2

Data Cleanup

Pipe 1

Sidebar: Is there any pattern in the usage by hour?

Pipe 2

Sidebar: Is there any pattern in the usage by hour?

Model Creation

A Note on Process

Pipe 1 Candidates

STL + ETS

STL + ARIMA

ARIMA

Pipe 2 Candidates

STL + ARIMA

ARIMA

Summary