Part A – ATM Forecast, ATM624Data.xlsx

Instruction

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

About the Data

The data represents a daily time series from May 2009 to April 2010. The target variable is Cash, which gives the aggregate cash withdrawals in hundreds of dollars for the indicated ATM machine (ATM) and date (DATE).

Data Cleansing

First let’s review the data. We note the following:

Outliers: one observation (10920 for ATM4 on 2010/2/9) appears to be an outlier. This is likely a data measurement or recording error, as the value is larger by an order of magnitude compared to the distribution of the remaining ATM4 observations. Missing data: there are 14 rows in the data file without ATM and Cash data, which we exclude. In addition, there are 3 missing Cash values for ATM1 and 2 missing Cash values for ATM2. Zero values: the Cash data for ATM3 is mostly zeros; in fact ATM3 has non-zero Cash data for only the last 3 days starting on 2010/4/28. This may indicate a new ATM machine that came online at the end of the measurement period. Regarding the outlier for ATM4, ideally we would investigate the outlier with the source of the data and then decide how to treat it. For instance, if we knew the outlier was explained by a holiday or other special circumstance, then we would keep it and note it as an outlier when reviewing the results of our analysis. On the other hand, if we knew the outlier to be a data error, we would either remove it or replace it with an estimate (imputation). However, in this case we don’t know for sure, so we go with our intuition that this is a data error, in which case we treat it as missing data.

Regarding the missing data for ATM1 and ATM2 (as well as the outlier removed for ATM4), the simplest option would be to ignore them since we’re only missing 2-3 values out of 365 for each machine. However, the ETS and ARIMA model-fitting functions that we use in the modeling phase will throw errors if there are missing data in the interior of the time series. Consequently we need to impute values to the missing data, which we do by simply taking the average of the prior week and the following week, i.e., we interpolate the adjacent weekly measurements. If we knew more about the data and believed that the ATMs are subject to the same drivers of consumer behavior (for instance, if the machines are all situated in the same shopping mall), then we could undertake a more sophisticated imputation method like regresssion, in which we fit the relationship between one machine and the other three machines, and then use the regression to estimate the missing observation for the first machine as a function of the observations from the other three machines. In any case, because there are only 2-3 missing observations out of 365 for each machine, whichever imputation method we choose is not likely to bias the results in any meaningful way.

After making adjustments for the outlier and missing data, we define our time series for model fitting, which we call cashts. We set the seasonal period to 7, since we expect there may be a weekly “seasonal” pattern.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Instruction

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Instruction

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.