Introduction

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade.

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Part A

Data Exploration

Data Summary

##       DATE                       ATM           Cash        
##  Min.   :2009-05-01 00:00:00   ATM1:365   Min.   :    0.0  
##  1st Qu.:2009-08-01 00:00:00   ATM2:365   1st Qu.:    0.5  
##  Median :2009-11-01 00:00:00   ATM3:365   Median :   73.0  
##  Mean   :2009-10-31 19:11:48   ATM4:365   Mean   :  155.6  
##  3rd Qu.:2010-02-01 00:00:00   NA's: 14   3rd Qu.:  114.0  
##  Max.   :2010-05-14 00:00:00              Max.   :10919.8  
##                                           NA's   :19

Missing-value Check

It’s observed that there are 6 missing values of [Cash] from series ATM1 & ATM2 before May 2010, and all [Cash] values after May 2010 are missing. As we are requested to forecast how much cash is taken in May 2010, the current data rows of May 2010 are removed.

##       DATE              ATM           Cash        
##  Min.   :2009-05-01   ATM1:365   Min.   :    0.0  
##  1st Qu.:2009-07-31   ATM2:365   1st Qu.:    0.5  
##  Median :2009-10-30   ATM3:365   Median :   73.0  
##  Mean   :2009-10-30   ATM4:365   Mean   :  155.6  
##  3rd Qu.:2010-01-29              3rd Qu.:  114.0  
##  Max.   :2010-04-30              Max.   :10919.8  
##                                  NA's   :5

Timelineness Check

Check the timelineness of the daily series. It is checked that there are no daily gaps in the daily time series.

Outliner Check

Check that there exist significant outliner at ATM4 series.

Data Visualization

Observed that outliners are suppressed in the final data set.

ATM 1

Observation on Raw Data

  1. Significant weekly seasonality exists;

  2. No sign of steady trend but small fluctuation over time;

  3. ACF shows decreasing trend in seasonal lags and PACF shows drop off after the first seasonal lag.

  4. Both ACF and PACF show non-seasonal lags either within the critical limit of slightly above the limit.

  5. Based on the above observation, the time series atm_1 is non-stationary with significant seasonality and little trend. seasonal Differecing is required to transform atm_1 into a stationary series.

Time Series Transformation

  1. Perform seasonal differencing with lag = 7;

  2. check with unit root test that the p-value is less than 0.05 therefore the transformed data set is within the expected range of staionary.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0153 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Observation on Transformed Data

  1. The seasonal effect is elimiated after deferencing, the transformed data shows no siginificant seasonality or trend.

  2. As the data set becomes stationary after seasonal diferencing, no further differencing is needed.

  3. As this atm_1 data set is non-stationary with seasonality, and becomes stationary after seasonal deferencing, an ARIMA model with seasonal difference D = 1. And because no further differencing is needed, the trend differnce d = 0.

  4. The PACF shows decreasing trend in the seasonal tags, the ACF shows drop off after the first seasonal tag, therefore the Seasonal AR factor P = 0 and Seasonal MA factor Q = 1.

  5. The PACF shows decreasing trend in non-seasonal tags with multiple lags above critical limit, and ACF shows drop off after the frist non-seasonal tag, therefore the AR factor p = 0 and MA factor q >= 1

  6. Therefore from the analysis above, suggested ARIMA models are ARIMA(0,0,>=1)(0,1,1)[7]

Build ARIMA Model

Use auto.arima function to determine a model with lowest AICc, this process verifies the claim above for suggested ARIMA models are ARIMA(0,0,>=1)(0,1,1)[7]. The value q obtained by auto.arima is 2.

The final model is ARIMA(0,0,2)(0,1,1)[7].

Checked that the p-value for Ljung-Box test is greater that 0.05, which means the residuals of the model have no remaining autocorrelations.

## Series: atm_1 
## ARIMA(0,0,2)(0,1,1)[7] 
## Box Cox transformation: lambda= 0.2615708 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1126  -0.1094  -0.6418
## s.e.  0.0524   0.0520   0.0432
## 
## sigma^2 estimated as 1.764:  log likelihood=-609.99
## AIC=1227.98   AICc=1228.09   BIC=1243.5

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,2)(0,1,1)[7]
## Q* = 9.8626, df = 11, p-value = 0.5428
## 
## Model df: 3.   Total lags used: 14

ATM 2

Observation on Raw Data

  1. Significant weekly seasonality exists;

  2. Slightly decreasing trend over time;

  3. ACF shows positive and decreasing trend in seasonal lags and PACF shows drop off after the first two seasonal lags.

  4. ACF shows slightly decreasing trend on non-seasonal lags, PACF shows drop off after the first two lags.

  5. Based on the above observation, the time series atm_2 is non-stationary with significant seasonality and slightly decreasing trend. seasonal Differecing is required to transform atm_2 into a stationary series.

Time Series Transformation

  1. Perform seasonal differencing with lag = 7;

  2. check with unit root test that the p-value is less than 0.05 therefore the transformed data set is within the expected range of staionary.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0162 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Observation on Transformed Data

  1. The seasonal effect is elimiated after deferencing, the transformed data shows no siginificant seasonality or trend.

  2. As the data set becomes stationary after seasonal diferencing, no further trend differencing is needed.

  3. As this atm_2 data set is non-stationary with seasonality, and becomes stationary after seasonal deferencing, an ARIMA model with seasonal difference D = 1. And because no further differencing is needed, the trend differnce d = 0.

  4. The PACF shows decreasing trend in the seasonal tags, the ACF shows drop off after the first seasonal tag, therefore the Seasonal AR factor P = 0 and Seasonal MA factor Q = 1.

  5. Both ACF and PACF shows stable variations within or slightly above the critical limits, therefore both AR and MA factors can not be omitted, the AR factor p >= 1 and MA factor q >= 1

  6. Therefore from the analysis above, suggested ARIMA models are ARIMA(>=1,0,>=1)(0,1,1)[7]

Build ARIMA Model

Use auto.arima function to determine a model with lowest AICc, this process verifies the claim above for suggested ARIMA models are ARIMA(>=1,0,>=1)(0,1,1)[7]. The value p, q obtained by auto.arima are both 3.

The final model is ARIMA(3,0,3)(0,1,1)[7] with drift.

Checked that the p-value for Ljung-Box test is greater that 0.05, which means the residuals of the model have no remaining autocorrelations.

## Series: atm_2 
## ARIMA(3,0,3)(0,1,1)[7] with drift 
## Box Cox transformation: lambda= 0.7242585 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2      ma3     sma1    drift
##       0.4902  -0.4948  0.8326  -0.4823  0.3203  -0.7837  -0.7153  -0.0203
## s.e.  0.0863   0.0743  0.0614   0.1060  0.0941   0.0621   0.0453   0.0072
## 
## sigma^2 estimated as 67.52:  log likelihood=-1260.59
## AIC=2539.18   AICc=2539.69   BIC=2574.1

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,3)(0,1,1)[7] with drift
## Q* = 8.944, df = 6, p-value = 0.1768
## 
## Model df: 8.   Total lags used: 14

ATM 3

Observation on Raw Data

  1. There are only 3 valid data point exists in the time series.

  2. Not enough information for inferring trend or seasonality, developing an advanced forecast model is not possible.

  3. Intead, use average method as the forcasting model.

ATM 4

Observation on Raw Data

  1. No stable seasonality over time;

  2. No stable trend over time;

  3. The fluctuation over time appears to be random;

  4. Both ACF and PACF shows no significant spike at seasonal lags.

  5. Both ACF and PACF shows stable variable within critical limit expect a few spike in the begining.

  6. Based on the above observation, the time series atm_4 is stationary with no seasonality and no stable trend. Differecing is not required to transform atm_4.

Time Series Transformation

  1. No differencing is performed due to no seasonality, however Box-cox is performed to stablize fluctuation in some degree;

  2. check with unit root test that the p-value is slightly over 0.05.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0797 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Observation on Transformed Data

  1. As this atm_4 data set is somewhat stationary, an ARIMA model with difference factors D = 0 and d = 0.

  2. As both ACF and PACF show decreasing trend in seasonal lags, however PACF decrease more dramatically than ACF and drop off after lag 21, therefore the Seasonal AR factor P >= 1 and Seasonal MA factor Q >= 0.

  3. Both ACF and PACF shows stable variations within or slightly above the critical limits, and PACF shows multiple spikes above critical limit, therefore the AR factor p >= 0 and MA factor q >= 0.

  4. Therefore from the analysis above, suggested ARIMA models are ARIMA(>=0,0,>=0)(>=1,0,>=0)[7].

Build ARIMA Model

Use auto.arima function to determine a model with lowest AICc, this process verifies the claim above for suggested ARIMA models are ARIMA(>=0,0,>=0)(>=1,0,>=0)[7]. The value p, q, P, Q obtained by auto.arima are 1, 0, 2, 0 respectively.

The final model is ARIMA(1,0,0)(2,0,0)[7] with non-zero mean.

Checked that the p-value for Ljung-Box test is greater that 0.05, which means the residuals of the model have no remaining autocorrelations.

## Series: atm_4 
## ARIMA(1,0,0)(2,0,0)[7] with non-zero mean 
## Box Cox transformation: lambda= 0.4492823 
## 
## Coefficients:
##          ar1    sar1    sar2     mean
##       0.0801  0.2076  0.2031  28.5695
## s.e.  0.0526  0.0516  0.0524   1.2477
## 
## sigma^2 estimated as 175.6:  log likelihood=-1459.6
## AIC=2929.2   AICc=2929.37   BIC=2948.7

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(2,0,0)[7] with non-zero mean
## Q* = 16.891, df = 10, p-value = 0.07681
## 
## Model df: 4.   Total lags used: 14

Export Forecast to CSV

Part B

Data Exploration

Data Summary

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

Missing-value Check

It’s observed that there is only one missing value in Sep 2008.

Timelineness Check

Check the timelineness of the monthly series. It is checked that there are no monthly gaps in the time series. There are total 12 years’ monthly data in the time series.

Outliner Check

Check that there is one outliner at case sequence 883.

Data Manipulation

Imputing Missing Values & Handling Outliner

Impute missing value & suspress outliner using function tsclean.

Data Visualization

Observed that outliners are suppressed in the final data set.

Observation on Raw Data

  1. Significant weekly seasonality exists;

  2. No sign of steady trend but small fluctuation over time;

  3. ACF shows decreasing trend in seasonal lags and PACF shows drop off after the first seasonal lag.

  4. Both ACF and PACF show non-seasonal lags either within the critical limit of slightly above the limit.

  5. Based on the above observation, the time series atm_1 is non-stationary with significant seasonality and little trend. seasonal Differecing is required to transform atm_1 into a stationary series.

Time Series Transformation

  1. Perform seasonal differencing with lag = 12;

  2. check with unit root test that the p-value is greater than 0.05 therefore the test is failed. Sometimes it is not possible to find a model that passes all of the tests.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 0.1049 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Observation on Transformed Data

  1. The seasonal effect is elimiated after deferencing, the transformed data shows no siginificant seasonality or trend.

  2. As the data set becomes stationary after seasonal diferencing, no further differencing is needed.

  3. As this data set is non-stationary with seasonality, and becomes stationary after seasonal deferencing, an ARIMA model with seasonal difference D = 1. And because no further differencing is needed, the trend differnce d = 0.

  4. The PACF shows decreasing trend in the seasonal tags with two spikes above critical limit, the ACF shows drop off after the first seasonal tag, therefore the Seasonal AR factor P = 0 and Seasonal MA factor Q >= 1.

  5. The PACF shows decreasing trend in non-seasonal tags with multiple lags above critical limit, and ACF shows stable variation within the critical limit after the frist non-seasonal tag, therefore the AR factor p >= 1 and MA factor q = 0.

  6. Therefore from the analysis above, suggested ARIMA models are ARIMA(>=1,0,0)(0,1,1)[12].

Build ARIMA Model

Use auto.arima function to determine a model with lowest AICc, this process verifies the claim above for suggested ARIMA models are ARIMA(>=1,0,0)(0,1,1)[12]. The value p obtained by auto.arima is 1.

The final model is ARIMA(1,0,0)(0,1,1)[12] with drift.

Checked that the p-value for Ljung-Box test is greater that 0.05, which means the residuals of the model have no remaining autocorrelations.

## Series: res_ts %>% tsclean() 
## ARIMA(1,0,0)(0,1,1)[12] with drift 
## Box Cox transformation: lambda= -0.1442665 
## 
## Coefficients:
##          ar1     sma1  drift
##       0.2903  -0.7349  1e-04
## s.e.  0.0724   0.0698  1e-04
## 
## sigma^2 estimated as 8.731e-05:  log likelihood=585.27
## AIC=-1162.55   AICc=-1162.32   BIC=-1149.78

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,1)[12] with drift
## Q* = 25.496, df = 21, p-value = 0.2263
## 
## Model df: 3.   Total lags used: 24

Forecast

To forecast the cash withdrawal in May 2010, we set h = 12

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

Part C

Aggregate wf_p1 Based on Hour

Sum Up wf_p1 and wf_p2 WaterFlow Readings

Observation on Data

  1. Slightly decreasing trend is observed, decreasing ACF that above critical limit justified trend effect.

  2. No obvious seasonality is presented according to ACF and PACF;

  3. No significant outliners is observed;

  4. The data is non-stationary, differencing is needed in the next step.

Data Transformation

  1. Box-cox is performed to stablize variation.

  2. first order differecing is perfromed.

  3. The unit root test shows P-value less than 0.05, demostrating staionary.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0098 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Observation on Transformed Data

  1. Trend effect is eliminated after differencing; Modeling with ARIMA is applicable with difference factor d = 1 and seasonal difference factor D = 0;

  2. Decreasing seasonal lags in PACF and stable seasonal lags within critical limit in ACF hints AR factor p = 0 and MA factor q >= 1;

  3. Multiple spikes in non-seasonal lags in ACF and stable non-seasonal lags within critcal limit in PACF hints seasonal AR factor P >= 1 and seasonal MA factor Q = 0;

  4. Suggested model: ARIMA(0,1,>=1)(>=1,0,0)[24].

Build ARIMA Model

The auto arima function verifies the claim that Suggested models are ARIMA(0,1,>=1)(>=1,0,0)[24]. The Q, p are both estimated to be 1.

Final Model: ARIMA(0,1,1)(1,0,0)[24] .

## Series: wf_ts 
## ARIMA(0,1,1)(1,0,0)[24] 
## Box Cox transformation: lambda= 0.877383 
## 
## Coefficients:
##           ma1    sar1
##       -0.9577  0.0771
## s.e.   0.0104  0.0322
## 
## sigma^2 estimated as 109.2:  log likelihood=-3765.62
## AIC=7537.24   AICc=7537.27   BIC=7551.97

Export Forecast to xlsx