Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file

Reading data, renaming columns and getting rid of NA values plus adjusting the date

df <- read_excel('../Project1/ATM624Data.xlsx', col_names=TRUE,
                 col_types=c('date', 'text', 'numeric')) %>%
                  rename(date=DATE, atm=ATM, cash=Cash) %>%
                  mutate(date = as_date(date)) %>%
                  na.omit %>%
                  arrange(date, atm)

head(df)

## # A tibble: 6 x 3
##   date       atm    cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1    96 
## 2 2009-05-01 ATM2   107 
## 3 2009-05-01 ATM3     0 
## 4 2009-05-01 ATM4   777.
## 5 2009-05-02 ATM1    82 
## 6 2009-05-02 ATM2    89

Finding number of unique ATMs

unique(df$atm)

## [1] "ATM1" "ATM2" "ATM3" "ATM4"

Using group_map function the dataset is going to be converted for each ATM’s data into a time series, with weekly seasonality:

ts_atm <- df %>%
  group_by(atm) %>%
  group_map(~ ts(.x$cash, start=c(2009, 05, 01), frequency=7))

str(ts_atm)

## List of 4
##  $ : Time-Series [1:362] from 2010 to 2061: 96 82 85 90 99 88 8 104 87 93 ...
##  $ : Time-Series [1:363] from 2010 to 2061: 107 89 90 55 79 19 2 103 107 118 ...
##  $ : Time-Series [1:365] from 2010 to 2062: 0 0 0 0 0 0 0 0 0 0 ...
##  $ : Time-Series [1:365] from 2010 to 2062: 777 524.4 792.8 908.2 52.8 ...

autoplot(ts_atm[[1]])

autoplot(ts_atm[[2]])

autoplot(ts_atm[[3]])

autoplot(ts_atm[[4]])

As we can see from above graphs, the first one and second one turn to normal, unremarkable time series. The third ATM graph shows only recent transactions. The fourth one shows an outlier. Use a fucntion to suggest correction.

tsoutliers(ts_atm[[4]])

## $index
## [1] 285
## 
## $replacements
## [1] 226.6204

ts_atm[[4]][285] <- 226.6204

autoplot(ts_atm[[4]])

The above graph looks much better after replacement.

ANALYSIS

ggtsdisplay(ts_atm[[1]], points = FALSE, plot.type = "histogram")

There is noticeable correlation from ACF plot and residuals seem to be normally distributed every 7 days.

Box.test(ts_atm[[1]], lag = 24, fitdf = 0, type = "Lj")

## 
##  Box-Ljung test
## 
## data:  ts_atm[[1]]
## X-squared = 498.84, df = 24, p-value < 2.2e-16

adf.test(ts_atm[[1]])

## Warning in adf.test(ts_atm[[1]]): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_atm[[1]]
## Dickey-Fuller = -4.6043, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Since the model is stationary and the daily data is observed, the Exponential Smoothing State Space Model would be a good choice to forecast this data

atm1<-ts_atm[[1]]
head(atm1)

## Time Series:
## Start = c(2009, 5) 
## End = c(2010, 3) 
## Frequency = 7 
## [1] 96 82 85 90 99 88

model1<- ets(atm1)
model1 <- forecast(atm1, 31)
autoplot(model1, 31)

The Ljung-Box test and diagnostic plotting are used to investigate the residuals.

checkresiduals(model1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 17.703, df = 5, p-value = 0.003342
## 
## Model df: 9.   Total lags used: 14

accuracy(model1)

##                      ME     RMSE     MAE       MPE     MAPE      MASE      ACF1
## Training set -0.5473075 26.05397 16.9637 -110.6377 126.3744 0.9013793 0.1374786

With the Ljung-Box test, we have a p-value that is less than .05 - that means that we reject the null hypothesis that there’s no difference between the means and conclude that a significant difference does exist in the residuals. This model can be used for predicting the cash taken out of ATM 1.

ATM2

It appears that there is sort of seasonality with the data for ATM2. Similarly as we saw it with ATM!, the ATM2 there is significant correlation from ACF plot and also residulas seem to be normally distributed.

Box.test(ts_atm[[2]], lag = 24, fitdf = 0, type = "Lj")

## 
##  Box-Ljung test
## 
## data:  ts_atm[[2]]
## X-squared = 700.46, df = 24, p-value < 2.2e-16

adf.test(ts_atm[[2]])

## Warning in adf.test(ts_atm[[2]]): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_atm[[2]]
## Dickey-Fuller = -6.009, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Looks like the data is stationary similarly to the ATM1.

atm2<-ts_atm[[2]]
head(atm2)

## Time Series:
## Start = c(2009, 5) 
## End = c(2010, 3) 
## Frequency = 7 
## [1] 107  89  90  55  79  19

ggtsdisplay(atm2, points = FALSE, plot.type = "histogram")

The weekly seasonality of the time series is very clear in the graph above. Using an Arima model since it is based on following- the data is stationary, meaning that the mean and variance should not vary with time. Also, the data provided as input must be univariate series, because that is what arima uses the past values to predict the future values.

Training Data

str(atm2)

##  Time-Series [1:363] from 2010 to 2061: 107 89 90 55 79 19 2 103 107 118 ...

train = atm2[1:330]
test  = atm2[330:363]
model2 = auto.arima(train)
summary(model2)

## Series: train 
## ARIMA(3,1,1) 
## 
## Coefficients:
##           ar1      ar2     ar3      ma1
##       -0.0413  -0.3903  0.0442  -0.9727
## s.e.   0.0560   0.0510  0.0559   0.0097
## 
## sigma^2 estimated as 1269:  log likelihood=-1642.32
## AIC=3294.64   AICc=3294.83   BIC=3313.62
## 
## Training set error measures:
##                     ME    RMSE      MAE  MPE MAPE      MASE         ACF1
## Training set -3.435315 35.3568 29.53137 -Inf  Inf 0.6815728 -0.007701287

forecast <-predict(model2,30)

accuracy(test, forecast$pred)

##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -2.732223 36.97182 33.03938 -4.148932 61.07952 0.1467459  10.22849

After running the training set, it suggests to apply an ARIMA(3,1,1) model to the data to forecast cash withdraws for May.

lambda <- BoxCox.lambda(atm2)

## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.

model3 <- Arima(atm2, order = c(3, 1, 1), seasonal = c(0,1,1), lambda =lambda )
model3 <- forecast(model3, 31)
autoplot(model3, 31)

Checking the accuracy.

checkresiduals(model3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)(0,1,1)[7]
## Q* = 29.991, df = 9, p-value = 0.0004402
## 
## Model df: 5.   Total lags used: 14

accuracy(model3)

##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.5572441 27.21348 18.69743 -Inf  Inf 0.8701024 -0.01881647

ATM3

ggtsdisplay(ts_atm[[3]], points = FALSE, plot.type = "histogram")

The ATM3 has only few observations.

length(unique(ts_atm[[3]]))

## [1] 4

When it comes to the ATM3, there are only 4 observations, which simply is not enough data to make a proper forecast and all of them are 0 which would give us a flat forecast.

atm3<-ts_atm[[3]]
head(atm3)

## Time Series:
## Start = c(2009, 5) 
## End = c(2010, 3) 
## Frequency = 7 
## [1] 0 0 0 0 0 0

str(atm3)

##  Time-Series [1:365] from 2010 to 2062: 0 0 0 0 0 0 0 0 0 0 ...

ATM4

Moving along to ATM4, which is a bit different from previous examples.

adf.test(ts_atm[[4]])

## Warning in adf.test(ts_atm[[4]]): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_atm[[4]]
## Dickey-Fuller = -5.6307, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

atm4 <- ts_atm[[4]]
head(atm4)

## Time Series:
## Start = c(2009, 5) 
## End = c(2010, 3) 
## Frequency = 7 
## [1] 776.99342 524.41796 792.81136 908.23846  52.83210  52.20845

ggtsdisplay(atm4, points = FALSE, plot.type = "histogram")

It looks there is no seasonality similar to ATM1 and ATM2. The outliers seem to sku the forecast for this data. There could be error as there was a one time withdraw of 1500. I am not sure if it is possible, maybe, I am not a great bank customer as I do not withdraw more than 200$. So I am not sure if it possible or not.

Further data set exploration with auto.arima hopefully give us some options.

str(atm4)

##  Time-Series [1:365] from 2010 to 2062: 777 524.4 792.8 908.2 52.8 ...

train4= atm4[1:335]
test4 = atm4[336:365]
model4 = auto.arima(train4)
summary(model4)

## Series: train4 
## ARIMA(0,0,1) with non-zero mean 
## 
## Coefficients:
##          ma1      mean
##       0.0795  449.2427
## s.e.  0.0547   20.9186
## 
## sigma^2 estimated as 126596:  log likelihood=-2442.26
## AIC=4890.52   AICc=4890.6   BIC=4901.97
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.04460359 354.7402 295.9304 -523.4147 556.8516 0.7594607
##                      ACF1
## Training set 0.0001446206

forecast <- predict(model4,30)
lambda4 <- BoxCox.lambda(atm4)
model4 <- Arima(atm4, order = c(0, 0, 0), seasonal = c(0, 1, 1), lambda =lambda4 )
model4 <- forecast(model4, 31)
autoplot(model4, 31)

checkresiduals(model4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(0,1,1)[7]
## Q* = 20.901, df = 13, p-value = 0.07491
## 
## Model df: 1.   Total lags used: 14

accuracy(model4)

##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 63.24085 339.2724 256.8838 -371.6286 415.3023 0.7415465 0.09955737

I’d like to see if I can gt better results with another model, I might try the ETS method.

model44<- ets(atm4)
model44 <- forecast(model44, 31)
autoplot(model44, 31)

Checking the residuals

checkresiduals(model44)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 19.372, df = 5, p-value = 0.001638
## 
## Model df: 9.   Total lags used: 14

The p-value of suggests that there is non zero correlation .

accuracy(model44)

##                     ME     RMSE      MAE       MPE    MAPE      MASE      ACF1
## Training set -6.218025 328.8349 264.8216 -516.3996 546.331 0.7644608 0.1033497

The RMSE value is lower in this model. It suggests that our first model was a better prediction for ATM4. As it did not included a large over 1000$ withdrawals, and was also clearly showning negative values, which might indicate that the data is white noise.

Write results to a CSV.

atms_forecasts = rbind(model1,model2,model3,model4)

## Warning in rbind(model1, model2, model3, model4): number of columns of result is
## not a multiple of vector length (arg 1)

write.csv(atms_forecasts,"atms_forecasts.csv")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

df2 <- read_excel('../Project1//ResidentialCustomerForecastLoad-624.xlsx', col_names=TRUE) %>%
  rename(case=CaseSequence, date=`YYYY-MMM`, kwh=KWH) %>%
  mutate(Date=as_date(paste(date, '01', sep='-'), format='%Y-%b-%d'))

summary(df2)

##       case           date                kwh                Date           
##  Min.   :733.0   Length:192         Min.   :  770523   Min.   :1998-01-01  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912   1st Qu.:2001-12-24  
##  Median :828.5   Mode  :character   Median : 6283324   Median :2005-12-16  
##  Mean   :828.5                      Mean   : 6502475   Mean   :2005-12-15  
##  3rd Qu.:876.2                      3rd Qu.: 7620524   3rd Qu.:2009-12-08  
##  Max.   :924.0                      Max.   :10655730   Max.   :2013-12-01  
##                                     NA's   :1

There is one NA value, inputting median value if KWH

df2$kwh[is.na(df2$kwh)] = median(df2$kwh, na.rm = TRUE)
summary(df2)

##       case           date                kwh                Date           
##  Min.   :733.0   Length:192         Min.   :  770523   Min.   :1998-01-01  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539   1st Qu.:2001-12-24  
##  Median :828.5   Mode  :character   Median : 6283324   Median :2005-12-16  
##  Mean   :828.5                      Mean   : 6501333   Mean   :2005-12-15  
##  3rd Qu.:876.2                      3rd Qu.: 7608792   3rd Qu.:2009-12-08  
##  Max.   :924.0                      Max.   :10655730   Max.   :2013-12-01

power <- ts(df2$kwh, start=c(1998, 1), frequency=12)
ggtsdisplay(power, points = FALSE, plot.type = "histogram")

ggseasonplot(power)

The graph above indicates a clear seasonality in the data. There are visible spikes during warm months, possibly due to ac usage. Additionally, there was an increase in usage around December, possibly due to Christmas decorations, in some regions, as well heating equipment. There is noticeable dip in July of 2010, could it be possible to blackout. Inputting a median in a place of NA did not fixed the problem.

adf.test(power)

## Warning in adf.test(power): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  power
## Dickey-Fuller = -4.7719, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

Using decomposition we will get better understanding of a problem.

ts_pw<-decompose(power)
plot(ts_pw)

The trend plots indicates that there is upward movement in the data. The random graph shows the outlier of July 2010. The STL model seems to be the better option for forecasting the power usage in 2014. The reason for the choice is because the seasonal component can change over time, and the rate of change can be controlled by user. The clearness of the trend-cycle can also be managed by the user. It can be robust to outliers when the user can specify a robust decomposition.(i.e., the user can specify a robust decomposition). Some occasional observations will not have an impact on the estimates of the trend-cycle and seasonal components.

modelpw <- stl(power, s.window='periodic', robust=TRUE) 
modelpw<- forecast(modelpw)
autoplot(modelpw, 12)

Checking the model residuals and accuracy.

checkresiduals(modelpw)

## Warning in checkresiduals(modelpw): The fitted degrees of freedom is based on
## the model used for the seasonally adjusted data.

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(A,N,N)
## Q* = 28.969, df = 22, p-value = 0.1457
## 
## Model df: 2.   Total lags used: 24

With the p-value of 0.14 it indicates that the prediction might not be the very good once. It suggests that the evidence is weak that our model’s prediction would be very accurate. It could be possible that after removing the July 2010 drop in energy, we could get a better predictions. But with high RMSE we can still aspect that the forecast is a good prediction.

accuracy(modelpw)

##                    ME     RMSE      MAE       MPE     MAPE      MASE     ACF1
## Training set 69834.05 843670.1 512067.7 -4.243142 12.03155 0.7316422 0.209786

write.csv(modelpw,"pw_forecast.csv")

Project1