Data-624 Project-1

—————————————————————————

Student Name : Sachid Deshmukh

Date : 03/29/2020

RPubs location of published file

—————————————————————————

Part-A

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Load data from excel and convert date column to Date format

atm.df = fread("ATM624Data.txt", sep='\t', header=TRUE, stringsAsFactors =FALSE)
atm.df$DATE = as.Date(atm.df$DATE, format='%m/%d/%Y 12:00:00 AM', origin = '1990-01-01')

View loaded data frame

head(atm.df)

##          DATE  ATM Cash
## 1: 2009-05-01 ATM1   96
## 2: 2009-05-01 ATM2  107
## 3: 2009-05-02 ATM1   82
## 4: 2009-05-02 ATM2   89
## 5: 2009-05-03 ATM1   85
## 6: 2009-05-03 ATM2   90

As we can see in the dataframe above there are three columns

Date : Date of cash withdrawal
ATM : ATM machine from which cash is withdrawn
Cash : Cash amount withdrawn

Our aim is to forecast monthly withdrawal of cash from all four different ATMs. For this purpose we will take monthly sum of cash withdrawn from all four ATMs and convert that into monthly time series. Also note that we have NA records for May-2010 we will remove those records because we are anyway forecasting for month May-2010

atm.df$DateStr = paste0(year(atm.df$DATE), '-', month(atm.df$DATE))
atm.df = atm.df %>% dplyr::filter(!is.na(Cash)) %>% dplyr::group_by(DateStr) %>% dplyr::summarise(Cash = sum(Cash)) %>% dplyr::select(-DateStr)
atm.ts = ts(atm.df, start=c(2009,5), end=c(2010,4), frequency=12)

Let’s plot the monthly time series

autoplot(atm.ts) +
  xlab('Time') +
  ylab('MonthlyCash') +
  ggtitle ('Time series plot showing total monthly cash withdrawn from all four ATMs')

Above time series plot represents total monthly cash withdrawn from all four ATMs. From the time series plot we can see that time series doesn’t show any seasonal pattern. There is a upword trend till Jan-2010 and downward trend post that. There seems to be outlier indicating huge cash withdrawn in the month of Feb-2010.

Let’s try following forecating models on this time series and assess which one is better in terms of RMSE. We will use time series cross validation function to calculate RMSE for eash model

Seasonal Naive Model
Random walk model
ETS model
ARIMA model

calcRMSE = function(modelName, modelFunc)
{
  res = tsCV(atm.ts,modelFunc)
  print(paste('RMSE for model', modelName, '=', sqrt(mean(res^2, na.rm=TRUE))))
}

Fit seasonal naive model and calc RMSE

calcRMSE('Seasonal Naive', snaive)

## [1] "RMSE for model Seasonal Naive = 6445.52777301652"

Fit Random Walk model and calc RMSE

calcRMSE('Random Walk', rwf)

## [1] "RMSE for model Random Walk = 4085.96603688105"

Fit ARIMA model and calc RMSE

func <- function(x, h){forecast(auto.arima(x), h=h)}
calcRMSE('ARIMA', func)

## [1] "RMSE for model ARIMA = 4326.07477846553"

Fit ETS model and calc RMSE

func <- function(x, h){forecast(ets(x, model="ZZZ" ), h=h)}
calcRMSE('ETS', func)

## [1] "RMSE for model ETS = 3902.99872575104"

We can see that ETS model gives us best RMSE. Let’s analyze model details

fit = ets(atm.ts, model = 'ZZZ')
summary(fit)

## ETS(M,N,N) 
## 
## Call:
##  ets(y = atm.ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.5293 
## 
##   Initial states:
##     l = 15080.8623 
## 
##   sigma:  0.1853
## 
##      AIC     AICc      BIC 
## 228.5467 231.5467 230.0015 
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE MASE       ACF1
## Training set 494.8631 3420.644 2829.518 0.864177 14.76745  NaN -0.1823195

We can see that auto selection method of ETS has chosen ETS (M,N,N) model indicating multiplicative error terms and No trend and seasonality

Let’s analyze residuals

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,N)
## Q* = 1.1929, df = 3, p-value = 0.7547
## 
## Model df: 2.   Total lags used: 5

Large value of Ljung-Box test shows that residuals are not correlated. ACF plot also confirms the same. Histogram suggests that residuals are normally distributed. The Residual plot indicates a huge outlier presense, which is in line with our finding in the time series

Let’s use ETS(M,N,N) model to generate forecat for May-2010

fcst = forecast(fit, h=1)
print(fcst)

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## May 2010       18224.17 13897.03 22551.31 11606.38 24841.96

fwrite(data.frame(fcst), "ATMForecast.csv", sep=",", col.names = TRUE, row.names = FALSE)

From the above forecast we can expect the withdrwal from all four ATMs in the month of May-2010 as $18224. The lower bound with 95% confidence interval is $11606 and upper bound is $22551

Let’s plot the forecast

autoplot(fcst, pi=TRUE) +
  ggtitle("Monthly withdrawal forecast from all four ATMs")

Autoplot above shows point forecast along with 95% confidence interval

Part-B

Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Load data

df.power  = fread("ResidentialCustomerForecastLoad-624.txt", header=TRUE, sep = '\t', stringsAsFactors = FALSE)

View dataframe preview

head(df.power)

##    CaseSequence YYYY-MMM     KWH
## 1:          733 1998-Jan 6862583
## 2:          734 1998-Feb 5838198
## 3:          735 1998-Mar 5420658
## 4:          736 1998-Apr 5010364
## 5:          737 1998-May 4665377
## 6:          738 1998-Jun 6467147

summary(df.power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

Note that KWH column has one NA value. Lets fill that value with median value of KWH

df.power$KWH[is.na(df.power$KWH)] = median(df.power$KWH, na.rm = TRUE)
summary(df.power)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5434539  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6501333  
##  3rd Qu.:876.2                      3rd Qu.: 7608792  
##  Max.   :924.0                      Max.   :10655730

Let’s remove Year and Month string column and convert the time series into monthly frequency

power.ts = ts(df.power$KWH, start = c(1998,1), end=c(2013,12), frequency = 12)

Let’s plot the time series

autoplot(power.ts) +
  xlab('Month') +
  ylab('KWH') +
  ggtitle('Monthly power consumption in KWH')

From the above time series we can see an outlier in year 2010. The Time series exhibits seasonal pattern indicating high power consumption in winter and summar time and low power consumption in rest of the period. There is no upward ot downward trend exhibited by the time series indicating the demand for power is relatively stable from 2000 to 2013. We can see the surge in peaks has increased after 2010 indicating high usage of power during winter and summer season post 2010

Let’s use STL decomposition to confirm our findings

res = stl(power.ts, s.window = 12)
plot(res)

We can see that STL plot above shows strong seasonal pattern. There is no underlying trend however there is a surge in power consumption post 2010. The unusually low power consumption in the year 2010 is appearent in the error plot above

Above is also a non-stationary time series. Let’s see what order differencing is required to make it stationary

print(paste('Order of seasonal difference required = ', nsdiffs(power.ts)))

## [1] "Order of seasonal difference required =  1"

print(paste('Order of difference required = ', ndiffs(power.ts)))

## [1] "Order of difference required =  1"

We can see that we need first order seasonal difference followed by first order regular difference to make this time series stationary

Let’s try following forecating models on this time series and assess which one is better in terms of RMSE. We will use time series cross validation function to calculate RMSE for eash model

Seasonal Naive Model
Random walk model
ETS model
ARIMA model

calcRMSE = function(modelName, modelFunc)
{
  res = tsCV(power.ts,modelFunc)
  print(paste('RMSE for model', modelName, '=', sqrt(mean(res^2, na.rm=TRUE))))
}

Fit seasonal naive model and calc RMSE

calcRMSE('Seasonal Naive', snaive)

## [1] "RMSE for model Seasonal Naive = 1210609.88804598"

Fit Random Walk model and calc RMSE

calcRMSE('Random Walk', rwf)

## [1] "RMSE for model Random Walk = 1539633.85951456"

Fit Seasonal ARIMA model and calc RMSE

func <- function(x, h){forecast(auto.arima(x), h=h, seasonal=TRUE)}
calcRMSE('ARIMA', func)

## [1] "RMSE for model ARIMA = 1115003.87958045"

Fit ETS model and calc RMSE

func <- function(x, h){forecast(ets(x, model="ZZZ"), h=h)}
calcRMSE('ETS', func)

## [1] "RMSE for model ETS = 1037684.72933356"

We can see that ETS model gives us best RMSE. Let’s analyze model details

fit = ets(power.ts, model = 'ZZZ')
summary(fit)

## ETS(M,N,M) 
## 
## Call:
##  ets(y = power.ts, model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.1167 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 6133758.2586 
##     s = 0.9568 0.7524 0.8702 1.1849 1.2617 1.1857
##            1.0001 0.7743 0.8154 0.9116 1.0651 1.222
## 
##   sigma:  0.1201
## 
##      AIC     AICc      BIC 
## 6224.787 6227.514 6273.649 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 61969.44 833321.3 505622.3 -4.356712 12.01361 0.722433
##                   ACF1
## Training set 0.1757877

We can see that auto selection method of ETS has chosen ETS (M,N,M) model indicating multiplicative error terms and No trend and multiplicative seasonality. This make sense based on time series and STL dcomposition results

Let’s analyze residuals

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,M)
## Q* = 30.705, df = 10, p-value = 0.0006562
## 
## Model df: 14.   Total lags used: 24

Residual plots are not satisfactory. small value of Ljung-Box test shows that residuals are correlated. ACF plot also confirms the same. Histogram suggests that residuals are not normally distributed. It shows huge left skew in the residual distribution. The Residual plot indicates a huge outlier presense in year 2010, which is in line with our finding in the time series

Let’s apply BoxCox transformation on the time series and see the results

fit = ets(power.ts, model = 'ZZZ', lambda = BoxCox.lambda(power.ts))
summary(fit)

## ETS(A,N,A) 
## 
## Call:
##  ets(y = power.ts, model = "ZZZ", lambda = BoxCox.lambda(power.ts)) 
## 
##   Box-Cox transformation: lambda= 0.1042 
## 
##   Smoothing parameters:
##     alpha = 0.0538 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 39.2123 
##     s = -0.0972 -1.2756 -0.575 0.9225 1.3681 0.4026
##            0.1313 -1.2026 -0.9124 -0.3817 0.389 1.231
## 
##   sigma:  0.9553
## 
##      AIC     AICc      BIC 
## 1007.327 1010.054 1056.189 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 181549.3 876724.3 565471.4 -2.128111 12.02496 0.8079454
##                   ACF1
## Training set 0.2409174

We can see that RMSE is greatly improved using BoxCox transformation. We can also see that BoxCox transformation changes ETS model to ETS(A,N,A) where error and seasonality are additive and there is no trend

Let’s visualize residuals

checkresiduals(fit)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 13.696, df = 10, p-value = 0.1873
## 
## Model df: 14.   Total lags used: 24

Residual plots are satisfactory. Large value of Ljung-Box test shows that residuals are not correlated. ACF plot also confirms the same. Histogram suggests that residuals are not normally distributed however it appears to be acceptable

Let’s use ETS(A,N,A) model to generate forecast for 2014

fcst = forecast(fit, h=12, lambda = BoxCox.lambda(power.ts))

## Warning in forecast.ets(fit, h = 12, lambda = BoxCox.lambda(power.ts)):
## biasadj information not found, defaulting to FALSE.

print(fcst)

##          Point Forecast   Lo 80    Hi 80   Lo 95    Hi 95
## Jan 2014        9072736 7183518 11395443 6333493 12829171
## Feb 2014        7731631 6095283  9751051 5361280 11000973
## Mar 2014        6662622 5230951  8435807 4590628  9536172
## Apr 2014        6005531 4700833  7625719 4118547  8633056
## May 2014        5671328 4431064  7213978 3878258  8174218
## Jun 2014        7358020 5785457  9303198 5081382 10509173
## Jul 2014        7751644 6100889  9791808 5361297 11055897
## Aug 2014        9309851 7358433 11712684 6481503 13197502
## Sep 2014        8558822 6748361 10792837 5936177 12175460
## Oct 2014        6416036 5020103  8150139 4397280  9228578
## Nov 2014        5589925 4356484  7127459 3807699  8086030
## Dec 2014        7040548 5518105  8929009 4838018 10102189

fwrite(data.frame(fcst), "PowerForecast.csv", sep=",", col.names = TRUE, row.names = FALSE)

Let’s plot the forecast

autoplot(fcst, pi=TRUE) +
  ggtitle("Forecast - Power consumption in KWH")

Autoplot above shows point forecast along with 95% confidence interval for 2014 energy consumption in KWH using ETS(A,N,A) model with BoxCox transformation