Data 624 - Project 1

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

First step is to read in the ATM624Data.xlsx file and get a sense of the data that we are working with.

df_atm <- read.xlsx("/Users/ntlrsmllghn/Dropbox/Data/Data 624/Project 1 Data/ATM624Data.xlsx")
head(df_atm)

##    DATE  ATM Cash
## 1 39934 ATM1   96
## 2 39934 ATM2  107
## 3 39935 ATM1   82
## 4 39935 ATM2   89
## 5 39936 ATM1   85
## 6 39936 ATM2   90

summary(df_atm)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

looks like I need to adjust the date, find out the number of unique ATMs and drop the NA values.

df_atm <-drop_na(df_atm)
df_atm$DATE <- as.Date(df_atm$DATE, origin = "1899-12-30")
summary(df_atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1455        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-10-31   Mode  :character   Median :   73.0  
##  Mean   :2009-10-30                      Mean   :  155.6  
##  3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
##  Max.   :2010-04-30                      Max.   :10919.8

unique(df_atm$ATM)

## [1] "ATM1" "ATM2" "ATM3" "ATM4"

as well as figure out the range of dates

difftime(strptime("2009-05-01", format = "%Y-%m-%d"),
strptime("2010-04-30", format = "%Y-%m-%d"),units="days")

## Time difference of -364 days

length(unique(df_atm$DATE))

## [1] 365

okay! looks like this is a dataframe I can work with now and we know that this is one year worth of data. First, I will seperate each of the unique ATMS into their own dataframes. (reference: https://stackoverflow.com/questions/33180753/create-multiple-data-frames-from-one-based-off-values-with-a-for-loop)

for(i in unique(df_atm$ATM)) {
        nam <- paste("df", i, sep = ".")
        assign(nam, df_atm[df_atm$ATM==i,])
}

Now, I will drop the ATM columns from each new DF

drops <- c("ATM")
df.ATM1<-df.ATM1[ , !(names(df.ATM1) %in% drops)]
df.ATM2<-df.ATM2[ , !(names(df.ATM2) %in% drops)]
df.ATM3<-df.ATM3[ , !(names(df.ATM3) %in% drops)]
df.ATM4<-df.ATM4[ , !(names(df.ATM4) %in% drops)]

Now, I will transform each of the new dataframes into a timeseries.

ts_ATM1 <- ts(df.ATM1["Cash"], start = c(2009), frequency = 365)
ts_ATM2 <- ts(df.ATM2["Cash"], start = c(2009, 5, 1), frequency = 365)
ts_ATM3 <- ts(df.ATM3["Cash"], start = c(2009, 5, 1), frequency = 365)
ts_ATM4 <- ts(df.ATM4["Cash"], start = c(2009, 5, 1), frequency = 365)

and will visualize each timeseries and preform a transformation, and make a forecast.

ATM 1

ggtsdisplay(ts_ATM1, points = FALSE, plot.type = "histogram")

#####We see significant correlation from the ACF plot but the residuals do not seem to be normally distributed. There appears to be a pattern to the lag every 7 days.

Box.test(ts_ATM1, lag = 24, fitdf = 0, type = "Lj")

## 
##  Box-Ljung test
## 
## data:  ts_ATM1
## X-squared = 498.84, df = 24, p-value < 2.2e-16

adf.test(ts_ATM1)

## Warning in adf.test(ts_ATM1): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_ATM1
## Dickey-Fuller = -4.6043, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Since the model is stationary and we’re looking at daily data, I will use an Exponential Smoothing State Space Model to forecast this data, which is referenced here (https://otexts.com/fpp2/weekly.html) and here (https://robjhyndman.com/hyndsight/dailydata/)

ts_ATM1_2 <- ts(df.ATM1["Cash"], frequency=7)

head(ts_ATM1_2)

## Time Series:
## Start = c(1, 1) 
## End = c(1, 6) 
## Frequency = 7 
##      Cash
## [1,]   96
## [2,]   82
## [3,]   85
## [4,]   90
## [5,]   99
## [6,]   88

model_ATM1<- ets(ts_ATM1_2)
model_ATM1 <- forecast(model_ATM1, 31)
autoplot(model_ATM1, 31)

#####The residuals are investigated using a Ljung-Box test and diagnostic plotting:

checkresiduals(model_ATM1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 17.703, df = 5, p-value = 0.003342
## 
## Model df: 9.   Total lags used: 14

accuracy(model_ATM1)

##                      ME     RMSE     MAE       MPE     MAPE      MASE
## Training set -0.5473075 26.05397 16.9637 -110.6377 126.3744 0.9013793
##                   ACF1
## Training set 0.1374786

From the Ljung-Box test, we have a p-value that is less than .05 - which means that we reject the null hypothesis that there’s no difference between the means and conclude that a significant difference does exist in the residuals. This model can be used for predicting the cash taken out of ATM 1.

ggtsdisplay(ts_ATM2, points = FALSE, plot.type = "histogram")

#####There appears to be some sort of seasonality within the data from ATM2.

Box.test(ts_ATM2, lag = 24, fitdf = 0, type = "Lj")

## 
##  Box-Ljung test
## 
## data:  ts_ATM2
## X-squared = 700.46, df = 24, p-value < 2.2e-16

As with ATM1, ATM2 we see significant correlation from the ACF plot but the residuals do not seem to be normally distributed.

adf.test(ts_ATM2)

## Warning in adf.test(ts_ATM2): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_ATM2
## Dickey-Fuller = -6.009, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Looks like the data is stationary according to the Dickey-Fuller test. Following what I did with the Data from ATM1, I will change the frequency of the data.

ts_ATM2_1 <- ts(df.ATM2["Cash"], frequency=7)
ggtsdisplay(ts_ATM2_1, points = FALSE, plot.type = "histogram")

####the 7 day seasonality of the adjusted time series become very clear in this graph. I will us an ARIMA model since these modelswork on the following assumptions: A. That data series is stationary, which means that the mean and variance should not vary with time. A series can be made stationary by using log transformation or differencing the series. B. The data provided as input must be a univariate series, since arima uses the past values to predict the future values. First, I’ll train the data on the first 333 observations (approx. 11 months) and use 30 days to test out the accuracy of the forecast.

train_ts_ATM2_1 = ts_ATM2_1[1:333,]
valid_ts_ATM2_1 = ts_ATM2_1[334:nrow(ts_ATM2_1),]

model = auto.arima(train_ts_ATM2_1)
summary(model)

## Series: train_ts_ATM2_1 
## ARIMA(3,1,1) 
## 
## Coefficients:
##           ar1      ar2     ar3      ma1
##       -0.0390  -0.3886  0.0481  -0.9724
## s.e.   0.0558   0.0511  0.0559   0.0097
## 
## sigma^2 estimated as 1281:  log likelihood=-1658.84
## AIC=3327.67   AICc=3327.86   BIC=3346.7
## 
## Training set error measures:
##                     ME     RMSE      MAE  MPE MAPE      MASE         ACF1
## Training set -3.627067 35.52467 29.70381 -Inf  Inf 0.6838406 -0.008466317

forecast = predict(model,30)

accuracy(valid_ts_ATM2_1, forecast$pred)

##                 ME     RMSE      MAE       MPE     MAPE      ACF1
## Test set -8.791866 37.30711 33.79654 -16.48248 64.44034 0.1795273
##          Theil's U
## Test set  4.312127

After running the training set, it looks like I should apply an ARIMA(3,1,1) model to my data to try to forecast cash withdrawls for May.

ATM2_lambda <- BoxCox.lambda(ts_ATM2_1)
model_ATM2 <- Arima(ts_ATM2_1, order = c(3, 1, 1), seasonal = c(0,1,1), lambda =ATM2_lambda )
model_ATM2 <- forecast(model_ATM2, 31)
autoplot(model_ATM2, 31)

#####Check the models accuracy

checkresiduals(model_ATM2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)(0,1,1)[7]
## Q* = 29.991, df = 9, p-value = 0.0004402
## 
## Model df: 5.   Total lags used: 14

accuracy(model_ATM2)

##                     ME     RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.5572441 27.21348 18.69743 -Inf  Inf 0.8701024 -0.01881647

ggtsdisplay(ts_ATM3, points = FALSE, plot.type = "histogram")

#####ATM3 appears to only have a few observations. Let’s investigate further …

length(unique(ts_ATM3))

## [1] 4

With only 4 unique observations, I think that a simple average method will be suffice. There isn’t enough data to make a proper forecast (reference: https://otexts.com/fpp2/simple-methods.html). I will also remove the zero rows from the data to only predict on the 3 observations. This will need to happen before I turn the data into a time series.

df.ATM3[df.ATM3==0] <- NA
df.ATM3<-df.ATM3[complete.cases(df.ATM3),]
ts_ATM3 <- ts(df.ATM3["Cash"], start = c(2010, 4, 28), frequency = 3)

model_ATM3<- meanf(ts_ATM3, 31)
autoplot(model_ATM3, 31)

#####The forecast is flat, but I think that this is the best I am going to be able to do. Let’s check the residuals and accuracy.

checkresiduals(model_ATM3)

## 
##  Ljung-Box test
## 
## data:  Residuals from Mean
## Q* = NA, df = 3, p-value = NA
## 
## Model df: 1.   Total lags used: 4

accuracy(model_ATM3)

##                         ME    RMSE      MAE        MPE     MAPE MASE
## Training set -4.737024e-15 6.01849 5.555556 -0.4557562 6.242793  NaN
##                   ACF1
## Training set -0.295501

Data from ATM4 is much different than ATM1 and ATM2.

ggtsdisplay(ts_ATM4, points = FALSE, plot.type = "histogram")

adf.test(ts_ATM4)

## Warning in adf.test(ts_ATM4): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_ATM4
## Dickey-Fuller = -6.3253, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

Like the data from ATM1 and ATM2, I will transform this set to have a frequency of 7

ts_ATM4_1 <- ts(df.ATM4["Cash"], frequency=7)

ggtsdisplay(ts_ATM4_1, points = FALSE, plot.type = "histogram")

####There appears to be no seasonality like in ATM1 & ATM2. There is also one outlier that I would imagine will sku the forecast for this data. There must be an error in reporting because it looks like in 1 day there was over $1000000 withdrawn from that ATM. I dont believe that a single ATM would hold that much cash.

I will explore the auto.arima options for this dataset.

train_ts_ATM4_1 = ts_ATM4_1[1:333,]
valid_ts_ATM4_1 = ts_ATM4_1[334:nrow(ts_ATM4_1),]
model_ATM4 = auto.arima(train_ts_ATM4_1)
summary(model_ATM4)

## Series: train_ts_ATM4_1 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##           mean
##       482.4379
## s.e.   36.9573
## 
## sigma^2 estimated as 456195:  log likelihood=-2641.61
## AIC=5287.23   AICc=5287.26   BIC=5294.84
## 
## Training set error measures:
##                        ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 5.218616e-14 674.4074 332.1155 -573.6331 603.8937 0.734099
##                     ACF1
## Training set -0.01069726

forecast = predict(model_ATM4,30)

ATM4_lambda <- BoxCox.lambda(ts_ATM4_1)
model_ATM4 <- Arima(ts_ATM4_1, order = c(0, 0, 0), seasonal = c(0, 1, 1), lambda =ATM4_lambda )
model_ATM4 <- forecast(model_ATM4, 31)
autoplot(model_ATM2, 31)

checkresiduals(model_ATM4)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(0,1,1)[7]
## Q* = 21.265, df = 13, p-value = 0.06785
## 
## Model df: 1.   Total lags used: 14

accuracy(model_ATM4)

##                    ME     RMSE     MAE      MPE     MAPE      MASE
## Training set 168.6943 664.5617 303.076 -280.661 341.3589 0.7541188
##                     ACF1
## Training set 0.001571149

I’d like to see if I can gt better results with another model, I might try the ETS method.

model_ATM4_1<- ets(ts_ATM4_1)
model_ATM4_1 <- forecast(model_ATM4, 31)
autoplot(model_ATM4_1, 31)

#####Let’s check the residuals and accuracy of this model.

checkresiduals(model_ATM4_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(0,1,1)[7]
## Q* = 21.265, df = 13, p-value = 0.06785
## 
## Model df: 1.   Total lags used: 14

The p-value of .1505 suggests that there is white noise.

accuracy(model_ATM4_1)

##                    ME     RMSE     MAE      MPE     MAPE      MASE
## Training set 168.6943 664.5617 303.076 -280.661 341.3589 0.7541188
##                     ACF1
## Training set 0.001571149

The RMSE value is higher in this model.

Let’s write these results to a CSV.

atm_forecasts = rbind(model_ATM1,model_ATM2,model_ATM3,model_ATM4)

## Warning in rbind(model_ATM1, model_ATM2, model_ATM3, model_ATM4): number of
## columns of result is not a multiple of vector length (arg 1)

write.csv(atm_forecasts,"ATM_Forecasts.csv")

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

read in the ResidentialCustomerForecastLoad-624.xlsx

df_pwr <- read.xlsx("/Users/ntlrsmllghn/Dropbox/Data/Data 624/Project 1 Data/ResidentialCustomerForecastLoad-624.xlsx")
head(df_pwr)

##   CaseSequence YYYY-MMM     KWH
## 1          733 1998-Jan 6862583
## 2          734 1998-Feb 5838198
## 3          735 1998-Mar 5420658
## 4          736 1998-Apr 5010364
## 5          737 1998-May 4665377
## 6          738 1998-Jun 6467147

summary(df_pwr)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

df_pwr <-drop_na(df_pwr)
colnames(df_pwr)[2] <- "date"
summary(df_pwr)

##   CaseSequence       date                KWH          
##  Min.   :733.0   Length:191         Min.   :  770523  
##  1st Qu.:780.5   Class :character   1st Qu.: 5429912  
##  Median :828.0   Mode  :character   Median : 6283324  
##  Mean   :828.3                      Mean   : 6502475  
##  3rd Qu.:876.5                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730

ts_pwr <- ts(df_pwr[, "KWH"], start = c(1998, 1), frequency = 12)
ggtsdisplay(ts_pwr, points = FALSE, plot.type = "histogram")

ggseasonplot(ts_pwr)

shows clear seasonality within the data. there is a spike around may - august (presumably from air conditioning usage) and again around november and december. I would say that is is roughly a 6 month cycle within this dataset.There is one drip in june of 2010, which could be because of a cold spell?

adf.test(ts_pwr)

## Warning in adf.test(ts_pwr): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_pwr
## Dickey-Fuller = -4.8347, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

components.ts_pwr = decompose(ts_pwr)
plot(components.ts_pwr)

####The trend plot shows that there is an overall upward movement of the datapoints. The random graph further illustrates the outlier in june 2010. For this dataset, I will use an STL function for forecasting the power usage in 2014. This is because the seasonal component is allowed to change over time, and the rate of change can be controlled by the user. The smoothness of the trend-cycle can also be controlled by the user. It can be robust to outliers (i.e., the user can specify a robust decomposition). So occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components.

power_model <- stl(ts_pwr, s.window='periodic', robust=TRUE) 
power_model<- forecast(power_model)
autoplot(power_model, 12)

#####Checking the model residuals and accuracy

checkresiduals(power_model)

## 
##  Ljung-Box test
## 
## data:  Residuals from STL +  ETS(M,N,N)
## Q* = 161.42, df = 22, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 24

With a very low p-value of this model, I would say that it is good to use for a prediction.

write.csv(power_model,"Power_Forecasts.csv")