Predictive Analytics - Project #1

Oluwakemi Omotunde

2019-03-30

Project 1

Part A.

We are asked to forecast how much cash is taken out of 4 different ATM machines for May 2010. We are given data in a single file with variable cash provided in hundreds of dollars. Explain and demonstrate you process, techniques used and not used and your actual forecast.

atm.data <- readxl::read_excel("ATM624Data.xlsx")
head(atm.data)
## # A tibble: 6 x 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

Upon looking at the data, the first thing I noticed was that there are 4 different ATM’s, which immediately lead me to believe that I will need to subset by ATM at some point. I also noticed that the dates/times are not reading properly. I will go ahead and take care of that.

atm.data$DATE =as.Date(atm.data$DATE, origin = "1899-12-30")
head(atm.data)
## # A tibble: 6 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-01 ATM2    107
## 3 2009-05-02 ATM1     82
## 4 2009-05-02 ATM2     89
## 5 2009-05-03 ATM1     85
## 6 2009-05-03 ATM2     90
tail(atm.data)
## # A tibble: 6 x 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-04-25 ATM4  542. 
## 2 2010-04-26 ATM4  404. 
## 3 2010-04-27 ATM4   13.7
## 4 2010-04-28 ATM4  348. 
## 5 2010-04-29 ATM4   44.2
## 6 2010-04-30 ATM4  482.

We were able to get the date to look proper, and see that our data spans from May 1, 2009 until April 30, 1010. I’m going to go ahead and subset the data by ATM and do some summary statics.

library(dplyr)
attach(atm.data)
atm1.data <- subset(atm.data, ATM == "ATM1")
atm2.data <- subset(atm.data, ATM == "ATM2")
atm3.data <- subset(atm.data, ATM == "ATM3" & Cash >0) #came back to add that I wanted the cash value to be greater than 0.
atm4.data <- subset(atm.data, ATM == "ATM4")
#I found out about tsclean and came back to add that here. It will impute the few NA's we have as well has take care of the outliers that we have. 
library(forecast)
atm1.data <- tsclean(atm1.data$Cash, replace.missing = T)
atm2.data <- tsclean(atm2.data$Cash, replace.missing = T)
atm3.data <- tsclean(atm3.data$Cash, replace.missing = T)
atm4.data <- tsclean(atm4.data$Cash, replace.missing = T)

#remove date and ATM column
atm1.data <- atm1.data[c(-1, -2)]
atm2.data <- atm2.data[c(-1, -2)]
atm3.data <- atm3.data[c(-1, -2)]
atm4.data <- atm4.data[c(-1, -2)]

#convert to time series
atm1.data <- ts(atm1.data, start = c(2009, as.numeric(format(as.Date("2009-05-01"), "%j"))), frequency = 7)
atm2.data <- ts(atm2.data, start = c(2009, as.numeric(format(as.Date("2009-05-01"), "%j"))), frequency = 7)
atm3.data <- ts(atm3.data, start = c(2010, as.numeric(format(as.Date("2010-04-28"), "%j"))), end = c(2010, as.numeric(format(as.Date("2010-04-30"), "%j"))), frequency = 7) #because when I examined the data, there were three enteries for the date range. This leads me to believe that there were no previous transactions
atm4.data <- ts(atm4.data, start = c(2009, as.numeric(format(as.Date("2009-05-01"), "%j"))), frequency = 7)

head(atm1.data)
## Time Series:
## Start = c(2026, 2) 
## End = c(2026, 7) 
## Frequency = 7 
## [1]  85  90  99  88   8 104
head(atm2.data)
## Time Series:
## Start = c(2026, 2) 
## End = c(2026, 7) 
## Frequency = 7 
## [1]  90  55  79  19   2 103
head(atm3.data)
## Time Series:
## Start = c(2026, 6) 
## End = c(2027, 1) 
## Frequency = 7 
## [1] 85 85 85
head(atm4.data)
## Time Series:
## Start = c(2026, 2) 
## End = c(2026, 7) 
## Frequency = 7 
## [1] 792.81136 908.23846  52.83210  52.20845  55.47361 558.50325

I removed the column denoting which ATM is which since I subset by the ATM. I also removed the date column. I’ll now take a look at the different summary statistics.

summary(atm1.data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   73.00   91.00   84.13  108.00  180.00
summary(atm2.data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    25.5    66.0    62.4    92.5   147.0
summary(atm3.data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      85      85      85      85      85      85
summary(atm4.data)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.563  123.501  400.485  443.622  700.219 1712.075

At first glance, I don’t see too many NA’s, with only ATM 1 and 2 having 3 and 2 respectively. Perhaps I will replace with the mean of the ATM. I noticed right away that ATM 3 and 4 look a little wonky. ATM 3 only has 3 entries(will verify below), with the rest being 0’s. ATM 4 looks as though it has a few outliers and there are decimal transaction which raises the question of how that is possible. Perhaps its a POS and not an ATM? We will have to come up with a solution for them(outliers). Here’s a list of tasks: + verify NA’s, impute the few that we have (there are no longer any NA’s since we used tsclean) + plots to look at outliers(outliers are also no longer an issue with tsclean) + autoplot for all the ATM’s, particularly ATM3 and ATM4.

library(fpp2)
autoplot(atm1.data)+labs(title = "ATM-1 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010",
       x = "Date",y = "Cash")

ggseasonplot(atm1.data)

ggseasonplot(atm1.data, polar = TRUE)

tsdisplay(atm1.data)

ATM 1 shows no trend but there is obvious seasonality, shown by the spikes at 7, 14, and 21 in the ACF plot.

autoplot(atm2.data)+labs(title = "ATM-2 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010",
       x = "Date",y = "Cash")

ggseasonplot(atm2.data)

ggseasonplot(atm2.data, polar = TRUE)

tsdisplay(atm2.data)

ATM 2 shows the same spikes at 7,14,21. One thing I notice that I will try to come back to change it the 2030, 2040, etc label for my plot.

#autoplot(atm3.data)+labs(title = "ATM-3 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010", x = "Date",y = "Cash")
#ggseasonplot(atm3.data) 
#ggseasonplot(atm3.data, polar = TRUE)
#tsdisplay(atm3.data)

I will not be forecasting ATM 3 as it only contains 3 point, which indicates to me that this is a newly installed ATM. I would say to wait at least 3 months and then attempt to forecast at that time.Imputing the3 entries was not fruitful because there were only 3 entries.

autoplot(atm4.data)+labs(title = "ATM-4 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010",
       x = "Date",y = "Cash")

ggseasonplot(atm4.data)

ggseasonplot(atm4.data, polar = TRUE)

tsdisplay(atm4.data)

ATM 4 shows the same seasonality information. The spike from the outlier is no longer present because we used tsclean. I wonder if it would be worth it to show the before and after.

I would like to take a look at the decomposed plots for each of the 3 ATM’s we’ll be forecasting then check to see if our data is stationary with deterministic trend or stochastic trend. This will help us determine which way to go with our modeling.

plot(decompose(atm1.data))

plot(decompose(atm2.data))

plot(decompose(atm4.data))

library(tseries)
adf.test(atm1.data)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm1.data
## Dickey-Fuller = -4.5115, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
adf.test(atm2.data)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm2.data
## Dickey-Fuller = -5.9039, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
adf.test(atm4.data)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm4.data
## Dickey-Fuller = -5.636, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

The p values are all below .05 which tells us that we can reject the null hypothesis and accept the alternate. The data for the 3 ATMs are stochastic leading me to use regression as opposed to differencing. Now we can start building our model. First let’s see if we can find lambda values that we will use to transform our data. Once we have the lambda value, I’ll go ahead and split my data, find a model, then test it on the training set.

lambda1 <- BoxCox.lambda(atm1.data)
lambda1   # See the optimal transformation chosen (??)
## [1] 0.2688212
atm1.tran <- BoxCox(atm1.data,lambda1)

lambda2 <- BoxCox.lambda(atm2.data)
lambda2   # See the optimal transformation chosen (??)
## [1] 0.7238199
atm2.tran <- BoxCox(atm2.data,lambda2)

lambda4 <- BoxCox.lambda(atm4.data)
lambda4   # See the optimal transformation chosen (??)
## [1] 0.4498458
atm4.tran <- BoxCox(atm4.data,lambda4)
library(CombMSC)
#we have 363 observations
atm1.split <- splitTrainTest(atm1.tran, numTrain = length(atm1.tran) - 58)
atm2.split <- splitTrainTest(atm2.tran, numTrain = length(atm2.tran) - 58)
atm4.split <- splitTrainTest(atm4.tran, numTrain = length(atm4.tran) - 58)

atm1.test <- atm1.split$test
atm2.test <- atm2.split$test
atm4.test <- atm4.split$test

atm1.train <- atm1.split$train
atm2.train <- atm2.split$train
atm4.train <- atm4.split$train

Now that we have split out data with about 85% in the training and 15 in the test, we will build our model.

(atm1.fit <- auto.arima(atm1.train, stepwise = F, approximation = F, d = 0))
## Series: atm1.train 
## ARIMA(0,0,2)(1,1,1)[7] 
## 
## Coefficients:
##          ma1      ma2    sar1     sma1
##       0.1115  -0.1128  0.1953  -0.8785
## s.e.  0.0578   0.0582  0.0787   0.0587
## 
## sigma^2 estimated as 2.041:  log likelihood=-531.16
## AIC=1072.31   AICc=1072.52   BIC=1090.8

\[ARIMA(0,0,2)(1,1,1)_7\] with a non-zero mean is the suggested model for ATM 1.

(atm2.fit <- auto.arima(atm2.train, stepwise = F, approximation = F, d = 0))
## Series: atm2.train 
## ARIMA(1,0,3)(0,1,1)[7] with drift 
## 
## Coefficients:
##           ar1     ma1      ma2      ma3     sma1    drift
##       -0.6413  0.7005  -0.1109  -0.2039  -0.7705  -0.0254
## s.e.   0.2112  0.2117   0.0775   0.0619   0.0684   0.0149
## 
## sigma^2 estimated as 73.35:  log likelihood=-1063.15
## AIC=2140.3   AICc=2140.69   BIC=2166.18

\[ARIMA(1,0,3)(0,1,1)_7\] with drift -.0254 is the suggested model for ATM 2

(atm4.fit <- auto.arima(atm4.train, stepwise = F, approximation = F, d = 0))
## Series: atm4.train 
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean 
## 
## Coefficients:
##         sar1    sar2     mean
##       0.2211  0.1318  28.9678
## s.e.  0.0575  0.0581   1.1847
## 
## sigma^2 estimated as 187.2:  log likelihood=-1229.52
## AIC=2467.04   AICc=2467.17   BIC=2481.92

\[ARIMA(0,0,0)(2,0,0)_7\] with non-zero mean is suggested for ATM 4.

Next, lets take a look at the residual ACF and PACF plots.

acf(residuals(atm1.fit), ylab = "ACF ATM 1")

pacf(residuals(atm1.fit), ylab = "PACF ATM 1")

acf(residuals(atm2.fit), ylab = "ACF ATM 2")

pacf(residuals(atm2.fit), ylab = "PACF ATM 2")

acf(residuals(atm4.fit), ylab = "ACF ATM 4")

pacf(residuals(atm4.fit), ylab = "PACF ATM 4")

The plots show white noise in both plots for all the ATM’s, which I will confirm with the Box-Ljung test.

Box.test(residuals(atm1.fit), lag = 7, fitdf = sum(atm1.fit$arma[1:2]), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(atm1.fit)
## X-squared = 3.8097, df = 5, p-value = 0.5771
Box.test(residuals(atm2.fit), lag = 7, fitdf = sum(atm2.fit$arma[1:2]), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(atm2.fit)
## X-squared = 11.51, df = 3, p-value = 0.009265
Box.test(residuals(atm4.fit), lag = 7, fitdf = sum(atm4.fit$arma[1:2]), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(atm4.fit)
## X-squared = 6.1016, df = 7, p-value = 0.5279

We do have what is white noise so we can use these models to forecast.

forecast1 <- forecast(atm1.fit, h =30)
forecast2 <- forecast(atm2.fit, h =30)
forecast4 <- forecast(atm4.fit, h =30)
plot(forecast1, ylab = "ATM 1 CASH TAKEN OUT");
lines(stats::lag(atm1.test, -length(atm1.train)), col = "red")

plot(forecast2, ylab = "ATM 2 CASH TAKEN OUT");
lines(stats::lag(atm2.test, -length(atm2.train)), col = "red")

plot(forecast4, ylab = "ATM 4 CASH TAKEN OUT");
lines(stats::lag(atm4.test, -length(atm4.train)), col = "red")

The plots, Lastly, let’s check the accuracy of the models and export into and excel file.

round(accuracy(forecast1, length(atm1.test)), 3)
##                  ME   RMSE    MAE    MPE   MAPE   MASE   ACF1
## Training set  0.016  1.403  0.763   -Inf    Inf  0.373 -0.001
## Test set     50.733 50.733 50.733 87.471 87.471 24.816     NA
round(accuracy(forecast2, length(atm2.test)), 3)
##                  ME   RMSE    MAE    MPE   MAPE MASE   ACF1
## Training set  0.131  8.380  5.804   -Inf    Inf 0.40 -0.007
## Test set     38.879 38.879 38.879 67.033 67.033 2.68     NA
round(accuracy(forecast4, length(atm4.test)), 3)
##                  ME   RMSE    MAE     MPE    MAPE  MASE ACF1
## Training set -0.001 13.614 11.467 -75.669 102.228 0.719 0.08
## Test set     21.789 21.789 21.789  37.568  37.568 1.367   NA

It looks as though model 1 was the most accurate with model 4 with the worst.

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_201') 
library(rJava)
library(xlsx)

#I got an error that my forecast were not dataframes, so I had to create new forecast for the export.

export.forecast1 <- forecast::forecast(forecast::Arima(atm1.data, model = atm1.fit), h = 30)
export.forecast2 <- forecast::forecast(forecast::Arima(atm2.data, model = atm2.fit), h = 30)
export.forecast4 <- forecast::forecast(forecast::Arima(atm4.data, model = atm4.fit), h = 30)


xlsx::write.xlsx(export.forecast1, file = "Predictive Analytics Project 1.xlsx", sheetName = "ATM 1", col.name = T, row.names = T, append = F)

xlsx::write.xlsx(export.forecast2, file = "Predictive Analytics Project 1.xlsx", sheetName = "ATM 2", col.name = T, row.names = T, append = T)

xlsx::write.xlsx(export.forecast4, file = "Predictive Analytics Project 1.xlsx", sheetName = "ATM 4", col.name = T, row.names = T, append = T)

Part B

Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

res.power <- readxl::read_excel("ResidentialCustomerForecastLoad.xlsx")
head(res.power)
## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

I notice that the data is a character value and needs to be changed.

res.power$`YYYY-MMM` <- as.Date(paste0(res.power$`YYYY-MMM`, "-01"), format = "%Y -%b - %d")
head(res.power)
## # A tibble: 6 x 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <date>       <dbl>
## 1          733 1998-01-01 6862583
## 2          734 1998-02-01 5838198
## 3          735 1998-03-01 5420658
## 4          736 1998-04-01 5010364
## 5          737 1998-05-01 4665377
## 6          738 1998-06-01 6467147

Let’s do some summary statistics.

summary(res.power)
##   CaseSequence      YYYY-MMM               KWH          
##  Min.   :733.0   Min.   :1998-01-01   Min.   :  770523  
##  1st Qu.:780.8   1st Qu.:2001-12-24   1st Qu.: 5429912  
##  Median :828.5   Median :2005-12-16   Median : 6283324  
##  Mean   :828.5   Mean   :2005-12-15   Mean   : 6502475  
##  3rd Qu.:876.2   3rd Qu.:2009-12-08   3rd Qu.: 7620524  
##  Max.   :924.0   Max.   :2013-12-01   Max.   :10655730  
##                                       NA's   :1

I notice that this is 1 NA value. Our date range is January 1, 1998 through December 1, 2013. Next, lets convert our data into a time series, use the tsclean function to replace our outlier and NA values then complete some plots.

ts.res.power <- ts(tsclean(res.power$KWH, replace.missing = T), frequency = 12, start = c(1998, 1))
#manually replace outlier with mean
ts.res.power[ts.res.power == min(ts.res.power)] <- mean(ts.res.power[ts.res.power != min(ts.res.power)])
summary(ts.res.power)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  4313019  5443502  6351262  6532836  7608792 10655730
autoplot(ts.res.power) + labs(title = "Residential Power Usage")

ggseasonplot(ts.res.power)

ggseasonplot(ts.res.power, polar = TRUE)

tsdisplay(ts.res.power)

plot(decompose(ts.res.power))

It does not look as though tsclean replaced our outlier like I thought it would. I’m going to go back to manually change the outlier to the average. The decomposition plot shows an upward trend. Once again, I’ll use the Dickey-Fuller test to test for deterministic or stochastic trend.

adf.test(ts.res.power)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts.res.power
## Dickey-Fuller = -4.5454, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

Results are similar to our ATM results. I’m essentially going to copy and paste from here.

lambda.respower <- BoxCox.lambda(ts.res.power)
lambda.respower
## [1] -0.2103898
ts.power.trans <- BoxCox(ts.res.power, lambda.respower)
ts.power.split <- splitTrainTest(ts.power.trans, numTrain = length(ts.power.trans) - 30)

ts.power.test <- ts.power.split$test

ts.power.train <- ts.power.split$train
(res.power.fit <- auto.arima(ts.power.train, stepwise = F, approximation = F, d = 0))
## Series: ts.power.train 
## ARIMA(0,0,3)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1     ma2     ma3     sar1     sar2  drift
##       0.1894  0.0449  0.2198  -0.7747  -0.4487  0e+00
## s.e.  0.0812  0.0875  0.0806   0.0818   0.0823  1e-04
## 
## sigma^2 estimated as 1.215e-05:  log likelihood=645.91
## AIC=-1277.82   AICc=-1277.03   BIC=-1256.75

\[ARIMA(0,0,3)(2,1,0)_12\] with drift is the suggested model.

acf(residuals(res.power.fit), ylab = "ACF Residential Power")

pacf(residuals(res.power.fit), ylab = "PACF Residential Power")

Looks like we have white noise. We’ll perform the Box-Ljung test to verify.

Box.test(residuals(res.power.fit), lag = 7, fitdf = sum(res.power.fit$arma[1:2]), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(res.power.fit)
## X-squared = 9.7239, df = 4, p-value = 0.04534
forecast.respower <- forecast(res.power.fit, h = 12)
plot(forecast.respower, ylab = "Residential Power");
lines(stats::lag(ts.power.test, -length(ts.power.train)), col = "red")

round(accuracy(forecast.respower, length(ts.power.test)), 3)
##                  ME   RMSE    MAE    MPE   MAPE     MASE ACF1
## Training set  0.000  0.003  0.003  0.006  0.058    0.413 0.13
## Test set     25.415 25.415 25.415 84.717 84.717 3939.391   NA
export.forecastpower <- forecast::forecast(forecast::Arima(ts.res.power, model = res.power.fit), h = 12)

xlsx::write.xlsx(export.forecastpower, file = "Predictive Analytics Project 1.xlsx", sheetName = "Power", col.name = T, row.names = T, append = T)

I’m currently out of steam, otherwise I would try my hands on the extra credit. I think once we take care of the different time, I’d be able to do the same steps that I took for the other two previous part.