Project 1
Part A.
We are asked to forecast how much cash is taken out of 4 different ATM machines for May 2010. We are given data in a single file with variable cash provided in hundreds of dollars. Explain and demonstrate you process, techniques used and not used and your actual forecast.
atm.data <- readxl::read_excel("ATM624Data.xlsx")
head(atm.data)
## # A tibble: 6 x 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
Upon looking at the data, the first thing I noticed was that there are 4 different ATM’s, which immediately lead me to believe that I will need to subset by ATM at some point. I also noticed that the dates/times are not reading properly. I will go ahead and take care of that.
atm.data$DATE =as.Date(atm.data$DATE, origin = "1899-12-30")
head(atm.data)
## # A tibble: 6 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-01 ATM2 107
## 3 2009-05-02 ATM1 82
## 4 2009-05-02 ATM2 89
## 5 2009-05-03 ATM1 85
## 6 2009-05-03 ATM2 90
tail(atm.data)
## # A tibble: 6 x 3
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2010-04-25 ATM4 542.
## 2 2010-04-26 ATM4 404.
## 3 2010-04-27 ATM4 13.7
## 4 2010-04-28 ATM4 348.
## 5 2010-04-29 ATM4 44.2
## 6 2010-04-30 ATM4 482.
We were able to get the date to look proper, and see that our data spans from May 1, 2009 until April 30, 1010. I’m going to go ahead and subset the data by ATM and do some summary statics.
library(dplyr)
attach(atm.data)
atm1.data <- subset(atm.data, ATM == "ATM1")
atm2.data <- subset(atm.data, ATM == "ATM2")
atm3.data <- subset(atm.data, ATM == "ATM3" & Cash >0) #came back to add that I wanted the cash value to be greater than 0.
atm4.data <- subset(atm.data, ATM == "ATM4")
#I found out about tsclean and came back to add that here. It will impute the few NA's we have as well has take care of the outliers that we have.
library(forecast)
atm1.data <- tsclean(atm1.data$Cash, replace.missing = T)
atm2.data <- tsclean(atm2.data$Cash, replace.missing = T)
atm3.data <- tsclean(atm3.data$Cash, replace.missing = T)
atm4.data <- tsclean(atm4.data$Cash, replace.missing = T)
#remove date and ATM column
atm1.data <- atm1.data[c(-1, -2)]
atm2.data <- atm2.data[c(-1, -2)]
atm3.data <- atm3.data[c(-1, -2)]
atm4.data <- atm4.data[c(-1, -2)]
#convert to time series
atm1.data <- ts(atm1.data, start = c(2009, as.numeric(format(as.Date("2009-05-01"), "%j"))), frequency = 7)
atm2.data <- ts(atm2.data, start = c(2009, as.numeric(format(as.Date("2009-05-01"), "%j"))), frequency = 7)
atm3.data <- ts(atm3.data, start = c(2010, as.numeric(format(as.Date("2010-04-28"), "%j"))), end = c(2010, as.numeric(format(as.Date("2010-04-30"), "%j"))), frequency = 7) #because when I examined the data, there were three enteries for the date range. This leads me to believe that there were no previous transactions
atm4.data <- ts(atm4.data, start = c(2009, as.numeric(format(as.Date("2009-05-01"), "%j"))), frequency = 7)
head(atm1.data)
## Time Series:
## Start = c(2026, 2)
## End = c(2026, 7)
## Frequency = 7
## [1] 85 90 99 88 8 104
head(atm2.data)
## Time Series:
## Start = c(2026, 2)
## End = c(2026, 7)
## Frequency = 7
## [1] 90 55 79 19 2 103
head(atm3.data)
## Time Series:
## Start = c(2026, 6)
## End = c(2027, 1)
## Frequency = 7
## [1] 85 85 85
head(atm4.data)
## Time Series:
## Start = c(2026, 2)
## End = c(2026, 7)
## Frequency = 7
## [1] 792.81136 908.23846 52.83210 52.20845 55.47361 558.50325
I removed the column denoting which ATM is which since I subset by the ATM. I also removed the date column. I’ll now take a look at the different summary statistics.
summary(atm1.data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 73.00 91.00 84.13 108.00 180.00
summary(atm2.data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 25.5 66.0 62.4 92.5 147.0
summary(atm3.data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 85 85 85 85 85 85
summary(atm4.data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.563 123.501 400.485 443.622 700.219 1712.075
At first glance, I don’t see too many NA’s, with only ATM 1 and 2 having 3 and 2 respectively. Perhaps I will replace with the mean of the ATM. I noticed right away that ATM 3 and 4 look a little wonky. ATM 3 only has 3 entries(will verify below), with the rest being 0’s. ATM 4 looks as though it has a few outliers and there are decimal transaction which raises the question of how that is possible. Perhaps its a POS and not an ATM? We will have to come up with a solution for them(outliers). Here’s a list of tasks: + verify NA’s, impute the few that we have (there are no longer any NA’s since we used tsclean) + plots to look at outliers(outliers are also no longer an issue with tsclean) + autoplot for all the ATM’s, particularly ATM3 and ATM4.
library(fpp2)
autoplot(atm1.data)+labs(title = "ATM-1 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010",
x = "Date",y = "Cash")
ggseasonplot(atm1.data)
ggseasonplot(atm1.data, polar = TRUE)
tsdisplay(atm1.data)
ATM 1 shows no trend but there is obvious seasonality, shown by the spikes at 7, 14, and 21 in the ACF plot.
autoplot(atm2.data)+labs(title = "ATM-2 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010",
x = "Date",y = "Cash")
ggseasonplot(atm2.data)
ggseasonplot(atm2.data, polar = TRUE)
tsdisplay(atm2.data)
ATM 2 shows the same spikes at 7,14,21. One thing I notice that I will try to come back to change it the 2030, 2040, etc label for my plot.
#autoplot(atm3.data)+labs(title = "ATM-3 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010", x = "Date",y = "Cash")
#ggseasonplot(atm3.data)
#ggseasonplot(atm3.data, polar = TRUE)
#tsdisplay(atm3.data)
I will not be forecasting ATM 3 as it only contains 3 point, which indicates to me that this is a newly installed ATM. I would say to wait at least 3 months and then attempt to forecast at that time.Imputing the3 entries was not fruitful because there were only 3 entries.
autoplot(atm4.data)+labs(title = "ATM-4 WITHDRAWALS", subtitle = "5/1/2009 - 4/30/2010",
x = "Date",y = "Cash")
ggseasonplot(atm4.data)
ggseasonplot(atm4.data, polar = TRUE)
tsdisplay(atm4.data)
ATM 4 shows the same seasonality information. The spike from the outlier is no longer present because we used tsclean. I wonder if it would be worth it to show the before and after.
I would like to take a look at the decomposed plots for each of the 3 ATM’s we’ll be forecasting then check to see if our data is stationary with deterministic trend or stochastic trend. This will help us determine which way to go with our modeling.
plot(decompose(atm1.data))
plot(decompose(atm2.data))
plot(decompose(atm4.data))
library(tseries)
adf.test(atm1.data)
##
## Augmented Dickey-Fuller Test
##
## data: atm1.data
## Dickey-Fuller = -4.5115, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
adf.test(atm2.data)
##
## Augmented Dickey-Fuller Test
##
## data: atm2.data
## Dickey-Fuller = -5.9039, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
adf.test(atm4.data)
##
## Augmented Dickey-Fuller Test
##
## data: atm4.data
## Dickey-Fuller = -5.636, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
The p values are all below .05 which tells us that we can reject the null hypothesis and accept the alternate. The data for the 3 ATMs are stochastic leading me to use regression as opposed to differencing. Now we can start building our model. First let’s see if we can find lambda values that we will use to transform our data. Once we have the lambda value, I’ll go ahead and split my data, find a model, then test it on the training set.
lambda1 <- BoxCox.lambda(atm1.data)
lambda1 # See the optimal transformation chosen (??)
## [1] 0.2688212
atm1.tran <- BoxCox(atm1.data,lambda1)
lambda2 <- BoxCox.lambda(atm2.data)
lambda2 # See the optimal transformation chosen (??)
## [1] 0.7238199
atm2.tran <- BoxCox(atm2.data,lambda2)
lambda4 <- BoxCox.lambda(atm4.data)
lambda4 # See the optimal transformation chosen (??)
## [1] 0.4498458
atm4.tran <- BoxCox(atm4.data,lambda4)
library(CombMSC)
#we have 363 observations
atm1.split <- splitTrainTest(atm1.tran, numTrain = length(atm1.tran) - 58)
atm2.split <- splitTrainTest(atm2.tran, numTrain = length(atm2.tran) - 58)
atm4.split <- splitTrainTest(atm4.tran, numTrain = length(atm4.tran) - 58)
atm1.test <- atm1.split$test
atm2.test <- atm2.split$test
atm4.test <- atm4.split$test
atm1.train <- atm1.split$train
atm2.train <- atm2.split$train
atm4.train <- atm4.split$train
Now that we have split out data with about 85% in the training and 15 in the test, we will build our model.
(atm1.fit <- auto.arima(atm1.train, stepwise = F, approximation = F, d = 0))
## Series: atm1.train
## ARIMA(0,0,2)(1,1,1)[7]
##
## Coefficients:
## ma1 ma2 sar1 sma1
## 0.1115 -0.1128 0.1953 -0.8785
## s.e. 0.0578 0.0582 0.0787 0.0587
##
## sigma^2 estimated as 2.041: log likelihood=-531.16
## AIC=1072.31 AICc=1072.52 BIC=1090.8
\[ARIMA(0,0,2)(1,1,1)_7\] with a non-zero mean is the suggested model for ATM 1.
(atm2.fit <- auto.arima(atm2.train, stepwise = F, approximation = F, d = 0))
## Series: atm2.train
## ARIMA(1,0,3)(0,1,1)[7] with drift
##
## Coefficients:
## ar1 ma1 ma2 ma3 sma1 drift
## -0.6413 0.7005 -0.1109 -0.2039 -0.7705 -0.0254
## s.e. 0.2112 0.2117 0.0775 0.0619 0.0684 0.0149
##
## sigma^2 estimated as 73.35: log likelihood=-1063.15
## AIC=2140.3 AICc=2140.69 BIC=2166.18
\[ARIMA(1,0,3)(0,1,1)_7\] with drift -.0254 is the suggested model for ATM 2
(atm4.fit <- auto.arima(atm4.train, stepwise = F, approximation = F, d = 0))
## Series: atm4.train
## ARIMA(0,0,0)(2,0,0)[7] with non-zero mean
##
## Coefficients:
## sar1 sar2 mean
## 0.2211 0.1318 28.9678
## s.e. 0.0575 0.0581 1.1847
##
## sigma^2 estimated as 187.2: log likelihood=-1229.52
## AIC=2467.04 AICc=2467.17 BIC=2481.92
\[ARIMA(0,0,0)(2,0,0)_7\] with non-zero mean is suggested for ATM 4.
Next, lets take a look at the residual ACF and PACF plots.
acf(residuals(atm1.fit), ylab = "ACF ATM 1")
pacf(residuals(atm1.fit), ylab = "PACF ATM 1")
acf(residuals(atm2.fit), ylab = "ACF ATM 2")
pacf(residuals(atm2.fit), ylab = "PACF ATM 2")
acf(residuals(atm4.fit), ylab = "ACF ATM 4")
pacf(residuals(atm4.fit), ylab = "PACF ATM 4")
The plots show white noise in both plots for all the ATM’s, which I will confirm with the Box-Ljung test.
Box.test(residuals(atm1.fit), lag = 7, fitdf = sum(atm1.fit$arma[1:2]), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(atm1.fit)
## X-squared = 3.8097, df = 5, p-value = 0.5771
Box.test(residuals(atm2.fit), lag = 7, fitdf = sum(atm2.fit$arma[1:2]), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(atm2.fit)
## X-squared = 11.51, df = 3, p-value = 0.009265
Box.test(residuals(atm4.fit), lag = 7, fitdf = sum(atm4.fit$arma[1:2]), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(atm4.fit)
## X-squared = 6.1016, df = 7, p-value = 0.5279
We do have what is white noise so we can use these models to forecast.
forecast1 <- forecast(atm1.fit, h =30)
forecast2 <- forecast(atm2.fit, h =30)
forecast4 <- forecast(atm4.fit, h =30)
plot(forecast1, ylab = "ATM 1 CASH TAKEN OUT");
lines(stats::lag(atm1.test, -length(atm1.train)), col = "red")
plot(forecast2, ylab = "ATM 2 CASH TAKEN OUT");
lines(stats::lag(atm2.test, -length(atm2.train)), col = "red")
plot(forecast4, ylab = "ATM 4 CASH TAKEN OUT");
lines(stats::lag(atm4.test, -length(atm4.train)), col = "red")
The plots, Lastly, let’s check the accuracy of the models and export into and excel file.
round(accuracy(forecast1, length(atm1.test)), 3)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.016 1.403 0.763 -Inf Inf 0.373 -0.001
## Test set 50.733 50.733 50.733 87.471 87.471 24.816 NA
round(accuracy(forecast2, length(atm2.test)), 3)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.131 8.380 5.804 -Inf Inf 0.40 -0.007
## Test set 38.879 38.879 38.879 67.033 67.033 2.68 NA
round(accuracy(forecast4, length(atm4.test)), 3)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.001 13.614 11.467 -75.669 102.228 0.719 0.08
## Test set 21.789 21.789 21.789 37.568 37.568 1.367 NA
It looks as though model 1 was the most accurate with model 4 with the worst.
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_201')
library(rJava)
library(xlsx)
#I got an error that my forecast were not dataframes, so I had to create new forecast for the export.
export.forecast1 <- forecast::forecast(forecast::Arima(atm1.data, model = atm1.fit), h = 30)
export.forecast2 <- forecast::forecast(forecast::Arima(atm2.data, model = atm2.fit), h = 30)
export.forecast4 <- forecast::forecast(forecast::Arima(atm4.data, model = atm4.fit), h = 30)
xlsx::write.xlsx(export.forecast1, file = "Predictive Analytics Project 1.xlsx", sheetName = "ATM 1", col.name = T, row.names = T, append = F)
xlsx::write.xlsx(export.forecast2, file = "Predictive Analytics Project 1.xlsx", sheetName = "ATM 2", col.name = T, row.names = T, append = T)
xlsx::write.xlsx(export.forecast4, file = "Predictive Analytics Project 1.xlsx", sheetName = "ATM 4", col.name = T, row.names = T, append = T)
Part B
Part B consists of a simple data set of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
res.power <- readxl::read_excel("ResidentialCustomerForecastLoad.xlsx")
head(res.power)
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
I notice that the data is a character value and needs to be changed.
res.power$`YYYY-MMM` <- as.Date(paste0(res.power$`YYYY-MMM`, "-01"), format = "%Y -%b - %d")
head(res.power)
## # A tibble: 6 x 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <date> <dbl>
## 1 733 1998-01-01 6862583
## 2 734 1998-02-01 5838198
## 3 735 1998-03-01 5420658
## 4 736 1998-04-01 5010364
## 5 737 1998-05-01 4665377
## 6 738 1998-06-01 6467147
Let’s do some summary statistics.
summary(res.power)
## CaseSequence YYYY-MMM KWH
## Min. :733.0 Min. :1998-01-01 Min. : 770523
## 1st Qu.:780.8 1st Qu.:2001-12-24 1st Qu.: 5429912
## Median :828.5 Median :2005-12-16 Median : 6283324
## Mean :828.5 Mean :2005-12-15 Mean : 6502475
## 3rd Qu.:876.2 3rd Qu.:2009-12-08 3rd Qu.: 7620524
## Max. :924.0 Max. :2013-12-01 Max. :10655730
## NA's :1
I notice that this is 1 NA value. Our date range is January 1, 1998 through December 1, 2013. Next, lets convert our data into a time series, use the tsclean function to replace our outlier and NA values then complete some plots.
ts.res.power <- ts(tsclean(res.power$KWH, replace.missing = T), frequency = 12, start = c(1998, 1))
#manually replace outlier with mean
ts.res.power[ts.res.power == min(ts.res.power)] <- mean(ts.res.power[ts.res.power != min(ts.res.power)])
summary(ts.res.power)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4313019 5443502 6351262 6532836 7608792 10655730
autoplot(ts.res.power) + labs(title = "Residential Power Usage")
ggseasonplot(ts.res.power)
ggseasonplot(ts.res.power, polar = TRUE)
tsdisplay(ts.res.power)
plot(decompose(ts.res.power))
It does not look as though tsclean replaced our outlier like I thought it would. I’m going to go back to manually change the outlier to the average. The decomposition plot shows an upward trend. Once again, I’ll use the Dickey-Fuller test to test for deterministic or stochastic trend.
adf.test(ts.res.power)
##
## Augmented Dickey-Fuller Test
##
## data: ts.res.power
## Dickey-Fuller = -4.5454, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
Results are similar to our ATM results. I’m essentially going to copy and paste from here.
lambda.respower <- BoxCox.lambda(ts.res.power)
lambda.respower
## [1] -0.2103898
ts.power.trans <- BoxCox(ts.res.power, lambda.respower)
ts.power.split <- splitTrainTest(ts.power.trans, numTrain = length(ts.power.trans) - 30)
ts.power.test <- ts.power.split$test
ts.power.train <- ts.power.split$train
(res.power.fit <- auto.arima(ts.power.train, stepwise = F, approximation = F, d = 0))
## Series: ts.power.train
## ARIMA(0,0,3)(2,1,0)[12] with drift
##
## Coefficients:
## ma1 ma2 ma3 sar1 sar2 drift
## 0.1894 0.0449 0.2198 -0.7747 -0.4487 0e+00
## s.e. 0.0812 0.0875 0.0806 0.0818 0.0823 1e-04
##
## sigma^2 estimated as 1.215e-05: log likelihood=645.91
## AIC=-1277.82 AICc=-1277.03 BIC=-1256.75
\[ARIMA(0,0,3)(2,1,0)_12\] with drift is the suggested model.
acf(residuals(res.power.fit), ylab = "ACF Residential Power")
pacf(residuals(res.power.fit), ylab = "PACF Residential Power")
Looks like we have white noise. We’ll perform the Box-Ljung test to verify.
Box.test(residuals(res.power.fit), lag = 7, fitdf = sum(res.power.fit$arma[1:2]), type = "Ljung-Box")
##
## Box-Ljung test
##
## data: residuals(res.power.fit)
## X-squared = 9.7239, df = 4, p-value = 0.04534
forecast.respower <- forecast(res.power.fit, h = 12)
plot(forecast.respower, ylab = "Residential Power");
lines(stats::lag(ts.power.test, -length(ts.power.train)), col = "red")
round(accuracy(forecast.respower, length(ts.power.test)), 3)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.000 0.003 0.003 0.006 0.058 0.413 0.13
## Test set 25.415 25.415 25.415 84.717 84.717 3939.391 NA
export.forecastpower <- forecast::forecast(forecast::Arima(ts.res.power, model = res.power.fit), h = 12)
xlsx::write.xlsx(export.forecastpower, file = "Predictive Analytics Project 1.xlsx", sheetName = "Power", col.name = T, row.names = T, append = T)
I’m currently out of steam, otherwise I would try my hands on the extra credit. I think once we take care of the different time, I’d be able to do the same steps that I took for the other two previous part.
References
https://stackoverflow.com/questions/33128865/starting-a-daily-time-series-in-r (getting proper time series) https://www.rdocumentation.org/packages/aTSA/versions/3.1.2/topics/adf.test (Dickey Fuller Test) http://www.sthda.com/english/wiki/writing-data-from-r-to-excel-files-xls-xlsx