library(fpp3)
## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --
## v tibble 3.1.4 v tsibble 1.0.1
## v dplyr 1.0.7 v tsibbledata 0.3.0
## v tidyr 1.1.3 v feasts 0.2.2
## v lubridate 1.7.10 v fable 0.3.1
## v ggplot2 3.3.5
## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval() masks lubridate::interval()
## x dplyr::lag() masks stats::lag()
## x tsibble::setdiff() masks base::setdiff()
## x tsibble::union() masks base::union()
library(dplyr)
library(latex2exp)
library(lubridate)
library(ggplot2)
library(fabletools)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
rm(list=ls())
ATM_loc<-"C:/Users/Lisa/OneDrive/CUNY/624/Project/ATM624Data.csv"
ATM<-read.table(file=ATM_loc,header=TRUE, sep=",")
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
str(ATM)
## 'data.frame': 1474 obs. of 3 variables:
## $ DATE: chr "05/01/09" "05/01/09" "05/02/09" "05/02/09" ...
## $ ATM : chr "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: int 96 107 82 89 85 90 90 55 99 79 ...
Convert the date to date format
ATM$DATE <- as.Date(ATM$DATE,"%m/%d/%y")
str(ATM)
## 'data.frame': 1474 obs. of 3 variables:
## $ DATE: Date, format: "2009-05-01" "2009-05-01" ...
## $ ATM : chr "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: int 96 107 82 89 85 90 90 55 99 79 ...
Let’s take a quick look at the four ATM’s:
ATM %>%
ggplot(aes(x=DATE, y=Cash, colour=ATM))+
geom_line() +
facet_grid(ATM~., scales="free_y")
## Warning: Removed 14 row(s) containing missing values (geom_path).
A visual inspection reveals some missing values should be dealt with prior to further analysis.
The ATM’s should be analyzed as 4 separate time series, because there are different things going on at each ATM.
The location and convenience of the ATM will differ from ATM to ATM. From the plot, there are a couple of striking issues that need to be addressed.
ATM1 and ATM2 would be good candidates to model using ARIMA or SNAIVE.
ATM3 appears to have activity at the last few dates, suggesting that it is just put in service. If possible, I would confirm this with the data source.
ATM4 has huge day of withdrawals on 2/9/10 in the order of 1,000,000. This suggests an error; it is unlikely that the ATM has the capacity to hold 1,000,000 and dispense it as well. If possible, I would confirm this with the data source. An educated guess would be to impute this erroneous value to the mean or other methodology to impute.
Create four separate timeseries datasets for ATM1, ATM2, ATM3, ATM4 respectively..
Let’s begin with ATM1
ATM1<-ATM %>%
filter (ATM=="ATM1") %>%
as_tsibble(index=DATE)
ATM1 %>%
autoplot()+labs(title="ATM1", subtitle="Cash withdrawals by day", y="$ 00's")
## Plot variable not specified, automatically selected `.vars = Cash`
sum(is.na(ATM1$Cash))
## [1] 3
md.pattern(ATM1)
## DATE ATM Cash
## 362 1 1 1 0
## 3 1 1 0 1
## 0 0 3 3
gg_season(ATM1, y=Cash, period=7)
## Warning: Removed 1 row(s) containing missing values (geom_path).
There are 3 missing values. The plot above suggests 7-day seasonality with pattern changes.
Notice that the green weeks show no activity on a Tuesday, and in later weeks there is no activity on a Thursday. This suggests the ATM is probably in service and the service scheduled changed over time.
Because there are only 3 missing values, let’s look at the missing dates and go back 1 week and forward 1 week to see the values. If the two values are similar, this will assure us that the change in operations to service on a certain day did not occur at that time.
Then, we can average the two values and impute into the missing.
#Missing Values
M1<-ATM1 %>%
filter(is.na(Cash))
M1
## # A tsibble: 3 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
#Let's deal with the 1st missing value at 6/13/21
#Go back 1 week and forward 1 week
M2<-ATM1 %>%
filter (DATE==as.Date("2009-06-06") | DATE==as.Date("2009-06-20"))
M2
## # A tsibble: 2 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-06 ATM1 96
## 2 2009-06-20 ATM1 110
#Let's average these two and impute into 6/13/09
(110+96)/2
## [1] 103
#103.
#get row index
which ((ATM1$DATE==as.Date("2009-06-13")))
## [1] 44
#This says row 44 so impute 103 into Cash
ATM1[44,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-13 ATM1 NA
ATM1[44,3]<-103
ATM1[44,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-13 ATM1 103
#####################################################
###################################################
# Moving on to the second missing value
###############################################
M1<-ATM1 %>%
filter(is.na(Cash))
M1
## # A tsibble: 2 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-16 ATM1 NA
## 2 2009-06-22 ATM1 NA
# Let's calculate a value for 6/16/21 looking back 1 week and forward 1 week.
M2<-ATM1 %>%
filter (DATE==as.Date("2009-06-09") | DATE==as.Date("2009-06-23"))
(145+108)/2
## [1] 126.5
#Use 127 as imputed value for 6/16/21
#get row index
which ((ATM1$DATE==as.Date("2009-06-16")))
## [1] 47
#This says row 47 so impute 127 into Cash
ATM1[47,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-16 ATM1 NA
ATM1[47,3]<-127
ATM1[47,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-16 ATM1 127
##################################
# Moving on to the third missing value
####################################
M1<-ATM1 %>%
filter(is.na(Cash))
M1
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-22 ATM1 NA
# Let's calculate a value for 6/22/21 looking back 1 week and forward 1 week.
M2<-ATM1 %>%
filter (DATE==as.Date("2009-06-15") | DATE==as.Date("2009-06-29"))
M2
## # A tsibble: 2 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-15 ATM1 106
## 2 2009-06-29 ATM1 98
(106+98)/2
## [1] 102
#Use 102 as imputed value for 6/22/21
#get row index
which ((ATM1$DATE==as.Date("2009-06-22")))
## [1] 53
#This says row 53 so impute 102 into Cash
ATM1[53,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-22 ATM1 NA
ATM1[53,3]<-102
ATM1[53,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-22 ATM1 102
#double check all missing gone
sum(is.na(ATM1$Cash))
## [1] 0
Let’s take a closer look at ATM1:
ATM1%>%
autoplot(Cash)
Transform the data to reduce the variability
lambda1<-ATM1%>%
features(Cash,features=guerrero)%>%
pull(lambda_guerrero)
lambda1
## [1] 0.3551715
ATM1<-ATM1%>% mutate(newCash=(((Cash^lambda1)-1)/lambda1))
ATM1%>%
autoplot(box_cox(Cash, lambda1))+
labs(y="",
title=latex2exp::TeX(paste0(
"Transformed Cash with $$\\lambda1$ =",
round(lambda1,2))))
lambda1
## [1] 0.3551715
ATM1%>%autoplot(newCash)
Let’s do a decomposition:
ATM1%>%
model(classical_decomposition(newCash, type="additive")) %>%
components () %>%
autoplot()+ labs(title="Classical additive decomposition of ATM1")
## Warning: Removed 3 row(s) containing missing values (geom_path).
We see a strong seasonality component.
Take a closer look…..
#test if uncorrelated with itself.
ATM1 %>%
features(newCash,ljung_box)
## # A tibble: 1 x 2
## lb_stat lb_pvalue
## <dbl> <dbl>
## 1 0.217 0.642
#do not need differencing
ATM1 %>% features(newCash, unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
#this confirms no differencing needed
#lets check seasonal differencing
ATM1 %>% features(newCash, unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 1
#we need to take 1st seasonal difference
ATM1<-ATM1 %>%
mutate(diff_Cash = (difference(newCash,7)))
ATM1%>% autoplot(diff_Cash)
## Warning: Removed 7 row(s) containing missing values (geom_path).
ATM1%>% ACF((diff_Cash))%>% autoplot()
ATM1%>% PACF((diff_Cash))%>% autoplot()
Our series has seasonality. We took the first seasonal difference.
Let’s fit some models.
###########################################
#Fit SNAIVE model
##########################################
fitATM1_2<-ATM1 %>%
model(SNAIVE(newCash~lag(7)))
report(fitATM1_2)
## Series: newCash
## Model: SNAIVE
##
## sigma^2: 4.6893
##SNAIVE Residual SD=sqrt(4.6893)=2.206649
###########################################
#Fit ETS model
##########################################
fitATM1_3<-ATM1 %>%
model(ETS(newCash))
report(fitATM1_3)
## Series: newCash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.0001000197
## gamma = 0.3586853
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 10.06064 -6.329706 0.7406964 1.802599 0.7464719 1.001276 0.6964825 1.34218
##
## sigma^2: 3.4407
##
## AIC AICc BIC
## 2615.372 2615.993 2654.371
#ETS(A,N,A) Residual SD=sqrt(3.4407)=1.854912
#AICc=2616
###########################################
#Fit ARIMA model
##########################################
fitATM1_1<-ATM1 %>%
model(ARIMA(diff_Cash))
report(fitATM1_1)
## Series: diff_Cash
## Model: ARIMA(2,0,0)(2,0,0)[7]
##
## Coefficients:
## ar1 ar2 sar1 sar2
## 0.1062 -0.0935 -0.5162 -0.2291
## s.e. 0.0528 0.0525 0.0512 0.0509
##
## sigma^2 estimated as 3.516: log likelihood=-735.59
## AIC=1481.19 AICc=1481.35 BIC=1500.69
#ARIMA(2,0,0)(2,0,0) Residual SD=sqrt(3.516)=1.8751
# AICc=1481
glance(fitATM1_2)
## # A tibble: 1 x 2
## .model sigma2
## <chr> <dbl>
## 1 SNAIVE(newCash ~ lag(7)) 4.69
glance(fitATM1_1)
## # A tibble: 1 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(diff_Cash) 3.52 -736. 1481. 1481. 1501. <cpl [16]> <cpl [0]>
glance(fitATM1_3)
## # A tibble: 1 x 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(newCash) 3.44 -1298. 2615. 2616. 2654. 3.36 3.37 1.03
Comparing the sigma^2 and AICc, the ARIMA model is superior.
Take a look at the residuals and ACF.
fitATM1_1 %>%
gg_tsresiduals()
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing non-finite values (stat_bin).
Residuals and ACF, PACF looks satisfactory
Now forecast for ATM1
forecast(fitATM1_1, h=31) %>%
autoplot(ATM1)
## Warning: Removed 7 row(s) containing missing values (geom_path).
fitATM1_1 %>%
forecast(h=31) %>%
autoplot(ATM1)
## Warning: Removed 7 row(s) containing missing values (geom_path).
MOVING ON TO ATM2
ATM2<-ATM %>%
filter (ATM=="ATM2") %>%
as_tsibble(index=DATE)
ATM2 %>%
autoplot()+labs(title="ATM2", subtitle="Cash withdrawals by day", y="$ 00's")
## Plot variable not specified, automatically selected `.vars = Cash`
sum(is.na(ATM2$Cash))
## [1] 2
md.pattern(ATM2)
## DATE ATM Cash
## 363 1 1 1 0
## 2 1 1 0 1
## 0 0 2 2
gg_season(ATM2, y=Cash, period=7)
## ATM2 Preliminary analysis
There are 2 missing values. The plot above suggests 7-day seasonality with pattern changes.
Like ATM1, the green weeks show no activity on a Tuesday, and in later weeks there is no activity on a Thursday. This suggests the ATM is probably in service and the service scheduled changed over time.
Because there are only 2 missing values, let’s look at the missing dates and go back 1 week and forward 1 week to see the values. If the two values are similar, this will assure us that the change in operations to service on a certain day did not occur at that time.
Then, we can average the two values and impute into the missing.
#Missing Values
m1<-ATM2 %>%
filter(is.na(Cash))
m1
## # A tsibble: 2 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-18 ATM2 NA
## 2 2009-06-24 ATM2 NA
#Let's deal with the 1st missing value at 6/18/21
#Go back 1 week and forward 1 week
m2<-ATM2 %>%
filter (DATE==as.Date("2009-06-11") | DATE==as.Date("2009-06-25"))
m2
## # A tsibble: 2 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-11 ATM2 7
## 2 2009-06-25 ATM2 3
#Let's average these two and impute into 6/18/09
(3+7)/2
## [1] 5
#5.
#get row index
which ((ATM2$DATE==as.Date("2009-06-18")))
## [1] 49
#This says row 49 so impute 5 into Cash
ATM2[49,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-18 ATM2 NA
ATM2[49,3]<-5
ATM2[49,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-18 ATM2 5
#####################################################
###################################################
# Moving on to the second missing value
m1<-ATM2 %>%
filter(is.na(Cash))
m1
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-24 ATM2 NA
# Let's calculate a value for 6/24/21 looking back 1 week and forward 1 week.
m2<-ATM2 %>%
filter (DATE==as.Date("2009-06-17") | DATE==as.Date("2009-07-01"))
m2
## # A tsibble: 2 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-17 ATM2 24
## 2 2009-07-01 ATM2 36
#Let's average these two and impute into 6/24/09
(24+36)/2
## [1] 30
#Use 30 as imputed value for 6/24/21
#get row index
which ((ATM2$DATE==as.Date("2009-06-24")))
## [1] 55
#This says row 55 so impute 30 into Cash
ATM2[55,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-24 ATM2 NA
ATM2[55,3]<-30
ATM2[55,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-24 ATM2 30
#double check all missing gone
sum(is.na(ATM2$Cash))
## [1] 0
Let’s take a closer look at ATM2:
ATM2%>%
autoplot(Cash)
Transform the data to reduce the variability
lambda2<-ATM2%>%
features(Cash,features=guerrero)%>%
pull(lambda_guerrero)
lambda2
## [1] 0.6343478
ATM2<-ATM2%>% mutate(newCash=(((Cash^lambda2)-1)/lambda2))
ATM2%>%
autoplot(box_cox(Cash, lambda2))+
labs(y="",
title=latex2exp::TeX(paste0(
"ATM2: Transformed Cash with $$\\lambda2$ =",
round(lambda2,2))))
lambda2
## [1] 0.6343478
ATM2%>%autoplot(newCash)
Let’s do a decomposition:
ATM2%>%
model(classical_decomposition(newCash, type="additive")) %>%
components () %>%
autoplot()+ labs(title="Classical additive decomposition of ATM2")
## Warning: Removed 3 row(s) containing missing values (geom_path).
## ATM2 Analysis continued
We see a strong seasonality component.
Take a closer look…..
#test if uncorrelated with itself.
ATM2 %>%
features(newCash,ljung_box)
## # A tibble: 1 x 2
## lb_stat lb_pvalue
## <dbl> <dbl>
## 1 0.919 0.338
#do not need differencing
ATM2 %>% features(newCash, unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 1
#this confirms no differencing needed
#lets check seasonal differencing
ATM2 %>% features(newCash, unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 1
#we need to take 1st seasonal difference
ATM2<-ATM2 %>%
mutate(diff_Cash = (difference(newCash,7)))
ATM2%>% autoplot(diff_Cash)
## Warning: Removed 7 row(s) containing missing values (geom_path).
#Let's look at ACF, PACF
ATM2%>% ACF((diff_Cash))%>% autoplot()
ATM2%>% PACF((diff_Cash))%>% autoplot()
These plots show some issues, but not too bad. Let’s continue.
Our series has seasonality. We took the first seasonal difference.
Let’s fit some models. Begin with SNAIVE model, then Exponential, ARIMA.
###########################################
#Fit SNAIVE model
##########################################
fitATM2_2<-ATM2 %>%
model(SNAIVE(newCash~lag(7)))
report(fitATM2_2)
## Series: newCash
## Model: SNAIVE
##
## sigma^2: 47.905
##SNAIVE Residual SD=sqrt(47.905)=6.921344
###########################################
#Fit ETS model
##########################################
fitATM2_3<-ATM2 %>%
model(ETS(newCash))
report(fitATM2_3)
## Series: newCash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.000100013
## gamma = 0.4058325
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 20.61538 -12.32673 -5.992331 2.521689 -2.685606 2.841684 8.392067 7.249227
##
## sigma^2: 35.2982
##
## AIC AICc BIC
## 3465.148 3465.770 3504.147
#ETS(A,N,A) Residual SD=sqrt(35.2982)=5.941229
#AICc=3466
###########################################
#Fit ARIMA model
##########################################
fitATM2_1<-ATM2 %>%
model(ARIMA(diff_Cash))
report(fitATM2_1)
## Series: diff_Cash
## Model: ARIMA(0,0,0)(2,0,0)[7]
##
## Coefficients:
## sar1 sar2
## -0.5518 -0.1692
## s.e. 0.0520 0.0516
##
## sigma^2 estimated as 35.39: log likelihood=-1149.91
## AIC=2305.83 AICc=2305.9 BIC=2317.53
#ARIMA(0,0,0)(2,0,0) Residual SD=sqrt(35.39)=5.948949
#AICc=2306
glance(fitATM2_2)
## # A tibble: 1 x 2
## .model sigma2
## <chr> <dbl>
## 1 SNAIVE(newCash ~ lag(7)) 47.9
glance(fitATM2_1)
## # A tibble: 1 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(diff_Cash) 35.4 -1150. 2306. 2306. 2318. <cpl [14]> <cpl [0]>
glance(fitATM2_3)
## # A tibble: 1 x 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(newCash) 35.3 -1723. 3465. 3466. 3504. 34.4 34.5 4.07
Comparing the sigma^2, AICc the ARIMA model is superior.
Take a look at the residuals and ACF.
fitATM2_1 %>%
gg_tsresiduals()
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing non-finite values (stat_bin).
Residuals look satisfactory.
forecast(fitATM2_1, h=31) %>%
autoplot () + labs(title="ATM2 ARIMA", subtitle="Cash withdrawals by day", y="$ 00's")
fitATM2_1 %>%
forecast(h=31) %>%
autoplot(ATM2)
## Warning: Removed 7 row(s) containing missing values (geom_path).
Let’s move on to ATM3
ATM3<-ATM %>%
filter (ATM=="ATM3") %>%
as_tsibble(index=DATE)
ATM3 %>%
autoplot()+labs(title="ATM3", subtitle="Cash withdrawals by day", y="$ 00's")
## Plot variable not specified, automatically selected `.vars = Cash`
sum(is.na(ATM3$Cash))
## [1] 0
From the plot above, the ATM3 dispenses no money until the last three days: 96,82,85.
If possible, I would confirm this with the data source.
With only three datapoints, I would approach this in two ways.
A 1) Naive model or a 2) model which uses only the last three points and takes the mean.
Let’s begin with a Naive model, followed by taking the Mean….
####################################
#NAIVE
###################################
fitATM3_1<-ATM3 %>%
model(NAIVE(Cash))
report(fitATM3_1)
## Series: Cash
## Model: NAIVE
##
## sigma^2: 25.8985
##NAIVE Residual SD=sqrt(25.8985)=5.089057
###########################################
#Fit model with 3 values MEAN
#Start the series at the 1st dispensing
#Series only will contain 3 values
##########################################
ATM3a<-ATM3 %>%
filter (DATE >= as.Date("2010-04-28"))
fitATM3_2<-ATM3a%>%
model(MEAN(Cash))
report(fitATM3_2)
## Series: Cash
## Model: MEAN
##
## Mean: 87.6667
## sigma^2: 54.3333
##MEAN Residual SD=sqrt(54.3333)=7.371113
glance(fitATM3_1)
## # A tibble: 1 x 2
## .model sigma2
## <chr> <dbl>
## 1 NAIVE(Cash) 25.9
glance(fitATM3_2)
## # A tibble: 1 x 2
## .model sigma2
## <chr> <dbl>
## 1 MEAN(Cash) 54.3
Comparing the sigma^2, the NAIVE model is superior.
Take a look at the residuals and ACF.
fitATM3_1 %>%
gg_tsresiduals()
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
## ATM3 model
The naive model uses the last value to forecast. Given only three values, this is a sound approach. The residual, acf plots are not useful.
#ATM3 Forecast
Now forecast for ATM3
forecast(fitATM3_1, h=31) %>%
autoplot() + labs(title="ATM3", subtitle="Cash withdrawals by day", y="$ 00's")
fitATM3_1 %>%
forecast(h=31) %>%
autoplot(ATM3)
# ATM 4 Analysis
MOVING ON TO the LAST ATM4……..
ATM4<-ATM %>%
filter (ATM=="ATM4") %>%
as_tsibble(index=DATE)
ATM4 %>%
autoplot()+labs(title="ATM4", subtitle="Cash withdrawals by day", y="$ 00's")
## Plot variable not specified, automatically selected `.vars = Cash`
sum(is.na(ATM4$Cash))
## [1] 0
We have no missing values, but one outlier cash withdrawal of over 1 million dollars on 2/9/2020.
This suggests an error. It is unlikely that the ATM has the capacity to hold 1,000,000 and dispense it as well. If possible, I would confirm this with the data source. An educated guess would be to impute this erroneous value to the mean or other methodology to impute.
#here is the erroneous withdrawal
ATM4[285,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2010-02-09 ATM4 10920
#Let's see 7 days prior and 7 days after
ATM4[278,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2010-02-02 ATM4 153
ATM4[292,]
## # A tsibble: 1 x 3 [1D]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2010-02-16 ATM4 989
#This values are not in the same order of magnitude.
#Let's use an overall mean for this value
mean(ATM4$Cash)
## [1] 474.0137
ATM4[285,3]<-round(mean(ATM4$Cash),0)
#Let's look at the timeseries after we imputed
ATM4 %>%
autoplot() + labs(title="ATM4")
## Plot variable not specified, automatically selected `.vars = Cash`
gg_season(ATM4, y=Cash, period=7)
#Seasonality not well organized, test to follow.
Transform the data to reduce the variability
lambda4<-ATM4%>%
features(Cash,features=guerrero)%>%
pull(lambda_guerrero)
lambda4
## [1] 0.4039733
ATM4<-ATM4%>% mutate(newCash=(((Cash^lambda4)-1)/lambda4))
ATM4%>%
autoplot(box_cox(Cash, lambda4))+
labs(y="",
title=latex2exp::TeX(paste0(
"Transformed Cash with $$\\lambda4$ =",
round(lambda4,2))))
lambda4
## [1] 0.4039733
ATM4%>%autoplot(newCash)
Let’s do a decomposition:
ATM4%>%
model(classical_decomposition(newCash, type="additive")) %>%
components () %>%
autoplot()+ labs(title="Classical additive decomposition of ATM4")
## Warning: Removed 3 row(s) containing missing values (geom_path).
## ATM4 analysis continued
We see a strong seasonality component.
Take a closer look…..
#test if uncorrelated with itself.
ATM4 %>%
features(newCash,ljung_box)
## # A tibble: 1 x 2
## lb_stat lb_pvalue
## <dbl> <dbl>
## 1 0.907 0.341
#do not need differencing
ATM4 %>% features(newCash, unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 0
#this confirms no differencing needed
#lets check seasonal differencing
ATM4 %>% features(newCash, unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 0
#we need to take 1st seasonal difference
ATM4<-ATM4 %>%
mutate(diff_Cash = (difference(newCash,7)))
ATM4%>% autoplot(diff_Cash)
## Warning: Removed 7 row(s) containing missing values (geom_path).
ATM4%>% ACF((diff_Cash))%>% autoplot()
ATM4%>% PACF((diff_Cash))%>% autoplot()
These plots show some issues, but look ok. Let’s continue.
Our series has seasonality. We took the first seasonal difference.
Let’s fit some models.
###########################################
#Fit SNAIVE model
##########################################
fitATM4_1<-ATM4 %>%
model(SNAIVE(newCash~lag(7)))
report(fitATM4_1)
## Series: newCash
## Model: SNAIVE
##
## sigma^2: 172.4506
##SNAIVE Residual SD=sqrt(172.4506 )=13.13204
###########################################
#Fit ETS model
##########################################
fitATM4_2<-ATM4 %>%
model(ETS(newCash))
report(fitATM4_2)
## Series: newCash
## Model: ETS(M,N,A)
## Smoothing parameters:
## alpha = 0.01525782
## gamma = 0.1700593
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 21.65918 -14.61678 -4.779729 3.479644 2.528753 4.726212 3.294735 5.367169
##
## sigma^2: 0.1946
##
## AIC AICc BIC
## 3836.200 3836.821 3875.199
#ETS(A,N,A) Residual SD=sqrt(0.1946)=0.4411349
#AICc=3837
###########################################
#Fit ARIMA model
##########################################
fitATM4_3<-ATM4 %>%
model(ARIMA(diff_Cash))
report(fitATM4_3)
## Series: diff_Cash
## Model: ARIMA(1,0,0)(2,0,0)[7]
##
## Coefficients:
## ar1 sar1 sar2
## 0.1011 -0.6366 -0.2833
## s.e. 0.0530 0.0514 0.0513
##
## sigma^2 estimated as 117.9: log likelihood=-1365.33
## AIC=2738.67 AICc=2738.78 BIC=2754.27
#ARIMA(1,0,0)(2,0,0) Residual SD=sqrt(117.9)=10.85818
#AICc=2739
glance(fitATM4_1)
## # A tibble: 1 x 2
## .model sigma2
## <chr> <dbl>
## 1 SNAIVE(newCash ~ lag(7)) 172.
glance(fitATM4_2)
## # A tibble: 1 x 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(newCash) 0.195 -1908. 3836. 3837. 3875. 99.2 99.6 0.346
glance(fitATM4_3)
## # A tibble: 1 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(diff_Cash) 118. -1365. 2739. 2739. 2754. <cpl [15]> <cpl [0]>
Comparing the sigma^2, AICc, the ARIMA model is superior.
Take a look at the residuals and ACF.
fitATM4_3 %>%
gg_tsresiduals()
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing non-finite values (stat_bin).
ACF, PACF residuals look satisfactory
Now forecast for ATM4
forecast(fitATM4_3, h=31) %>%
autoplot () + labs(title="ATM4 ARIMA", subtitle="Cash withdrawals by day", y="$ 00's")
fitATM4_3 %>%
forecast(h=31) %>%
autoplot(ATM4)+ labs(title="ATM4 ARIMA", subtitle="Cash withdrawals by day", y="$ 00's")
## Warning: Removed 7 row(s) containing missing values (geom_path).
## Output to EXCEL
Get forecasts for ATM4 to excel.
forecastATM1_1 <- forecast(fitATM1_1, h=31)
forecastATM2_1 <- forecast(fitATM2_1, h=31)
forecastATM3_1 <- forecast(fitATM3_1, h=31)
forecastATM4_3 <- forecast(fitATM4_3, h=31)
#write.csv(forecastATM1_1,"C:/Users/Lisa/OneDrive/CUNY/624/Project/forecastATM1_1.csv", row.names = TRUE)
#write.csv(forecastATM2_1,"C:/Users/Lisa/OneDrive/CUNY/624/Project/forecastATM2_1.csv", row.names = TRUE)
#write.csv(forecastATM3_1,"C:/Users/Lisa/OneDrive/CUNY/624/Project/forecastATM3_1.csv", row.names = TRUE)
#write.csv(forecastATM4_3,"C:/Users/Lisa/OneDrive/CUNY/624/Project/forecastATM4_3.csv", row.names = TRUE)
#write.csv(ATM1,"C:/Users/Lisa/OneDrive/CUNY/624/Project/ATM1.csv", row.names = TRUE)
#write.csv(ATM2,"C:/Users/Lisa/OneDrive/CUNY/624/Project/ATM2.csv", row.names = TRUE)
#write.csv(ATM3,"C:/Users/Lisa/OneDrive/CUNY/624/Project/ATM3.csv", row.names = TRUE)
#write.csv(ATM4,"C:/Users/Lisa/OneDrive/CUNY/624/Project/ATM4.csv", row.names = TRUE)
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Let’s explore the file.
Power_csv<-"C:/Users/Lisa/OneDrive/CUNY/624/Project/ResidentialCustomerForecastLoad-624.csv"
Power<-read.table(file=Power_csv,header=TRUE, sep=",")
str(Power)
## 'data.frame': 192 obs. of 3 variables:
## $ CaseSequence: int 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY.MMM : chr "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : int 6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...
A plot of the power timeseries
Put into timeseries structure
PowerUse<-Power %>%
mutate (Month=yearmonth(YYYY.MMM)) %>%
as_tsibble(index=Month)
PowerUse1<-PowerUse %>%
select(Month,KWH)
PowerUse1 %>%
autoplot() + labs (title="Power by month", subtitle="Residential Power usage", y="Kwh")
## Plot variable not specified, automatically selected `.vars = KWH`
Check for missing values
sum(is.na(PowerUse1$KWH))
## [1] 1
#There is a missing value
gg_season(PowerUse1, y=KWH, period=12)
There is clear seasonality and 1 missing value and 1 outlier of in July. This is most probably an error.
Impute these values by taking the prior year + subsequent year average.
K1<-PowerUse1 %>%
filter(is.na(KWH))
K1
## # A tsibble: 1 x 2 [1M]
## Month KWH
## <mth> <int>
## 1 2008 Sep NA
#Let's deal with the 1st missing value at Sep 2008
PowerUse[129,]
## # A tsibble: 1 x 4 [1M]
## CaseSequence YYYY.MMM KWH Month
## <int> <chr> <int> <mth>
## 1 861 2008-Sep NA 2008 Sep
#Go back 1 year and forward 1 year by subtracting 12 to the row id
PowerUse1[141,]
## # A tsibble: 1 x 2 [1M]
## Month KWH
## <mth> <int>
## 1 2009 Sep 7583146
PowerUse[117,]
## # A tsibble: 1 x 4 [1M]
## CaseSequence YYYY.MMM KWH Month
## <int> <chr> <int> <mth>
## 1 849 2007-Sep 7666970 2007 Sep
#Let's average these two and impute into Sep 2008
(PowerUse1[141,2]+PowerUse1[117,2])/2
## KWH
## 1 7625058
#7625058.
#get row index
#row 129
#This row 129 so impute 7625058 into KWH
PowerUse1[129,]
## # A tsibble: 1 x 2 [1M]
## Month KWH
## <mth> <int>
## 1 2008 Sep NA
PowerUse1[129,2]<-7625058
PowerUse1[129,]
## # A tsibble: 1 x 2 [1M]
## Month KWH
## <mth> <int>
## 1 2008 Sep 7625058
# now check the month where KWH is 0 and impute
PowerUse1[129,]
## # A tsibble: 1 x 2 [1M]
## Month KWH
## <mth> <int>
## 1 2008 Sep 7625058
As seen from the following plot, there is an outlier in July.
PowerUse1 %>%
autoplot() + labs (title="Power by month", subtitle="Residential Power usage", y="Kwh")
## Plot variable not specified, automatically selected `.vars = KWH`
## Error????
This is probably an error, lets take a closer look at July values.
ErrJuly<-PowerUse %>%
filter(PowerUse$CaseSequence==883)
ErrJuly
## # A tsibble: 1 x 4 [1M]
## CaseSequence YYYY.MMM KWH Month
## <int> <chr> <int> <mth>
## 1 883 2010-Jul 770523 2010 Jul
# Let's look at all other July Values
PowerJuly<-PowerUse %>%
filter (PowerUse$CaseSequence==739 |
PowerUse$CaseSequence==751 |
PowerUse$CaseSequence==763 |
PowerUse$CaseSequence==775 |
PowerUse$CaseSequence==787 |
PowerUse$CaseSequence==799 |
PowerUse$CaseSequence==811 |
PowerUse$CaseSequence==823 |
PowerUse$CaseSequence==835 |
PowerUse$CaseSequence==847 |
PowerUse$CaseSequence==859 |
PowerUse$CaseSequence==871 |
PowerUse$CaseSequence==895 |
PowerUse$CaseSequence==907 |
PowerUse$CaseSequence==919 |
PowerUse$CaseSequence==751)
PowerJuly
## # A tsibble: 15 x 4 [1M]
## CaseSequence YYYY.MMM KWH Month
## <int> <chr> <int> <mth>
## 1 739 1998-Jul 8914755 1998 Jul
## 2 751 1999-Jul 7444416 1999 Jul
## 3 763 2000-Jul 6862662 2000 Jul
## 4 775 2001-Jul 7855129 2001 Jul
## 5 787 2002-Jul 7039702 2002 Jul
## 6 799 2003-Jul 6900676 2003 Jul
## 7 811 2004-Jul 7307931 2004 Jul
## 8 823 2005-Jul 8337998 2005 Jul
## 9 835 2006-Jul 7945564 2006 Jul
## 10 847 2007-Jul 6426220 2007 Jul
## 11 859 2008-Jul 7643987 2008 Jul
## 12 871 2009-Jul 7713260 2009 Jul
## 13 895 2011-Jul 10093343 2011 Jul
## 14 907 2012-Jul 8886851 2012 Jul
## 15 919 2013-Jul 8415321 2013 Jul
The outlier value in July 2010 = 770,523 while all other July values are in the the 7,000,000 range.
The error most likely is the omission of 1 digit.
Let’s impute the median value of all other July KWH for the July 2010.
median(PowerJuly$KWH)
## [1] 7713260
#This value of 7,713,260 will be imputed into the data set
PowerUse1[151,]
## # A tsibble: 1 x 2 [1M]
## Month KWH
## <mth> <int>
## 1 2010 Jul 770523
PowerUse1[151,2]<-7713260
PowerUse1[151,]
## # A tsibble: 1 x 2 [1M]
## Month KWH
## <mth> <int>
## 1 2010 Jul 7713260
#Let's look at the autoplot now that data set has been cleaned.
Let’s look at the autoplot now that data set has been cleaned.
PowerUse1 %>%
autoplot() + labs (title="Power by month", subtitle="Residential Power usage", y="Kwh")
## Plot variable not specified, automatically selected `.vars = KWH`
Transform the data to reduce the variability
lambdaP<-PowerUse1%>%
features(KWH,features=guerrero)%>%
pull(lambda_guerrero)
lambdaP
## [1] -0.2045992
PowerUse1<-PowerUse1%>% mutate(newKWH=(((KWH^lambdaP)-1)/lambdaP))
PowerUse1%>%
autoplot(box_cox(KWH, lambdaP))+
labs(y="",
title=latex2exp::TeX(paste0(
"Transformed KWH with $$\\lambdaP$ =",
round(lambdaP,2))))
lambdaP
## [1] -0.2045992
PowerUse1%>%autoplot(newKWH)
Let’s do a decomposition:
PowerUse1%>%
model(classical_decomposition(newKWH, type="additive")) %>%
components () %>%
autoplot()+ labs(title="Classical additive decomposition of PowerUse1")
## Warning: Removed 6 row(s) containing missing values (geom_path).
Strong seasonality and some trend is revealed
Take a closer look…..
#test if uncorrelated with itself.
PowerUse1 %>%
features(newKWH,ljung_box)
## # A tibble: 1 x 2
## lb_stat lb_pvalue
## <dbl> <dbl>
## 1 44.5 2.52e-11
#do need differencing
PowerUse1 %>% features(newKWH, unitroot_ndiffs)
## # A tibble: 1 x 1
## ndiffs
## <int>
## 1 1
#this confirms we need 1st difference needed
#lets check seasonal differencing
PowerUse1 %>% features(newKWH, unitroot_nsdiffs)
## # A tibble: 1 x 1
## nsdiffs
## <int>
## 1 1
#we need to take 1st seasonal difference, 1st difference
PowerUse1<-PowerUse1 %>%
mutate(diff_KWH = difference(difference(newKWH,12),1))
PowerUse1%>% autoplot(diff_KWH)
## Warning: Removed 13 row(s) containing missing values (geom_path).
PowerUse1%>% ACF((diff_KWH))%>% autoplot()
PowerUse1%>% PACF((diff_KWH))%>% autoplot()
Our series has both seasonality and trend We took the first seasonal difference and first difference.
Let’s fit some models.
###########################################
#Fit SNAIVE model
##########################################
fitKWH1<-PowerUse1 %>%
model(SNAIVE(KWH~lag(12)))
report(fitKWH1)
## Series: KWH
## Model: SNAIVE
##
## sigma^2: 651107743136.935
##SNAIVE Residual SD=sqrt(651107743136.935)=806912.5
#This model is not a good fit.
###########################################
#Fit ETS model using transformed data
##########################################
fitKWH2<-PowerUse1 %>%
model(ETS(newKWH))
report(fitKWH2)
## Series: newKWH
## Model: ETS(A,A,A)
## Smoothing parameters:
## alpha = 0.1080076
## beta = 0.0001036622
## gamma = 0.0001019441
##
## Initial states:
## l[0] b[0] s[0] s[-1] s[-2] s[-3]
## 4.686777 5.207743e-05 -0.001669313 -0.0111128 -0.005659824 0.006889337
## s[-4] s[-5] s[-6] s[-7] s[-8] s[-9]
## 0.01020079 0.008017251 0.0009451794 -0.01036705 -0.007704751 -0.002665824
## s[-10] s[-11]
## 0.003441106 0.009685905
##
## sigma^2: 0
##
## AIC AICc BIC
## -1118.018 -1114.501 -1062.641
#ETS(A,N,A) Residual SD=sqrt(0)=0
#AICc=-1114.501
###########################################
#Fit ARIMA model
##########################################
fitKWH3<-PowerUse1 %>%
model(ARIMA(diff_KWH))
report(fitKWH3)
## Series: diff_KWH
## Model: ARIMA(4,0,0)(2,0,0)[12]
##
## Coefficients:
## ar1 ar2 ar3 ar4 sar1 sar2
## -0.5768 -0.5113 -0.1397 -0.2131 -0.7424 -0.4093
## s.e. 0.0766 0.0881 0.0877 0.0740 0.0747 0.0746
##
## sigma^2 estimated as 1.369e-05: log likelihood=740.31
## AIC=-1466.61 AICc=-1466 BIC=-1443.81
#ARIMA(4,0,0)(2,0,0)[12] Residual SD=sqrt(1.369e-05)=0.0037
# AICc=-1466
glance(fitKWH1)
## # A tibble: 1 x 2
## .model sigma2
## <chr> <dbl>
## 1 SNAIVE(KWH ~ lag(12)) 651107743137.
glance(fitKWH2)
## # A tibble: 1 x 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(newKWH) 0.0000141 576. -1118. -1115. -1063. 0.0000129 0.0000135 0.00278
glance(fitKWH3)
## # A tibble: 1 x 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(diff_KWH) 0.0000137 740. -1467. -1466. -1444. <cpl [28]> <cpl [0]>
Comparing the sigma^2 and AICc. We see negative AICc.
"One question students often have about AIC is: How do I interpret negative AIC values?
The simple answer: The lower the value for AIC, the better the fit of the model. The absolute value of the AIC value is not important. It can be positive or negative." from https://www.statology.org/negative-aic/
The ARIMA model is superior.
Take a look at the residuals and ACF.
fitKWH3 %>%
gg_tsresiduals()
## Warning: Removed 13 row(s) containing missing values (geom_path).
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing non-finite values (stat_bin).
Below, we code the forecasting…….FORECASTING THE POWER WITH ARIMA
forecast(fitKWH3, h=12) %>%
autoplot () + labs(title="Power KWH ARIMA", subtitle="KWH by month", y="KWH")
fitKWH3 %>%
forecast(h=12) %>%
autoplot(PowerUse1)
## Warning: Removed 13 row(s) containing missing values (geom_path).
## Export to Excel
Get forecasts for ATM4 to excel.
forecastKWH3 <- forecast(fitKWH3, h=12)
#write.csv(forecastKWH3,"C:/Users/Lisa/OneDrive/CUNY/624/Project/forecastKWH3.csv", row.names = TRUE)