The task for part 1 is to take ATM transaction data for 4 different ATMs and forecast how much cash will be taken out of them for May 2010.
library(tidyverse)
library(fpp3)
library(readxl)
library(knitr)
library(ggplot2)
library(writexl)
Import Data:
suppressWarnings({
df <- read_excel("ATM624Data.xlsx", col_types = c("date","text","numeric"))
})
df <- df %>% mutate(DATE = as_date(DATE))
## Date for time series needs to be in tsibble formatting
df <- df %>% as_tsibble(index = DATE, key = ATM)
df
## # A tsibble: 1,474 x 3 [1D]
## # Key: ATM [5]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
## 7 2009-05-07 ATM1 8
## 8 2009-05-08 ATM1 104
## 9 2009-05-09 ATM1 87
## 10 2009-05-10 ATM1 93
## # ℹ 1,464 more rows
Note: The cash is provided in hundreds of dollars so the real amount in transactions is the value * 100. Cash is easier to read as is, so I will leave it like this.
EDA:
paste("There are:", sum(is.na(df)), "total missing values")
## [1] "There are: 33 total missing values"
print(colSums(is.na(df)))
## DATE ATM Cash
## 0 14 19
The ATM and Cash columns will need to be imputed or have data removed
## Look at rows in data that contain missing values
df %>% filter(is.na(ATM) | is.na(Cash))
## # A tsibble: 19 x 3 [1D]
## # Key: ATM [3]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
Some rows contain missing values in both the ATM and Cash column while others only have the cash missing. Imputing the missing values with both ATM and Cash missing will be difficult (impossible?), But imputing those with only Cash missing will be the optimal solution
Further exploration on the data:
## Generate a dataframe with useful summarizing statistics for each ATM
as.data.frame(df) %>% filter(!is.na(Cash)) %>%
group_by(ATM) %>%
summarise(min = min(Cash, na.rm = TRUE),
max = max(Cash, na.rm = TRUE),
mean = mean(Cash, na.rm = TRUE),
median = median(Cash, na.rm = TRUE),
std = sd(Cash, na.rm = TRUE),
var = var(Cash, na.rm = TRUE))
## # A tibble: 4 × 7
## ATM min max mean median std var
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM1 1 180 83.9 91 36.7 1344.
## 2 ATM2 0 147 62.6 67 38.9 1513.
## 3 ATM3 0 96 0.721 0 7.94 63.1
## 4 ATM4 1.56 10920. 474. 404. 651. 423718.
ATM 3 and ATM 4 have data that implies further inspection on them is needed. ATM3 has a mean and median that show that most values are near or at 0 for most of what has been collected. ATM4 has a maximum cash value that far exceeds the other ATMS and a much larger standard deviation and variance than the other ATMS. ATM1 and ATM2 appear to be similar to eachother.
Now the data must be visualized:
ggplot(df, aes(x = DATE, y= Cash, color = ATM)) +
geom_line() + labs(title= "Transaction History of All ATMs Across the Dataset")
## Warning: Removed 14 rows containing missing values (`geom_line()`).
By overlaying all four ATMs on one plot, we can see how different both the scale and variance of transactions across the various ATMs is
filtered <- df %>% filter((ATM == "ATM1" | (ATM == "ATM2")))
filtered
## # A tsibble: 730 x 3 [1D]
## # Key: ATM [2]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
## 7 2009-05-07 ATM1 8
## 8 2009-05-08 ATM1 104
## 9 2009-05-09 ATM1 87
## 10 2009-05-10 ATM1 93
## # ℹ 720 more rows
ggplot(filtered, aes(x = DATE, y= Cash, color = ATM)) +
geom_line() + labs(title ="ATM1 and ATM2 Transaction History")
Models that try to predict transactions from ATM1 and ATM2 will probably be created through similar processes. Data from ATM3 and ATM4 will probably have to be more specifically tailored
ggplot(filter(df, ATM == "ATM4"), aes(x= DATE, y = Cash)) + geom_line() + labs(title= "ATM4 Transaction History")
It appears that it is one isolated incident that has skewed the transaction values for ATM4. Multiple models might be developed, that either exclude or incorporate the outlier
ggplot(filter(df, ATM == "ATM3"), aes(x= DATE, y = Cash)) + geom_line() + labs(title = "ATM3 Transaction History")
This transaction data also has a spike, but the spike is closer to the end of our timeseries.This might give greater weight to the spike, and alter how we handle it compared to the spike that occurs in ATM4. All data prior to this spike in transactions is 0, so this might represent the time in which the ATM went online
df %>% filter(ATM == "ATM3")
## # A tsibble: 365 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM3 0
## 2 2009-05-02 ATM3 0
## 3 2009-05-03 ATM3 0
## 4 2009-05-04 ATM3 0
## 5 2009-05-05 ATM3 0
## 6 2009-05-06 ATM3 0
## 7 2009-05-07 ATM3 0
## 8 2009-05-08 ATM3 0
## 9 2009-05-09 ATM3 0
## 10 2009-05-10 ATM3 0
## # ℹ 355 more rows
For easier modeling, the 4 atms will be split into their own dataframes. This will also remove the missing values that only contain dates which are unable to be imputed:
atm1 <- df %>% filter(ATM == "ATM1")
atm2 <- df %>% filter(ATM == "ATM2")
atm3 <- df %>% filter(ATM == "ATM3")
atm4 <- df %>% filter(ATM == "ATM4")
atm1
## # A tsibble: 365 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <dbl>
## 1 2009-05-01 ATM1 96
## 2 2009-05-02 ATM1 82
## 3 2009-05-03 ATM1 85
## 4 2009-05-04 ATM1 90
## 5 2009-05-05 ATM1 99
## 6 2009-05-06 ATM1 88
## 7 2009-05-07 ATM1 8
## 8 2009-05-08 ATM1 104
## 9 2009-05-09 ATM1 87
## 10 2009-05-10 ATM1 93
## # ℹ 355 more rows
The ARIMA model can be used to generate imputed values for the rows that have missing Cash values.
ATM 1 impute:
atm1_impute <- atm1 %>% model(ARIMA(Cash)) %>%
interpolate(atm1)
atm1_impute
## # A tsibble: 365 x 3 [1D]
## # Key: ATM [1]
## ATM DATE Cash
## <chr> <date> <dbl>
## 1 ATM1 2009-05-01 96
## 2 ATM1 2009-05-02 82
## 3 ATM1 2009-05-03 85
## 4 ATM1 2009-05-04 90
## 5 ATM1 2009-05-05 99
## 6 ATM1 2009-05-06 88
## 7 ATM1 2009-05-07 8
## 8 ATM1 2009-05-08 104
## 9 ATM1 2009-05-09 87
## 10 ATM1 2009-05-10 93
## # ℹ 355 more rows
ATM 2 impute:
atm2_impute <- atm2 %>% model(ARIMA(Cash)) %>%
interpolate(atm2)
atm2_impute
## # A tsibble: 365 x 3 [1D]
## # Key: ATM [1]
## ATM DATE Cash
## <chr> <date> <dbl>
## 1 ATM2 2009-05-01 107
## 2 ATM2 2009-05-02 89
## 3 ATM2 2009-05-03 90
## 4 ATM2 2009-05-04 55
## 5 ATM2 2009-05-05 79
## 6 ATM2 2009-05-06 19
## 7 ATM2 2009-05-07 2
## 8 ATM2 2009-05-08 103
## 9 ATM2 2009-05-09 107
## 10 ATM2 2009-05-10 118
## # ℹ 355 more rows
ATM 4 impute:
ATM 4 contains a spike in values. I would like to replace the outlier, and then make a model for data with and without it.
atm4_decomp <- atm4 %>%
model(stl = STL(Cash ~ season(period = 1), robust = TRUE)) %>%
components()
atm4_decomp %>% autoplot()
The outliers are evident based on the remainder data
atm4_outliers <- atm4_decomp %>% filter(remainder < quantile(remainder, 0.25) - 3 * IQR(remainder) |
remainder > quantile(remainder, 0.75) + 3 * IQR(remainder))
atm4_outliers
## # A dable: 1 x 7 [1D]
## # Key: ATM, .model [1]
## # : Cash = trend + remainder
## ATM .model DATE Cash trend remainder season_adjust
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
## 1 ATM4 stl 2010-02-09 10920. 491. 10428. 10920.
The outlier has now been isolated
atm4_impute <- atm4 %>% anti_join(atm4_outliers) %>% fill_gaps()
## Joining with `by = join_by(DATE, ATM, Cash)`
atm4_impute <- atm4_impute %>% model(ARIMA(Cash)) %>% interpolate(atm4_impute)
After replacing the outlier using the ARIMA model, we can plot the data again:
atm4_impute %>% autoplot(Cash) + labs (title = "Imputed Data for ATM 4")
We will need to determine core components of the time series, such as seasonality and trend. We can do this with an STL decomposition:
atm1_impute %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() + labs(title = "ATM1 STL Decomposition")
Atm 1 has an inconsistent trend component and a weekly seasonal element to its data, where cash drops and then returns to baseline.
We can apply a box-cox transformation to the data for modelling
lambda <- atm1_impute %>% features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
paste("Lambda Value:", round(lambda, 4))
## [1] "Lambda Value: 0.2623"
A box-cox transformation will be a good start making the variance resemble a normal distribution
Now we can start running models:
atm1_model <- atm1_impute %>%
model(
seasonal_naive = SNAIVE(Cash),
seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda)),
ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
MAM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("A") + season("M")),
ANA_trans = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
MNM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
ARIMA = ARIMA(box_cox(Cash, lambda))
)
Look at the model fits:
glance(atm1_model) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 12 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 1.77 -611. 1229. 1229. 1245.
## 2 ANA_trans 1.81 -1180. 2381. 2381. 2420.
## 3 MNM_trans 0.0383 -1220. 2459. 2460. 2498.
## 4 MAM_trans 0.0383 -1221. 2465. 2466. 2512.
## 5 ANA 577. -2232. 4485. 4486. 4524.
## 6 AAA 582. -2233. 4490. 4491. 4537.
## 7 MNM 0.134 -2273. 4566. 4566. 4605.
## 8 MAM 0.133 -2273. 4571. 4572. 4618.
## 9 ANN 1360. -2391. 4793. 4793. 4812.
## 10 AAN 1360. -2391. 4793. 4793. 4812.
## 11 seasonal_naive 765. NA NA NA NA
## 12 seasonal_naive_trans 2.49 NA NA NA NA
accuracy(atm1_model) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 12 × 2
## .model RMSE
## <chr> <dbl>
## 1 MNM_trans 23.3
## 2 MAM_trans 23.6
## 3 ANA 23.7
## 4 AAA 23.8
## 5 ANA_trans 24.9
## 6 ARIMA 24.9
## 7 MNM 26.2
## 8 MAM 26.3
## 9 seasonal_naive 27.6
## 10 seasonal_naive_trans 27.6
## 11 ANN 36.7
## 12 AAN 36.7
Looking at the accuracy metrics, the multiplicative transformed model or the ARIMA model are the most appropriate. transformed multiplicative model has the lowest RMSE, but the ARIMA model has the lowest AIC, AICc, and BIC. With this in mind, the ARIMA model will be the one that is chosen
atm1_impute %>% model(ARIMA(box_cox(Cash, lambda))) %>% report()
## Series: Cash
## Model: ARIMA(0,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ma1 ma2 sma1
## 0.1105 -0.1088 -0.6419
## s.e. 0.0524 0.0521 0.0431
##
## sigma^2 estimated as 1.771: log likelihood=-610.69
## AIC=1229.38 AICc=1229.49 BIC=1244.9
The optimal model is an ARIMA(0,0,2)(0,1,1)
atm1_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")
The residuals for this model dont appear to have any autocorrelation, and they appear to be random with a fairly normal distribution
atm1_fc <- atm1_model %>% forecast(h = 31) %>% filter(.model == "ARIMA")
atm1_model %>% forecast(h = 31) %>% filter(.model == "ARIMA") %>%
autoplot(atm1_impute) + labs(title = "ATM1 forecast for May")
We will apply a box-cox transformation to ATM2 as well
lambda <- atm2_impute %>% features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
paste("Lambda Value:", round(lambda, 4))
## [1] "Lambda Value: 0.6747"
atm2_impute %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() + labs(title = "ATM1 STL Decomposition")
As we predicted earlier, ATM1 is similar to ATM2 and there is a weekly seasonal component with no real visible trend throughout the data
The same models can be tested for fit as ATM1:
atm2_model <- atm2_impute %>%
model(
seasonal_naive = SNAIVE(Cash),
seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda)),
ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
MAM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("A") + season("M")),
ANA_trans = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
MNM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
ARIMA = ARIMA(box_cox(Cash, lambda))
)
look at accuracy:
glance(atm2_model) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 12 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 45.0 -1189. 2390. 2390. 2413.
## 2 ANA_trans 47.5 -1777. 3573. 3574. 3612.
## 3 MNM_trans 0.245 -1932. 3885. 3886. 3924.
## 4 MAM_trans 0.273 -1932. 3888. 3888. 3934.
## 5 ANA 622. -2246. 4512. 4513. 4551.
## 6 AAA 634. -2249. 4521. 4522. 4568.
## 7 MAM 0.384 -2377. 4779. 4779. 4825.
## 8 ANN 1482. -2407. 4824. 4825. 4844.
## 9 AAN 1482. -2407. 4824. 4825. 4844.
## 10 MNM 0.738 -2438. 4897. 4898. 4936.
## 11 seasonal_naive 881. NA NA NA NA
## 12 seasonal_naive_trans 65.2 NA NA NA NA
accuracy(atm1_model) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 12 × 2
## .model RMSE
## <chr> <dbl>
## 1 MNM_trans 23.3
## 2 MAM_trans 23.6
## 3 ANA 23.7
## 4 AAA 23.8
## 5 ANA_trans 24.9
## 6 ARIMA 24.9
## 7 MNM 26.2
## 8 MAM 26.3
## 9 seasonal_naive 27.6
## 10 seasonal_naive_trans 27.6
## 11 ANN 36.7
## 12 AAN 36.7
Once again, ARIMA performs the best in most metrics, while the transformed multiplicative model performs best on the RMSE. We will use the ARIMA model for ATM2 as well
atm2_impute %>% model(ARIMA(box_cox(Cash, lambda))) %>% report()
## Series: Cash
## Model: ARIMA(2,0,2)(0,1,1)[7]
## Transformation: box_cox(Cash, lambda)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.4180 -0.9012 0.4579 0.7889 -0.7129
## s.e. 0.0592 0.0475 0.0888 0.0619 0.0419
##
## sigma^2 estimated as 44.97: log likelihood=-1188.76
## AIC=2389.51 AICc=2389.75 BIC=2412.8
The optimal ARIMA model is the ARIMA(2,0,2)(0,1,1)
atm2_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")
For ATM2, it appears that there is some auto correlation, but the residuals are about normally distributed and random
atm2_fc <- atm2_model %>% forecast(h = 31) %>% filter(.model == "ARIMA")
atm2_model %>% forecast(h = 31) %>% filter(.model == "ARIMA") %>%
autoplot(atm2_impute) + labs(title = "ATM2 forecast for May")
ATM3 modeling will need to be more subjective. There is almost 0 information to model off of as all but 3 values are 0. With this in mind a naive model will be the most likely choice:
atm3 %>% autoplot()
## Plot variable not specified, automatically selected `.vars = Cash`
A box cox transformation will do nothing for our data and is unneeded. The only potential models that can really be used to represent this are the mean,naive, and drift
atm3_model <- atm3 %>%
model(naive = NAIVE(Cash),
mean = MEAN(Cash),
drift = NAIVE(Cash ~ drift()))
accuracy(atm3_model) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 3 × 2
## .model RMSE
## <chr> <dbl>
## 1 drift 5.08
## 2 naive 5.09
## 3 mean 7.93
Although it is less helpful for a model such as this, the RMSE tells us that the lowest error model will be the drift or the naive model. The drift model is what we will chose based on this fit
atm3_model %>% forecast(h = 31) %>% autoplot(atm3, level = NULL) + labs(title = "ATM3 forecast for May using 3 models")
atm3_fc <- atm3_model %>% forecast(h = 31) %>% filter(.model == "drift")
atm3_model %>% forecast(h = 31) %>% filter(.model == "drift") %>% autoplot(atm3) + labs(title ="ATM 3 forecast using Drift model")
Our ATM 4 data has two versions: one with the outlier and other with imputed data using the ARIMA model. We can create models for both
atm4 %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() + labs(title = "ATM4 STL Decomposition")
atm4_impute %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>%
components() %>%
autoplot() + labs(title = "ATM4 Imputed Data STL Decomposition")
The weekly seasonal component is the same for both versions of the data, the only noticeable difference is the spike in the remainder for the original data, where the outlier is. Next, we can apply boxcox transformations for both datasets
lambda1 <- atm4 %>% features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
paste("Lambda Value For original dataset:", round(lambda1, 4))
## [1] "Lambda Value For original dataset: -0.0737"
lambda2 <- atm4_impute %>% features(Cash, features = guerrero) %>%
pull(lambda_guerrero)
paste("Lambda Value For Imputed dataset:", round(lambda2, 4))
## [1] "Lambda Value For Imputed dataset: 0.4496"
The lambda values for both datasets are noticeably different. Now we can apply our models:
atm4_model <- atm4 %>%
model(
seasonal_naive = SNAIVE(Cash),
seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda1)),
ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
MAM_trans = ETS(box_cox(Cash, lambda1) ~ error("M") + trend("A") + season("M")),
ANA_trans = ETS(box_cox(Cash, lambda1) ~ error("A") + trend("N") + season("A")),
MNM_trans = ETS(box_cox(Cash, lambda1) ~ error("M") + trend("N") + season("M")),
ARIMA = ARIMA(box_cox(Cash, lambda1))
)
glance(atm4_model) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 12 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 0.842 -486. 979. 979. 995.
## 2 ANA_trans 0.807 -1033. 2086. 2087. 2125.
## 3 MNM_trans 0.0409 -1034. 2089. 2089. 2128.
## 4 MAM_trans 0.0411 -1037. 2098. 2099. 2145.
## 5 MAM 1.84 -3385. 6795. 6796. 6841.
## 6 MNM 1.85 -3388. 6795. 6796. 6834.
## 7 ANA 412779. -3432. 6884. 6885. 6923.
## 8 AAA 423177. -3436. 6895. 6896. 6942.
## 9 ANN 434658. -3444. 6898. 6898. 6917.
## 10 AAN 434658. -3444. 6898. 6898. 6917.
## 11 seasonal_naive 805110. NA NA NA NA
## 12 seasonal_naive_trans 1.32 NA NA NA NA
accuracy(atm4_model) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 12 × 2
## .model RMSE
## <chr> <dbl>
## 1 ANA 635.
## 2 AAA 641.
## 3 ANN 656.
## 4 AAN 656.
## 5 MAM_trans 664.
## 6 ANA_trans 665.
## 7 MNM_trans 666.
## 8 ARIMA 679.
## 9 MNM 851.
## 10 seasonal_naive_trans 896.
## 11 seasonal_naive 896.
## 12 MAM 1017.
For the dataset with the outlier, the ARIMA model fits best, as it has the lowest AIC, AICc, and BIC.The additive model has the best RMSE but fails on other metrics.
atm4_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ATM 4 ARIMA model residuals")
The ARIMA model for base ATM data has only 1 instance of autocorrelation and the remainders appear to be randomly distributed
atm4_model %>% select(.model = "ARIMA") %>% report()
## Series: Cash
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda1)
##
## Coefficients:
## sar1 sar2 constant
## 0.2487 0.1947 2.4972
## s.e. 0.0521 0.0525 0.0468
##
## sigma^2 estimated as 0.8418: log likelihood=-485.59
## AIC=979.18 AICc=979.29 BIC=994.78
The ARIMA(0,0,1)(2,0,0) model is what has been selected
Now we can model for the imputed dataset
atm4_impute_model <- atm4_impute %>%
model(
seasonal_naive = SNAIVE(Cash),
seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda2)),
ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
MAM_trans = ETS(box_cox(Cash, lambda2) ~ error("M") + trend("A") + season("M")),
ANA_trans = ETS(box_cox(Cash, lambda2) ~ error("A") + trend("N") + season("A")),
MNM_trans = ETS(box_cox(Cash, lambda2) ~ error("M") + trend("N") + season("M")),
ARIMA = ARIMA(box_cox(Cash, lambda2))
)
glance(atm4_impute_model) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 12 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 176. -1460. 2930. 2930. 2950.
## 2 MNM_trans 0.220 -2003. 4025. 4026. 4064.
## 3 ANA_trans 167. -2006. 4032. 4032. 4071.
## 4 MAM_trans 0.209 -2007. 4038. 4039. 4085.
## 5 ANA 110883. -3192. 6404. 6405. 6443.
## 6 MNM 0.728 -3192. 6404. 6405. 6443.
## 7 MAM 0.646 -3195. 6414. 6415. 6461.
## 8 AAA 115261. -3198. 6420. 6421. 6467.
## 9 ANN 127144. -3220. 6449. 6449. 6469.
## 10 AAN 127144. -3220. 6449. 6449. 6469.
## 11 seasonal_naive 205867. NA NA NA NA
## 12 seasonal_naive_trans 289. NA NA NA NA
accuracy(atm4_impute_model) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 12 × 2
## .model RMSE
## <chr> <dbl>
## 1 ANA 329.
## 2 AAA 334.
## 3 ANA_trans 338.
## 4 MNM_trans 345.
## 5 MAM_trans 350.
## 6 ARIMA 352.
## 7 MNM 354.
## 8 ANN 355.
## 9 AAN 355.
## 10 MAM 367.
## 11 seasonal_naive 453.
## 12 seasonal_naive_trans 453.
atm4_impute_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")
The imputed residuals are similar to the unimputed data, with one example of autocorrelation and randomly distributed residuals
atm4_model %>% select(.model = "ARIMA") %>% report()
## Series: Cash
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean
## Transformation: box_cox(Cash, lambda1)
##
## Coefficients:
## sar1 sar2 constant
## 0.2487 0.1947 2.4972
## s.e. 0.0521 0.0525 0.0468
##
## sigma^2 estimated as 0.8418: log likelihood=-485.59
## AIC=979.18 AICc=979.29 BIC=994.78
The ARIMA(0,0,1)(2,0,0) model is what has been selected
Overall, the model without the imputed data has better values for ARIMA than the imputed one.I will still keep the imputed ARIMA model for forecasting as it will be better for representing general future data
atm4_fc <- atm4_impute_model %>% forecast(h = 31) %>% filter(.model == "ARIMA")
atm4_impute_model %>% forecast(h = 31) %>% filter(.model == "ARIMA") %>%
autoplot(atm4_impute) + labs(title = "ATM4 Imputed Data forecast for May")
atm4_model %>% forecast(h = 31) %>% filter(.model == "ARIMA") %>%
autoplot(atm4) + labs(title = "ATM4 Data forecast for May")
Now we can put our forecasts into an excel file
## Add back in the ATM specification
atm1_fc$atm = "ATM1"
atm2_fc$atm = "ATM2"
atm3_fc$atm = "ATM3"
atm4_fc$atm = "ATM4"
atm_fc_all <- bind_rows(atm1_fc, atm2_fc,atm3_fc,atm4_fc) %>%
as.data.frame() %>%
select(DATE, atm, .mean) %>%
rename(Cash = .mean)
atm_fc_all %>% write.csv("Project1_ATM_forecasts_Steve_Phillips.csv")
## Warning in file(file, ifelse(append, "a", "w")): cannot open file
## 'Project1_ATM_forecasts_Steve_Phillips.csv': Permission denied
## Error in file(file, ifelse(append, "a", "w")): cannot open the connection
The Task for Part 2 is to create a monthly forecast for residential power usage in 2014
suppressWarnings({
df <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
})
colnames(df) <- c("case_sequence","date", "kwh")
df <- df %>% mutate(date = yearmonth(date))
df <- df %>% as_tsibble(index = date)
Now that the data is a tsibble it can be modeled as a timeseries:
df %>% autoplot(kwh) + labs(title = "Plot of KWH as a function of time")
paste("There are:", sum(is.na(df)), "total missing values")
## [1] "There are: 1 total missing values"
print(colSums(is.na(df)))
## case_sequence date kwh
## 0 0 1
We have 1 missing value that needs to be evaluated:
df %>% filter(is.na(kwh))
## # A tsibble: 1 x 3 [1M]
## case_sequence date kwh
## <dbl> <mth> <dbl>
## 1 861 2008 Sep NA
The value for September 2008 will need to be imputed
as.data.frame(df) %>% filter(!is.na(kwh)) %>%
summarise(min = min(kwh, na.rm = TRUE),
max = max(kwh, na.rm = TRUE),
mean = mean(kwh, na.rm = TRUE),
median = median(kwh, na.rm = TRUE),
std = sd(kwh, na.rm = TRUE),
var = var(kwh, na.rm = TRUE))
## min max mean median std var
## 1 770523 10655730 6502475 6283324 1447571 2.095461e+12
Impute missing value
kwh_impute <- df %>% model(ARIMA(kwh)) %>%
interpolate(df)
kwh_decomp <- kwh_impute %>%
model(stl = STL(kwh ~ season(period = 1), robust = TRUE)) %>%
components()
kwh_decomp %>% autoplot()
There does appear to be an outlier in the data around 2010 of January, we can replace this as well
kwh_outliers <- kwh_decomp %>% filter(remainder < quantile(remainder, 0.25) - 2 * IQR(remainder) |
remainder > quantile(remainder, 0.75) + 2 * IQR(remainder))
kwh_outliers
## # A dable: 1 x 6 [1M]
## # Key: .model [1]
## # : kwh = trend + remainder
## .model date kwh trend remainder season_adjust
## <chr> <mth> <dbl> <dbl> <dbl> <dbl>
## 1 stl 2010 Jul 770523 6855641. -6085118. 770523
We can use the ARIMA model to replace this outlier
kwh_impute <- kwh_impute %>% anti_join(kwh_outliers) %>% fill_gaps()
## Joining with `by = join_by(date, kwh)`
kwh_impute <- kwh_impute %>% model(ARIMA(kwh)) %>% interpolate(kwh_impute)
Now a box-cox transformation can be applied:
lambda <- kwh_impute %>% features(kwh, features = guerrero) %>%
pull(lambda_guerrero)
paste("Lambda Value:", round(lambda, 4))
## [1] "Lambda Value: -0.2049"
kwh_decomp <- kwh_impute %>% model(STL(kwh ~ season(window = "periodic"), robust = TRUE)) %>%
components()
kwh_decomp %>%autoplot() + labs(title = "kwh STL Decomposition")
It appears that there is a trend as well as a yearly seasonal component to the kwh data. The outlier that occurs in the data is no longer present. We can identify the months that KWH peaks:
kwh_impute %>%
gg_subseries(kwh) + labs(title = "Seasonal Subseries KWH")
KWHs peak around July, August and September when it is hotter, as well as December and January when it is colder
We can now try to fit our models and evaluate them:
kwh_model <- kwh_impute %>%
model(
seasonal_naive = SNAIVE(kwh),
seasonal_naive_trans = SNAIVE(box_cox(kwh,lambda)),
ANN = ETS(kwh ~ error("A") + trend("A") + season("N")),
AAN = ETS(kwh ~ error("A") + trend("A") + season("N")),
AAA = ETS(kwh ~ error("A") + trend("A") + season("A")),
ANA = ETS(kwh ~ error("A") + trend("N") + season("A")),
MNM = ETS(kwh ~ error("M") + trend("N") + season("M")),
MAM = ETS(kwh ~ error("M") + trend("A") + season("M")),
MAM_trans = ETS(box_cox(kwh, lambda) ~ error("M") + trend("A") + season("M")),
ANA_trans = ETS(box_cox(kwh, lambda) ~ error("A") + trend("N") + season("A")),
MNM_trans = ETS(box_cox(kwh, lambda) ~ error("M") + trend("N") + season("M")),
ARIMA = ARIMA(box_cox(kwh, lambda))
)
glance(kwh_model) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 12 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 1.51e- 5 752. -1493. -1493. -1478.
## 2 MNM_trans 6.46e- 7 574. -1119. -1116. -1070.
## 3 ANA_trans 1.42e- 5 574. -1118. -1115. -1069.
## 4 MAM_trans 6.51e- 7 575. -1115. -1112. -1060.
## 5 MAM 9.00e- 3 -3053. 6140. 6144. 6196.
## 6 MNM 9.27e- 3 -3056. 6142. 6145. 6191.
## 7 ANA 4.20e+11 -3067. 6163. 6166. 6212.
## 8 AAA 4.19e+11 -3065. 6165. 6168. 6220.
## 9 ANN 2.06e+12 -3225. 6459. 6460. 6476.
## 10 AAN 2.06e+12 -3225. 6459. 6460. 6476.
## 11 seasonal_naive 6.51e+11 NA NA NA NA
## 12 seasonal_naive_trans 2.25e- 5 NA NA NA NA
accuracy(kwh_model) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 12 × 2
## .model RMSE
## <chr> <dbl>
## 1 MAM 613581.
## 2 AAA 619703.
## 3 ANA 623756.
## 4 MNM 623954.
## 5 MNM_trans 624993.
## 6 ANA_trans 625341.
## 7 MAM_trans 625349.
## 8 ARIMA 627673.
## 9 seasonal_naive 809926.
## 10 seasonal_naive_trans 809926.
## 11 ANN 1420764.
## 12 AAN 1420764.
The best model based on the AIC, AICc, and BIC is the ARIMA model. The multiplicative model that has additive seasonality is the best when considering the RMSE
kwh_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")
It appears that there is some autocorrelation in the residual data. We can also see that at the beginning of the residual plot they don’t appear to be heteroscedastic, but after the year of 2000 they appear to be random.
kwh_forecast <- kwh_model %>% forecast(h = 12) %>% filter(.model == "ARIMA")
kwh_model %>% forecast(h = 12) %>% filter(.model == "ARIMA") %>%
autoplot(kwh_impute) + labs(title = "KWH forecast for 2014")
Export it to a csv
kwh_forecast %>% write.csv("Project1_KWH_forecast_Steve_Phillips.csv")
pipe1 <- read_excel("Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe2 <- read_excel("Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))
pipe1
## # A tibble: 1,000 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 00:24:06 23.4
## 2 2015-10-23 00:40:02 28.0
## 3 2015-10-23 00:53:51 23.1
## 4 2015-10-23 00:55:40 30.0
## 5 2015-10-23 01:19:17 6.00
## 6 2015-10-23 01:23:58 15.9
## 7 2015-10-23 01:50:05 26.6
## 8 2015-10-23 01:55:33 33.3
## 9 2015-10-23 01:59:15 12.4
## 10 2015-10-23 02:51:51 21.8
## # ℹ 990 more rows
pipe2
## # A tibble: 1,000 × 2
## `Date Time` WaterFlow
## <dttm> <dbl>
## 1 2015-10-23 01:00:00 18.8
## 2 2015-10-23 02:00:00 43.1
## 3 2015-10-23 03:00:00 38.0
## 4 2015-10-23 04:00:00 36.1
## 5 2015-10-23 05:00:00 31.9
## 6 2015-10-23 06:00:00 28.2
## 7 2015-10-23 07:00:00 9.86
## 8 2015-10-23 08:00:00 26.7
## 9 2015-10-23 09:00:00 55.8
## 10 2015-10-23 10:00:00 54.2
## # ℹ 990 more rows
## Rename the columns for both pipes
colnames(pipe1) <- c("time", "flow")
colnames(pipe2) <- c("time", "flow")
The time for this will need to be adjusted, and the values will need to be grouped by the hour and then averaged
pipe1_hourly <- pipe1 %>% group_by(time = floor_date(time, "1 hour")) %>% summarize(avg_hourly_flow = mean(flow))
pipe2_hourly <- pipe2 %>% group_by(time = floor_date(time, "1 hour")) %>% summarize(avg_hourly_flow = mean(flow))
by using the floor_date function, the times can be rounded to the hour that the belong in, and grouped. The summarize function can then give us the mean value of the flow
pipe1_hourly
## # A tibble: 236 × 2
## time avg_hourly_flow
## <dttm> <dbl>
## 1 2015-10-23 00:00:00 26.1
## 2 2015-10-23 01:00:00 18.9
## 3 2015-10-23 02:00:00 15.2
## 4 2015-10-23 03:00:00 23.1
## 5 2015-10-23 04:00:00 15.5
## 6 2015-10-23 05:00:00 22.7
## 7 2015-10-23 06:00:00 20.6
## 8 2015-10-23 07:00:00 18.4
## 9 2015-10-23 08:00:00 20.8
## 10 2015-10-23 09:00:00 21.2
## # ℹ 226 more rows
pipe1_hourly <-pipe1_hourly %>% as_tsibble()
## Using `time` as index variable.
pipe2_hourly <- pipe2_hourly %>% as_tsibble()
## Using `time` as index variable.
In tsibble format we can plot our data:
pipe1_hourly %>% autoplot(avg_hourly_flow) + labs(title = "Pipe 1 Time series")
paste("There are:", sum(is.na(fill_gaps(pipe1_hourly))), "total missing values")
## [1] "There are: 4 total missing values"
When the fill_gaps function is applied to the dataset, there are 4 missing values. We can impute these values. I initially tried to impute these values using the ARIMA method, but the imputed values were negative, so I decided to use a different method.I have decided to go with a simple mean replacement of the missing values for now
## Fill gaps with NA values
pipe1_hourly <- pipe1_hourly %>% fill_gaps()
## Replace NA values with mean values
pipe1_hourly$avg_hourly_flow <- replace(pipe1_hourly$avg_hourly_flow, is.na(pipe1_hourly$avg_hourly_flow), mean(pipe1_hourly$avg_hourly_flow, na.rm =TRUE))
summary(pipe1_hourly)
## time avg_hourly_flow
## Min. :2015-10-23 00:00:00 Min. : 8.923
## 1st Qu.:2015-10-25 11:45:00 1st Qu.:17.133
## Median :2015-10-27 23:30:00 Median :19.847
## Mean :2015-10-27 23:30:00 Mean :19.893
## 3rd Qu.:2015-10-30 11:15:00 3rd Qu.:22.737
## Max. :2015-11-01 23:00:00 Max. :31.730
We can look at the ACF plot now that values have been imputed:
pipe1_hourly %>% gg_tsdisplay(avg_hourly_flow, plot_type = "partial")
The acf plot of pipe 1 seems to be stationary, so I will not be taking any differences, and we can move on to modeling. As a double check, we can use the unitroot_ndiffs to see if any differencing is recommended
pipe1_hourly %>% features(avg_hourly_flow, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
pipe1_hourly %>% features(avg_hourly_flow, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
They confirm that no difference should be taken
pipe1_hourly %>% model(STL(avg_hourly_flow ~ season(window = "periodic"), robust = TRUE))%>%
components() %>%
autoplot() + labs(title = "Pipe 1 STL Decomposition")
Although no differencing is recommended, there may be some daily seasonality. Now we can apply the box-cox transformation and then forecast our data. As this data is stationary, ARIMA will be used.
lambda <- pipe1_hourly %>%
features(avg_hourly_flow, features = guerrero) %>%
pull(lambda_guerrero)
pipe1_fit <- pipe1_hourly %>%
model(mean = MEAN(avg_hourly_flow),
naive = NAIVE(avg_hourly_flow),
naive_box = NAIVE(box_cox(avg_hourly_flow, lambda)),
ARIMA = ARIMA(box_cox(avg_hourly_flow, lambda)),
mean_box = MEAN(box_cox(avg_hourly_flow, lambda)))
glance(pipe1_fit) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 5 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 20.5 -703. 1409. 1409. 1416.
## 2 mean 17.7 NA NA NA NA
## 3 naive 34.6 NA NA NA NA
## 4 naive_box 40.2 NA NA NA NA
## 5 mean_box 20.5 NA NA NA NA
accuracy(pipe1_fit) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 5 × 2
## .model RMSE
## <chr> <dbl>
## 1 mean 4.20
## 2 ARIMA 4.20
## 3 mean_box 4.20
## 4 naive 5.87
## 5 naive_box 5.87
The mean and ARIMA model have the same RMSE and they only perform slightly better than the naive model. This suggests that our data can’t be reliably forecast
pipe1_hourly %>% model(ARIMA(box_cox(avg_hourly_flow, lambda))) %>% report()
## Series: avg_hourly_flow
## Model: ARIMA(0,0,0) w/ mean
## Transformation: box_cox(avg_hourly_flow, lambda)
##
## Coefficients:
## constant
## 19.9493
## s.e. 0.2918
##
## sigma^2 estimated as 20.52: log likelihood=-702.6
## AIC=1409.21 AICc=1409.26 BIC=1416.17
The optimal ARIMA model chosen is ARIMA(0,0,0)
pipe1_fc <- pipe1_fit %>% forecast(h = "1 week") %>% filter(.model == "ARIMA")
pipe1_fc %>% autoplot(pipe1_hourly) + labs(title = "Pipe 1 Forecast")
Although we are able to forecast the data, the ARIMA model simply adopts the mean of the data, which suggests that trying to forecast this data is not all that useful
pipe2_hourly %>% autoplot(avg_hourly_flow) + labs(title = "Pipe 2 Time Series")
paste("There are:", sum(is.na(fill_gaps(pipe2_hourly))), "total missing values")
## [1] "There are: 0 total missing values"
Our second pipe dataset doesn’t have any missing values, so nothing will need to be imputed
pipe2_hourly %>%gg_tsdisplay(avg_hourly_flow, plot_type = "partial")
There does appear to be some autocorrelation with pipe 2. We can check
if there should be differencing applied using the unitroot checks
pipe2_hourly %>% features(avg_hourly_flow, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
pipe2_hourly %>% features(avg_hourly_flow, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
no differencing is recommended
pipe2_hourly %>% model(STL(avg_hourly_flow ~ season(window = "periodic"), robust = TRUE))%>%
components() %>%
autoplot() + labs(title = "Pipe 2 STL Decomposition")
After deconstructing the data, it looks like there may be a seasonal daily and weekly component to the data. Now we can apply a boxcox transformation to the data
lambda <- pipe2_hourly %>%
features(avg_hourly_flow, features = guerrero) %>%
pull(lambda_guerrero)
pipe2_fit <- pipe2_hourly %>%
model(mean = MEAN(avg_hourly_flow),
naive = NAIVE(avg_hourly_flow),
naive_box = NAIVE(box_cox(avg_hourly_flow, lambda)),
ARIMA = ARIMA(box_cox(avg_hourly_flow, lambda)),
mean_box = MEAN(box_cox(avg_hourly_flow, lambda)))
glance(pipe2_fit) %>% arrange(AICc) %>% select(.model:BIC)
## # A tibble: 5 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA 172. -3992. 7990. 7990. 8005.
## 2 mean 258. NA NA NA NA
## 3 naive 528. NA NA NA NA
## 4 naive_box 355. NA NA NA NA
## 5 mean_box 173. NA NA NA NA
accuracy(pipe2_fit) %>% select(.model, RMSE) %>% arrange(RMSE)
## # A tibble: 5 × 2
## .model RMSE
## <chr> <dbl>
## 1 ARIMA 16.0
## 2 mean 16.0
## 3 mean_box 16.0
## 4 naive 23.0
## 5 naive_box 23.0
pipe2_hourly %>% model(ARIMA(box_cox(avg_hourly_flow, lambda))) %>% report()
## Series: avg_hourly_flow
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean
## Transformation: box_cox(avg_hourly_flow, lambda)
##
## Coefficients:
## sma1 constant
## 0.0844 32.9694
## s.e. 0.0314 0.4486
##
## sigma^2 estimated as 172.1: log likelihood=-3992.1
## AIC=7990.2 AICc=7990.23 BIC=8004.93
Once again, the model for ARIMA is about the same as simply taking the mean of the untransformed data. This means that the forecasts for this data are likely unreliable. The optimal model that has been chosen for it is ARIMA(0,0,0)(0,0,1)
pipe2_fc <- pipe1_fit %>% forecast(h = "1 week") %>% filter(.model == "ARIMA")
pipe2_fc %>% autoplot(pipe1_hourly) + labs(title = "Pipe 2 Forecast")
Export the data:
pipe1_export<- pipe1_fc %>%
as.data.frame() %>%
select(time, .mean) %>%
rename(avg_hourly_flow = .mean)
pipe2_export<- pipe2_fc %>%
as.data.frame() %>%
select(time, .mean) %>%
rename(avg_hourly_flow = .mean)
## Write to xlsx
write_xlsx(list("Pipe1" = pipe1_export, "Pipe2" = pipe2_export), path = "project1_pipes_steve_phillips.xlsx")
## Error: Error in libxlsxwriter: 'Error creating output xlsx file. Usually a permissions error.'