Project 1

Part 1:

The task for part 1 is to take ATM transaction data for 4 different ATMs and forecast how much cash will be taken out of them for May 2010.

Import data and Perform EDA

library(tidyverse)
library(fpp3)
library(readxl)
library(knitr)
library(ggplot2)
library(writexl)

Import Data:

suppressWarnings({
  df <- read_excel("ATM624Data.xlsx", col_types = c("date","text","numeric"))
  })

df <- df %>% mutate(DATE = as_date(DATE))

## Date for time series needs to be in tsibble formatting
df <- df %>% as_tsibble(index = DATE, key = ATM)

df

## # A tsibble: 1,474 x 3 [1D]
## # Key:       ATM [5]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM1     96
##  2 2009-05-02 ATM1     82
##  3 2009-05-03 ATM1     85
##  4 2009-05-04 ATM1     90
##  5 2009-05-05 ATM1     99
##  6 2009-05-06 ATM1     88
##  7 2009-05-07 ATM1      8
##  8 2009-05-08 ATM1    104
##  9 2009-05-09 ATM1     87
## 10 2009-05-10 ATM1     93
## # ℹ 1,464 more rows

Note: The cash is provided in hundreds of dollars so the real amount in transactions is the value * 100. Cash is easier to read as is, so I will leave it like this.

EDA:

paste("There are:", sum(is.na(df)), "total missing values")

## [1] "There are: 33 total missing values"

print(colSums(is.na(df)))

## DATE  ATM Cash 
##    0   14   19

The ATM and Cash columns will need to be imputed or have data removed

## Look at rows in data that contain missing values
df %>% filter(is.na(ATM) | is.na(Cash))

## # A tsibble: 19 x 3 [1D]
## # Key:       ATM [3]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-22 ATM1     NA
##  4 2009-06-18 ATM2     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

Some rows contain missing values in both the ATM and Cash column while others only have the cash missing. Imputing the missing values with both ATM and Cash missing will be difficult (impossible?), But imputing those with only Cash missing will be the optimal solution

Further exploration on the data:

## Generate a dataframe with useful summarizing statistics for each ATM
as.data.frame(df) %>% filter(!is.na(Cash)) %>%
  group_by(ATM) %>% 
  summarise(min = min(Cash, na.rm = TRUE),
                                max = max(Cash, na.rm = TRUE),
                                mean = mean(Cash, na.rm = TRUE),
                                median = median(Cash, na.rm = TRUE),
                                std = sd(Cash, na.rm = TRUE),
                                var = var(Cash, na.rm = TRUE))

## # A tibble: 4 × 7
##   ATM     min    max    mean median    std      var
##   <chr> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>    <dbl>
## 1 ATM1   1      180   83.9      91   36.7    1344. 
## 2 ATM2   0      147   62.6      67   38.9    1513. 
## 3 ATM3   0       96    0.721     0    7.94     63.1
## 4 ATM4   1.56 10920. 474.      404. 651.   423718.

ATM 3 and ATM 4 have data that implies further inspection on them is needed. ATM3 has a mean and median that show that most values are near or at 0 for most of what has been collected. ATM4 has a maximum cash value that far exceeds the other ATMS and a much larger standard deviation and variance than the other ATMS. ATM1 and ATM2 appear to be similar to eachother.

Now the data must be visualized:

ggplot(df, aes(x = DATE, y= Cash, color = ATM)) +
  geom_line() + labs(title= "Transaction History of All ATMs Across the Dataset")

## Warning: Removed 14 rows containing missing values (`geom_line()`).

By overlaying all four ATMs on one plot, we can see how different both the scale and variance of transactions across the various ATMs is

filtered <- df %>% filter((ATM == "ATM1" | (ATM == "ATM2"))) 
filtered

## # A tsibble: 730 x 3 [1D]
## # Key:       ATM [2]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM1     96
##  2 2009-05-02 ATM1     82
##  3 2009-05-03 ATM1     85
##  4 2009-05-04 ATM1     90
##  5 2009-05-05 ATM1     99
##  6 2009-05-06 ATM1     88
##  7 2009-05-07 ATM1      8
##  8 2009-05-08 ATM1    104
##  9 2009-05-09 ATM1     87
## 10 2009-05-10 ATM1     93
## # ℹ 720 more rows

ggplot(filtered, aes(x = DATE, y= Cash, color = ATM)) + 
  geom_line() + labs(title ="ATM1 and ATM2 Transaction History")

Models that try to predict transactions from ATM1 and ATM2 will probably be created through similar processes. Data from ATM3 and ATM4 will probably have to be more specifically tailored

ggplot(filter(df, ATM == "ATM4"), aes(x= DATE, y = Cash)) + geom_line() + labs(title= "ATM4 Transaction History")

It appears that it is one isolated incident that has skewed the transaction values for ATM4. Multiple models might be developed, that either exclude or incorporate the outlier

ggplot(filter(df, ATM == "ATM3"), aes(x= DATE, y = Cash)) + geom_line() + labs(title = "ATM3 Transaction History")

This transaction data also has a spike, but the spike is closer to the end of our timeseries.This might give greater weight to the spike, and alter how we handle it compared to the spike that occurs in ATM4. All data prior to this spike in transactions is 0, so this might represent the time in which the ATM went online

df %>% filter(ATM == "ATM3")

## # A tsibble: 365 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM3      0
##  2 2009-05-02 ATM3      0
##  3 2009-05-03 ATM3      0
##  4 2009-05-04 ATM3      0
##  5 2009-05-05 ATM3      0
##  6 2009-05-06 ATM3      0
##  7 2009-05-07 ATM3      0
##  8 2009-05-08 ATM3      0
##  9 2009-05-09 ATM3      0
## 10 2009-05-10 ATM3      0
## # ℹ 355 more rows

Impute and Remove Data

For easier modeling, the 4 atms will be split into their own dataframes. This will also remove the missing values that only contain dates which are unable to be imputed:

atm1 <- df %>% filter(ATM == "ATM1")
atm2 <- df %>% filter(ATM == "ATM2")
atm3 <- df %>% filter(ATM == "ATM3")
atm4 <- df %>% filter(ATM == "ATM4")

atm1

## # A tsibble: 365 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-05-01 ATM1     96
##  2 2009-05-02 ATM1     82
##  3 2009-05-03 ATM1     85
##  4 2009-05-04 ATM1     90
##  5 2009-05-05 ATM1     99
##  6 2009-05-06 ATM1     88
##  7 2009-05-07 ATM1      8
##  8 2009-05-08 ATM1    104
##  9 2009-05-09 ATM1     87
## 10 2009-05-10 ATM1     93
## # ℹ 355 more rows

The ARIMA model can be used to generate imputed values for the rows that have missing Cash values.

ATM 1 impute:

atm1_impute <- atm1 %>% model(ARIMA(Cash)) %>%
  interpolate(atm1)

atm1_impute

## # A tsibble: 365 x 3 [1D]
## # Key:       ATM [1]
##    ATM   DATE        Cash
##    <chr> <date>     <dbl>
##  1 ATM1  2009-05-01    96
##  2 ATM1  2009-05-02    82
##  3 ATM1  2009-05-03    85
##  4 ATM1  2009-05-04    90
##  5 ATM1  2009-05-05    99
##  6 ATM1  2009-05-06    88
##  7 ATM1  2009-05-07     8
##  8 ATM1  2009-05-08   104
##  9 ATM1  2009-05-09    87
## 10 ATM1  2009-05-10    93
## # ℹ 355 more rows

ATM 2 impute:

atm2_impute <- atm2 %>% model(ARIMA(Cash)) %>% 
  interpolate(atm2)

atm2_impute

## # A tsibble: 365 x 3 [1D]
## # Key:       ATM [1]
##    ATM   DATE        Cash
##    <chr> <date>     <dbl>
##  1 ATM2  2009-05-01   107
##  2 ATM2  2009-05-02    89
##  3 ATM2  2009-05-03    90
##  4 ATM2  2009-05-04    55
##  5 ATM2  2009-05-05    79
##  6 ATM2  2009-05-06    19
##  7 ATM2  2009-05-07     2
##  8 ATM2  2009-05-08   103
##  9 ATM2  2009-05-09   107
## 10 ATM2  2009-05-10   118
## # ℹ 355 more rows

ATM 4 impute:

ATM 4 contains a spike in values. I would like to replace the outlier, and then make a model for data with and without it.

atm4_decomp <- atm4 %>%
  model(stl = STL(Cash ~ season(period = 1), robust = TRUE)) %>% 
  components()
atm4_decomp %>% autoplot()

The outliers are evident based on the remainder data

atm4_outliers <- atm4_decomp %>% filter(remainder < quantile(remainder, 0.25) - 3 * IQR(remainder) |
                                        remainder > quantile(remainder, 0.75) + 3 * IQR(remainder))
atm4_outliers

## # A dable: 1 x 7 [1D]
## # Key:     ATM, .model [1]
## # :        Cash = trend + remainder
##   ATM   .model DATE         Cash trend remainder season_adjust
##   <chr> <chr>  <date>      <dbl> <dbl>     <dbl>         <dbl>
## 1 ATM4  stl    2010-02-09 10920.  491.    10428.        10920.

The outlier has now been isolated

atm4_impute <- atm4 %>% anti_join(atm4_outliers) %>% fill_gaps()

## Joining with `by = join_by(DATE, ATM, Cash)`

atm4_impute <- atm4_impute %>% model(ARIMA(Cash)) %>% interpolate(atm4_impute)

After replacing the outlier using the ARIMA model, we can plot the data again:

atm4_impute %>% autoplot(Cash) + labs (title = "Imputed Data for ATM 4")

Modeling

ATM 1

We will need to determine core components of the time series, such as seasonality and trend. We can do this with an STL decomposition:

atm1_impute %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>% 
  components() %>%
  autoplot() + labs(title = "ATM1 STL Decomposition")

Atm 1 has an inconsistent trend component and a weekly seasonal element to its data, where cash drops and then returns to baseline.

We can apply a box-cox transformation to the data for modelling

lambda <- atm1_impute %>% features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)
paste("Lambda Value:", round(lambda, 4))

## [1] "Lambda Value: 0.2623"

A box-cox transformation will be a good start making the variance resemble a normal distribution

Now we can start running models:

atm1_model <- atm1_impute %>%
  model(
    seasonal_naive = SNAIVE(Cash),
        seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda)),
    ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
    ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
    MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
    MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
    MAM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("A") + season("M")),
    ANA_trans = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
    MNM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
    ARIMA = ARIMA(box_cox(Cash, lambda))
        )

Look at the model fits:

glance(atm1_model) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 12 × 6
##    .model                  sigma2 log_lik   AIC  AICc   BIC
##    <chr>                    <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 ARIMA                   1.77     -611. 1229. 1229. 1245.
##  2 ANA_trans               1.81    -1180. 2381. 2381. 2420.
##  3 MNM_trans               0.0383  -1220. 2459. 2460. 2498.
##  4 MAM_trans               0.0383  -1221. 2465. 2466. 2512.
##  5 ANA                   577.      -2232. 4485. 4486. 4524.
##  6 AAA                   582.      -2233. 4490. 4491. 4537.
##  7 MNM                     0.134   -2273. 4566. 4566. 4605.
##  8 MAM                     0.133   -2273. 4571. 4572. 4618.
##  9 ANN                  1360.      -2391. 4793. 4793. 4812.
## 10 AAN                  1360.      -2391. 4793. 4793. 4812.
## 11 seasonal_naive        765.         NA    NA    NA    NA 
## 12 seasonal_naive_trans    2.49       NA    NA    NA    NA

accuracy(atm1_model) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 12 × 2
##    .model                RMSE
##    <chr>                <dbl>
##  1 MNM_trans             23.3
##  2 MAM_trans             23.6
##  3 ANA                   23.7
##  4 AAA                   23.8
##  5 ANA_trans             24.9
##  6 ARIMA                 24.9
##  7 MNM                   26.2
##  8 MAM                   26.3
##  9 seasonal_naive        27.6
## 10 seasonal_naive_trans  27.6
## 11 ANN                   36.7
## 12 AAN                   36.7

Looking at the accuracy metrics, the multiplicative transformed model or the ARIMA model are the most appropriate. transformed multiplicative model has the lowest RMSE, but the ARIMA model has the lowest AIC, AICc, and BIC. With this in mind, the ARIMA model will be the one that is chosen

atm1_impute %>% model(ARIMA(box_cox(Cash, lambda))) %>% report()

## Series: Cash 
## Model: ARIMA(0,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##          ma1      ma2     sma1
##       0.1105  -0.1088  -0.6419
## s.e.  0.0524   0.0521   0.0431
## 
## sigma^2 estimated as 1.771:  log likelihood=-610.69
## AIC=1229.38   AICc=1229.49   BIC=1244.9

The optimal model is an ARIMA(0,0,2)(0,1,1)

atm1_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")

The residuals for this model dont appear to have any autocorrelation, and they appear to be random with a fairly normal distribution

atm1_fc <- atm1_model %>% forecast(h = 31) %>%  filter(.model == "ARIMA")
atm1_model %>% forecast(h = 31) %>%  filter(.model == "ARIMA") %>%
  autoplot(atm1_impute) + labs(title = "ATM1 forecast for May")

ATM2

We will apply a box-cox transformation to ATM2 as well

lambda <- atm2_impute %>% features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)
paste("Lambda Value:", round(lambda, 4))

## [1] "Lambda Value: 0.6747"

atm2_impute %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>% 
  components() %>%
  autoplot() + labs(title = "ATM1 STL Decomposition")

As we predicted earlier, ATM1 is similar to ATM2 and there is a weekly seasonal component with no real visible trend throughout the data

The same models can be tested for fit as ATM1:

atm2_model <- atm2_impute %>%
  model(
    seasonal_naive = SNAIVE(Cash),
        seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda)),
    ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
    ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
    MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
    MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
    MAM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("A") + season("M")),
    ANA_trans = ETS(box_cox(Cash, lambda) ~ error("A") + trend("N") + season("A")),
    MNM_trans = ETS(box_cox(Cash, lambda) ~ error("M") + trend("N") + season("M")),
    ARIMA = ARIMA(box_cox(Cash, lambda))
        )

look at accuracy:

glance(atm2_model) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 12 × 6
##    .model                 sigma2 log_lik   AIC  AICc   BIC
##    <chr>                   <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 ARIMA                  45.0    -1189. 2390. 2390. 2413.
##  2 ANA_trans              47.5    -1777. 3573. 3574. 3612.
##  3 MNM_trans               0.245  -1932. 3885. 3886. 3924.
##  4 MAM_trans               0.273  -1932. 3888. 3888. 3934.
##  5 ANA                   622.     -2246. 4512. 4513. 4551.
##  6 AAA                   634.     -2249. 4521. 4522. 4568.
##  7 MAM                     0.384  -2377. 4779. 4779. 4825.
##  8 ANN                  1482.     -2407. 4824. 4825. 4844.
##  9 AAN                  1482.     -2407. 4824. 4825. 4844.
## 10 MNM                     0.738  -2438. 4897. 4898. 4936.
## 11 seasonal_naive        881.        NA    NA    NA    NA 
## 12 seasonal_naive_trans   65.2       NA    NA    NA    NA

accuracy(atm1_model) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 12 × 2
##    .model                RMSE
##    <chr>                <dbl>
##  1 MNM_trans             23.3
##  2 MAM_trans             23.6
##  3 ANA                   23.7
##  4 AAA                   23.8
##  5 ANA_trans             24.9
##  6 ARIMA                 24.9
##  7 MNM                   26.2
##  8 MAM                   26.3
##  9 seasonal_naive        27.6
## 10 seasonal_naive_trans  27.6
## 11 ANN                   36.7
## 12 AAN                   36.7

Once again, ARIMA performs the best in most metrics, while the transformed multiplicative model performs best on the RMSE. We will use the ARIMA model for ATM2 as well

atm2_impute %>% model(ARIMA(box_cox(Cash, lambda))) %>% report()

## Series: Cash 
## Model: ARIMA(2,0,2)(0,1,1)[7] 
## Transformation: box_cox(Cash, lambda) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2     sma1
##       -0.4180  -0.9012  0.4579  0.7889  -0.7129
## s.e.   0.0592   0.0475  0.0888  0.0619   0.0419
## 
## sigma^2 estimated as 44.97:  log likelihood=-1188.76
## AIC=2389.51   AICc=2389.75   BIC=2412.8

The optimal ARIMA model is the ARIMA(2,0,2)(0,1,1)

atm2_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")

For ATM2, it appears that there is some auto correlation, but the residuals are about normally distributed and random

atm2_fc <- atm2_model %>% forecast(h = 31) %>%  filter(.model == "ARIMA")
atm2_model %>% forecast(h = 31) %>%  filter(.model == "ARIMA") %>%
  autoplot(atm2_impute) + labs(title = "ATM2 forecast for May")

ATM 3

ATM3 modeling will need to be more subjective. There is almost 0 information to model off of as all but 3 values are 0. With this in mind a naive model will be the most likely choice:

atm3 %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Cash`

A box cox transformation will do nothing for our data and is unneeded. The only potential models that can really be used to represent this are the mean,naive, and drift

atm3_model <- atm3 %>% 
  model(naive = NAIVE(Cash),
        mean = MEAN(Cash),
        drift = NAIVE(Cash ~ drift()))

accuracy(atm3_model) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 3 × 2
##   .model  RMSE
##   <chr>  <dbl>
## 1 drift   5.08
## 2 naive   5.09
## 3 mean    7.93

Although it is less helpful for a model such as this, the RMSE tells us that the lowest error model will be the drift or the naive model. The drift model is what we will chose based on this fit

atm3_model %>% forecast(h = 31) %>% autoplot(atm3, level = NULL) + labs(title = "ATM3 forecast for May using 3 models")

atm3_fc <- atm3_model %>% forecast(h = 31) %>%  filter(.model == "drift")
atm3_model %>% forecast(h = 31) %>% filter(.model == "drift") %>% autoplot(atm3) + labs(title ="ATM 3 forecast using Drift model")

ATM 4

Our ATM 4 data has two versions: one with the outlier and other with imputed data using the ARIMA model. We can create models for both

atm4 %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>% 
  components() %>%
  autoplot() + labs(title = "ATM4 STL Decomposition")

atm4_impute %>% model(STL(Cash ~ season(window = "periodic"), robust = TRUE)) %>% 
  components() %>%
  autoplot() + labs(title = "ATM4 Imputed Data STL Decomposition")

The weekly seasonal component is the same for both versions of the data, the only noticeable difference is the spike in the remainder for the original data, where the outlier is. Next, we can apply boxcox transformations for both datasets

lambda1 <- atm4 %>% features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)
paste("Lambda Value For original dataset:", round(lambda1, 4))

## [1] "Lambda Value For original dataset: -0.0737"

lambda2 <- atm4_impute %>% features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)
paste("Lambda Value For Imputed dataset:", round(lambda2, 4))

## [1] "Lambda Value For Imputed dataset: 0.4496"

The lambda values for both datasets are noticeably different. Now we can apply our models:

atm4_model <- atm4 %>%
  model(
    seasonal_naive = SNAIVE(Cash),
        seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda1)),
    ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
    ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
    MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
    MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
    MAM_trans = ETS(box_cox(Cash, lambda1) ~ error("M") + trend("A") + season("M")),
    ANA_trans = ETS(box_cox(Cash, lambda1) ~ error("A") + trend("N") + season("A")),
    MNM_trans = ETS(box_cox(Cash, lambda1) ~ error("M") + trend("N") + season("M")),
    ARIMA = ARIMA(box_cox(Cash, lambda1))
        )

glance(atm4_model) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 12 × 6
##    .model                    sigma2 log_lik   AIC  AICc   BIC
##    <chr>                      <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 ARIMA                     0.842    -486.  979.  979.  995.
##  2 ANA_trans                 0.807   -1033. 2086. 2087. 2125.
##  3 MNM_trans                 0.0409  -1034. 2089. 2089. 2128.
##  4 MAM_trans                 0.0411  -1037. 2098. 2099. 2145.
##  5 MAM                       1.84    -3385. 6795. 6796. 6841.
##  6 MNM                       1.85    -3388. 6795. 6796. 6834.
##  7 ANA                  412779.      -3432. 6884. 6885. 6923.
##  8 AAA                  423177.      -3436. 6895. 6896. 6942.
##  9 ANN                  434658.      -3444. 6898. 6898. 6917.
## 10 AAN                  434658.      -3444. 6898. 6898. 6917.
## 11 seasonal_naive       805110.         NA    NA    NA    NA 
## 12 seasonal_naive_trans      1.32       NA    NA    NA    NA

accuracy(atm4_model) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 12 × 2
##    .model                RMSE
##    <chr>                <dbl>
##  1 ANA                   635.
##  2 AAA                   641.
##  3 ANN                   656.
##  4 AAN                   656.
##  5 MAM_trans             664.
##  6 ANA_trans             665.
##  7 MNM_trans             666.
##  8 ARIMA                 679.
##  9 MNM                   851.
## 10 seasonal_naive_trans  896.
## 11 seasonal_naive        896.
## 12 MAM                  1017.

For the dataset with the outlier, the ARIMA model fits best, as it has the lowest AIC, AICc, and BIC.The additive model has the best RMSE but fails on other metrics.

atm4_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ATM 4 ARIMA model residuals")

The ARIMA model for base ATM data has only 1 instance of autocorrelation and the remainders appear to be randomly distributed

atm4_model %>% select(.model = "ARIMA") %>% report()

## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda1) 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2487  0.1947    2.4972
## s.e.  0.0521  0.0525    0.0468
## 
## sigma^2 estimated as 0.8418:  log likelihood=-485.59
## AIC=979.18   AICc=979.29   BIC=994.78

The ARIMA(0,0,1)(2,0,0) model is what has been selected

Now we can model for the imputed dataset

atm4_impute_model <- atm4_impute %>%
  model(
    seasonal_naive = SNAIVE(Cash),
        seasonal_naive_trans = SNAIVE(box_cox(Cash,lambda2)),
    ANN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAN = ETS(Cash ~ error("A") + trend("A") + season("N")),
    AAA = ETS(Cash ~ error("A") + trend("A") + season("A")),
    ANA = ETS(Cash ~ error("A") + trend("N") + season("A")),
    MNM = ETS(Cash ~ error("M") + trend("N") + season("M")),
    MAM = ETS(Cash ~ error("M") + trend("A") + season("M")),
    MAM_trans = ETS(box_cox(Cash, lambda2) ~ error("M") + trend("A") + season("M")),
    ANA_trans = ETS(box_cox(Cash, lambda2) ~ error("A") + trend("N") + season("A")),
    MNM_trans = ETS(box_cox(Cash, lambda2) ~ error("M") + trend("N") + season("M")),
    ARIMA = ARIMA(box_cox(Cash, lambda2))
        )

glance(atm4_impute_model) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 12 × 6
##    .model                   sigma2 log_lik   AIC  AICc   BIC
##    <chr>                     <dbl>   <dbl> <dbl> <dbl> <dbl>
##  1 ARIMA                   176.     -1460. 2930. 2930. 2950.
##  2 MNM_trans                 0.220  -2003. 4025. 4026. 4064.
##  3 ANA_trans               167.     -2006. 4032. 4032. 4071.
##  4 MAM_trans                 0.209  -2007. 4038. 4039. 4085.
##  5 ANA                  110883.     -3192. 6404. 6405. 6443.
##  6 MNM                       0.728  -3192. 6404. 6405. 6443.
##  7 MAM                       0.646  -3195. 6414. 6415. 6461.
##  8 AAA                  115261.     -3198. 6420. 6421. 6467.
##  9 ANN                  127144.     -3220. 6449. 6449. 6469.
## 10 AAN                  127144.     -3220. 6449. 6449. 6469.
## 11 seasonal_naive       205867.        NA    NA    NA    NA 
## 12 seasonal_naive_trans    289.        NA    NA    NA    NA

accuracy(atm4_impute_model) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 12 × 2
##    .model                RMSE
##    <chr>                <dbl>
##  1 ANA                   329.
##  2 AAA                   334.
##  3 ANA_trans             338.
##  4 MNM_trans             345.
##  5 MAM_trans             350.
##  6 ARIMA                 352.
##  7 MNM                   354.
##  8 ANN                   355.
##  9 AAN                   355.
## 10 MAM                   367.
## 11 seasonal_naive        453.
## 12 seasonal_naive_trans  453.

atm4_impute_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")

The imputed residuals are similar to the unimputed data, with one example of autocorrelation and randomly distributed residuals

atm4_model %>% select(.model = "ARIMA") %>% report()

## Series: Cash 
## Model: ARIMA(0,0,0)(2,0,0)[7] w/ mean 
## Transformation: box_cox(Cash, lambda1) 
## 
## Coefficients:
##         sar1    sar2  constant
##       0.2487  0.1947    2.4972
## s.e.  0.0521  0.0525    0.0468
## 
## sigma^2 estimated as 0.8418:  log likelihood=-485.59
## AIC=979.18   AICc=979.29   BIC=994.78

The ARIMA(0,0,1)(2,0,0) model is what has been selected

Overall, the model without the imputed data has better values for ARIMA than the imputed one.I will still keep the imputed ARIMA model for forecasting as it will be better for representing general future data

atm4_fc <- atm4_impute_model %>% forecast(h = 31) %>%  filter(.model == "ARIMA")

atm4_impute_model %>% forecast(h = 31) %>%  filter(.model == "ARIMA") %>%
  autoplot(atm4_impute) + labs(title = "ATM4 Imputed Data forecast for May")

atm4_model %>% forecast(h = 31) %>%  filter(.model == "ARIMA") %>%
  autoplot(atm4) + labs(title = "ATM4 Data forecast for May")

Now we can put our forecasts into an excel file

## Add back in the ATM specification
atm1_fc$atm = "ATM1"
atm2_fc$atm = "ATM2"
atm3_fc$atm = "ATM3"
atm4_fc$atm = "ATM4"

atm_fc_all <- bind_rows(atm1_fc, atm2_fc,atm3_fc,atm4_fc) %>%
  as.data.frame() %>%
  select(DATE, atm, .mean) %>%
  rename(Cash = .mean)

atm_fc_all %>% write.csv("Project1_ATM_forecasts_Steve_Phillips.csv")

## Warning in file(file, ifelse(append, "a", "w")): cannot open file
## 'Project1_ATM_forecasts_Steve_Phillips.csv': Permission denied

## Error in file(file, ifelse(append, "a", "w")): cannot open the connection

Part 2

The Task for Part 2 is to create a monthly forecast for residential power usage in 2014

Import Data and Perform EDA:

suppressWarnings({
  df <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
  })
colnames(df) <- c("case_sequence","date", "kwh")
df <- df %>% mutate(date = yearmonth(date))
df <- df %>% as_tsibble(index = date)

Now that the data is a tsibble it can be modeled as a timeseries:

df %>% autoplot(kwh) + labs(title = "Plot of KWH as a function of time")

paste("There are:", sum(is.na(df)), "total missing values")

## [1] "There are: 1 total missing values"

print(colSums(is.na(df)))

## case_sequence          date           kwh 
##             0             0             1

We have 1 missing value that needs to be evaluated:

df %>% filter(is.na(kwh))

## # A tsibble: 1 x 3 [1M]
##   case_sequence     date   kwh
##           <dbl>    <mth> <dbl>
## 1           861 2008 Sep    NA

The value for September 2008 will need to be imputed

as.data.frame(df) %>% filter(!is.na(kwh)) %>%
  summarise(min = min(kwh, na.rm = TRUE),
                                max = max(kwh, na.rm = TRUE),
                                mean = mean(kwh, na.rm = TRUE),
                                median = median(kwh, na.rm = TRUE),
                                std = sd(kwh, na.rm = TRUE),
                                var = var(kwh, na.rm = TRUE))

##      min      max    mean  median     std          var
## 1 770523 10655730 6502475 6283324 1447571 2.095461e+12

Impute missing value

kwh_impute <- df %>% model(ARIMA(kwh)) %>%
  interpolate(df)

kwh_decomp <- kwh_impute %>%
  model(stl = STL(kwh ~ season(period = 1), robust = TRUE)) %>% 
  components()
kwh_decomp %>% autoplot()

There does appear to be an outlier in the data around 2010 of January, we can replace this as well

kwh_outliers <- kwh_decomp %>% filter(remainder < quantile(remainder, 0.25) - 2 * IQR(remainder) |
                                        remainder > quantile(remainder, 0.75) + 2 * IQR(remainder))
kwh_outliers

## # A dable: 1 x 6 [1M]
## # Key:     .model [1]
## # :        kwh = trend + remainder
##   .model     date    kwh    trend remainder season_adjust
##   <chr>     <mth>  <dbl>    <dbl>     <dbl>         <dbl>
## 1 stl    2010 Jul 770523 6855641. -6085118.        770523

We can use the ARIMA model to replace this outlier

kwh_impute <- kwh_impute %>% anti_join(kwh_outliers) %>% fill_gaps()

## Joining with `by = join_by(date, kwh)`

kwh_impute <- kwh_impute %>% model(ARIMA(kwh)) %>% interpolate(kwh_impute)

Now a box-cox transformation can be applied:

lambda <- kwh_impute %>% features(kwh, features = guerrero) %>%
  pull(lambda_guerrero)
paste("Lambda Value:", round(lambda, 4))

## [1] "Lambda Value: -0.2049"

Modeling

kwh_decomp <- kwh_impute %>% model(STL(kwh ~ season(window = "periodic"), robust = TRUE)) %>% 
  components()
kwh_decomp %>%autoplot() + labs(title = "kwh STL Decomposition")

It appears that there is a trend as well as a yearly seasonal component to the kwh data. The outlier that occurs in the data is no longer present. We can identify the months that KWH peaks:

kwh_impute %>% 
  gg_subseries(kwh) + labs(title = "Seasonal Subseries KWH")

KWHs peak around July, August and September when it is hotter, as well as December and January when it is colder

We can now try to fit our models and evaluate them:

kwh_model <- kwh_impute %>%
  model(
    seasonal_naive = SNAIVE(kwh),
        seasonal_naive_trans = SNAIVE(box_cox(kwh,lambda)),
    ANN = ETS(kwh ~ error("A") + trend("A") + season("N")),
    AAN = ETS(kwh ~ error("A") + trend("A") + season("N")),
    AAA = ETS(kwh ~ error("A") + trend("A") + season("A")),
    ANA = ETS(kwh ~ error("A") + trend("N") + season("A")),
    MNM = ETS(kwh ~ error("M") + trend("N") + season("M")),
    MAM = ETS(kwh ~ error("M") + trend("A") + season("M")),
    MAM_trans = ETS(box_cox(kwh, lambda) ~ error("M") + trend("A") + season("M")),
    ANA_trans = ETS(box_cox(kwh, lambda) ~ error("A") + trend("N") + season("A")),
    MNM_trans = ETS(box_cox(kwh, lambda) ~ error("M") + trend("N") + season("M")),
    ARIMA = ARIMA(box_cox(kwh, lambda))
        )

glance(kwh_model) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 12 × 6
##    .model                 sigma2 log_lik    AIC   AICc    BIC
##    <chr>                   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1 ARIMA                1.51e- 5    752. -1493. -1493. -1478.
##  2 MNM_trans            6.46e- 7    574. -1119. -1116. -1070.
##  3 ANA_trans            1.42e- 5    574. -1118. -1115. -1069.
##  4 MAM_trans            6.51e- 7    575. -1115. -1112. -1060.
##  5 MAM                  9.00e- 3  -3053.  6140.  6144.  6196.
##  6 MNM                  9.27e- 3  -3056.  6142.  6145.  6191.
##  7 ANA                  4.20e+11  -3067.  6163.  6166.  6212.
##  8 AAA                  4.19e+11  -3065.  6165.  6168.  6220.
##  9 ANN                  2.06e+12  -3225.  6459.  6460.  6476.
## 10 AAN                  2.06e+12  -3225.  6459.  6460.  6476.
## 11 seasonal_naive       6.51e+11     NA     NA     NA     NA 
## 12 seasonal_naive_trans 2.25e- 5     NA     NA     NA     NA

accuracy(kwh_model) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 12 × 2
##    .model                   RMSE
##    <chr>                   <dbl>
##  1 MAM                   613581.
##  2 AAA                   619703.
##  3 ANA                   623756.
##  4 MNM                   623954.
##  5 MNM_trans             624993.
##  6 ANA_trans             625341.
##  7 MAM_trans             625349.
##  8 ARIMA                 627673.
##  9 seasonal_naive        809926.
## 10 seasonal_naive_trans  809926.
## 11 ANN                  1420764.
## 12 AAN                  1420764.

The best model based on the AIC, AICc, and BIC is the ARIMA model. The multiplicative model that has additive seasonality is the best when considering the RMSE

kwh_model %>% select(ARIMA) %>% gg_tsresiduals() + labs(title = "ARIMA model residuals")

It appears that there is some autocorrelation in the residual data. We can also see that at the beginning of the residual plot they don’t appear to be heteroscedastic, but after the year of 2000 they appear to be random.

kwh_forecast <- kwh_model %>% forecast(h = 12) %>% filter(.model == "ARIMA")
kwh_model %>% forecast(h = 12) %>%  filter(.model == "ARIMA") %>%
  autoplot(kwh_impute) + labs(title = "KWH forecast for 2014")

Export it to a csv

kwh_forecast %>% write.csv("Project1_KWH_forecast_Steve_Phillips.csv")

Part 3 (Bonus)

Import data

pipe1 <- read_excel("Waterflow_Pipe1.xlsx", col_types = c("date", "numeric"))
pipe2 <- read_excel("Waterflow_Pipe2.xlsx", col_types = c("date", "numeric"))

pipe1

## # A tibble: 1,000 × 2
##    `Date Time`         WaterFlow
##    <dttm>                  <dbl>
##  1 2015-10-23 00:24:06     23.4 
##  2 2015-10-23 00:40:02     28.0 
##  3 2015-10-23 00:53:51     23.1 
##  4 2015-10-23 00:55:40     30.0 
##  5 2015-10-23 01:19:17      6.00
##  6 2015-10-23 01:23:58     15.9 
##  7 2015-10-23 01:50:05     26.6 
##  8 2015-10-23 01:55:33     33.3 
##  9 2015-10-23 01:59:15     12.4 
## 10 2015-10-23 02:51:51     21.8 
## # ℹ 990 more rows

pipe2

## # A tibble: 1,000 × 2
##    `Date Time`         WaterFlow
##    <dttm>                  <dbl>
##  1 2015-10-23 01:00:00     18.8 
##  2 2015-10-23 02:00:00     43.1 
##  3 2015-10-23 03:00:00     38.0 
##  4 2015-10-23 04:00:00     36.1 
##  5 2015-10-23 05:00:00     31.9 
##  6 2015-10-23 06:00:00     28.2 
##  7 2015-10-23 07:00:00      9.86
##  8 2015-10-23 08:00:00     26.7 
##  9 2015-10-23 09:00:00     55.8 
## 10 2015-10-23 10:00:00     54.2 
## # ℹ 990 more rows

## Rename the columns for both pipes
colnames(pipe1) <- c("time", "flow")
colnames(pipe2) <- c("time", "flow")

The time for this will need to be adjusted, and the values will need to be grouped by the hour and then averaged

pipe1_hourly <- pipe1 %>% group_by(time = floor_date(time, "1 hour")) %>% summarize(avg_hourly_flow = mean(flow))
pipe2_hourly <- pipe2 %>% group_by(time = floor_date(time, "1 hour")) %>% summarize(avg_hourly_flow = mean(flow))

by using the floor_date function, the times can be rounded to the hour that the belong in, and grouped. The summarize function can then give us the mean value of the flow

pipe1_hourly

## # A tibble: 236 × 2
##    time                avg_hourly_flow
##    <dttm>                        <dbl>
##  1 2015-10-23 00:00:00            26.1
##  2 2015-10-23 01:00:00            18.9
##  3 2015-10-23 02:00:00            15.2
##  4 2015-10-23 03:00:00            23.1
##  5 2015-10-23 04:00:00            15.5
##  6 2015-10-23 05:00:00            22.7
##  7 2015-10-23 06:00:00            20.6
##  8 2015-10-23 07:00:00            18.4
##  9 2015-10-23 08:00:00            20.8
## 10 2015-10-23 09:00:00            21.2
## # ℹ 226 more rows

Pipe 1

pipe1_hourly <-pipe1_hourly %>% as_tsibble()

## Using `time` as index variable.

pipe2_hourly <- pipe2_hourly %>% as_tsibble()

## Using `time` as index variable.

In tsibble format we can plot our data:

pipe1_hourly %>% autoplot(avg_hourly_flow) + labs(title = "Pipe 1 Time series")

paste("There are:", sum(is.na(fill_gaps(pipe1_hourly))), "total missing values")

## [1] "There are: 4 total missing values"

When the fill_gaps function is applied to the dataset, there are 4 missing values. We can impute these values. I initially tried to impute these values using the ARIMA method, but the imputed values were negative, so I decided to use a different method.I have decided to go with a simple mean replacement of the missing values for now

## Fill gaps with NA values
pipe1_hourly <- pipe1_hourly %>% fill_gaps()

## Replace NA values with mean values

pipe1_hourly$avg_hourly_flow <- replace(pipe1_hourly$avg_hourly_flow, is.na(pipe1_hourly$avg_hourly_flow), mean(pipe1_hourly$avg_hourly_flow, na.rm =TRUE))

summary(pipe1_hourly)

##       time                     avg_hourly_flow 
##  Min.   :2015-10-23 00:00:00   Min.   : 8.923  
##  1st Qu.:2015-10-25 11:45:00   1st Qu.:17.133  
##  Median :2015-10-27 23:30:00   Median :19.847  
##  Mean   :2015-10-27 23:30:00   Mean   :19.893  
##  3rd Qu.:2015-10-30 11:15:00   3rd Qu.:22.737  
##  Max.   :2015-11-01 23:00:00   Max.   :31.730

We can look at the ACF plot now that values have been imputed:

pipe1_hourly %>% gg_tsdisplay(avg_hourly_flow, plot_type = "partial")

The acf plot of pipe 1 seems to be stationary, so I will not be taking any differences, and we can move on to modeling. As a double check, we can use the unitroot_ndiffs to see if any differencing is recommended

pipe1_hourly %>% features(avg_hourly_flow, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

pipe1_hourly %>% features(avg_hourly_flow, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

They confirm that no difference should be taken

pipe1_hourly %>% model(STL(avg_hourly_flow ~ season(window = "periodic"), robust = TRUE))%>%
  components() %>%
  autoplot() + labs(title = "Pipe 1 STL Decomposition")

Although no differencing is recommended, there may be some daily seasonality. Now we can apply the box-cox transformation and then forecast our data. As this data is stationary, ARIMA will be used.

lambda <- pipe1_hourly %>% 
  features(avg_hourly_flow, features = guerrero) %>% 
  pull(lambda_guerrero)

pipe1_fit <- pipe1_hourly %>%
  model(mean = MEAN(avg_hourly_flow),
        naive = NAIVE(avg_hourly_flow),
    naive_box = NAIVE(box_cox(avg_hourly_flow, lambda)),
    ARIMA = ARIMA(box_cox(avg_hourly_flow, lambda)),
    mean_box = MEAN(box_cox(avg_hourly_flow, lambda)))

glance(pipe1_fit) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 5 × 6
##   .model    sigma2 log_lik   AIC  AICc   BIC
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA       20.5   -703. 1409. 1409. 1416.
## 2 mean        17.7     NA    NA    NA    NA 
## 3 naive       34.6     NA    NA    NA    NA 
## 4 naive_box   40.2     NA    NA    NA    NA 
## 5 mean_box    20.5     NA    NA    NA    NA

accuracy(pipe1_fit) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 5 × 2
##   .model     RMSE
##   <chr>     <dbl>
## 1 mean       4.20
## 2 ARIMA      4.20
## 3 mean_box   4.20
## 4 naive      5.87
## 5 naive_box  5.87

The mean and ARIMA model have the same RMSE and they only perform slightly better than the naive model. This suggests that our data can’t be reliably forecast

pipe1_hourly %>% model(ARIMA(box_cox(avg_hourly_flow, lambda))) %>% report()

## Series: avg_hourly_flow 
## Model: ARIMA(0,0,0) w/ mean 
## Transformation: box_cox(avg_hourly_flow, lambda) 
## 
## Coefficients:
##       constant
##        19.9493
## s.e.    0.2918
## 
## sigma^2 estimated as 20.52:  log likelihood=-702.6
## AIC=1409.21   AICc=1409.26   BIC=1416.17

The optimal ARIMA model chosen is ARIMA(0,0,0)

pipe1_fc <- pipe1_fit %>% forecast(h = "1 week") %>% filter(.model == "ARIMA")

pipe1_fc %>% autoplot(pipe1_hourly) + labs(title = "Pipe 1 Forecast")

Although we are able to forecast the data, the ARIMA model simply adopts the mean of the data, which suggests that trying to forecast this data is not all that useful

Pipe 2

pipe2_hourly %>% autoplot(avg_hourly_flow) + labs(title  = "Pipe 2 Time Series")

paste("There are:", sum(is.na(fill_gaps(pipe2_hourly))), "total missing values")

## [1] "There are: 0 total missing values"

Our second pipe dataset doesn’t have any missing values, so nothing will need to be imputed

pipe2_hourly %>%gg_tsdisplay(avg_hourly_flow, plot_type = "partial")

There does appear to be some autocorrelation with pipe 2. We can check if there should be differencing applied using the unitroot checks

pipe2_hourly %>% features(avg_hourly_flow, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

pipe2_hourly %>% features(avg_hourly_flow, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

no differencing is recommended

pipe2_hourly %>% model(STL(avg_hourly_flow ~ season(window = "periodic"), robust = TRUE))%>%
  components() %>%
  autoplot() + labs(title = "Pipe 2 STL Decomposition")

After deconstructing the data, it looks like there may be a seasonal daily and weekly component to the data. Now we can apply a boxcox transformation to the data

lambda <- pipe2_hourly %>% 
  features(avg_hourly_flow, features = guerrero) %>% 
  pull(lambda_guerrero)

pipe2_fit <- pipe2_hourly %>%
  model(mean = MEAN(avg_hourly_flow),
        naive = NAIVE(avg_hourly_flow),
    naive_box = NAIVE(box_cox(avg_hourly_flow, lambda)),
    ARIMA = ARIMA(box_cox(avg_hourly_flow, lambda)),
    mean_box = MEAN(box_cox(avg_hourly_flow, lambda)))

glance(pipe2_fit) %>% arrange(AICc) %>% select(.model:BIC)

## # A tibble: 5 × 6
##   .model    sigma2 log_lik   AIC  AICc   BIC
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA       172.  -3992. 7990. 7990. 8005.
## 2 mean        258.     NA    NA    NA    NA 
## 3 naive       528.     NA    NA    NA    NA 
## 4 naive_box   355.     NA    NA    NA    NA 
## 5 mean_box    173.     NA    NA    NA    NA

accuracy(pipe2_fit) %>% select(.model, RMSE) %>% arrange(RMSE)

## # A tibble: 5 × 2
##   .model     RMSE
##   <chr>     <dbl>
## 1 ARIMA      16.0
## 2 mean       16.0
## 3 mean_box   16.0
## 4 naive      23.0
## 5 naive_box  23.0

pipe2_hourly %>% model(ARIMA(box_cox(avg_hourly_flow, lambda))) %>% report()

## Series: avg_hourly_flow 
## Model: ARIMA(0,0,0)(0,0,1)[24] w/ mean 
## Transformation: box_cox(avg_hourly_flow, lambda) 
## 
## Coefficients:
##         sma1  constant
##       0.0844   32.9694
## s.e.  0.0314    0.4486
## 
## sigma^2 estimated as 172.1:  log likelihood=-3992.1
## AIC=7990.2   AICc=7990.23   BIC=8004.93

Once again, the model for ARIMA is about the same as simply taking the mean of the untransformed data. This means that the forecasts for this data are likely unreliable. The optimal model that has been chosen for it is ARIMA(0,0,0)(0,0,1)

pipe2_fc <- pipe1_fit %>% forecast(h = "1 week") %>% filter(.model == "ARIMA")

pipe2_fc %>% autoplot(pipe1_hourly) + labs(title = "Pipe 2 Forecast")

Export the data:

pipe1_export<- pipe1_fc %>%
  as.data.frame() %>%
  select(time, .mean) %>%
  rename(avg_hourly_flow = .mean)
pipe2_export<- pipe2_fc %>%
  as.data.frame() %>%
  select(time, .mean) %>%
  rename(avg_hourly_flow = .mean)

## Write to xlsx

write_xlsx(list("Pipe1" = pipe1_export, "Pipe2" = pipe2_export), path = "project1_pipes_steve_phillips.xlsx")

## Error: Error in libxlsxwriter: 'Error creating output xlsx file. Usually a permissions error.'

Project1_data624

Steve Phillips

2024-03-20

Project 1

Part 1:

Import data and Perform EDA

Impute and Remove Data

Modeling

ATM 1

ATM2

ATM 3

ATM 4

Part 2

Import Data and Perform EDA:

Modeling

Part 3 (Bonus)

Import data

Pipe 1

Pipe 2