Project 1

Part A

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data ends on 4/30/2010, so forecast horizon should be = 30
Because it asks to predict how much is taken out of 4 different ATMs,I plan to make a model for each. As I continue coding, we will likely see different patterns with each ATM,

# load libraries 
library(readr)
library(readxl)
library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.0
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(ggplot2)
library(zoo)

## 
## Attaching package: 'zoo'

## The following object is masked from 'package:tsibble':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tsoutliers)

## 
## Attaching package: 'tsoutliers'

## The following object is masked from 'package:fabletools':
## 
##     outliers

library(patchwork)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(urca)
library(rsample)

Data Exploration and Cleaning

First, I need to verify the structure is appropriately saved before forecasting work. Next I will check for missing data and address them, if needed. Next I will look ay the distibution of the data.

atm <- read_xlsx("ATM624Data.xlsx")

atm$DATE <- as.Date(atm$DATE, origin = "1899-12-30")
str(atm)

## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.6  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.8  
##                                          NA's   :19

str(atm)

## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

Spread function is used to reshape the data so each ATM can have its own column. I used box plots to confirm the atm’s have different patterns and should ve modeled separately. This is also confirmed by the different scales, and we can see ATM4 has a large outlier, probably the max value from above 10919.8. The box plots also show that ATM3 appears to have very little cash taken out, so I will look deeper at this. Note: atm_wide is not used again.

atm_wide <- atm %>% spread(ATM, Cash)
#Linear interpolation

par(mfrow=c(1,4))
for (i in 2:5){
    boxplot(atm_wide[i], 
            main = sprintf("%s", names(atm_wide)[i]))
}

With 19 values missing, I want to check which variables that falls under. Some of these NA are for ATM values, which are for May 2010; it makes sense to remove the rows where NA is present in the ATM column. Although the proportion of NA remaining, now in cash values, is small in a dataset of 1474, these data points might be important for continuity in time series. Interpolation is used ordered data so I will use this technique to fill the missing cash values for ATM1 (3), and ATM2 (2).

missing_rows <- atm[rowSums(is.na(atm)) > 0, ]
print(missing_rows)

## # A tibble: 19 × 3
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-18 ATM2     NA
##  4 2009-06-22 ATM1     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

atm <- atm %>% filter(!is.na(ATM)) #remove NA in ATM

# Check again only NA remaining are in Cash column
missing_rows_after_filter <- atm[rowSums(is.na(atm)) > 0, ]
print(missing_rows_after_filter)

## # A tibble: 5 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-06-13 ATM1     NA
## 2 2009-06-16 ATM1     NA
## 3 2009-06-18 ATM2     NA
## 4 2009-06-22 ATM1     NA
## 5 2009-06-24 ATM2     NA

#Interpolation
atm1_data <- atm %>% filter(ATM == "ATM1")
atm1_data$Cash <- zoo::na.approx(atm1_data$Cash, na.rm = FALSE) #na is retained

atm2_data <- atm %>% filter(ATM == "ATM2")
atm2_data$Cash <- zoo::na.approx(atm2_data$Cash, na.rm = FALSE)

imputed_data <- bind_rows(atm1_data, atm2_data)

other_atm_data <- atm %>% filter(!ATM %in% c("ATM1", "ATM2"))
atm <- bind_rows(other_atm_data, imputed_data) %>%
  arrange(DATE, ATM) 

atm_tsibble <- atm %>%
  as_tsibble(index = DATE, key = ATM)

# Check
sum(is.na(atm_tsibble$Cash))

## [1] 0

I will the convert the data to tsibble then to time series to visualize them over time. The autoplot confirms that AMT4’s outlier looks like a spike. The autoplots confirm that ATM3 only has data for a few recent time points.

atm_tsibble %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Before Outlier Removal", x = "Date", y = "Cash (in hundreds)")

I have to consider if the outlier in ATM4 is an real world event or an error. It is possible some event could have occurred during this time. However, the value of 10919.8 is actually in hundreds, meaning was it possible for 10,919,800 to be withdrawn from one ATM on one day? It would be best to address this outlier.

atm4_data <- atm %>% filter(ATM == "ATM4")
atm4_tsib <- ts(atm4_data$Cash, frequency = 365)  

# Identify outliers using tsoutliers 
outlier_results <- tso(atm4_tsib )

## Warning in locate.outliers.iloop(resid = resid, pars = pars, cval = cval, :
## stopped when 'maxit.iloop' was reached

## Warning in locate.outliers.oloop(y = y, fit = fit, types = types, cval = cval,
## : stopped when 'maxit.oloop = 4' was reached

print(outlier_results)

## Series: atm4_tsib 
## Regression with ARIMA(0,0,0) errors 
## 
## Coefficients:
##       intercept       AO285
##        445.3463  10474.4153
## s.e.    18.3671    350.9031
## 
## sigma^2 = 123472:  log likelihood = -2656.5
## AIC=5319   AICc=5319.06   BIC=5330.7
## 
## Outliers:
##   type ind  time coefhat tstat
## 1   AO 285 1:285   10474 29.85

outlier_index <- 285
original_index <- which(atm_tsibble$ATM == "ATM4")[outlier_index] 

replacement_value <- median(atm4_data$Cash[-outlier_index], na.rm = TRUE)
atm4_data$Cash[outlier_index] <- replacement_value

# Replace the value in the original dataset
atm_tsibble$Cash[original_index] <- replacement_value

print(atm_tsibble[original_index, ])

## # A tsibble: 1 x 3 [1D]
## # Key:       ATM [1]
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2010-02-09 ATM4   403.

Next, I want to look more ATM3 in depth. It seems like there were only three days where the withdrawal was no zero. For those same dates, ATM1 has the same withdrawal amount: 2010-04-28 96.00000, 2010-04-29, 82.00000 2010-04-30 85.00000. Since ATM3 has very limited data, I will remove it and use AMT1 to guide the forecast amount needed for ATM3 later on.

atm3_dates <- atm_tsibble %>%
  filter(ATM == "ATM3" & Cash > 0) %>%  # where cash values greater than 0 to filter out other rows
  select(DATE) %>%
  distinct()

# Get all ATM values for the same dates
all_atm_values <- atm_tsibble %>%
  filter(DATE %in% atm3_dates$DATE) 
print(all_atm_values)

## # A tsibble: 12 x 3 [1D]
## # Key:       ATM [4]
##    DATE       ATM    Cash
##    <date>     <chr> <dbl>
##  1 2010-04-28 ATM1   96  
##  2 2010-04-29 ATM1   82  
##  3 2010-04-30 ATM1   85  
##  4 2010-04-28 ATM2  107  
##  5 2010-04-29 ATM2   89  
##  6 2010-04-30 ATM2   90  
##  7 2010-04-28 ATM3   96  
##  8 2010-04-29 ATM3   82  
##  9 2010-04-30 ATM3   85  
## 10 2010-04-28 ATM4  348. 
## 11 2010-04-29 ATM4   44.2
## 12 2010-04-30 ATM4  482.

# Remove ATM3
atm_tsibble <- atm_tsibble %>%
  filter(ATM != "ATM3")

Finally before I can move into modelling, I want to review if there were other dates that were potentially not present on the data. Great, this all set.

full_dates <- seq(min(atm_tsibble$DATE), max(atm_tsibble$DATE), by = "day")
missing_dates <- setdiff(full_dates, atm_tsibble$DATE)
if (length(missing_dates) == 0) {
  print("All days are present in the data.")
} else {
  print("Missing dates:")
  print(missing_dates)
}

## [1] "All days are present in the data."

Autoplots look a lot better than initially, now the outlier is removed, and missing variables are accounted for. Next I will decompose each ATMs to better understand if further transformation is needed and what type of modelling might be best.

atm_tsibble %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Before Outlier Removal", x = "Date", y = "Cash (in hundreds)")

The components of all three show seasonal pattern based on the week. They all show a changing trend, indicating the series are not stationary. The remainder for ATM1 and ATM2 seems seems random up to a point, then theres more variation; ATM4 remainer has a lot of variation.

sum(is.na(atm_tsibble$Cash))

## [1] 0

Transformation

atm1_ts <- as_tsibble(atm_tsibble %>% filter(ATM == "ATM1") , index = DATE)
atm1_stl <- atm1_ts %>%  model(STL(Cash ~ season(window = "periodic")))
atm1_components <- atm1_stl %>%  components() 
autoplot(atm1_components)

atm2_ts <- as_tsibble(atm_tsibble %>% filter(ATM == "ATM2") , index = DATE)
atm2_stl <- atm2_ts |>  model(STL(Cash ~ season(window = "periodic")))
atm2_components <- atm2_stl %>%  components() 
autoplot(atm2_components)

atm4_ts <- as_tsibble(atm_tsibble %>% filter(ATM == "ATM4") , index = DATE)
atm4_stl <- atm4_ts |>  model(STL(Cash ~ season(window = "periodic")))
atm4_components <- atm4_stl %>%  components() 
autoplot(atm4_components)

I transform the series to make them stationary and normalize the variance. This is not done automatically when modelling. Next, I use gg_display function to view the new time series autoplots and ACF and PACF graphs. The variance for ATM1, ATM2 and ATM4 looks like the fluctuate around a constant mean and have no trend or volatlity (there is a bit more fluctuations for ATM2 and 4). The ACF for ATM1, ATM2 and ATM4 shows similar results: autocorrelations drop off after a few lags, with some significant spikes around lags 7 and 21. The PACF shows a significant spike around lag 7 for ATM1, ATM2 and ATM4, confirming they are stationary.

lambda1 <- atm_tsibble %>%
  filter(ATM == "ATM1") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm1_tsibble <- atm_tsibble %>%
  filter(ATM == "ATM1") %>%
  mutate(Cash = box_cox(Cash + 2, lambda1))  

gg_display <- atm1_tsibble %>%
  gg_tsdisplay(Cash, plot_type = "partial") +  
  labs(title = "ATM1: Transformed Cash ", y = "Transformed Cash")
print(gg_display)

lambda2 <- atm_tsibble %>%
  filter(ATM == "ATM2") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm2_tsibble <- atm_tsibble %>%
  filter(ATM == "ATM2") %>%
  mutate(Cash = box_cox(Cash + 2, lambda2))

gg_display_atm2 <- atm2_tsibble %>%
  gg_tsdisplay(Cash, plot_type = "partial") +
  labs(title = "ATM2: Transformed Cash", y = "Transformed Cash")

print(gg_display_atm2)

lambda4 <- atm_tsibble %>%
  filter(ATM == "ATM4") %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

atm4_tsibble <- atm_tsibble %>%
  filter(ATM == "ATM4") %>%
  mutate(Cash = box_cox(Cash + 2, lambda4))

gg_display_atm4 <- atm4_tsibble %>%
  gg_tsdisplay(Cash, plot_type = "partial") +
  labs(title = "ATM4: Transformed Cash", y = "Transformed Cash")
print(gg_display_atm4)

## Modelling

Now, I want to finally model the data and I selected, ETS, which captures error, trend, and seasonality, ARIMA (with seasonal), since its verstile and handles seasonality and SNaive which also is a good model for seasonality. Remember the lag on the 7 indicates a weekly seasonality so I wanted to make sure I account for it.

fit_compare <- function(atm_tsibble) {
  ets_model <- atm_tsibble %>% model(ETS(Cash))
  arima_model <- atm_tsibble  %>% model(ARIMA(Cash ~ season(), stepwise = FALSE)) #accounts for seaonsal
  snaive_model <- atm_tsibble %>% model(SNAIVE(Cash))
  models <- list(ets = ets_model, arima = arima_model, snaive = snaive_model)
  accuracy_table <- purrr::map_df(models, accuracy, .id = "model")
  return(accuracy_table)
}
combined_results <- rbind(
  mutate(fit_compare(atm1_tsibble), ATM = "ATM1"),
  mutate(fit_compare(atm2_tsibble), ATM = "ATM2"),
  mutate(fit_compare(atm4_tsibble), ATM = "ATM4") 
)
print(combined_results |> select(-.type, -.model, -ME))

## # A tibble: 9 × 9
##   model  ATM    RMSE    MAE    MPE  MAPE  MASE RMSSE      ACF1
##   <chr>  <chr> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>     <dbl>
## 1 ets    ATM1   1.22  0.689  -7.04  15.1 0.884 0.849  0.102   
## 2 arima  ATM1   1.22  0.717  -7.50  15.7 0.919 0.848 -0.000428
## 3 snaive ATM1   1.44  0.779  -6.03  15.7 1     1      0.129   
## 4 ets    ATM2   8.27  5.79  -38.2   55.6 0.858 0.839  0.0296  
## 5 arima  ATM2   8.56  6.46  -51.1   68.7 0.958 0.869 -0.0265  
## 6 snaive ATM2   9.85  6.74  -29.6   53.9 1     1      0.0553  
## 7 ets    ATM4  13.0  10.3   -47.9   73.6 0.761 0.758  0.0824  
## 8 arima  ATM4  12.7  10.3   -52.0   76.0 0.762 0.743 -0.00227 
## 9 snaive ATM4  17.1  13.5   -45.3   82.5 1     1      0.0439

The RMSE values are larger for SNAIVE, so I will not select that one. Now the ARIMA and ETS models have comparable RMSE values. ARIMA’s RMSE is slightly higher for ATM 1 and ATM2. Considering that ETS is more simple and an easier model to explain than ARIMA plus the fact it has lower RMSE for two out of the three ATMs I will more forward with forecasting using ETS. Note: The MPE and MAPE had ‘inf’ values for ATM2 due to having near zero values, I initially tried to avoid this by adding the +1, which I then changed to +2 to the lambda during transformation, and it corrected the values.

Next I applied ETS model manually to each atm tsibble to better understand what I will be using to forecast. ATM1 and ATM2 best selected model is ETS (A,N,A), additive error, no trend, additive seasonality; ATM 4 best selected model is ETS(M,N,A), multiplicative error, no trend, additive seasonality. I reviewed the residuals which all appear to be white noise and histograms have normal distributions.

ets1_model <- atm1_tsibble %>%
  model(ETS(Cash)) 
ets2_model <- atm2_tsibble %>%
  model(ETS(Cash)) 
ets4_model <- atm4_tsibble %>%
  model(ETS(Cash))

create_combined_ets_diagnostics <- function(atm_data_list) {
  par(mfrow = c(3, 3)) 

  for (atm_info in atm_data_list) {
    atm_tsibble <- atm_info$tsibble
    model_name <- atm_info$name

    ets_model <- atm_tsibble %>% model(ETS(Cash))
    print(report(ets_model))
    
    residuals <- ets_model %>% augment() %>% pull(.resid)
    
    plot(residuals, main = paste("Residuals of", model_name), ylab = "Residuals", xlab = "Index", type = "l")
    abline(h = 0, col = "red") 
    
    
    hist(residuals, main = paste("Histogram of Residuals for", model_name), xlab = "Residuals", breaks = 20)
    
    
    Acf(residuals, main = paste("ACF of Residuals for", model_name))
    Pacf(residuals, main = paste("PACF of Residuals for", model_name))
  }
  
  par(mfrow = c(1, 1)) 
}

atm_data_list <- list(
  list(tsibble = atm1_tsibble, name = "ATM1"),
  list(tsibble = atm2_tsibble, name = "ATM2"),
  list(tsibble = atm4_tsibble, name = "ATM4")
)

create_combined_ets_diagnostics(atm_data_list)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000065 
##     gamma = 0.3662404 
## 
##   Initial states:
##     l[0]      s[0]     s[-1]     s[-2]     s[-3]     s[-4]    s[-5]     s[-6]
##  7.90113 -4.179114 0.5827181 0.4285257 0.5334693 0.9818783 1.062594 0.5899285
## 
##   sigma^2:  1.535
## 
##      AIC     AICc      BIC 
## 2320.775 2321.396 2359.774 
## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM    `ETS(Cash)`
##   <chr>      <model>
## 1 ATM1  <ETS(A,N,A)>

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.0001000123 
##     gamma = 0.3821918 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  30.06249 -20.56225 -9.114558 4.425845 1.801073 6.680938 10.08389 6.685066
## 
##   sigma^2:  70.0432
## 
##      AIC     AICc      BIC 
## 3715.276 3715.897 3754.275 
## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM    `ETS(Cash)`
##   <chr>      <model>
## 1 ATM2  <ETS(A,N,A)>

## Series: Cash 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.01543598 
##     gamma = 0.1731552 
## 
##   Initial states:
##      l[0]      s[0]     s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  26.87038 -19.01183 -6.330076 4.945182 3.246947 7.593668 4.364441 5.191662
## 
##   sigma^2:  0.2148
## 
##      AIC     AICc      BIC 
## 4024.047 4024.669 4063.046 
## # A mable: 1 x 2
## # Key:     ATM [1]
##   ATM    `ETS(Cash)`
##   <chr>      <model>
## 1 ATM4  <ETS(M,N,A)>

#when I tried to do it the simple way ets4_model <- atm4_tsibble %>% model(ETS(Cash))%>% gg_tsresiduals() I could get blank space

Forecasting

Now, I use the models to make forecast(and title has a reminder that these are not on the right scale essentially). May has 31 days so I ensure that my h value is correctly assigned. The forecasted plots start in March so I can see 2 months of historical data next to the forecasted values.

# forecasts for May 2010 (31 days)

forecast1 <- ets1_model %>% forecast(h = 31)
forecast2 <- ets2_model %>% forecast(h = 31)
forecast4 <- ets4_model %>% forecast(h = 31)


p1 <- autoplot(atm1_tsibble, Cash) +
  autolayer(forecast1, series = "Forecast ATM1", PI = FALSE) +
  labs(title = "Cash Withdrawals for ATM1 (transformed))",
       x = "Date",
       y = "Cash Withdrawals")  +
  xlim(as.Date("2010-03-01"), as.Date("2010-05-31")) +
  theme_minimal()

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series` and `PI`

## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series` and `PI`

p2 <- autoplot(atm2_tsibble, Cash) +
  autolayer(forecast2, series = "Forecast ATM2", PI = FALSE) +
  labs(title = "Cash Withdrawals for ATM2 (transformed))",
       x = "Date",
       y = "Cash Withdrawals") +
  xlim(as.Date("2010-03-01"), as.Date("2010-05-31")) +
  theme_minimal()

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), : Ignoring unknown parameters: `series` and `PI`
## Ignoring unknown parameters: `series` and `PI`

p3 <- autoplot(atm4_tsibble, Cash) +
  autolayer(forecast4, series = "Forecast ATM4", PI = FALSE) +
  labs(title = "Cash Withdrawals for ATM4 (transformed))",
       x = "Date",
       y = "Cash Withdrawals") +
  xlim(as.Date("2010-03-01"), as.Date("2010-05-31")) +
  theme_minimal()

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), : Ignoring unknown parameters: `series` and `PI`
## Ignoring unknown parameters: `series` and `PI`

print(p1)

## Warning: Removed 304 rows containing missing values or values outside the scale range
## (`geom_line()`).

print(p2)

## Warning: Removed 304 rows containing missing values or values outside the scale range
## (`geom_line()`).

print(p3)

## Warning: Removed 304 rows containing missing values or values outside the scale range
## (`geom_line()`).

Next, I want to make the forecasts to be directly interpretable in the same units as the original data, which means I need to inverse the transformation which I do using the forecasted mean values.

inverse_box_cox <- function(y_prime, lambda) {
  if (lambda == 0) {
    return(exp(y_prime))
  } else {
    return((lambda * y_prime + 1)^(1/lambda))
  }
}
forecast1$.mean <- inv_box_cox(forecast1$.mean, lambda = lambda1)
forecast2$.mean <- inv_box_cox(forecast2$.mean, lambda = lambda2)
forecast4$.mean <- inv_box_cox(forecast4$.mean, lambda = lambda4)

In order to find the forecasted ATM values for May I need to sum the forecasted values in the original scale for each ATM for 31 days.

sum_forecast1 <- sum(forecast1$.mean, na.rm = TRUE)
sum_forecast2 <- sum(forecast2$.mean, na.rm = TRUE)
sum_forecast4 <- sum(forecast4$.mean, na.rm = TRUE)

print(paste("Total forecasted cash withdrawals for ATM1 in May 2010:", sum_forecast1))

## [1] "Total forecasted cash withdrawals for ATM1 in May 2010: 2446.96678918692"

print(paste("Total forecasted cash withdrawals for ATM2 in May 2010:", sum_forecast2))

## [1] "Total forecasted cash withdrawals for ATM2 in May 2010: 1882.04906194075"

print(paste("Total forecasted cash withdrawals for ATM4 in May 2010:", sum_forecast4))

## [1] "Total forecasted cash withdrawals for ATM4 in May 2010: 10226.8861647894"

#Sanity check to see if values are similar to previous month
april_sums <- atm %>%
  filter(DATE >= as.Date("2010-04-01") & DATE <= as.Date("2010-04-30")) %>%
  group_by(ATM) %>%
  summarise(Total_Cash = sum(Cash, na.rm = TRUE))
print(april_sums)

## # A tibble: 4 × 2
##   ATM   Total_Cash
##   <chr>      <dbl>
## 1 ATM1       2300 
## 2 ATM2       1842 
## 3 ATM3        263 
## 4 ATM4      11858.

Conclusion

Finally, we have the forecasted values for each ATM for May 2010, these sum values times ** hundreds of dollars **. For the Month of May 2010 the amount needed in ATM1 is $2,446,967, ATM 2 is $1,882,050 (rounding up to make sure to have enough in the atm), ATM 3, might follow the amount of AMT1, based on the data we did have available where amounts withdrawn were the same as ATM1, and the amount needed for ATM 4 is $10,226,867 (rounding up to make sure to have enough in the atm). While these values seem high they are comparable to April 2010 withdrawals at each ATM.

library(writexl)

# Convert forecasts to data frames
forecast1_df <- as.data.frame(forecast1)
forecast2_df <- as.data.frame(forecast2)
forecast4_df <- as.data.frame(forecast4)

write_xlsx(
  list(
    "Forecast_ATM1" = forecast1_df,
    "Forecast_ATM2" = forecast2_df,
    "Forecast_ATM4" = forecast4_df
  ),
  path = "ATM_Forecasts_May2010.xlsx"
)

Part B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# load libraries 
library(readr)
library(readxl)
library(fpp3)
library(ggplot2)
library(zoo)
library(forecast)
library(tsoutliers)
library(patchwork)
library(gridExtra)

Loading and Cleaning Data

After reading in the data, I replaced the Na with the median value and removed CaseSequence as it is not needed.

rcf <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")

summary(rcf)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

rcf$KWH[is.na(rcf$KWH)] = median(rcf$KWH, na.rm = TRUE)
any(is.na(rcf))

## [1] FALSE

rcf <- rcf %>%
  select(-CaseSequence)
tail(rcf)

## # A tibble: 6 × 2
##   `YYYY-MMM`     KWH
##   <chr>        <dbl>
## 1 2013-Jul   8415321
## 2 2013-Aug   9080226
## 3 2013-Sep   7968220
## 4 2013-Oct   5759367
## 5 2013-Nov   5769083
## 6 2013-Dec   9606304

I create rcf tssible and made sure to convert directly to yearmonth format so the tsibble can reflect monthly data.

rcf_tsibble <- rcf %>%
  rename(Date = `YYYY-MMM`) %>%
  mutate(Date = yearmonth(Date)) %>% 
  as_tsibble(index = Date)

head(rcf_tsibble)

## # A tsibble: 6 x 2 [1M]
##       Date     KWH
##      <mth>   <dbl>
## 1 1998 Jan 6862583
## 2 1998 Feb 5838198
## 3 1998 Mar 5420658
## 4 1998 Apr 5010364
## 5 1998 May 4665377
## 6 1998 Jun 6467147

I decompose the model and extract the components to better understand them. The autplot shows some fluctuations and one substantial dip (outlier), there seems to be a upward trend, and there is a clear seasonal pattern with peaks on several months, one of which seems to be January.

rcf_stl <- rcf_tsibble %>%
  model(STL(KWH ~ season(window = "periodic")))

rcf_components <- rcf_stl %>% components()

autoplot(rcf_components)

I using rolling mean to replace the outlier and visualize the autoplot again.

rcf_tsibble$KWH <- rollmean(rcf_tsibble$KWH, k = 5, fill = NA, align = "center")

autoplot(rcf_tsibble, KWH) +
  labs(title = "After Addressing Outliers", x = "Time", y = "KWH") +
  theme_minimal()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

na_rows <- rcf_tsibble[is.na(rcf_tsibble$KWH), ]
print(na_rows)

## # A tsibble: 4 x 2 [1M]
##       Date   KWH
##      <mth> <dbl>
## 1 1998 Jan    NA
## 2 1998 Feb    NA
## 3 2013 Nov    NA
## 4 2013 Dec    NA

rcf_tsibble$KWH[is.na(rcf_tsibble$KWH)] <- median(rcf_tsibble$KWH, na.rm = TRUE)

Seasonal pattern is clear, so I use gg-season to look more in depth at it as it may be helpful later for modeling. The plot shows a peak during the summer and trought during the spring and this pattern is seen acorss most the years present below.

gg_season(rcf_tsibble, KWH) +
  labs(title = "Seaonsal Plot over 12 Months", x = "Time", y = "KWH") +
  theme_minimal()

I find the optimal lambda value for the Box-Cox using the Guerrero. I apply the Box-Cox transformation to the KWH variable in the rcf_tsibblee and rename the variable trcf_tsibble which is now the transformed tsibble. gg_tsdisplay function is applied to produce time series plot, ACF plot, and PACF plot to confirm the variance is stabilized.

lambda5 <- rcf_tsibble %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

trcf_tsibble <- rcf_tsibble %>%
  mutate(KWH = box_cox(KWH, lambda5))

gg_display_tt <- trcf_tsibble %>%
  gg_tsdisplay(KWH, plot_type = "partial")
print(gg_display_tt)

Now I will start looking for best fitting model, trcf_tsibble is the transformed tsibble.

ets_model <- trcf_tsibble %>%
  model(
    ets = ETS(KWH)
  )
future_forecast <- ets_model %>%
  forecast(h = "12 months")
accuracy(ets_model)

## # A tibble: 1 × 10
##   .model .type       ME    RMSE    MAE     MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>    <dbl>   <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets    Training  972. 147136. 94839. -0.0465  2.45 0.372 0.380 0.169

autoplot(future_forecast, trcf_tsibble)

trcf_tsibble %>%
    features(KWH, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

trcf_tsibble %>%
    features(KWH, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

After applying the seasonal and regular order of differencing, the plot shows that the data is oscillating around the mean of zero. It looks like it could be stationary but the ACF and PACF dont drop off or cut off as quickly as they normally do when stationary. So I will do another test just to confirm.

seasonal_diff_tsibble <- trcf_tsibble %>%
  mutate(KWH_seasonal_diff = difference(KWH, lag = 12)) #seasonal

combined_diff_tsibble <- seasonal_diff_tsibble %>%
  mutate(KWH_combined_diff = difference(KWH_seasonal_diff)) #additional order of differencing

combined_ts <- ts(combined_diff_tsibble$KWH_combined_diff, frequency = 12, start = c(1998, 1))

ggtsdisplay(combined_ts)

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

The ADF test confirms that it is stationary with a p value of less than 0.01 which is less than 0.05 and KPSS is more than .05 - This indicates that it’s stationary

library(tseries)

trc_clean <- na.omit(combined_ts)

check_stationarity <- function(ts) {
  adf_result <- adf.test(ts)
  kpss_result <- kpss.test(ts)

  list(
    ADF = list(statistic = adf_result$statistic, p.value = adf_result$p.value),
    KPSS = list(statistic = kpss_result$statistic, p.value = kpss_result$p.value)
  )
}
pipe_2_stationarity <- check_stationarity(trc_clean)

## Warning in adf.test(ts): p-value smaller than printed p-value

## Warning in kpss.test(ts): p-value greater than printed p-value

print(pipe_2_stationarity)

## $ADF
## $ADF$statistic
## Dickey-Fuller 
##     -6.245781 
## 
## $ADF$p.value
## [1] 0.01
## 
## 
## $KPSS
## $KPSS$statistic
## KPSS Level 
## 0.04745011 
## 
## $KPSS$p.value
## [1] 0.1

So I now know the ideal model will 1 seasonal differencing and one regular order of differencing. I will go back to using the transformed tsibble: trcf_tsibble (rather than differenced value) and check the auto ARIMA model. I created the auto arima model first and notice the

model_auto <- auto.arima(trcf_tsibble$KWH, seasonal = TRUE, stepwise = FALSE)
summary(model_auto)

## Series: trcf_tsibble$KWH 
## ARIMA(0,1,5) 
## 
## Coefficients:
##          ma1      ma2      ma3      ma4      ma5
##       0.3480  -0.0396  -0.2668  -0.1393  -0.7040
## s.e.  0.1026   0.1161   0.1755   0.0500   0.1917
## 
## sigma^2 = 2.633e+10:  log likelihood = -2563.1
## AIC=5138.2   AICc=5138.66   BIC=5157.71
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE      ACF1
## Training set 13144.51 159703.8 118652.1 0.2180583 3.01978 0.7291728 0.0397053

model_b <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(1, 1, 1), period = 12))
model_c <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(1, 1, 2), period = 12))
model_d <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 1), period = 12))
model_e <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 2), period = 12))
models2v <- list(model_auto = model_auto, model_b = model_b, model_c = model_c, model_d = model_d, model_e = model_e)

aic_values <- sapply(models2v, AIC)
print(aic_values)

## model_auto    model_b    model_c    model_d    model_e 
##   5138.201   4739.690   4741.362   4741.621   4739.408

summary(model_e)

## Series: trcf_tsibble$KWH 
## ARIMA(0,1,5)(0,1,2)[12] 
## 
## Coefficients:
##          ma1     ma2     ma3      ma4      ma5     sma1    sma2
##       0.0887  0.0102  0.1268  -0.0197  -0.9114  -0.9973  0.2060
## s.e.  0.0569  0.0675  0.0598   0.0552   0.0633   0.0929  0.0934
## 
## sigma^2 = 1.512e+10:  log likelihood = -2361.7
## AIC=4739.41   AICc=4740.26   BIC=4764.91
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 13347.66 116379.2 71519.77 0.2930489 1.825928 0.4395227 0.04993144

I attempted to do Cross- validation of the models, specifically model_auto (ARIMA’s auto selection); model_e (the one that is best fit based on having the lowest AIC metric), and the ets_model (R’s auto ETS selection). However, there several errors and model e was not able to be validated, possibly because the test of 20% was too small.

splits <- initial_time_split(trcf_tsibble, prop = 0.8)  # 80% train, 20% test

evaluate_models <- function(split) {
  train_data <- training(split)
  test_data <- testing(split)
  
  ets_model <- train_data %>%
    model(ETS = ETS(KWH))
  auto_model <- train_data %>%
    model(Auto_ARIMA = ARIMA(KWH))
 
  # model_e <- train_data %>%
    #model(Model_E = ARIMA(KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 2), period = 12)))

  ets_forecast <- ets_model %>%
    forecast(h = nrow(test_data))
  
  auto_forecast <- auto_model %>%
    forecast(h = nrow(test_data))

  
  ets_accuracy <- accuracy(ets_forecast, test_data)
  auto_accuracy <- accuracy(auto_forecast, test_data)
  
  return(list(ets_accuracy = ets_accuracy, auto_accuracy = auto_accuracy))
}

results <- evaluate_models(splits)

print(results$ets_accuracy)

## # A tibble: 1 × 10
##   .model .type       ME     RMSE      MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>    <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS    Test  1335746. 1353881. 1335746.  29.7  29.7   NaN   NaN 0.634

print(results$auto_accuracy)

## # A tibble: 1 × 10
##   .model     .type      ME    RMSE     MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>      <chr>   <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Auto_ARIMA Test  638614. 727569. 641524.  14.0  14.0   NaN   NaN 0.713

The best model is model e. ARIMA (0, 1, 5) (0, 1, 2)[12] with a RMSE of 4739.408, which is lower than the autoarima model and lower than the auto ets model. I will continue with this model, confirm it is stationary and forecast with it.

The Ljung-Box test shows a p-value of 0.19, p-value indicates that there is no significant autocorrelation in the residuals, and it is stationary. There is an error in gg_tsresiduals of this output as well, but based on the visual of residuals and the historgram, stationary is confirmed.

model_e <- Arima(trcf_tsibble$KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 2), period = 12))

checkresiduals(model_e)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,5)(0,1,2)[12]
## Q* = 4.6908, df = 3, p-value = 0.1959
## 
## Model df: 7.   Total lags used: 10

#gg_tsresiduals(model_e) makes an errors so I'll remove it but it looks like white noise

The blue forecast is coming up blank.

train_size <- floor(0.7 * nrow(trcf_tsibble))
train_data <- trcf_tsibble %>% slice(1:train_size)
test_data <- trcf_tsibble %>% slice((train_size + 1):nrow(trcf_tsibble))

model_e <- train_data %>%
  model(
    Model_E = ARIMA(KWH ~ pdq(0, 1, 5) + PDQ(0, 1, 2)[12])
  )

## Warning: 1 error encountered for Model_E
## [1] invalid type (list) for variable 'PDQ(0, 1, 2)[12]'

model_e_forecast <- model_e %>%
  forecast(h = nrow(test_data)) 

model_e_accuracy <- accuracy(model_e_forecast, test_data)

print(model_e_accuracy)

## # A tibble: 1 × 10
##   .model  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Model_E Test    NaN   NaN   NaN   NaN   NaN   NaN   NaN    NA

autoplot(model_e_forecast, color = "blue") +
  autolayer(test_data, KWH, color = "red", series = "Actual") +
  labs(title = "Model E Forecast vs Actual", y = "KWH") +
  theme_minimal()

## Warning in geom_line(eval_tidy(expr(aes(!!!aes_spec))), data = object, ..., :
## Ignoring unknown parameters: `series`

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 58 rows containing missing values or values outside the scale range
## (`geom_line()`).

** This code chunk causes an error and the model e becomes null - unsure how to correct, so I will move on** model_e <- train_data %>% model(model_e = ARIMA(KWH, order = c(0, 1, 5), seasonal = list(order = c(0, 1, 2), period = 12)))
model_e_forecast <- model_e %>% forecast(h = “12 months”)
model_e_accuracy <- accuracy(model_e_forecast, test_data) library(ggplot2) autoplot(model_e_forecast) + autolayer(trcf_tsibble, aes(x = index(trcf_tsibble), y = KWH), series = “Actual”) + labs(title = “Model E Forecast”, y = “KWH”) + theme_minimal()

Something seems wrong - with the forecasts…

evaluate_models <- function(split) {
  train_data <- training(split)
  
  auto_model <- train_data %>%
    model(Auto_ARIMA = ARIMA(KWH))
  
  ets_model <- train_data %>%
    model(ETS = ETS(KWH))
  
  auto_forecast <- auto_model %>%
    forecast(h = 12) %>%
    mutate(model = "Auto ARIMA")  

  ets_forecast <- ets_model %>%
    forecast(h = 12) %>%
    mutate(model = "ETS")  
  
  combined_forecast <- bind_rows(auto_forecast, ets_forecast)

  return(list(auto_forecast = auto_forecast, ets_forecast = ets_forecast, combined_forecast = combined_forecast))
}

results <- evaluate_models(splits)

autoplot(rcf_tsibble, KWH) +
  autolayer(results$auto_forecast, KWH, color = "blue", series = "Auto ARIMA") +
  autolayer(results$ets_forecast, KWH, color = "purple", series = "ETS") +
  labs(title = "Forecasts from Auto ARIMA and ETS Models",
       y = "KWH",
       x = "Date") +
  theme_minimal() +
  facet_wrap(~model, scales = "free_y")

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`

## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`

## Warning in ggdist::geom_lineribbon(without(intvl_mapping, "colour_ramp"), :
## Ignoring unknown parameters: `series`

## Warning in geom_line(mapping = without(mapping, "shape"), data =
## unpack_data(object[single_row[["FALSE"]], : Ignoring unknown parameters:
## `series`

## Scale for fill_ramp is already present.
## Adding another scale for fill_ramp, which will replace the existing scale.

evaluate_models <- function(data, new_data) {
  auto_model <- data %>%
    model(Auto_ARIMA = ARIMA(KWH))
  
  ets_model <- data %>%
    model(ETS = ETS(KWH))
  
  auto_forecast <- auto_model %>%
    forecast(new_data = new_data) %>%
    mutate(model = "Auto ARIMA")
  
  ets_forecast <- ets_model %>%
    forecast(new_data = new_data) %>%
    mutate(model = "ETS")
  
  combined_forecast <- bind_rows(auto_forecast, ets_forecast)
  
  return(combined_forecast)
}

forecast_period <- new_data(rcf_tsibble, n = 24) %>%
  mutate(index = yearmonth("2014 Jan") + 0:23) 

results <- evaluate_models(rcf_tsibble, forecast_period)

results %>%
  autoplot(rcf_tsibble, level = NULL) +
  labs(title = "Forecasts from Auto ARIMA and ETS Models",
       y = "KWH",
       x = "Date") +
  theme_minimal() +
  facet_wrap(~ .model, scales = "free_y")

The auto ARIMA model had the lower RMSE value (I do admit neither of these models are the ‘best’) but in comparison the visual also suggests to pick ARIMA. I try again ans create a forecast for 2014, 12 months, then I fit arima model and forecast. I try to apply inverse Box-Cox transformation of forecasted values based on lambda5, but then receive an error because the output becomes null. However, the environment shows there are values for this forecast which I imported a photo of below. The value in the original dataset are in the millions so maybe this isnt completely off (although I suspect it is not correct either).

forecast_period <- new_data(trcf_tsibble, n = 12) %>%
  mutate(index = yearmonth("2014 Jan") + 0:11)

evaluate_arima <- function(data, new_data) {
  auto_model <- data %>%
    model(Auto_ARIMA = ARIMA(KWH))
  auto_forecast <- auto_model %>%
    forecast(new_data = new_data) %>%
    mutate(model = "Auto ARIMA")
  
  return(auto_forecast)
}

results <- evaluate_arima(trcf_tsibble, forecast_period)

results <- results %>%
  mutate(predicted_KWH = if (lambda5 == 0) {
                             exp(.mean)          
                           } else {
                             inv_box_cox(.mean, lambda5)  
                           })
#----error # when trying to display forecasted KWH values for each month in 2014

knitr::include_graphics("arima results.png")

Part C

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Load the data and format the date time format

pipe_1 <- read_xlsx("Waterflow_Pipe1.xlsx") %>% mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))

pipe_2 <- read_xlsx("Waterflow_Pipe2.xlsx")  %>% 
  mutate(`Date Time` = as.POSIXct(`Date Time` * (60 * 60 * 24), origin = "1899-12-30", tz = "GMT"))

Visualize pipe 1

ggplot(pipe_1, aes(`Date Time`, WaterFlow)) +
  geom_line(color = "blue") +
  ggtitle("Water Flow for Pipe 1") +
  xlab("Date Time") + 
  ylab("Water Flow")

Data cleaning on pipe 1, and ensuring that multiple entries within the same hour are aggregated.

pipe_1_clean <- pipe_1 %>% 
  mutate(Date = as.Date(`Date Time`), 
         Time = format(`Date Time`, "%H:00:00")) %>% 
  group_by(Date, Time) %>% 
  summarise(WaterFlow = mean(WaterFlow), .groups = "drop") %>% 
  ungroup() %>% 
  mutate(`Date Time` = as.POSIXct(paste(Date, Time))) %>% 
  select(`Date Time`, WaterFlow)

Create tsibble and view autoplot

pipe_1_ts <- ts(pipe_1_clean$WaterFlow, start = c(2015, 10, 23), frequency = 24)

pipe_1_decomp <- stl(pipe_1_ts, s.window = "periodic")

plot(pipe_1_decomp)

It looks stationary as it but I double check using KPSS test and ADF. ADF p value is less than .05 so it means it is stationary, kpss is more than .05 so it also means it is stationary.

check_stationarity <- function(ts) {
  adf_result <- adf.test(ts)
  kpss_result <- kpss.test(ts)

  list(
    ADF = list(statistic = adf_result$statistic, p.value = adf_result$p.value),
    KPSS = list(statistic = kpss_result$statistic, p.value = kpss_result$p.value)
  )
}
pipe_1_stationarity <- check_stationarity(pipe_1_ts)

## Warning in adf.test(ts): p-value smaller than printed p-value

## Warning in kpss.test(ts): p-value greater than printed p-value

print(pipe_1_stationarity)

## $ADF
## $ADF$statistic
## Dickey-Fuller 
##     -6.264898 
## 
## $ADF$p.value
## [1] 0.01
## 
## 
## $KPSS
## $KPSS$statistic
## KPSS Level 
##  0.2465587 
## 
## $KPSS$p.value
## [1] 0.1

acf_plot <- ggAcf(pipe_1_ts) + ggtitle("ACF") + theme_minimal()
pacf_plot <- ggPacf(pipe_1_ts) + ggtitle("PACF") + theme_minimal()
combined_plot <- acf_plot / pacf_plot 
print(combined_plot)

Fit ETS model and check residuals, which confirm white noise and normal distribution

autoplot(pipe_1_ts)

Visualize pipe 2

ggplot(pipe_2, aes(`Date Time`, WaterFlow)) +
  geom_line(color = "blue") +
  ggtitle("Water Flow for Pipe 2") +
  xlab("Date Time") + 
  ylab("Water Flow")

Data cleaning on pipe 2

pipe_2_clean <- pipe_2 %>% 
  mutate(Date = as.Date(`Date Time`), 
         Time = format(`Date Time`, "%H:00:00")) %>% 
  group_by(Date, Time) %>% 
  summarise(WaterFlow = mean(WaterFlow), .groups = "drop") %>% 
  ungroup() %>% 
  mutate(`Date Time` = as.POSIXct(paste(Date, Time))) %>% 
  select(`Date Time`, WaterFlow)

Create tsibble and view autoplot

pipe_2_ts <- ts(pipe_2_clean$WaterFlow, start = c(2015, 10, 23), frequency = 24)

pipe_2_decomp <- stl(pipe_1_ts, s.window = "periodic")

# Plot the decomposition
plot(pipe_2_decomp)

It looks stationary as it but I double check using KPSS test and ADF. ADF p value is less than .05 so it means it is stationary, kpss is more than .05 so it also means it is stationary.

check_stationarity <- function(ts) {
  adf_result <- adf.test(ts)
  kpss_result <- kpss.test(ts)

  list(
    ADF = list(statistic = adf_result$statistic, p.value = adf_result$p.value),
    KPSS = list(statistic = kpss_result$statistic, p.value = kpss_result$p.value)
  )
}
pipe_2_stationarity <- check_stationarity(pipe_2_ts)

## Warning in adf.test(ts): p-value smaller than printed p-value

print(pipe_2_stationarity)

## $ADF
## $ADF$statistic
## Dickey-Fuller 
##     -8.658365 
## 
## $ADF$p.value
## [1] 0.01
## 
## 
## $KPSS
## $KPSS$statistic
## KPSS Level 
##  0.4204869 
## 
## $KPSS$p.value
## [1] 0.06832461

acf_plot <- ggAcf(pipe_2_ts) + ggtitle("ACF") + theme_minimal()
pacf_plot <- ggPacf(pipe_2_ts) + ggtitle("PACF") + theme_minimal()
combined_plot <- acf_plot / pacf_plot 
print(combined_plot)

Fit ETS model and check residuals, which confirm white noise and normal distribution

autoplot(pipe_2_ts)

## Conclusion Both pipe 1 and pipe 2 are stationary. Of course when considering the task to forecast the next 7 days or 168 hours, I was unable to complete the forecasting as the plots did not show the correct data.

code below produce this error: Warning: 1 error encountered for ARIMA(WaterFlow) [1] .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using tsibble::fill_gaps() if required. Warning: 1 error encountered for ETS(WaterFlow) [1] .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using tsibble::fill_gaps() if required. Warning: Ignoring unknown parameters: series and PIWarning: Ignoring unknown parameters: series and PIWarning: Ignoring unknown parameters: series and PIWarning: Ignoring unknown parameters: series and PI

I dont have time to attempt to correct it.

arima_model <- pipe_1_tsibble %>% model(ARIMA(WaterFlow))

ets_model <- pipe_1_tsibble %>% model(ETS(WaterFlow))

arima_forecast <- arima_model %>% forecast(h = “168 hours”) #7days

ets_forecast <- ets_model %>% forecast(h = “168 hours”)

autoplot(pipe_1_tsibble, WaterFlow) + autolayer(arima_forecast, series = “ARIMA Forecast”, PI = TRUE, color = “blue”) + ggtitle(“Auto ARIMA Forecast for Water Flow”) + xlab(“Date Time”) + ylab(“Water Flow”)

autoplot(pipe_1_tsibble, WaterFlow) + autolayer(ets_forecast, series = “ETS Forecast”, PI = TRUE, color = “red”) + ggtitle(“Auto ETS Forecast for Water Flow”) + xlab(“Date Time”) + ylab(“Water Flow”)

Project 1

Marjete Vucinaj

2024-10-21

Project 1

Part A

Part A – ATM Forecast, ATM624Data.xlsx

Data Exploration and Cleaning

Transformation

Forecasting

Conclusion

Part B

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Loading and Cleaning Data

Part C

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx