Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

# Load libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.4.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tsibble     1.1.6     ✔ feasts      0.4.2
## ✔ tsibbledata 0.4.1     ✔ fable       0.5.0

## Warning: package 'tsibble' was built under R version 4.4.3

## Warning: package 'tsibbledata' was built under R version 4.4.3

## Warning: package 'feasts' was built under R version 4.4.3

## Warning: package 'fabletools' was built under R version 4.4.3

## Warning: package 'fable' was built under R version 4.4.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(janitor)

## Warning: package 'janitor' was built under R version 4.4.2

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.4.2

library(readxl)

## Warning: package 'readxl' was built under R version 4.4.3

# Load dataset
atm <- read_excel("ATM624Data.xlsx")

# Review data
glimpse(atm)

## Rows: 1,474
## Columns: 3
## $ DATE <dbl> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938, 39…
## $ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …

class(atm)

## [1] "tbl_df"     "tbl"        "data.frame"

The data looks relatively clean but the date format should be converted and data frame should be converted to tsibble

# Clean data 

atm <- atm |>
  mutate(DATE = excel_numeric_to_date(DATE)) |>
  as_tsibble(key = "ATM", index = "DATE")

ATM 1

Exploratory data anlaysis

# Plot data 
atm1 <- atm |>
  filter(ATM == "ATM1") 
atm1 |> autoplot(Cash)

The data seems to have seasonal pattern with a relatively consistent trend.

# ACF plot
atm1 |> 
  ACF(Cash) |>
  autoplot()

The ACF plot shows peaks going outside of the blue dashed lines with especially prominent peaks every 7 days at 7, 14, and 21 which confirms the weekly seasonality of the data.

The missing data in the dataset should be replaced with median value of the data.

# Replace missing data with median 
atm1$Cash[is.na(atm1$Cash)] <- median(atm1$Cash, na.rm=TRUE)

Box-cox transformation

A box-cox transformation was attempted to try to smooth the data due to the variation in data spikes in the plot.

# Box cox transformation
# lambda
lambda_atm1 <- atm1 |> 
  features(Cash, features=guerrero) |>
  pull(lambda_guerrero)

atm1_bc <- atm1 |>
  mutate(bc=box_cox(Cash, lambda_atm1)) |>
  select(DATE, bc)

atm1_bc |> 
  autoplot(bc)

The box-cox transformation does not appear to have made the data more consistent and actually seems to have increased outliers so it will not be utilized for the forecast.

Model selection

Training and test dataset was created using 80% of the data for training and 20% for testing. Due to the qualities of the data described above, 3 different models were evaluated: seasonal naive bayes, ETS, and ARIMA.

# Split test/training data 

atm1_train <- atm1[1:round(.8*nrow(atm1)),]
atm1_test <- anti_join(atm1, atm1_train, by = "DATE")

# Forecast test data
atm1_model <- atm1_train |>
  model("snaive" = SNAIVE(Cash), 
        "ets" = ETS(Cash), 
        "arima" = ARIMA(Cash)) |>
  forecast(h=nrow(atm1_test))

# Model accuracy 
accuracy(atm1_model, atm1) |>
  select(.model, RMSE)

The ETS model has the lowest RMSE in this case so I will use it for the final forecast. The ETS residuals show some white noise but no pattern or significant outliers. There are a few spikes outside the blue dotted line in the acf plot but the plots are mostly within the lines. The residuals also show a normal distribution.

# Residuals 
atm1 |> 
  model(ETS(Cash)) |> 
  gg_tsresiduals()

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Forecast

Using the ETS model, a 30 day forecast was created.

# Forecast 
atm1_forecast <- atm1 |>
  model(ETS(Cash)) |>
  forecast(h=30)

atm1_forecast |>
  autoplot(atm1, level =NULL)

# Final output 
atm1_output <- atm1_forecast[,-c(2,4)] |>
  rename(Cash = .mean)
head(atm1_output)

ATM2

Exploratory data analysis

# Plot data
atm2 <- atm |>
  filter(ATM == "ATM2") 
atm2 |> autoplot(Cash)

This dataset seems similar to the last ATM1 dataset.

# ACF plot
atm2 |> 
  ACF(Cash) |>
  autoplot()

The ACF plot confirms weekly seasonality at 7,14, and 21 days.

Like the previous, missing data will be replaced with median value of the data.

# Replace NA data with median
atm2$Cash[is.na(atm2$Cash)] <- median(atm2$Cash, na.rm=TRUE)

Box-cox transformation

A box-cox transformation was applied to standardize the dataset but the shape of the data does not appear to have changed significantly so it will not be used in the forecast.

# Box cox 
# lambda
lambda_atm2 <- atm2 |> 
  features(Cash, features=guerrero) |>
  pull(lambda_guerrero)

atm2_bc <- atm2 |>
  mutate(bc=box_cox(Cash, lambda_atm2)) |>
  select(DATE, bc)

atm2_bc |> 
  autoplot(bc)

Model selection

Like the previous ATM1 dataset, a training and test data split was used with 80% of datapoint for training and 20% for testing. Since the data is also seasonal a seasonal Naive Bayes, ETS, and ARIMA model evaluated for this dataset.

# Split test/training data 

atm2_train <- atm2[1:round(.8*nrow(atm2)),]
atm2_test <- anti_join(atm2, atm2_train, by = "DATE")

# forecast test data
atm2_model <- atm2_train |>
  model("snaive" = SNAIVE(Cash), 
        "ets" = ETS(Cash), 
        "arima" = ARIMA(Cash)) |>
  forecast(h=nrow(atm2_test))

# model accuracy 
accuracy(atm2_model, atm2) |>
  select(.model, RMSE)

Seasonal naive bayes model had the lowest RMSE value so it will be utilized for the final forecast.

# Residuals 
atm2 |> 
  model(SNAIVE(Cash)) |> 
  gg_tsresiduals()

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).

The residuals plot shows consistency in spikes but the acf plot has a significant outlier at lag 7. This may be because the cyclical seasonality every 7 days is extremely consistent. I also checked the residuals of the ETS and ARIMA models as well to see if there was a similar pattern in these models.

# Residuals for ETS
atm2 |> 
  model(ETS(Cash)) |> 
  gg_tsresiduals()

# Residuals for ARIMA 
atm2 |> 
  model(ARIMA(Cash)) |> 
  gg_tsresiduals()

Interestingly, the acf model for ARIMA seem closer within the confines of the blue line than ECF and SNAIVE. However, there is a still a spike at lag 7 although much closer to the blue lines. Based on these residuals, I think ARIMA will provide a more accurate forecast but I will forecast both SNAIVE and ARIMA to see if there is a major difference.

Forecast

# ARIMA forecast 
atm2_forecast_arima <- atm2 |>
  model(ARIMA(Cash)) |>
  forecast(h=30)

atm2_forecast_arima |>
  autoplot(atm2, level=NULL)

# SNAIVE forecast 
atm2_forecast_snaive <- atm2 |>
  model(SNAIVE(Cash)) |>
  forecast(h=30)

atm2_forecast_snaive |>
  autoplot(atm2, level=NULL)

From the plots we can see that both forecasts are very similar but SNAIVE has a wider range of peaks and lows compared to the ARIMA forecast. Visually, based on the shape of the historical data, I would say that the SNAIVE model actually looks more accurate. I would recommend utilizing the SNAIVE model for further forecasts in this dataset.

# Final output
atm2_output <- atm2_forecast_snaive[,-c(2,4)] |>
  rename(Cash = .mean)

head(atm2_output)

ATM3

Exploratory data analysis

# Plot data 
atm3 <- atm |>
  filter(ATM == "ATM3") 
atm3 |> autoplot(Cash)

The plot for this data is very interesting showing cash taken out towards the end of the dataset.

tail(atm3, 10)

We can see that cash withdrawals happened starting from 4/28/2010 and there are only 3 nonzero data points which is insufficient for a reliable predictive model.

Forecast

Since it will not be feasible to use a predictive model, I tried to create a model using the mean. However, the minimal data in this dataset is an extreme limitation to the accuracy of this forecast.

# Forecast data using nonzero values
atm3_nonzero <- atm3 |> 
  filter(Cash != 0)

atm3_forecast <- atm3_nonzero |>
  model(MEAN(Cash)) |>
  forecast(h=30)

atm3_forecast |>
  autoplot(atm3, level =NULL)

atm3_output <- atm3_forecast[, -c(2,4)] |>
  rename(Cash = .mean)
head(atm3_output)

ATM4

Exploratory data analysis

# Plot data 
atm4 <- atm |>
  filter(ATM == "ATM4") 
atm4 |> autoplot(Cash)

There is a clear outlier between February and March. To avoid this outlier impacting the forecast, it was replaced with the median value of the data set. Missing data points were also replaced with the median.

# Clean data to remove outlier 
atm4$Cash[atm4$Cash == max(atm4$Cash, na.rm=TRUE)] <- median(atm4$Cash, na.rm=TRUE)
atm4$Cash[is.na(atm4$Cash)] <- median(atm4$Cash, na.rm=TRUE)

atm4 |> autoplot(Cash)

Now it’s easier to see the actual shape of the other data points.

# ACF plot 
atm4 |>
  ACF(Cash) |>
  autoplot()

The ACF plot shows that this data also has weekly seasonality like the previous plots.

Box-cox transformation

A box-cox transformation was applied to limit variation of data but it did not appear to be effective so it will not be utilized in the forecast.

# Find lambda
lambda_atm4 <- atm4 |> 
  features(Cash, features=guerrero) |>
  pull(lambda_guerrero) 

# Box cox transformation 
atm4_bc <- atm4 |>
  mutate(bc=box_cox(Cash, lambda_atm4)) |> 
  select(DATE, bc)

atm4_bc |>
  autoplot(bc)

Model selection

Like ATM1 and ATM2, 80% of the data will be used for training and 20% for testing. SNAIVE, ETS, and ARIMA models were utilized for this data.

# Test/train split 
atm4_train <- atm4[1:round(.8*nrow(atm4)),]
atm4_test <- anti_join(atm4, atm4_train, by = "DATE")

# forecast test data
atm4_model <- atm4_train |>
  model("snaive" = SNAIVE(Cash), 
        "ets" = ETS(Cash), 
        "arima" = ARIMA(Cash)) |>
  forecast(h=nrow(atm4_test))

# model accuracy 
accuracy(atm4_model, atm4) |>
  select(.model, RMSE)

The ARIMA model had the lowest RMSE so I will be using this model for the final forecast.

# Residuals 
atm4 |> 
  model(ARIMA(Cash)) |> 
  gg_tsresiduals()

The residuals show a plot skewed to the left and a spike outside the blue lines. This may indicate that ARIMA is not the ideal plot but I will try it to see what the plot visually looks like.

Forecast

# Forecast 
atm4_forecast <- atm4 |>
  model(ARIMA(Cash)) |>
  forecast(h=30)

atm4_forecast |>
  autoplot(atm4, level =NULL)

The forecast does not seem accurate based on the plot as the range of data points seems too narrow. I tried plotting ETS and seasonal naive bayes model as well to see if either of these seemed more accurate.

# ETS Forecast 
atm4_forecast_ETS <- atm4 |>
  model(ETS(Cash)) |>
  forecast(h=30)

atm4_forecast_ETS |>
  autoplot(atm4, level =NULL)

# SNAIVE Forecast 
atm4_forecast_snaive <- atm4 |>
  model(SNAIVE(Cash)) |>
  forecast(h=30)

atm4_forecast_snaive |>
  autoplot(atm4, level =NULL)

The SNAIVE model forecast actually seems most accurate visually. This will be used for the final output.

# Final output 
atm4_output <- atm4_forecast_snaive[,-c(2,4)] |>
  rename(Cash = .mean)

head(atm4_output)

Report

# Combine all predictions  
full_atm_output <- rbind(as_tibble(atm1_output), as_tibble(atm2_output), as_tibble(atm3_output), as_tibble(atm4_output)) |> 
  arrange(DATE)

# Create wide-format
atm_wide <- full_atm_output |> 
  pivot_wider(names_from = ATM, values_from = Cash)

# Write to CSV files
write.csv(full_atm_output, 'atm_output_Song_J.csv')
write.csv(atm_wide, 'atm_output_wide_Song_J.csv')

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

# Load data 
power <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")

clean data

# view data
head(power)

# clean data
power$"YYYY-MMM" <- yearmonth(power$"YYYY-MMM")
power <- power |>
  rename(Date = "YYYY-MMM") |>
  select(Date, KWH) |>
  as_tsibble(index = "Date")

power |> 
  autoplot(KWH)

The plot seems to have a seasonal trend and the ACF plot confirms this with peaks at 6, 12, and 18 months and lows at 3, 9, 15, and 21 months. There is also a significant outlier in the plot.

power |>
  ACF(KWH) |> 
  autoplot()

Missing data was replaced with the median of the dataset as well as the outlier.

# Replace outlier with median 
power$KWH[power$KWH == min(power$KWH, na.rm = TRUE)] <- median(power$KWH, na.rm = TRUE)

# Replace missing values with median
power$KWH[is.na(power$KWH)] <- median(power$KWH, na.rm=TRUE)
power |>
  autoplot(KWH)

The plot is now easier to read once adjusting the outlier.

Box-cox transformation

A box-cox transformation was applied to see if the variances would be more consistent although the data looks relatively consistent already. The resulting plot does not appear to have made a difference so it will not be utilized in the forecast.

# lambda 
lambda_power <- power |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)

# Box-cox transformation
power_bc <- power |>
  mutate(bc = box_cox(KWH, lambda_power)) |>
  select(Date, bc)

power_bc |>
  autoplot(bc)

Model selection

A test and training split was created using 80% of data for training and 20% for testing. Like in part A, seasonal naive bayes, ETS, and ARIMA models were evaluated.

# Train and test data 
power_train <- power[1:round(.8*nrow(power)),]
power_test <- anti_join(power, power_train, by = "Date") 

# Predict test data 
power_model <- power_train |>
  model("snaive" = SNAIVE(KWH), 
        "ets" = ETS(KWH), 
        "arima" = ARIMA(KWH)) |>
  forecast(h=nrow(power_test))

accuracy(power_model, power) |>
  select(.model, RMSE)

The ARIMA model has the lowest RMSE would it will be used for the forecast.

# Residuals 
power |>
  model(ARIMA(KWH)) |>
  gg_tsresiduals()

The residuals seem to white noise. There is 1 spike outside the blue line in the acf plot but the rest are within the confines.

Forecast

# 12 month forecast 
power_forecast <- power |>
  model(ARIMA(KWH)) |>
  forecast(h=12)

power_forecast |>
  autoplot(power, level=NULL)

# Final output 
power_output <- power_forecast[,-c(1,3)] |>
  rename(KWH = .mean)

print(power_output)

## # A tsibble: 12 x 2 [1M]
##        Date       KWH
##       <mth>     <dbl>
##  1 2014 Jan 10138894.
##  2 2014 Feb  8757344.
##  3 2014 Mar  6702667.
##  4 2014 Apr  6008940.
##  5 2014 May  5943702.
##  6 2014 Jun  8207930.
##  7 2014 Jul  9520071.
##  8 2014 Aug 10010697.
##  9 2014 Sep  8488807.
## 10 2014 Oct  5867038.
## 11 2014 Nov  6155281.
## 12 2014 Dec  8307433.

Report

write.csv(power_output, "power_ouput_Song_J.csv")

Part C – BONUS, optional (part or all)

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

# Load data 
pipe1 <- read_xlsx('Waterflow_Pipe1.xlsx')
pipe2 <- read_xlsx('Waterflow_Pipe2.xlsx')

head(pipe1)

head(pipe2)

Clean data

# Combine datasets
pipe <- rbind(pipe1 , pipe2) |> 
  rename(date_time = 'Date Time')

# Convert date time format 
pipe <- pipe |>
  mutate(date_time = convertToDateTime(date_time),
         date = as.Date(date_time),
         hour = format(date_time,"%H")) |>
  group_by(date, hour) |>
  mutate(hour_avg = mean(WaterFlow)) |>
  ungroup() |>
  select(date, hour, hour_avg) |>
  distinct() |>
  mutate(date_time = paste0(date, " ", hour, ':00:00'))

pipe$date_time <- as.POSIXct(pipe$date_time)
pipe <- pipe |> select(date_time, hour_avg) |> as_tsibble(index = 'date_time')
print(head(pipe))

## # A tsibble: 6 x 2 [1h] <?>
##   date_time           hour_avg
##   <dttm>                 <dbl>
## 1 2015-10-23 00:00:00     26.1
## 2 2015-10-23 01:00:00     18.8
## 3 2015-10-23 02:00:00     24.5
## 4 2015-10-23 03:00:00     25.6
## 5 2015-10-23 04:00:00     22.4
## 6 2015-10-23 05:00:00     23.6

EDA

pipe |> autoplot(hour_avg)

The variance in data is lower in the beginning and higher towards the middle and end of the dataset.

Box cox transformation

A box-cox transformation was applied due to the variance in this dataset.

# Find lambda
lambda_pipe <- pipe |>
  features(hour_avg, features = guerrero) |>
  pull(lambda_guerrero)

# Apply Box-Cox transformation
pipe_bc <- pipe |>
  mutate(bc = box_cox(hour_avg, lambda_pipe)) |>
  select(date_time, bc)

pipe_bc |> 
  autoplot(bc)

No seasonal differencing and one order of differencing were found to be necessary to achieve stationarity

# Seasonal differencing
pipe_bc |> 
  features(bc, unitroot_nsdiffs)

# One order of differencing 
pipe_bc |>
  features(bc, unitroot_ndiffs)

# Apply one order of differencing 

pipe_bc |>
  mutate(diff_bc = difference(bc)) |>
  autoplot(diff_bc)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

The data now does not appear to show an overall trend or seasonality.

Forecast

Due to the shape of the data, an ARIMA model is likely the best model to use in this case.

# Fill gaps in data
pipe_bc <- fill_gaps(pipe_bc)

# Generate forecast (forecast was made out to 5 days (120 hrs)
pipe_forecast <- pipe_bc |>
  model(ARIMA(bc)) |>
  forecast(h = 120)

pipe_forecast |> 
  autoplot(pipe_bc, level = NULL)

# Final output
pipe_output <- pipe_forecast[,-c(1,3)] |>
  rename(bc = .mean) |>
  mutate(hour_avg = inv_box_cox(bc, lambda_pipe)) |>
  select(date_time, hour_avg)

head(pipe_output)

Report

write.csv(pipe_output, 'pipe_output_Song_J.csv')

624 Project1

Jane Song

2026-03-14

Part A – ATM Forecast, ATM624Data.xlsx

ATM 1

Exploratory data anlaysis

Box-cox transformation

Model selection

Forecast

ATM2

Exploratory data analysis

Box-cox transformation

Model selection

Forecast

ATM3

Exploratory data analysis

Forecast

ATM4

Exploratory data analysis

Box-cox transformation

Model selection

Forecast

Report

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

clean data

Box-cox transformation

Model selection

Forecast

Report

Part C – BONUS, optional (part or all)

Clean data

EDA

Box cox transformation

Forecast

Report