── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(forecast)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(lubridate)library(readxl)
Warning: package 'readxl' was built under R version 4.5.3
library(writexl)
Warning: package 'writexl' was built under R version 4.5.3
Part A – ATM Forecast, ATM624Data.xlsx
First we convert the date to standard date format year, month, day format to make it easy to read.
#Convert date column and remove empty money columnsdf <-read_excel("ATM624Data.xlsx") %>%mutate(Date =as.Date(DATE, origin ="1899-12-30")) %>%filter(!is.na(Cash))
When we analyized the data I noticed that ATM3 is missing almost the entire years worth of data. In a business context I would assume that ATM3 was out of service or the data is missing so it was removed from the final model.
Here I train the model our model. Converted zero to NA and used to TSclean function to remove them and smooth any extreme outliers, to make sure the model training wan’t corrupted. When it comes to picking a model Auto-ARIMA was choosen and the Seasonality was set to weekly. Linear interpolation was used for missing values to maintain the weekly pattern. SMA wasn’t used since it lags behind trends and fails to capture specific day of the week seasonality and Linear Regression assume the data points are independent which isn’t true in the case of an ATM
train_df <- df %>%filter(Date <as.Date("2010-05-01"))#Creating Forecasting Functionget_atm_forecast <-function(atm_id){#Adding in cleaning step due to atm3 and atm4 temp_data <- train_df %>%filter(ATM == atm_id) %>%#Converting 0 as NA for tscleanmutate(Cash =ifelse(Cash <=0, NA, Cash))#Creating a time series and cleaning it atm_ts <-ts(temp_data$Cash, frequency =7) %>%tsclean()# Fit Model Auto-ARIMA fit <-auto.arima(atm_ts, stepwise =FALSE, approximation =FALSE)# Month Forecast fc <-forecast(fit, h =31)return(as.vector(fc$mean))}
The final forecast shows that ATM 1 has a clear weekly pattern with projections staying mostly within 75 and 100. ATM2 has extremely high volatility with high peaks and low days. After the removing the outliers ATM4 has exetremely high sustained volume, hovering around $450.
Warning: package 'tsoutliers' was built under R version 4.5.3
There’s a large drop off around 2010. We’ll need to account for that outlier when we’re making the model.
data <-read_excel("ResidentialCustomerForecastLoad-624.xlsx")#Creating time series objectkwh_ts <-ts(data$KWH, start =c(1998, 1), frequency =12)autoplot(kwh_ts) +labs(title="Residential Power Consumption", y="KWH", x="Year")
Cleaning the data and removing the outlier
#TSclean Identify and replaces them outliers with interpolated values)kwh_ts_clean <-tsclean(kwh_ts)autoplot(kwh_ts, series="Original") +autolayer(kwh_ts_clean, series="Cleaned") +labs(title="Original vs. Outlier-Cleaned Data", y="KWH") +scale_color_manual(values=c("Original"="red", "Cleaned"="black"))
I used seasonal ARIMA similar the previous problem the only main difference is I used didn’t use stepwise = FALSE to force the model to work harder to find the best AIC score
process_waterflow_data <-function(file_path) { df <-read_excel(file_path)#Using .POSIXct with origin="1899-12-30 to convert to date time" df$DateTimeConverted <-as.POSIXct(df[[1]] *86400, origin ="1899-12-30", tz ="UTC")# Aggregate by floor hour and take the mean of WaterFlow# Floor hour turns "2023-10-27 15:34:21" into "2023-10-27 15:00:00" hourly_data <- df %>%mutate(DateTimeConverted =as.POSIXct(.[[1]] *86400, origin ="1899-12-30", tz ="UTC"),Hour =floor_date(DateTimeConverted, unit ="hour") ) %>%group_by(Hour) %>%# Use the actual column name here:summarize(Mean_WaterFlow =mean(WaterFlow, na.rm =TRUE)) %>%ungroup()return(hourly_data)}
pipe1_hourly <-process_waterflow_data("Waterflow_Pipe1.xlsx")start_time <-min(pipe1_hourly$Hour)end_time <-max(pipe1_hourly$Hour)waterflow_ts <-ts(pipe1_hourly$Mean_WaterFlow, frequency =24)
Test for Stationarity
adf_test <-adf.test(waterflow_ts)
Warning in adf.test(waterflow_ts): p-value smaller than printed p-value
print(adf_test)
Augmented Dickey-Fuller Test
data: waterflow_ts
Dickey-Fuller = -6.2649, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary
is_stationary <- adf_test$p.value <0.05
The data is stationary so we can start forecasting
fit <-auto.arima(waterflow_ts, seasonal =TRUE)pipe1_forecast <-forecast(fit, h =168)plot(pipe1_forecast, main="Weekly Water Flow Forecast - Pipe 1", xlab="Time (Hours)", ylab="Flow Rate", col="blue")