The goal of this analysis is to: Forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars.
Prepping the Data
#loading data and librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)library(tsibble)
Registered S3 method overwritten by 'tsibble':
method from
as_tibble.grouped_df dplyr
Attaching package: 'tsibble'
The following object is masked from 'package:lubridate':
interval
The following objects are masked from 'package:base':
intersect, setdiff, union
Attaching package: 'zoo'
The following object is masked from 'package:tsibble':
index
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
# I uploaded the provided file into GitHub for reproducibilityatm_data <-read_csv("https://raw.githubusercontent.com/Lfirenzeg/msds624labs/refs/heads/main/Project%201/ATM624Data.csv")
Rows: 1474 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): DATE, ATM
dbl (1): Cash
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now explore data and look for missing values
# check structure and first few rowsglimpse(atm_data)
# A tibble: 6 × 3
DATE ATM Cash
<chr> <chr> <dbl>
1 5/1/2009 12:00:00 AM ATM1 96
2 5/1/2009 12:00:00 AM ATM2 107
3 5/2/2009 12:00:00 AM ATM1 82
4 5/2/2009 12:00:00 AM ATM2 89
5 5/3/2009 12:00:00 AM ATM1 85
6 5/3/2009 12:00:00 AM ATM2 90
# Check for missing values in each columncolSums(is.na(atm_data))
DATE ATM Cash
0 14 19
# Count the number of unique ATM identifiersunique(atm_data$ATM)
[1] "ATM1" "ATM2" NA "ATM3" "ATM4"
The ATM and Cash columns have missing values. Each ATM is expected to have consistent daily records, so we can drop the rows with missing ATM or Cash, since the proportion is small (19 total NA for almost 1500 entries).
# dropping rows with missing ATM or Cash valuesatm_clean <- atm_data %>%filter(!is.na(ATM), !is.na(Cash))
Next we want to convert the Date column to a proper date-time format
# DATE column using lubridateatm_clean <- atm_clean %>%mutate(DATE =mdy_hms(DATE))summary(atm_clean$DATE)
Min. 1st Qu.
"2009-05-01 00:00:00.0000" "2009-08-01 00:00:00.0000"
Median Mean
"2009-10-31 00:00:00.0000" "2009-10-30 11:00:07.4226"
3rd Qu. Max.
"2010-01-29 12:00:00.0000" "2010-04-30 00:00:00.0000"
colSums(is.na(atm_clean))
DATE ATM Cash
0 0 0
At this point there are no missing values.
First Data Exploration
At this point it would be beneficial to visually explore the data
ggplot(atm_clean, aes(x = DATE, y = Cash, color = ATM)) +geom_line() +facet_wrap(~ATM, scales ="free", nrow =4)+labs(title ="Cash Withdrawals per ATM (Raw data)", x ="Date", y ="Cash (hundreds of $)") +theme_minimal()
We can see that ATMs 1, 2 have consistent daily activity, and the withdrawals are all within a small predictable range. However, ATM4 has quite an erratic behavior and a major outlier around January/February of 2010, which could be a data error or some sort of unusual event. ATM3 could be considered either new or out of service for a while, since it has 0 withdrawals for most of the entries.
In any case, the first 3 ATMs seem good candidates for either ARIMA or ETS forecasting, while ATM4 might need to remove the outlier or perhaps apply some STL decomposition that detects anomalies. At minimum the forecasting intervals will reflect the higher uncertainty.
Reshaping the Data
We will group the data so that each date-ATM pair is a group (in case there are multiple rows per ATM per day), also add up the cash withdrawals so that we have only one cash value per ATM per day. Again, we are checking for NAs
DATE ATM1 ATM2 ATM3
Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000
1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000
Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000
Mean :2009-10-30 Mean : 83.89 Mean : 62.58 Mean : 0.7206
3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000
NA's :3 NA's :2
ATM4
Min. : 2
1st Qu.: 124
Median : 404
Mean : 474
3rd Qu.: 705
Max. :10920
We see that we now have 3 NAs for ATM1 and 2 for ATM2, because some ATMs 1 and 2 were missing data for specific dates, so it didn’t appear as rows in the long format, and pivot_wider() inserted NA for the ATMs on those dates.
However, even though we already have full Date coverage we can fill missing values using a rolling median (safer than just a regular median or forward fill)
DATE ATM1 ATM2 ATM3
Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000
1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.00 1st Qu.: 0.0000
Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000
Mean :2009-10-30 Mean : 84.19 Mean : 62.57 Mean : 0.7206
3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000
ATM4
Min. : 2
1st Qu.: 124
Median : 404
Mean : 474
3rd Qu.: 705
Max. :10920
We have succesfully eliminated NAs, and the summary statistics before and after imputation are almost identical.
# Convert to long format for plottingatm_long <- atm_timeseries %>%pivot_longer(cols =-DATE, names_to ="ATM", values_to ="Cash")# Plot cleaned dataggplot(atm_long, aes(x = DATE, y = Cash, color = ATM)) +geom_line() +facet_wrap(~ATM, scales ="free", nrow =4)+labs(title ="Cleaned Daily Withdrawals per ATM", x ="Date", y ="Cash (in dollars)") +theme_minimal()
Decomposing ATM1 Data
We can now use a tidy model approach and decompose the data for each ATM.
We can see strong seasonality and some trend variation for ATM1. There might not be a need for transformation here, but it can be confirmed with some Box-Cox to follow.
Although the multiplicative_ts model got the lowest RMSE, indicating higher accuracy, the ARIMA model got by far best AIC, and relatively good RMSE, so that would be the simplest model with high accuracy.
But how to know the best ARIMA?
# Finding the best ARIMA model with report()atm1_fit %>%select(.model ="ARIMA") %>%report()
So we find the ARIMA model to be ARIMA(0,0,2)(0,1,1)[7]
Forecasting ATM1
We will fit the proposed ARIMA model, get residuals and run a Box-Pierce test, using a lag of 14 (2x7) since its appropriate for weekly seasonal data, and a dof of 0 since we are already modeling with seasonal ARIMA
#Forecasting the modelforecast_atm1 <- atm1_fit %>%forecast(h =31) %>%filter(.model=='ARIMA')# Plotting the forecast for ATM1forecast_atm1 %>%autoplot(atm1_data) +ggtitle(paste0("ATM 1 Forecasted with ARIMA(0,0,2)(0,1,1)_7 and lambda= ",round(lambda1,2)))
# Plotting residualsatm1_fit %>%select(ARIMA) %>%gg_tsresiduals() +ggtitle(paste0("Residuals for ARIMA(0,0,2)(0,1,1)7 with lambda = ",round(lambda1,2)))
The model captures seasonality well, and the forecaste range expands over time, maintaining the previously seen weekly cycles.
The residuals tell us that there seems to be no autocorrelation, and no major skewness. This is also reinforced by the Box-Pierce test, with a p-value of 0.8, we fail to reject the null and it validates the model fit.
Similar to ATM1, we see a clear weekly seasonality, with maybe some larger peaks at the beginning of each year. We can anticipate that ARIMA or ETS would be good models as well.
With a lambda of 0.7 we find AR terms p=2, MA terms q=2, and Seasonal MA Q=1 with seasonal differencing D=1
Forecasting ATM2
#Forecasting the model for ATM2forecast_atm2 <- atm2_fit %>%forecast(h =31) %>%filter(.model=='ARIMA')# Plotting the forecast for ATM2forecast_atm2 %>%autoplot(atm2_data) +ggtitle(paste0("ATM 2 Forecasted with ARIMA(2,0,2)(0,1,1)_7 and lambda= ",round(lambda2,2)))
# Plotting residualsatm2_fit %>%select(ARIMA) %>%gg_tsresiduals() +ggtitle(paste0("Residuals for ARIMA(2,0,2)(0,1,1)7 with lambda = ",round(lambda2,2)))
At a first glance we see the forecast line continue the weekly pattern, and the confidence interval behaving as expected (uncertainty growing over time).
The residual diagnostics are centered around 0, and most ACF falling inside the bands, suggesting no significant autocorrelation. The histogram has a slight skew on the right but does not seem significant, and shows a fairly normal distribution. Again, p-value bigger than 0.5 (0.77) so we failt to reject the null and pass the independence test.
As mentioned previously, ATM3 shows 0 withdrawals for most of the reporting period, with just the final weeks of April 2010 showing any activity. This could be due to the ATM not being installed or active or somehow recorded as operational but not used. In any case there is not enough historical data to detect seasonality, trends or cyclical patterns.
Comparing Models ATM3
ATM3 is interesting in that it offers very little data, so comparing models would be done merely as an academic exercise.
lambda3 <- ATM_tsibble %>%filter(ATM =="ATM3") %>%features(Cash, features = guerrero) %>%pull(lambda_guerrero)
Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
max(.period, : NA/Inf replaced by maximum positive value
Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
max(.period, : NA/Inf replaced by maximum positive value
# A tibble: 5 × 7
.model sigma2 log_lik AIC AICc BIC RMSE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 additive 25.6 -1664. 3348. 3348. 3387. 4.99
2 ARIMA 382. -1603. 3213. 3213. 3224. 5.02
3 MEAN(Cash) 63.1 NA NA NA NA 7.93
4 snaive 64.3 NA NA NA NA 8.04
5 multiplicative_ts 8.49 -1618. 3256. 3257. 3295. 8.71
Even with limited data when we attempt a model comparison, we see that ARIMA shows the lowest AICc. However, due to the limited period of ATM3 activity, forecasts are inherently uncertain. While ARIMA(0,0,2) provided the best fit among tested models, a simple mean forecast may offer comparable practical value for short-term planning
Forecasting ATM3
#Fitting the MEAN modelatm3_fit <- ATM_tsibble %>%filter(ATM =="ATM3", Cash !=0) %>%model(MEAN(Cash))#Forecasting the model for ATM3forecast_atm3 <- atm3_fit %>%forecast(h =31) # Plotting the forecast for ATM3forecast_atm3 %>%autoplot(atm3_data) +ggtitle("ATM 3 Forecasted with MEAN() model")
ATM3 was inactive for most of the year and only started showing consistent withdrawals in April 2010. Given the limited historical data, a MEAN() model was applied. This forecast assumes the daily average seen in late April will continue through May, with moderate prediction intervals reflecting the short active window.
ATM4 shows stable daily activity with strong weekly seasonality, but includes at least one extreme outlier in early 2010. Given the impact this spike has on modeling, we applied STL decomposition with robust fitting to isolate the anomaly and forecast based on the smoothed components. This allowed us to preserve the underlying structure without distortion from the spike.
Detect and Remove Outliers
We will first identify outliers
# Use Interquartile Range (IQR) to flag outliersatm4_outliers <- atm4_tsibble %>%mutate(is_outlier = Cash >quantile(Cash, 0.75) +1.5*IQR(Cash) | Cash <quantile(Cash, 0.25) -1.5*IQR(Cash) )# View how many outliers we haveatm4_outliers%>%filter(is_outlier)
# A tsibble: 2 x 4 [1D]
# Key: ATM [1]
DATE ATM Cash is_outlier
<date> <chr> <dbl> <lgl>
1 2009-09-22 ATM4 1712 TRUE
2 2010-02-09 ATM4 10920 TRUE
So if we define outliers as those above or below 1.5 IQ ranges we can identify September 22nd, 2009, and February 9th, 2010 as the only two outliers. Now we can proceed to replace them
And now we can explore data again to see what we have
atm4_interpolated %>%autoplot(Cash) +ggtitle("ATM4 without Outliers")
atm4_interpolated %>%filter(ATM =="ATM4") %>%model(STL(Cash ~season(window ="periodic"), robust =TRUE)) %>%components() %>%autoplot() +labs(title ="STL Decomposition for ATM4 with no outliers")
Comparing Models for ATM4
ATM4 exhibited significant volatility in cash withdrawals, including major outliers—particularly on September 22, 2009, and February 9, 2010. These extreme values heavily skewed the data and could bias forecasting models. To address this, outliers were systematically identified using the Interquartile Range (IQR) method and excluded from the dataset before fitting any model.
After removing outliers, we performed STL decomposition to understand the underlying components of the time series. Clear weekly seasonality and a slightly declining trend were observed, with reduced residual noise compared to the original data.
A Box-Cox transformation was applied to stabilize variance, and multiple time series models were tested. The ARIMA(0,0,1)(2,0,0)[7] model with a Box-Cox lambda of 0.48 emerged as one of the most appropriate based on AIC, BIC, and residual analysis.
# report of the arima modelatm4_fit %>%select(.model ="ARIMA") %>%report()
So we can see a MA(1) component in the non seasonal part, and a seasonal AR component of order 2 with a weekly period (seasonality is 7 days), with the model assuming a non zero mean.
Forecasting ATM4
#Forecasting the model for ATM4forecast_atm4 <- atm4_fit %>%forecast(h =31) %>%filter(.model=='ARIMA')# Plotting the forecast for ATM4forecast_atm4 %>%autoplot(atm4_interpolated) +ggtitle(paste0("ATM 4 Forecasted with ARIMA(2,0,2)(0,1,1)_7 and lambda= ",round(lambda4,2)))
# Plotting residualsatm4_fit %>%select(ARIMA) %>%gg_tsresiduals() +ggtitle(paste0("Residuals for ARIMA(0,0,1)(2,0,0)7 with lambda = ",round(lambda4,2)))
The results indicate a continuation of the weekly seasonal pattern observed historically. While volatility remains, the forecast no longer includes the erratic spikes seen in the raw data.
After cleaning the data, the ARIMA(0,0,1)(2,0,0)[7] model with a Box-Cox transformation (λ = 0.48) provided a robust forecast of daily withdrawals. While ATM4 still exhibits more variability than other machines, the outlier-adjusted model is statistically sound and should provide reliable estimates for cash flow planning during May 2010.
Exporting the data
# We can now combine all 4 forecasts to exportatms_forecast <-bind_rows(forecast_atm1, forecast_atm2, forecast_atm3, forecast_atm4) %>%as.data.frame() %>%select(DATE, ATM, .mean) %>%rename(Cash = .mean)# export fileatms_forecast %>%write.csv("ATM_Forecasts.csv")
Conclusions
atms_forecast %>%as_tsibble(index = DATE, key = ATM) %>%autoplot(Cash) +facet_wrap(~ATM, scales ="free", nrow =4) +labs(title ="Forecast for ATM Withdrawls in May 2010")
# Combined forecast plot, with historical data as black lineatms_forecast %>%as_tsibble(index = DATE, key = ATM) %>%autoplot(Cash) +autolayer(ATM_tsibble, Cash, colour ="black") +facet_wrap(~ATM, scales ="free", nrow =4) +labs(title ="ATM Withdrawals and Forecasts",y ="Cash (in dollars)", x ="Date") +theme_minimal()
This analysis presents a comprehensive time series forecast of daily cash withdrawals for four ATMs (ATM1–ATM4) for the month of May 2010. After cleaning and preparing the dataset, we applied tailored modeling techniques to each ATM based on its individual usage pattern, seasonality, and data quality.
ATM1 and ATM2 demonstrated highly consistent, seasonally-driven withdrawal patterns. Both were successfully modeled using seasonal ARIMA approaches that captured weekly trends with high accuracy. Model diagnostics confirmed well-behaved residuals and strong statistical fit.
Business implication: Cash needs for these ATMs can be reliably anticipated on a weekly cycle, enabling efficient scheduling of replenishments and minimizing idle cash.
ATM3 had very limited historical data, as it was inactive or recorded zero withdrawals for most of the observed period. Given this, we opted for a MEAN model, which assumes the recent average withdrawals will continue into May.
Business implication: Forecasts for ATM3 should be treated with caution due to data limitations. Continued monitoring is recommended to reassess the model as more usage data becomes available.
ATM4 presented extreme outliers likely due to unusual events or recording errors. These anomalies were identified and removed using IQR-based filtering, followed by interpolation. After cleaning, we applied a Box-Cox transformed ARIMA model that accounted for volatility while preserving the core weekly patterns.
Business implication: With outliers removed, ATM4 displays more stable behavior. Still, its variability remains higher than the others, suggesting a more flexible cash reserve strategy may be prudent.
Power Forecast
The goal of this analysis is to model the data of residential power usage for January 1998 until December 2013, and forecast 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours.
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(imputeTS)
Attaching package: 'imputeTS'
The following object is masked from 'package:zoo':
na.locf
library(tseries)
Attaching package: 'tseries'
The following object is masked from 'package:imputeTS':
na.remove
library(ggplot2)library(dplyr)# Loading the datapower_data <-read_csv("https://raw.githubusercontent.com/Lfirenzeg/msds624labs/refs/heads/main/Project%201/ResidentialCustomerForecastLoad-624.csv")
Rows: 192 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): YYYY-MMM
dbl (2): CaseSequence, KWH
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#rename columns for easier handlingpower_data <- power_data %>%rename(Month =`YYYY-MMM`)
First Data Exploration
# Convert the 'Month' column to Date formatpower_data <- power_data %>%mutate(Month =ym(Month)) # Check structure againsummary(power_data)
CaseSequence Month KWH
Min. :733.0 Min. :1998-01-01 Min. : 770523
1st Qu.:780.8 1st Qu.:2001-12-24 1st Qu.: 5429912
Median :828.5 Median :2005-12-16 Median : 6283324
Mean :828.5 Mean :2005-12-15 Mean : 6502475
3rd Qu.:876.2 3rd Qu.:2009-12-08 3rd Qu.: 7620524
Max. :924.0 Max. :2013-12-01 Max. :10655730
NA's :1
# Plot a time series for early explorationggplot(power_data, aes(x = Month, y = KWH)) +geom_line(color ="blue") +labs(title ="Residential Power Usage (KWH) Over Time", x ="Date", y ="KWH")
There is a clear pattern repeating every year, indicating a strong seasonal behavior, expected from energy usage in relation to the weather.
There also seems to be a gradual upward trend, move evident after 2005, perhaps related to an increase in population.
However, it’s worth noting that there is a significant dip around 2010. Also there is an NA in the KWH column that needs to be addressed.
Outliers and Missing Data
We can start with the outliers. It corresponds to July 7th, 2010, and looks like it might be missing a digit when compared to any other entry in the data (770523). So we will remove it and interpolate it.
# First we can calculate IQR bounds, and then replace any outliers exceeding those bounds with NAQ1 <-quantile(power_data$KWH, 0.25, na.rm =TRUE)Q3 <-quantile(power_data$KWH, 0.75, na.rm =TRUE)IQR <- Q3 - Q1lower_bound <- Q1 -1.5* IQRupper_bound <- Q3 +1.5* IQRpower_data <- power_data %>%mutate(KWH =ifelse(KWH < lower_bound | KWH > upper_bound, NA, KWH))
At this point there should be no remaining outliers, so we can proceed to create a time series and fit an ARIMA model with the current data to later interpolate missing values.
# We need to create a monthly time series starting from Jan 1998 in order to # address the missing values.kwh_ts <-ts(power_data$KWH, start =c(1998, 1), frequency =12)# Now fit ARIMA model (which handles missing values automatically)arima_model <-auto.arima(kwh_ts, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)# Summary of the modelsummary(arima_model)
Series: kwh_ts
ARIMA(0,0,3)(2,1,0)[12] with drift
Coefficients:
ma1 ma2 ma3 sar1 sar2 drift
0.3411 0.0308 0.2307 -0.6985 -0.4150 8998.942
s.e. 0.0792 0.0897 0.0749 0.0805 0.0795 3006.595
sigma^2 = 3.832e+11: log likelihood = -2627.93
AIC=5269.87 AICc=5270.52 BIC=5292.22
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -8119.868 589004.5 428520.6 -0.7969903 6.535785 0.7046532
ACF1
Training set -0.01306416
# Fill in missing values using Kalman smoothingkwh_filled <-na_kalman(kwh_ts, model ="auto.arima")# Compare original vs filled (optional)cbind(original = kwh_ts, filled = kwh_filled)[which(is.na(kwh_ts)), ]
original filled
[1,] NA 7549357
[2,] NA 7684639
We can see that 2 NAs were replaced with values expected at that point of the time series.
# Plot a time series for new explorationautoplot(kwh_filled) +labs(title ="Power Usage with Outliers and Missing Values Filled via ARIMA Kalman", x ="Date", y ="KWH") +theme_minimal()
We see that the extreme dip in 2010 was replaced with a more plausible value, aligning with the seasonal patterns. Additionally we see a continuity in the data, meaning no NAs at this point.
STL Decomposition
We can now move to STL decomposition, to visualize more clearly each component.
power_ts <-tibble(Month =yearmonth(seq(as.Date("1998-01-01"), by ="month", length.out =length(kwh_filled))),KWH =as.numeric(kwh_filled)) %>%as_tsibble(index = Month)# Perform STL decompositionpower_ts %>%model(STL(KWH ~season(window ="periodic"), robust =TRUE)) %>%components() %>%autoplot() +labs(title ="STL Decomposition of Residential Power Usage")
The plot makes a steady increase more evident as a trend, with a slight decrease after 2012. We also confirm a very strong and regular seasonal pattern, with peaks corresponding to either winter or summer months. Additionally seasonality is stable over the years.
power_ts %>%gg_subseries(KWH) +labs(title ="Residential Power Usage by Month")
The Residential Power Usage by Month plot confirms that the months that see more power usage are August, January, July and September.
At this point we can move ahead and compare different models for the series.
We see that some models obtained negative AICc, indicating a high positive log-likelihood that outweighs penalty terms.
Overall, ARIMA comes out as the model with the best fit, evidenced by lowest AICc. Transformed Additive model also had a low RMSE, but we will move forward to explore the ARIMA models.
We find a log likelihood of 750, which indicates a very good fit. So we can continue with the forecast.
Forecasting for Power Usage Data
# Forecast for all 12 months of 2014power_forecast <- power_fit %>%select(ARIMA) %>%forecast(h ="12 months")# Plotting the forecastautoplot(power_forecast, power_ts) +labs(title ="Forecast of Residential Power Usage (2014)",y ="KWH",x ="Month" ) +theme_minimal()
The residuals hover around zero, with no visible patterns, and the ACF shows most values falling within confidence bonds. The histogram shows a close normal distribution.
The forecast follows a clear seasonal pattern, consistent with previous years, indicating that power usage is driven by recurring seasonal behaviors. The relatively narrow width of the confidence intervals suggests high confidence in the model’s predictions. Notably, no unusual spikes or drops are predicted, supporting operational stability for electricity demand in 2014.
Forecasted demand aligns well with seasonal trends from previous years, which would enable energy providers to optimize generation schedules, maintenance windows, and staffing. Additionally these projections can be used to estimate power procurement costs and revenue expectations.
Exporting the data
# Creating a data frame with the desired columns and adding the Case Sequence numberspower_forecast_exp <- power_forecast %>%as.data.frame() %>%select(Month, .mean) %>%rename(KHW = .mean) %>%mutate(CaseSequence =925:936) %>%relocate(CaseSequence)# Export file to CSVpower_forecast_exp %>%write.csv("Power_forecast.csv")
Water Pipes
The final dataset actually consists of 2 files, showing a Date Time column and a Waterflow column, each file is for one pipe. The goal of the final analysis is to time-base sequence the data and aggregate based on hour. For multiple recordings within an hour, we will take the mean. Then determine if the data is stationary and if it can be forecast. If so, provide a week-forward forecast.
Rows: 1000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date Time
dbl (1): WaterFlow
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1000 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date Time
dbl (1): WaterFlow
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(pipe1)
# A tibble: 6 × 2
`Date Time` WaterFlow
<chr> <dbl>
1 10/23/15 12:24 AM 23.4
2 10/23/15 12:40 AM 28.0
3 10/23/15 12:53 AM 23.1
4 10/23/15 12:55 AM 30.0
5 10/23/15 1:19 AM 6.00
6 10/23/15 1:23 AM 15.9
head(pipe2)
# A tibble: 6 × 2
`Date Time` WaterFlow
<chr> <dbl>
1 10/23/15 1:00 AM 18.8
2 10/23/15 2:00 AM 43.1
3 10/23/15 3:00 AM 38.0
4 10/23/15 4:00 AM 36.1
5 10/23/15 5:00 AM 31.9
6 10/23/15 6:00 AM 28.2
Prepping the Data
When we loaded the data we saw that Pipe 1 had several entries per hour, and the instructions tell us to consolidate them into one entry per hour using the mean. For Pipe 2 we see that is close to the correct hourly format but it already has unique hourly readings (I verified this with code, just not shown here to avoid more clutter).
# First we will parse date and aggregate using mean per hourpipe1_clean <- pipe1 %>%#Replacing the DateTime with round hour unitsmutate(DateTime =mdy_hm(`Date Time`)) %>%mutate(DateTime =floor_date(DateTime, unit ="hour")) %>%group_by(DateTime) %>%# Aggregrate by mean per hoursummarise(WaterFlow =mean(WaterFlow, na.rm =TRUE)) %>%ungroup()
Changing the format of pipe 2 so both are consistent.
Now we can merge the data for both pipes into one tidy structure.
# Combining the two datasets into onecombined_pipes <-bind_rows(pipe1_labeled, pipe2_labeled)# Arranging by time combined_pipes <- combined_pipes %>%arrange(DateTime)head(combined_pipes)
ggplot(combined_pipes, aes(x = DateTime, y = WaterFlow)) +geom_line(color ="steelblue") +facet_wrap(~ Pipe, scales ="free_y", ncol =1) +labs(title ="Hourly Water Flow by Pipe",x ="Date and Time",y ="Water Flow (units)" ) +theme_minimal()
summary(combined_pipes)
DateTime WaterFlow Pipe
Min. :2015-10-23 00:00:00.0 Min. : 1.885 Length:1236
1st Qu.:2015-10-29 11:45:00.0 1st Qu.:21.577 Class :character
Median :2015-11-07 22:30:00.0 Median :34.353 Mode :character
Mean :2015-11-09 19:35:23.2 Mean :35.801
3rd Qu.:2015-11-20 19:15:00.0 3rd Qu.:47.806
Max. :2015-12-03 16:00:00.0 Max. :78.303
No NAs are evident at this point, but that can be quickly confirmed:
# Checking for NAs in the combined datasetcombined_pipes %>%summarise(NA_DateTime =sum(is.na(DateTime)),NA_WaterFlow =sum(is.na(WaterFlow)),NA_Pipe =sum(is.na(Pipe)) )
We find only 1 outlier in Pipe 1, while Pipe 2 has 0 outliers.
ggplot(combined_pipes, aes(x = Pipe, y = WaterFlow)) +geom_boxplot(fill ="skyblue") +labs(title ="Boxplot of Water Flow per Pipe", y ="Water Flow", x ="Pipe") +theme_minimal()
The boxplot illustrates that Pipe1 experiences stable, narrow-range flow, whereas Pipe2 has a much broader and more variable distribution. A single statistical outlier was detected in Pipe1, but Pipe2, despite its wide range, shows no extreme outliers by the IQR rule. These patterns suggest different operational behaviors or environmental conditions for each pipe.
We see that Pipe1 now has 4 NAs. So this is a good chance to handle those and the outlier at the same time. We will start replacing the outlier with NA.
And now we can impute all missing values using Kalman smoothing.
# Impute valuespipe1_ts_cleaned <- pipe1_ts_cleaned %>%mutate(WaterFlow =na_kalman(WaterFlow, model ="StructTS"))sum(is.na(pipe1_ts_cleaned$WaterFlow)) #If no NAs will show 0
[1] 0
Stationary Testing
# ADF test for Pipe1adf_pipe1 <-ur.df(pipe1_ts_cleaned$WaterFlow, type ="drift", lags =24)summary(adf_pipe1)
For Pipe 1 the ADF test shows that the t statistic (tau) -2.6912 is greater than the 5% critical value -2.88. This means that we fail to reject the null hypothesis. So Pipe 1 is not stationary, and differencing is required before modeling.
For Pipe 2 the ADF test shows that the t statistic (tau) -5.2979 is much lower than the 1% critical value -3.43. This means that we can reject the null hypothesis. So Pipe 2 is stationary, and can be modeled as is.
# First-order differencingpipe1_diff <- pipe1_ts_cleaned %>%mutate(WaterFlow =difference(WaterFlow, lag =1)) %>%filter(!is.na(WaterFlow))
# ADF test after differencingadf_pipe1_diff <-ur.df(pipe1_diff$WaterFlow, type ="drift", lags =24)summary(adf_pipe1_diff)
pipe1_forecast %>%autoplot(pipe1_ts_cleaned) +labs(title ="Forecast of Water Flow (Pipe 1)",y ="Hourly Water Flow",x ="Time" ) +theme_minimal()
pipe1_model %>%gg_tsresiduals()
The ARIMA model projects a stable average water flow for Pipe 1 over the next week, with moderate uncertainty reflected in the forecast intervals.
The model suggests that no strong upward or downward trends are expected. Expected flow will likely remain within the 15 to 30 units/hour range, with a central forecast around 20 units/hour
pipe2_forecast %>%autoplot(pipe2_ts) +labs(title ="Forecast of Water Flow (Pipe 2)",y ="Hourly Water Flow",x ="Time" ) +theme_minimal()
pipe2_model %>%gg_tsresiduals()
The ARIMA model projects that hourly water flow in Pipe 2 will continue to fluctuate around its long-term average of approximately 40 units/hour.
Forecast intervals are relatively tight and stable, reflecting the model’s confidence in its predictions despite the system’s natural variability.
Exporting the Data
# Add label to each forecastpipe1_forecast <- pipe1_forecast %>%mutate(Pipe ="Pipe1")pipe2_forecast <- pipe2_forecast %>%mutate(Pipe ="Pipe2")
# We can now combine both forecasts to exportpipes_forecast <-bind_rows(pipe1_forecast, pipe2_forecast) %>%as.data.frame() %>%select(DateTime, Pipe, .mean) %>%rename(Waterflow = .mean)# export filepipes_forecast %>%write.csv("Pipes_Forecasts.csv", row.names =FALSE)