Objective

This project comprises time series forecasting Techniques for prediction of future values in three different domains:

ATM Cash Withdrawals:cash withdrawals will be predicted from ATM machines for May 2010 Household Power Consumption: For the year of 2014 will Forcast the power of consumption in KWh Water Flow Data(bonus)- will use time series analysis to Aggregate and forecast water flow data.

The purpose is to use acceptable forecasting methodologies, assess the outcomes, and provide the conclusions in a structured, reproducible format.

The goal is to Load appropriate datasets. Handle the missig data coveniently. Apply time series forecasing models such as ARIMA/ETS. Illustrade trends and forecast providing visualizations. lastly, will generate Excel-readable forecast outputs.

Deliverable

Deliverables Written Report (RPubs link to share findings, analysis and visuals)

R Markdown File (.rmd to display the code and analysis)

Excel-readable forecast outputs- The final forecasts for ATM withdrawals, residential power consumption, and water flow in CSV format.

Load Packages

library(readxl)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tseries)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(quantmod)
## Loading required package: TTR
library(readr)

Loading and Inspect Data

  1. ATM Cash Withdrawals(ATM624Data.xlsx) The following datasets contains the information for cash withdrawals amounts from different ATM machines. Will forecast the amount of cash needed to be withdrawn in May 2010.
ATM624Data <- readxl::read_excel("ATM624Data.xlsx")

# checking data structure and format
str(ATM624Data)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: num [1:1474] 39934 39934 39935 39935 39936 ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
head(ATM624Data)
## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

The datasets display: DATE, cash and ATM_ID

Visualization of Cash Withdrawals

to understand the trend will create a time series plot to inspect visually the patterns over time. Will also handle missing value wich is a common issue in time series forcasting.

#Identifying missing value
sum(is.na(ATM624Data))
## [1] 33
colSums(is.na(ATM624Data))
## DATE  ATM Cash 
##    0   14   19
# Input Missing Values
ATM624Data_clean <- na.omit(ATM624Data)

# filling missing value to remain by using inputation methods:

library(zoo)

ATM624Data$Cash <- na.locf(ATM624Data$Cash, na.rm = FALSE)
# Ensuring that the 'DATE' column is of Date type
ATM624Data_clean$DATE <- as.Date(ATM624Data_clean$DATE, format="%Y-%m-%d")


# Generate the plot for ATM cash withdrawals over time
ggplot(ATM624Data_clean, aes(x = DATE, y = Cash)) +
  geom_line(color = "blue", size = 1) +  # Adding a line with color and size
  labs(title = "ATM Cash Withdrawals", 
       x = "Date", 
       y = "Cash Withdrawn (Hundreds of Dollars)") +  # Labels for the x and y axes
  theme_minimal() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Time Series Forecasting: Atm Withdrawals (ARIMA Model)

Now that the data is clean.It’s time to use a forecasting model to estimate ATM withdrawals in May 2010 We will predict using the ARIMA (AutoRegressive Integrated Moving Average) model because the data is continuous and probably seasonal.

# convert data to time series object
atm_ts <- ts(ATM624Data$Cash, frequency = 12)  # Assuming monthly data
atm_model <- auto.arima(atm_ts)

# forecast 1 month ahead for may 2010
atm_forecast <- forecast(atm_model, h = 1)

#Autoplot the forecast
autoplot(atm_forecast) + labs(title = "Atm Cash Withdrawal Forecast for May 2010")

# now will save the forecast result to a CSV file

write.csv(atm_forecast,"ATM_forecast_May_2010.csv")

This ARIMA model generate the forecast,wich will assist on exporting a readable escel file later.

(B) Residential Power Consumption Forecasting

Load and Inspect the dataset

We use statistics on home electricity consumption in this section. With this information, we will project 2014’s electricity usage. A straightforward Excel file with monthly electricity consumption in kilowatt-hours (KWH) contains the dat

ResidentialCustomerForecastLoad <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")

head(ResidentialCustomerForecastLoad)
## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147
# checking data structure and format 
str(ResidentialCustomerForecastLoad)
## tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
##  $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY-MMM    : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...

Processing and Converting the Data

The format used for the date column, YYYY-MMM, is not accepted by R as a standard date format. To make manipulation simpler, we’ll convert this to the appropriate Date format.

# Convert 'YYYY-MMM' to a proper date format

ResidentialCustomerForecastLoad$Date <- as.Date(paste0(ResidentialCustomerForecastLoad$`YYYY-MMM`, "-01"), format="%Y-%b-%d")

#check update format 

str(ResidentialCustomerForecastLoad)
## tibble [192 × 4] (S3: tbl_df/tbl/data.frame)
##  $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY-MMM    : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
##  $ Date        : Date[1:192], format: "1998-01-01" "1998-02-01" ...

Visualize The power consumption trend:

Plotting the power consumption trends over time allows us to examine the data and find any trends or seasonal patterns.

# Residential Power Consumption Line Plot
ggplot(ResidentialCustomerForecastLoad, aes(x = Date, y = KWH)) +
  geom_line(color = "darkblue", size = 1) +  
  labs(title = "Residential Power Consumption", 
       x = "Date", 
       y = "Power Consumption (KWH)") +  
  theme_minimal() +  # Clean and minimal theme
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 year") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

Time Series Forecasting for Residential Power Consumption

We are now forecasting power use for 2014 using the ARIMA model. This will help us anticipate power usage for the future year.

# convert to time series object
power_ts <- ts(ResidentialCustomerForecastLoad$KWH, frequency = 12, start = c(1998, 1))

# ARIMA model
power_model <- auto.arima(power_ts)

#forecast the year 2014 (12 months up)

power_forecast <- forecast(power_model, h = 12)

# Autoplot forecast

autoplot(power_forecast) + labs(title = "Residential Power Consumption Forecast for 2014")

Saving the forecast results into a CSV file for Excel-readable

write.csv(power_forecast, "ResidentialPowerForecast_2014.csv")

Water Flow Data (C-bonus)

#Load and Inspect Data

The water flow data is divided into two datasets: one for each water pipe. The idea is to aggregate these datasets by hour and then use forecasting. We begin by importing the data and analyzing its structure.

waterflow_data1 <- read_excel("Waterflow_Pipe1.xlsx")

waterflow_data2 <- read_excel("Waterflow_Pipe2.xlsx")

head(waterflow_data1)
## # A tibble: 6 × 2
##   `Date Time` WaterFlow
##         <dbl>     <dbl>
## 1      42300.     23.4 
## 2      42300.     28.0 
## 3      42300.     23.1 
## 4      42300.     30.0 
## 5      42300.      6.00
## 6      42300.     15.9
head(waterflow_data2)
## # A tibble: 6 × 2
##   `Date Time` WaterFlow
##         <dbl>     <dbl>
## 1      42300.      18.8
## 2      42300.      43.1
## 3      42300.      38.0
## 4      42300.      36.1
## 5      42300.      31.9
## 6      42300.      28.2

Checking the both Datasets structure

str(waterflow_data1)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Date Time: num [1:1000] 42300 42300 42300 42300 42300 ...
##  $ WaterFlow: num [1:1000] 23.4 28 23.1 30 6 ...
str(waterflow_data2)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
##  $ Date Time: num [1:1000] 42300 42300 42300 42300 42300 ...
##  $ WaterFlow: num [1:1000] 18.8 43.1 38 36.1 31.9 ...

Merging and Aggreagating Data

Because the datasets contain time stamps, we need to combine them and aggregate the data by hour. We will calculate the average flow for each hour as needed.

# Convert Excel numeric date to proper date format
waterflow_data1$Timestamp <- as.POSIXct(waterflow_data1$`Date Time` * 86400, origin = "1899-12-30", tz = "UTC")

waterflow_data2$Timestamp <- as.POSIXct(waterflow_data2$`Date Time` * 86400, origin = "1899-12-30", tz = "UTC")

# Now merge both datasets by Timestamp
merged_waterflow <- full_join(
  waterflow_data1 %>% select(Timestamp, WaterFlow),
  waterflow_data2 %>% select(Timestamp, WaterFlow),
  by = "Timestamp"
)

# Aggregate data by hour
hourly_waterflow<- merged_waterflow %>%
  mutate(hour = floor_date(Timestamp, "hour")) %>%
  group_by(hour) %>%
  summarise(mean_flow1 = mean(WaterFlow.x, na.rm = TRUE),
            mean_flow2 = mean(WaterFlow.y, na.rm = TRUE))

# confirming Aggreagated data

head(hourly_waterflow)
## # A tibble: 6 × 3
##   hour                mean_flow1 mean_flow2
##   <dttm>                   <dbl>      <dbl>
## 1 2015-10-23 00:00:00       26.1      NaN  
## 2 2015-10-23 01:00:00       18.9       30.9
## 3 2015-10-23 02:00:00       15.2      NaN  
## 4 2015-10-23 03:00:00       23.1       38.0
## 5 2015-10-23 04:00:00       15.5       34.0
## 6 2015-10-23 05:00:00       22.7      NaN
# Check for missing values in both columns
sum(is.na(hourly_waterflow$mean_flow1))
## [1] 510
sum(is.na(hourly_waterflow$mean_flow2))
## [1] 79
summary(hourly_waterflow$mean_flow1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   8.923  17.033  19.784  19.893  22.790  31.730     510
summary(hourly_waterflow$mean_flow2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.885  30.005  39.814  39.779  49.321  78.303      79
# Remove rows with NA values in either flow columns
hourly_waterflow_clean <- hourly_waterflow %>%
  filter(!is.na(mean_flow1) & !is.na(mean_flow2))

# Create ggplot for cleaned hourly water flow data
ggplot(hourly_waterflow_clean, aes(x = hour)) +
  geom_line(aes(y = mean_flow1, color = "Flow 1"), size = 1) +
  geom_line(aes(y = mean_flow2, color = "Flow 2"), size = 1) +
  labs(title = "Hourly Water Flow Trend (Cleaned Data)",
       x = "Hour",
       y = "Mean Water Flow",
       color = "Legend") +
  theme_minimal() +
  scale_color_manual(values = c("Flow 1" = "pink", "Flow 2" = "green")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

Stationarity Check and Forecasting

For forecasting, lets check if the time series is stationary by using the Augmented Dickey-Fuller (ADF) test to check for stationary.

# ADF test


adf.test(na.omit(hourly_waterflow$mean_flow1), alternative = "stationary")
## Warning in adf.test(na.omit(hourly_waterflow$mean_flow1), alternative =
## "stationary"): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(hourly_waterflow$mean_flow1)
## Dickey-Fuller = -6.2649, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
adf.test(na.omit(hourly_waterflow$mean_flow2), alternative = "stationary")
## Warning in adf.test(na.omit(hourly_waterflow$mean_flow2), alternative =
## "stationary"): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(hourly_waterflow$mean_flow2)
## Dickey-Fuller = -8.6584, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

we can see that the result fro the ADF test show that both mean_flow1 and mean_flow2 are stationary becasue Dickey-Fuller Statistic for both is negative: mean_flow1=-6.2649 mean_flow2=–8.6584 Therefore, we gather that p-value of both test is 0.01 much smaller than typical significance of level 0.05

Interpretation

The ADF test rejects the null hypothesis that a time series is non-stationary, revealing that both mean_flow1 and mean_flow2 are stationary, implying that they are forecastable.

Base on the interpretation we gather the result to be starionarry , will proceed on applying ARIMA/ETS model

Forecasting using ARIMA model

will apply ARIMAto each of the stationary time series for both mean_flow1 and mean_flow2 Then, will generate a forecast using specified period for the next month. Moreover, will evaluate the model with diagnostics like residual plots and ACF/PACF plots.

Applying ARIMA model to dataset

# Aggregating data by month (for example, using lubridate for month extraction)


monthly_waterflow <- hourly_waterflow_clean %>%
  mutate(month = floor_date(hour, "month")) %>%
  group_by(month) %>%
  summarise(mean_flow1 = mean(mean_flow1, na.rm = TRUE),
            mean_flow2 = mean(mean_flow2, na.rm = TRUE))

# Check the result
head(monthly_waterflow)
## # A tibble: 2 × 3
##   month               mean_flow1 mean_flow2
##   <dttm>                   <dbl>      <dbl>
## 1 2015-10-01 00:00:00       19.9       41.4
## 2 2015-11-01 00:00:00       19.3       37.2
arima_model_flow1 <- auto.arima(monthly_waterflow$mean_flow1, seasonal = TRUE)

summary(arima_model_flow1)
## Series: monthly_waterflow$mean_flow1 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##          mean
##       19.5837
## s.e.   0.1941
## 
## sigma^2 = 0.1507:  log likelihood = -0.25
## AIC=4.5   AICc=-7.5   BIC=1.89
## 
## Training set error measures:
##                        ME      RMSE       MAE         MPE     MAPE MASE ACF1
## Training set 6.003731e-11 0.2744771 0.2744771 -0.01964758 1.401836  0.5 -0.5
# Example of aggregating data to monthly level
monthly_waterflow <- hourly_waterflow_clean %>%
  mutate(month = floor_date(hour, "month")) %>%
  group_by(month) %>%
  summarise(mean_flow1 = mean(mean_flow1, na.rm = TRUE), 
            mean_flow2 = mean(mean_flow2, na.rm = TRUE))

# Create monthly aggregated data
monthly_waterflow <- hourly_waterflow_clean %>%
  mutate(month = floor_date(hour, "month")) %>%
  group_by(month) %>%
  summarise(mean_flow1 = mean(mean_flow1, na.rm = TRUE),
            mean_flow2 = mean(mean_flow2, na.rm = TRUE))

# Apply ARIMA model on the aggregated data
arima_model_flow1 <- auto.arima(monthly_waterflow$mean_flow1, seasonal = TRUE)

# Check the model summary
summary(arima_model_flow1)
## Series: monthly_waterflow$mean_flow1 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##          mean
##       19.5837
## s.e.   0.1941
## 
## sigma^2 = 0.1507:  log likelihood = -0.25
## AIC=4.5   AICc=-7.5   BIC=1.89
## 
## Training set error measures:
##                        ME      RMSE       MAE         MPE     MAPE MASE ACF1
## Training set 6.003731e-11 0.2744771 0.2744771 -0.01964758 1.401836  0.5 -0.5
arima_model_flow2 <- auto.arima(monthly_waterflow$mean_flow2, seasonal = TRUE)

summary(arima_model_flow2)
## Series: monthly_waterflow$mean_flow2 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##          mean
##       39.2616
## s.e.   1.4784
## 
## sigma^2 = 8.742:  log likelihood = -4.31
## AIC=12.63   AICc=0.63   BIC=10.01
## 
## Training set error measures:
##                        ME     RMSE      MAE       MPE    MAPE MASE ACF1
## Training set 3.552714e-15 2.090661 2.090661 -0.284357 5.34009  0.5 -0.5

Generate Forecasts

Once we have the ARIMA models, we may create forecasts for a certain time period. Let us anticipate the next 12 months for both water flow datasets.

# Next 12months forecast mean_flow1

forecast_flow1 <- forecast(arima_model_flow1, h = 12)

autoplot(forecast_flow1) + 
  labs(title = "Forecast for Water Flow 1", x = "Time", y = "Water Flow 1 (Mean)") +
  theme_minimal()

Next 12months forecast mean_flow2

forecast_flow2 <- forecast(arima_model_flow2, h = 12)

autoplot(forecast_flow2) + 
  labs(title = "Forecast for Water Flow 2", x = "Time", y = "Water Flow 2 (Mean)") +
  theme_minimal()

# CSV file

write.csv(forecast_flow1, "WaterFlow_Forecast_Flow1.csv")
write.csv(forecast_flow2, "WaterFlow_Forecast_Flow2.csv")

Conclusion

This study focuses on time series forecasting across three domains: ATM cash withdrawals, home electricity use, and water flow data. The purpose is to use appropriate forecasting methodologies, evaluate the data, and deliver the conclusions in a repeatable fashion.

The deliverables include a written report (posted on RPubs), a R Markdown file (.rmd), and Excel-readable forecast outputs in CSV format.

The structure contains:

Introduction: Using datasets from ATM withdrawals, home power usage, and water flow, this paper emphasizes the necessity of time series forecasting in banking and energy.

Data exploration and pre-processing include loading data, addressing missing values, and showing trends with ggplot2 and ts.plot().

Forecasting Methodology:

ATM withdrawals: Testing ARIMA, ETS, and Naïve models.

Power Consumption: Time series segmentation and 12-month forecasts.

Water Flow: Aggregating data by the hour, checking stationarity, and using appropriate models.

Results and Discussion: Comparison of anticipated and actual data, model selection explanation, and business consequences.

Conclusion and Limitations: Summary of findings, proposed improvements, and future studies.

Resources include “Forecasting: Principles and Practice” and R packages like forecast, tseries, ggplot2, and dplyr.

Resources

Textbook Reference: Forecasting: Principles and Practice by Hyndman & Athanasopoulos.

R Packages:

forecast, tseries, ggplot2, dplyr, xts, readxl

Documentation & Examples:

https://otexts.com/fpp3/

https://rpubs.com/ for publishing.