This project comprises time series forecasting Techniques for prediction of future values in three different domains:
ATM Cash Withdrawals:cash withdrawals will be predicted from ATM machines for May 2010 Household Power Consumption: For the year of 2014 will Forcast the power of consumption in KWh Water Flow Data(bonus)- will use time series analysis to Aggregate and forecast water flow data.
The purpose is to use acceptable forecasting methodologies, assess the outcomes, and provide the conclusions in a structured, reproducible format.
The goal is to Load appropriate datasets. Handle the missig data coveniently. Apply time series forecasing models such as ARIMA/ETS. Illustrade trends and forecast providing visualizations. lastly, will generate Excel-readable forecast outputs.
Deliverables Written Report (RPubs link to share findings, analysis and visuals)
R Markdown File (.rmd to display the code and analysis)
Excel-readable forecast outputs- The final forecasts for ATM withdrawals, residential power consumption, and water flow in CSV format.
library(readxl)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tseries)
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(quantmod)
## Loading required package: TTR
library(readr)
ATM624Data <- readxl::read_excel("ATM624Data.xlsx")
# checking data structure and format
str(ATM624Data)
## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
## $ DATE: num [1:1474] 39934 39934 39935 39935 39936 ...
## $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
## $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...
head(ATM624Data)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
The datasets display: DATE, cash and ATM_ID
to understand the trend will create a time series plot to inspect visually the patterns over time. Will also handle missing value wich is a common issue in time series forcasting.
#Identifying missing value
sum(is.na(ATM624Data))
## [1] 33
colSums(is.na(ATM624Data))
## DATE ATM Cash
## 0 14 19
# Input Missing Values
ATM624Data_clean <- na.omit(ATM624Data)
# filling missing value to remain by using inputation methods:
library(zoo)
ATM624Data$Cash <- na.locf(ATM624Data$Cash, na.rm = FALSE)
# Ensuring that the 'DATE' column is of Date type
ATM624Data_clean$DATE <- as.Date(ATM624Data_clean$DATE, format="%Y-%m-%d")
# Generate the plot for ATM cash withdrawals over time
ggplot(ATM624Data_clean, aes(x = DATE, y = Cash)) +
geom_line(color = "blue", size = 1) + # Adding a line with color and size
labs(title = "ATM Cash Withdrawals",
x = "Date",
y = "Cash Withdrawn (Hundreds of Dollars)") + # Labels for the x and y axes
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Now that the data is clean.It’s time to use a forecasting model to estimate ATM withdrawals in May 2010 We will predict using the ARIMA (AutoRegressive Integrated Moving Average) model because the data is continuous and probably seasonal.
# convert data to time series object
atm_ts <- ts(ATM624Data$Cash, frequency = 12) # Assuming monthly data
atm_model <- auto.arima(atm_ts)
# forecast 1 month ahead for may 2010
atm_forecast <- forecast(atm_model, h = 1)
#Autoplot the forecast
autoplot(atm_forecast) + labs(title = "Atm Cash Withdrawal Forecast for May 2010")
# now will save the forecast result to a CSV file
write.csv(atm_forecast,"ATM_forecast_May_2010.csv")
This ARIMA model generate the forecast,wich will assist on exporting a readable escel file later.
Load and Inspect the dataset
We use statistics on home electricity consumption in this section. With this information, we will project 2014’s electricity usage. A straightforward Excel file with monthly electricity consumption in kilowatt-hours (KWH) contains the dat
ResidentialCustomerForecastLoad <- readxl::read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(ResidentialCustomerForecastLoad)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
# checking data structure and format
str(ResidentialCustomerForecastLoad)
## tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
## $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
The format used for the date column, YYYY-MMM, is not accepted by R as a standard date format. To make manipulation simpler, we’ll convert this to the appropriate Date format.
# Convert 'YYYY-MMM' to a proper date format
ResidentialCustomerForecastLoad$Date <- as.Date(paste0(ResidentialCustomerForecastLoad$`YYYY-MMM`, "-01"), format="%Y-%b-%d")
#check update format
str(ResidentialCustomerForecastLoad)
## tibble [192 × 4] (S3: tbl_df/tbl/data.frame)
## $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY-MMM : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
## $ Date : Date[1:192], format: "1998-01-01" "1998-02-01" ...
Plotting the power consumption trends over time allows us to examine the data and find any trends or seasonal patterns.
# Residential Power Consumption Line Plot
ggplot(ResidentialCustomerForecastLoad, aes(x = Date, y = KWH)) +
geom_line(color = "darkblue", size = 1) +
labs(title = "Residential Power Consumption",
x = "Date",
y = "Power Consumption (KWH)") +
theme_minimal() + # Clean and minimal theme
scale_x_date(date_labels = "%b %Y", date_breaks = "1 year") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We are now forecasting power use for 2014 using the ARIMA model. This will help us anticipate power usage for the future year.
# convert to time series object
power_ts <- ts(ResidentialCustomerForecastLoad$KWH, frequency = 12, start = c(1998, 1))
# ARIMA model
power_model <- auto.arima(power_ts)
#forecast the year 2014 (12 months up)
power_forecast <- forecast(power_model, h = 12)
# Autoplot forecast
autoplot(power_forecast) + labs(title = "Residential Power Consumption Forecast for 2014")
Saving the forecast results into a CSV file for Excel-readable
write.csv(power_forecast, "ResidentialPowerForecast_2014.csv")
#Load and Inspect Data
The water flow data is divided into two datasets: one for each water pipe. The idea is to aggregate these datasets by hour and then use forecasting. We begin by importing the data and analyzing its structure.
waterflow_data1 <- read_excel("Waterflow_Pipe1.xlsx")
waterflow_data2 <- read_excel("Waterflow_Pipe2.xlsx")
head(waterflow_data1)
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dbl> <dbl>
## 1 42300. 23.4
## 2 42300. 28.0
## 3 42300. 23.1
## 4 42300. 30.0
## 5 42300. 6.00
## 6 42300. 15.9
head(waterflow_data2)
## # A tibble: 6 × 2
## `Date Time` WaterFlow
## <dbl> <dbl>
## 1 42300. 18.8
## 2 42300. 43.1
## 3 42300. 38.0
## 4 42300. 36.1
## 5 42300. 31.9
## 6 42300. 28.2
str(waterflow_data1)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
## $ Date Time: num [1:1000] 42300 42300 42300 42300 42300 ...
## $ WaterFlow: num [1:1000] 23.4 28 23.1 30 6 ...
str(waterflow_data2)
## tibble [1,000 × 2] (S3: tbl_df/tbl/data.frame)
## $ Date Time: num [1:1000] 42300 42300 42300 42300 42300 ...
## $ WaterFlow: num [1:1000] 18.8 43.1 38 36.1 31.9 ...
Merging and Aggreagating Data
Because the datasets contain time stamps, we need to combine them and aggregate the data by hour. We will calculate the average flow for each hour as needed.
# Convert Excel numeric date to proper date format
waterflow_data1$Timestamp <- as.POSIXct(waterflow_data1$`Date Time` * 86400, origin = "1899-12-30", tz = "UTC")
waterflow_data2$Timestamp <- as.POSIXct(waterflow_data2$`Date Time` * 86400, origin = "1899-12-30", tz = "UTC")
# Now merge both datasets by Timestamp
merged_waterflow <- full_join(
waterflow_data1 %>% select(Timestamp, WaterFlow),
waterflow_data2 %>% select(Timestamp, WaterFlow),
by = "Timestamp"
)
# Aggregate data by hour
hourly_waterflow<- merged_waterflow %>%
mutate(hour = floor_date(Timestamp, "hour")) %>%
group_by(hour) %>%
summarise(mean_flow1 = mean(WaterFlow.x, na.rm = TRUE),
mean_flow2 = mean(WaterFlow.y, na.rm = TRUE))
# confirming Aggreagated data
head(hourly_waterflow)
## # A tibble: 6 × 3
## hour mean_flow1 mean_flow2
## <dttm> <dbl> <dbl>
## 1 2015-10-23 00:00:00 26.1 NaN
## 2 2015-10-23 01:00:00 18.9 30.9
## 3 2015-10-23 02:00:00 15.2 NaN
## 4 2015-10-23 03:00:00 23.1 38.0
## 5 2015-10-23 04:00:00 15.5 34.0
## 6 2015-10-23 05:00:00 22.7 NaN
# Check for missing values in both columns
sum(is.na(hourly_waterflow$mean_flow1))
## [1] 510
sum(is.na(hourly_waterflow$mean_flow2))
## [1] 79
summary(hourly_waterflow$mean_flow1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 8.923 17.033 19.784 19.893 22.790 31.730 510
summary(hourly_waterflow$mean_flow2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.885 30.005 39.814 39.779 49.321 78.303 79
# Remove rows with NA values in either flow columns
hourly_waterflow_clean <- hourly_waterflow %>%
filter(!is.na(mean_flow1) & !is.na(mean_flow2))
# Create ggplot for cleaned hourly water flow data
ggplot(hourly_waterflow_clean, aes(x = hour)) +
geom_line(aes(y = mean_flow1, color = "Flow 1"), size = 1) +
geom_line(aes(y = mean_flow2, color = "Flow 2"), size = 1) +
labs(title = "Hourly Water Flow Trend (Cleaned Data)",
x = "Hour",
y = "Mean Water Flow",
color = "Legend") +
theme_minimal() +
scale_color_manual(values = c("Flow 1" = "pink", "Flow 2" = "green")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For forecasting, lets check if the time series is stationary by using the Augmented Dickey-Fuller (ADF) test to check for stationary.
# ADF test
adf.test(na.omit(hourly_waterflow$mean_flow1), alternative = "stationary")
## Warning in adf.test(na.omit(hourly_waterflow$mean_flow1), alternative =
## "stationary"): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(hourly_waterflow$mean_flow1)
## Dickey-Fuller = -6.2649, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
adf.test(na.omit(hourly_waterflow$mean_flow2), alternative = "stationary")
## Warning in adf.test(na.omit(hourly_waterflow$mean_flow2), alternative =
## "stationary"): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(hourly_waterflow$mean_flow2)
## Dickey-Fuller = -8.6584, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
we can see that the result fro the ADF test show that both mean_flow1 and mean_flow2 are stationary becasue Dickey-Fuller Statistic for both is negative: mean_flow1=-6.2649 mean_flow2=–8.6584 Therefore, we gather that p-value of both test is 0.01 much smaller than typical significance of level 0.05
The ADF test rejects the null hypothesis that a time series is non-stationary, revealing that both mean_flow1 and mean_flow2 are stationary, implying that they are forecastable.
Base on the interpretation we gather the result to be starionarry , will proceed on applying ARIMA/ETS model
will apply ARIMAto each of the stationary time series for both mean_flow1 and mean_flow2 Then, will generate a forecast using specified period for the next month. Moreover, will evaluate the model with diagnostics like residual plots and ACF/PACF plots.
# Aggregating data by month (for example, using lubridate for month extraction)
monthly_waterflow <- hourly_waterflow_clean %>%
mutate(month = floor_date(hour, "month")) %>%
group_by(month) %>%
summarise(mean_flow1 = mean(mean_flow1, na.rm = TRUE),
mean_flow2 = mean(mean_flow2, na.rm = TRUE))
# Check the result
head(monthly_waterflow)
## # A tibble: 2 × 3
## month mean_flow1 mean_flow2
## <dttm> <dbl> <dbl>
## 1 2015-10-01 00:00:00 19.9 41.4
## 2 2015-11-01 00:00:00 19.3 37.2
arima_model_flow1 <- auto.arima(monthly_waterflow$mean_flow1, seasonal = TRUE)
summary(arima_model_flow1)
## Series: monthly_waterflow$mean_flow1
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 19.5837
## s.e. 0.1941
##
## sigma^2 = 0.1507: log likelihood = -0.25
## AIC=4.5 AICc=-7.5 BIC=1.89
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 6.003731e-11 0.2744771 0.2744771 -0.01964758 1.401836 0.5 -0.5
# Example of aggregating data to monthly level
monthly_waterflow <- hourly_waterflow_clean %>%
mutate(month = floor_date(hour, "month")) %>%
group_by(month) %>%
summarise(mean_flow1 = mean(mean_flow1, na.rm = TRUE),
mean_flow2 = mean(mean_flow2, na.rm = TRUE))
# Create monthly aggregated data
monthly_waterflow <- hourly_waterflow_clean %>%
mutate(month = floor_date(hour, "month")) %>%
group_by(month) %>%
summarise(mean_flow1 = mean(mean_flow1, na.rm = TRUE),
mean_flow2 = mean(mean_flow2, na.rm = TRUE))
# Apply ARIMA model on the aggregated data
arima_model_flow1 <- auto.arima(monthly_waterflow$mean_flow1, seasonal = TRUE)
# Check the model summary
summary(arima_model_flow1)
## Series: monthly_waterflow$mean_flow1
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 19.5837
## s.e. 0.1941
##
## sigma^2 = 0.1507: log likelihood = -0.25
## AIC=4.5 AICc=-7.5 BIC=1.89
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 6.003731e-11 0.2744771 0.2744771 -0.01964758 1.401836 0.5 -0.5
arima_model_flow2 <- auto.arima(monthly_waterflow$mean_flow2, seasonal = TRUE)
summary(arima_model_flow2)
## Series: monthly_waterflow$mean_flow2
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 39.2616
## s.e. 1.4784
##
## sigma^2 = 8.742: log likelihood = -4.31
## AIC=12.63 AICc=0.63 BIC=10.01
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 3.552714e-15 2.090661 2.090661 -0.284357 5.34009 0.5 -0.5
Once we have the ARIMA models, we may create forecasts for a certain time period. Let us anticipate the next 12 months for both water flow datasets.
# Next 12months forecast mean_flow1
forecast_flow1 <- forecast(arima_model_flow1, h = 12)
autoplot(forecast_flow1) +
labs(title = "Forecast for Water Flow 1", x = "Time", y = "Water Flow 1 (Mean)") +
theme_minimal()
forecast_flow2 <- forecast(arima_model_flow2, h = 12)
autoplot(forecast_flow2) +
labs(title = "Forecast for Water Flow 2", x = "Time", y = "Water Flow 2 (Mean)") +
theme_minimal()
# CSV file
write.csv(forecast_flow1, "WaterFlow_Forecast_Flow1.csv")
write.csv(forecast_flow2, "WaterFlow_Forecast_Flow2.csv")
Conclusion
This study focuses on time series forecasting across three domains: ATM cash withdrawals, home electricity use, and water flow data. The purpose is to use appropriate forecasting methodologies, evaluate the data, and deliver the conclusions in a repeatable fashion.
The deliverables include a written report (posted on RPubs), a R Markdown file (.rmd), and Excel-readable forecast outputs in CSV format.
The structure contains:
Introduction: Using datasets from ATM withdrawals, home power usage, and water flow, this paper emphasizes the necessity of time series forecasting in banking and energy.
Data exploration and pre-processing include loading data, addressing missing values, and showing trends with ggplot2 and ts.plot().
Forecasting Methodology:
ATM withdrawals: Testing ARIMA, ETS, and Naïve models.
Power Consumption: Time series segmentation and 12-month forecasts.
Water Flow: Aggregating data by the hour, checking stationarity, and using appropriate models.
Results and Discussion: Comparison of anticipated and actual data, model selection explanation, and business consequences.
Conclusion and Limitations: Summary of findings, proposed improvements, and future studies.
Resources include “Forecasting: Principles and Practice” and R packages like forecast, tseries, ggplot2, and dplyr.
Textbook Reference: Forecasting: Principles and Practice by Hyndman & Athanasopoulos.
R Packages:
forecast, tseries, ggplot2, dplyr, xts, readxl
Documentation & Examples:
https://rpubs.com/ for publishing.