In part A, the task is to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. The instructions are somewhat ambiguous on purpose to give this a slightly more professional tone. Explain and demonstrate your process, techniques used and not used, and your actual forecast. The data is given via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ tsibble::interval() masks lubridate::interval()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
library(tseries)
Warning: package 'tseries' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
library(urca)
Warning: package 'urca' was built under R version 4.3.3
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(readxl)library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:scales':
alpha, rescale
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(tidyr)library(tidyr)library(forecast)
Warning: package 'forecast' was built under R version 4.3.3
library(openxlsx)
Warning: package 'openxlsx' was built under R version 4.3.3
library(imputeTS)
Attaching package: 'imputeTS'
The following object is masked from 'package:tseries':
na.remove
library(zoo)
Attaching package: 'zoo'
The following object is masked from 'package:imputeTS':
na.locf
The following object is masked from 'package:tsibble':
index
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Part A – ATM Forecasts from ATM624Data.xlsx
1. Data Preparation: Import and read the data into R
#|label: Import the data from my Github repository for reproducibility and view the first three rows.data <-read.csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/refs/heads/main/ATM624Data.csv")head(data, n=3)
DATE ATM Cash
1 5/1/2009 12:00:00 AM ATM1 96
2 5/1/2009 12:00:00 AM ATM2 107
3 5/2/2009 12:00:00 AM ATM1 82
library(readxl)file_path <-"/Users/heleinefouda/Library/Mobile Documents/com~apple~CloudDocs/Machine Learning- Data 622/ATM624Data-11.xlsx"atm_data <-read_excel(file_path)atm_data
#|label:label: Convert the DATE column atm_data$DATE <-as.Date(atm_data$DATE, origin ="1899-12-30")#|label: The code uses Excel's origin date for dates#|label: Ensure other columns are of correct typeatm_data$ATM <-as.character(atm_data$ATM)atm_data$Cash <-as.numeric(atm_data$Cash)
The structure function reveals that we have a dataframe. A conversion to a tsibble is necessary to make analysis and forecasting possible.
1.3. Run a summary statistics of the data frame
#|label: Summary statisticssummary(atm_data)
DATE ATM Cash
Min. :2009-05-01 Length:1474 Min. : 0.0
1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
Median :2009-11-01 Mode :character Median : 73.0
Mean :2009-10-31 Mean : 155.6
3rd Qu.:2010-02-01 3rd Qu.: 114.0
Max. :2010-05-14 Max. :10919.8
NA's :19
The summary statistics reveals the existence of 15 NA values that need to be dealt with before we make any forecast.
sapply(atm_data, function(x) sum(is.na(x)))
DATE ATM Cash
0 14 19
#|label: Finding the Rows that have NA'satm_data[which(is.na(atm_data$Cash)), ]
# A tibble: 19 × 3
DATE ATM Cash
<date> <chr> <dbl>
1 2009-06-13 ATM1 NA
2 2009-06-16 ATM1 NA
3 2009-06-18 ATM2 NA
4 2009-06-22 ATM1 NA
5 2009-06-24 ATM2 NA
6 2010-05-01 <NA> NA
7 2010-05-02 <NA> NA
8 2010-05-03 <NA> NA
9 2010-05-04 <NA> NA
10 2010-05-05 <NA> NA
11 2010-05-06 <NA> NA
12 2010-05-07 <NA> NA
13 2010-05-08 <NA> NA
14 2010-05-09 <NA> NA
15 2010-05-10 <NA> NA
16 2010-05-11 <NA> NA
17 2010-05-12 <NA> NA
18 2010-05-13 <NA> NA
19 2010-05-14 <NA> NA
1.4 Get a visual sense of the cash distribution per ATM
#|label: Filter the dataset to include data up to April 30, 2010, where ATM values are non-missing. atm_data %>%filter(DATE <"2010-05-01", !is.na(ATM)) %>%ggplot(aes(x = Cash)) +geom_histogram(bins =30, color="lightblue") +facet_wrap(~ ATM, ncol =2, scales ="free")
1.5. Take a closer look at each individual ATM
#|label: Take a closer look at ATM4head(atm_data |>filter(ATM=="ATM4"))
DATE ATM Cash
Min. :2009-05-01 Length:1455 Min. : 0.0
1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
Median :2009-10-31 Mode :character Median : 73.0
Mean :2009-10-30 Mean : 155.6
3rd Qu.:2010-01-29 3rd Qu.: 114.0
Max. :2010-04-30 Max. :10919.8
The summary statistics tells us that: The average withdrawal at ATM is around $58.53; The minimum withdrawal is $0.00 and The maximum withdrawal is $1,123.00.
2.3. Pivot wider to get a paralleland broader view of all the 4 ATM
DATE ATM1 ATM2 ATM3
Min. :2009-05-01 Min. : 1.00 Min. : 0.00 Min. : 0.0000
1st Qu.:2009-07-31 1st Qu.: 73.00 1st Qu.: 25.50 1st Qu.: 0.0000
Median :2009-10-30 Median : 91.00 Median : 67.00 Median : 0.0000
Mean :2009-10-30 Mean : 83.89 Mean : 62.58 Mean : 0.7206
3rd Qu.:2010-01-29 3rd Qu.:108.00 3rd Qu.: 93.00 3rd Qu.: 0.0000
Max. :2010-04-30 Max. :180.00 Max. :147.00 Max. :96.0000
NA's :3 NA's :2
ATM4
Min. : 1.563
1st Qu.: 124.334
Median : 403.839
Mean : 474.043
3rd Qu.: 704.507
Max. :10919.762
##3. Exploratory data analysis(EDA)
3.1 Pivot longer
#|label: Pivot the data from wide to long formatatm_longer <- atm_wider %>%pivot_longer(cols =c("ATM1", "ATM2", "ATM3", "ATM4"), names_to ="ATM", values_to ="Cash")head(atm_longer)
3.2: Grouped and combined visualization of the 4 ATM series
#|label: Time Series Graph by ATMggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +geom_line() +labs(title ="ATM Cash Withdrawals", x ="Date", y ="Cash in ($US)") +theme_minimal()
#|label: Combined seasonal graphs using the face_wrap function and standard ggplotggplot(atm_ts, aes(x = DATE, y = Cash, col = ATM)) +geom_line() +facet_wrap(~ATM,ncol =2, scales ="free_y") +labs(title ="Cash withdrawal per ATM: Seasonality Plot. ", x ="Date", y ="Cash in ($US)") +theme_minimal()
#|label: Combined seasonal graphs using the face_wrap function and the time series autoplot function atm_ts %>%autoplot(Cash) +facet_wrap(~ATM, scales ="free", nrow =4) +labs(title ="ATM Cash Withdrawals")
4. Modeling and Forecasting each ATM as a separate time series
Our analysis of the 4 ATM series will heavily lean on seasonal and trend decomposition using Loess (STL), a method recommended as a best practice for similar series in the third edition of Forecasting: Principles and Practice by Rob J Hyndman & George Athanasopoulos . In that book, the two scholars present STL as a versatile and robust method for decomposing time series to assess trend and seasonality strength. They specifically highlight STL’s ability to handle any seasonal pattern, including those beyond monthly or quarterly data. They add that STL allows the seasonality component to evolve over time, with rates of change controlled by the user and that it can be set to perform a robust decomposition, minimizing the influence of outliers on trend-cycle and seasonal component estimates.
4.1 ATM1
plot(atm1_ts, main ="ATM1 Time Series", xlab ="Date", ylab ="Cash", type ="l", col ="pink", lwd =2)
We have opted to use imputation through interpolation, instead of using the mean to fill in missing values. Interpolation is generally recommended over mean imputation for most time series analyses, as it produces a smooth, trend-sensitive series, which aligns better with the STL’s need to detect trends and seasonality accurately.
4.1.1. Apply STL on ATM1
#|label: Apply STL decomposition on ATM1 datalibrary(imputeTS) # Load the imputeTS library for imputationstl_atm1 <- atm_ts %>%filter(ATM =="ATM1") %>%#|label: Fill implicit time gapsfill_gaps() %>%mutate(Cash =na_interpolation(Cash)) %>%# Impute missing values in Cash columnmodel(stl =STL(Cash ~trend(window =7) +season(window ="periodic"), robust =TRUE) )#|label: Extract components using fabletoolsstl_components <- fabletools::components(stl_atm1)#|label: Plot the componentsautoplot(stl_components)
####4.1.2. Check residuals
#|label: Apply the ggsdisplay function on atm1_tsatm1_ts <- atm_ts %>% dplyr::filter(ATM =="ATM1") %>% dplyr::select(DATE, Cash) %>%as_tsibble(index = DATE)#|label: check its structurestr(atm1_ts)
#|label: extract the 'Cash' column for plottingggtsdisplay(atm1_ts$Cash, points =FALSE, plot.type ="histogram")
#|label: Check residuals using the statndard ggtsdisplay functionggtsdisplay(atm1_ts$Cash, plot.type ="partial" )
The ACF plot shows some spikes outside of the critical values at lags 5,10, 11, 12, 13, 14 and 21. The non stationary aspect of the data as revealed in the ACF plot of the remainder, is evidence of some form of seasonality. Most importantly,the shape of the histogram violates the assumption of normal distribution. Most of the data seems right -skewed indicating a need for some transformation, before moving forward.
4.1.3. Apply a Box Cox transformation & re- check residuals
#|label: Load the required librarylibrary(MASS) # For the boxcox() function
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
#|label: Box-Cox transformationlambda <-boxcox(Cash ~1, data = atm1_ts, lambda =seq(-2, 2, by =0.1))
#|label: Use the best lambda foundbest_lambda <- lambda$x[which.max(lambda$y)]#|label: Perform Box-Cox transformationatm1_ts_log <- atm1_ts %>%mutate(Cash_BoxCox = (Cash^best_lambda -1) / best_lambda)#|label: Extract the transformed 'Cash' column for plottingggtsdisplay(atm1_ts_log$Cash_BoxCox, points =FALSE, plot.type ="histogram")
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_bin()`).
4.1.4. Run the Ljung-Box test
# Perform the Ljung-Box test on the Cash seriesBox.test(atm1_ts$Cash, lag =24, fitdf =0, type ="Ljung-Box")
The results from the Box-Ljung test indicate that our Cash time series exhibits significant autocorrelation, which is a critical consideration for effective time series modeling and forecasting. Given that the p-value is extremely small (much smaller than the typical significance level of 0.05), we can reject the null hypothesis. This implies that there is significant autocorrelation in our Cash time series data at the lags tested (up to 24 lags or degrees of freedom)
4.1.5 Fit models and run accuracy metrics
# Fit modelsets_model <- atm1_ts %>%model(ETS(Cash))
Warning: 1 error encountered for ETS(Cash)
[1] ETS does not support missing values.
naive_model <- atm1_ts %>%model(NAIVE(Cash))snaive_model <- atm1_ts %>%model(SNAIVE(Cash))# Combine models into a listmodels <-list(ets = ets_model, naive = naive_model, snaive = snaive_model)# Generate forecasts for each model with a 30-day horizonforecasts <-map(models, ~forecast(.x, h =30))# Print forecastsprint(forecasts)
$ets
# A fable: 30 x 4 [1D]
# Key: .model [1]
.model DATE Cash .mean
<chr> <date> <dist> <dbl>
1 ETS(Cash) 2010-05-01 NA NA
2 ETS(Cash) 2010-05-02 NA NA
3 ETS(Cash) 2010-05-03 NA NA
4 ETS(Cash) 2010-05-04 NA NA
5 ETS(Cash) 2010-05-05 NA NA
6 ETS(Cash) 2010-05-06 NA NA
7 ETS(Cash) 2010-05-07 NA NA
8 ETS(Cash) 2010-05-08 NA NA
9 ETS(Cash) 2010-05-09 NA NA
10 ETS(Cash) 2010-05-10 NA NA
# ℹ 20 more rows
$naive
# A fable: 30 x 4 [1D]
# Key: .model [1]
.model DATE Cash .mean
<chr> <date> <dist> <dbl>
1 NAIVE(Cash) 2010-05-01 N(85, 2516) 85
2 NAIVE(Cash) 2010-05-02 N(85, 5032) 85
3 NAIVE(Cash) 2010-05-03 N(85, 7548) 85
4 NAIVE(Cash) 2010-05-04 N(85, 10064) 85
5 NAIVE(Cash) 2010-05-05 N(85, 12580) 85
6 NAIVE(Cash) 2010-05-06 N(85, 15096) 85
7 NAIVE(Cash) 2010-05-07 N(85, 17611) 85
8 NAIVE(Cash) 2010-05-08 N(85, 20127) 85
9 NAIVE(Cash) 2010-05-09 N(85, 22643) 85
10 NAIVE(Cash) 2010-05-10 N(85, 25159) 85
# ℹ 20 more rows
$snaive
# A fable: 30 x 4 [1D]
# Key: .model [1]
.model DATE Cash .mean
<chr> <date> <dist> <dbl>
1 SNAIVE(Cash) 2010-05-01 N(85, 772) 85
2 SNAIVE(Cash) 2010-05-02 N(109, 772) 109
3 SNAIVE(Cash) 2010-05-03 N(74, 772) 74
4 SNAIVE(Cash) 2010-05-04 N(4, 772) 4
5 SNAIVE(Cash) 2010-05-05 N(96, 772) 96
6 SNAIVE(Cash) 2010-05-06 N(82, 772) 82
7 SNAIVE(Cash) 2010-05-07 N(85, 772) 85
8 SNAIVE(Cash) 2010-05-08 N(85, 1543) 85
9 SNAIVE(Cash) 2010-05-09 N(109, 1543) 109
10 SNAIVE(Cash) 2010-05-10 N(74, 1543) 74
# ℹ 20 more rows
Warning: 1 error encountered for ETS
[1] ETS does not support missing values.
fit_atm1_tsc <- atm1_ts%>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_1)))fc_atm1_ts <- fit_atm1_tsa %>%forecast(h=31)fc_atm1_ts%>%autoplot(atm1_ts, level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM1 Forecast in May 2010 per various models")
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
Warning: Removed 31 rows containing missing values or values outside the scale range
(`geom_line()`).
4.1.6. Adding ARIMA to the models
# Ensure time gaps are filled and missing values are interpolated in `Cash`atm1_ts <- atm1_ts %>%fill_gaps() %>%mutate(Cash =na_interpolation(Cash)) # Impute missing values in Cash column# Fit multiple models in a single call, including Seasonal Naive (SNAIVE)atm1_models <- atm1_ts %>%model(ARIMA =ARIMA(Cash),ETS =ETS(Cash),NAIVE =NAIVE(Cash),SNAIVE =SNAIVE(Cash) # Add seasonal naive model )# Forecast using each model (example: h = 30 for next 30 periods)atm1_forecasts <- atm1_models %>%forecast(h =30)# View the forecast resultsatm1_forecasts
plot(atm2_ts, main ="ATM2Time Series", xlab ="Date", ylab ="Cash", type ="l", col ="grey", lwd =2)
library(zoo)atm2_ts <- atm_ts %>%as_tsibble(index = DATE, key = ATM) # Set the appropriate index and keysum(is.na(atm2_ts$Cash)) # This should return 0
[1] 5
# Impute missing values in the Cash columnatm2_ts$Cash <-na.approx(atm2_ts$Cash, na.rm =FALSE)# Fit modelsmodel_results <- atm2_ts %>%model(RW =RW(Cash),ETS =ETS(Cash),Naive =NAIVE(Cash),Drift =NAIVE(Cash ~drift()),ARIMA =ARIMA(Cash) )# Generate forecastsforecasts <- model_results %>%forecast(h =30)# Plot the forecastsforecasts %>%autoplot(atm2_ts) +labs(title ="Forecasts with Different Predictive Models", x ="Time", y ="Forecasted Cash") +theme_minimal()
#|label: extract the 'Cash' column for plottingggtsdisplay(atm2_ts$Cash, points =FALSE, plot.type ="histogram")
ggtsdisplay(atm2_ts$Cash, plot.type ="partial" )
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
The presence of autocorrelation suggests the the existence of underlying seasonal patterns or trend in the data. We may consider using an ARIMA model to try and capture these patterns or trend.
4.2.3 Apply necessary Transformations to stabilize the variance
# Calculate lambda using Guerrero's methodlambda_2 <- atm2_ts %>%features(Cash, features = guerrero) %>%pull(lambda_guerrero)# Fit multiple models including SNAIVE, ETS, and ARIMA to the transformed datafit_atm2_tsa <- atm2_ts %>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_2)),ETS =ETS(box_cox(Cash, lambda_2) ~error("A") +trend("A") +season("A")),ARIMA =ARIMA(box_cox(Cash, lambda_2)) # Adding ARIMA model )# Fit ETS model separatelyfit_atm2_tsb <- atm2_ts %>%model(ETS =ETS(box_cox(Cash, lambda_2) ~error("A") +trend("A") +season("A")) )# Fit SNAIVE model separatelyfit_atm2_tsc <- atm2_ts %>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_2)) )# Generate forecasts with a 31-day horizonfc_atm2_ts <- fit_atm2_tsa %>%forecast(h =31)# Plot forecasts using autoplot()fc_atm2_ts %>%autoplot(atm2_ts, level =80) +labs(y ="",title = latex2exp::TeX(paste0("Transformed Cash withdrawals $\\lambda = ", round(lambda_2, 2) )) )
4.2.4 Use ARIMA() to automatically find the best ARIMA model
Apply ARIMA model alone on ATM2
#|label: residuals plotsatm2_ts <- atm2_ts %>%fill_gaps() %>%mutate(Cash =na_interpolation(Cash)) # Impute missing values if necessarymodel <- atm2_ts %>%model(ARIMA(Cash))# Extract the residuals from the modelresiduals_tsibble <- model %>%augment() %>%# This adds residuals to the dataset dplyr::select(.resid) # Select only the residuals# Rename the residual columncolnames(residuals_tsibble)[1] <-"Residuals"# Create a date sequence for the residualsresiduals_tsibble <- residuals_tsibble %>% dplyr::mutate(Date =index(atm2_ts)) %>%# Assuming atm2_ts has a time index dplyr::select(Date, Residuals)# Plot the residuals using gg_tsresidualsgg_tsresiduals(model)
4.2.5 Check residuals
#|label: plot the residuals directlyresiduals_tsibble %>%ggplot(aes(x = Date, y = Residuals)) +geom_line() +labs(title ="Residuals of ARIMA Model", x ="Date", y ="Residuals") +theme_minimal()
4.2.6 Conduct a portmanteau test:Run the Ljung-Box test
# Perform the Ljung-Box test on the Cash seriesBox.test(atm2_ts$Cash, lag =24, fitdf =0, type ="Ljung-Box")
The results of the Ljung-Box reveal that atm2_ts$Cash time series exhibits significant autocorrelation. This suggests that the ARIMA (2.0.0) may not adequately capture the structure of the data.
adf.test(atm2_ts$Cash)
Warning in adf.test(atm2_ts$Cash): p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: atm2_ts$Cash
Dickey-Fuller = -6.046, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary
The Augmented Dickey-Fuller (ADF) test, is commonly used to determine whether a time series is stationary. The results of the ADF test provide strong evidence that the Cash time series does not have a unit root and can be treated as stationary, which is a favorable condition for many time series modeling techniques.
# Ensure continuity and handle missing values in Cashatm2_ts_ <- atm2_ts %>%fill_gaps() %>%# Fill implicit time gapsmutate(Cash =na_interpolation(Cash)) # Impute missing values in Cash column# Fit the ARIMA model (assuming a specified ARIMA model)forecast_model <- atm2_ts_ %>%model(ARIMA =ARIMA(Cash ~1)) # Forecast for May 2010 (h = 30 )forecast_results <- forecast_model %>%forecast(h =30)# Plot forecast results for May 2010autoplot(forecast_results, atm2_ts_) +labs(title ="ATM2 Forecast for May 2010", x ="Date", y ="Cash Withdrawals") +theme_minimal() # Optional: add minimal theme for better appearance
atm2_ts <- atm2_ts %>%fill_gaps() %>%mutate(Cash =na_interpolation(Cash)) # Impute missing values if any# Step 2: Fit ARIMA(2,0,0) and ARIMA(5,0,0) modelsforecast_models <- atm2_ts %>%model(ARIMA_2_0_0 =ARIMA(Cash ~pdq(2,0,0)) )# Step 3: Forecast for a 30-step horizon (adjust 'h' based on your needs)forecast_results <- forecast_models %>%forecast(h =30)# Step 4: Plot forecast resultsautoplot(forecast_results, atm2_ts) +labs(title ="ATM2 Cash Forecast using ARIMA (2,0,0)",x ="Date",y ="Cash Withdrawals" ) +theme_minimal()
4.2.7 Evaluate models performances on ATM 2
augment(forecast_models)%>%features(.innov, ljung_box, lag =7)
#|label: extract the 'Cash' column for plottingggtsdisplay(atm1_ts$Cash, points =FALSE, plot.type ="histogram")
4.4.3 Evaluate models performances on ATM 3
#|label: Evaluate models accuracy using Ljung-Box test on # Fit each model separately on atm3_tsets_model <- atm3_ts %>%model(ETS =ETS(Cash))snaive_model <- atm3_ts %>%model(Snaïve =SNAIVE(Cash))arima_model <- atm3_ts %>%model(ARIMA =ARIMA(Cash))# Function to evaluate each model with Ljung-Box testevaluate_model <-function(model) { model %>%augment() %>%# Extract residualsfeatures(.innov, ljung_box, lag =7) # Apply Ljung-Box test on residuals}# Apply the evaluation function to each modelets_accuracy <-evaluate_model(ets_model)snaive_accuracy <-evaluate_model(snaive_model)arima_accuracy <-evaluate_model(arima_model)# Combine results for easier viewingaccuracy_results <-bind_rows(ETS = ets_accuracy, Snaïve = snaive_accuracy,ARIMA = arima_accuracy,.id ="Model")print(accuracy_results)
Based on the output of the Ljung-Box test, ETS model is recommended for forecasting ATM3 withdrawals because of its ability to capture the underlying data structure effectively while leaving residuals that behave like white noise. The ARIMA model can also be considered if further analysis or specific features of the data warrant its use. The Snaïve model should be avoided due to its failure to account for autocorrelation.
4.4 ATM4
# Filter the data for ATM4atm4_ts <- atm_ts %>%filter(ATM =="ATM4")# Create the plotggplot(atm4_ts, aes(x = DATE, y = Cash)) +geom_line() +labs(title ="ATM4 Cash Over Time", x ="Date", y ="Cash")
4.4.1. Apply STL on ATM4
#|label: Apply STL decomposition on ATM4 datalibrary(imputeTS) # Load the imputeTS library for imputationstl_atm4 <- atm_ts %>%filter(ATM =="ATM4") %>%#|label: Fill implicit time gapsfill_gaps() %>%mutate(Cash =na_interpolation(Cash)) %>%# Impute missing values in Cash columnmodel(stl =STL(Cash ~trend(window =7) +season(window ="periodic"), robust =TRUE) )#|label: Extract components using fabletoolsstl_components <- fabletools::components(stl_atm1)#|label: Plot the componentsautoplot(stl_components)
4.4.2. Calculate the ideal lambda for transformation
(atm4_lambda <- atm4_ts %>%features(Cash, features = guerrero) %>%pull(lambda_guerrero))
[1] -0.0737252
4.4.3 Modeling ATM4 series
# Fit models and forecast for 30 periodsforecast_results <- atm4_ts %>%model(RW =RW(Cash),ETS =ETS(Cash), Naïve =NAIVE(Cash),Drift =NAIVE(Cash ~drift()),ARIMA =ARIMA(Cash),Holt =ETS(Cash ~error("A") +trend("A") +season("N")) # Holt's method ) %>%forecast(h =30)# Plot the forecasts along with the original dataautoplot(forecast_results, atm4_ts) +labs(title ="ATM 4 Withdrawals Forecast by Models")
#|label: extract the 'Cash' column for plottingggtsdisplay(atm4_ts$Cash, points =FALSE, plot.type ="histogram")
4.4.6 Evaluate models performances on atm4
#|label: Evaluate model accuracy using Ljung-Box test on # Fit each model separately on atm4_tsrw_model <- atm4_ts %>%model(RW =RW(Cash))ets_model <- atm4_ts %>%model(ETS =ETS(Cash))naive_model <- atm4_ts %>%model(Naïve =NAIVE(Cash))drift_model <- atm4_ts %>%model(Drift =NAIVE(Cash ~drift()))arima_model <- atm4_ts %>%model(ARIMA =ARIMA(Cash))holt_model <- atm4_ts %>%model(Holt =ETS(Cash ~error("A") +trend("A") +season("N"))) # Holt's method# Function to evaluate each model with Ljung-Box testevaluate_model <-function(model) { model %>%augment() %>%# Extract residualsfeatures(.innov, ljung_box, lag =7) # Apply Ljung-Box test on residuals}# Apply the evaluation function to each modelrw_accuracy <-evaluate_model(rw_model)ets_accuracy <-evaluate_model(ets_model)naive_accuracy <-evaluate_model(naive_model)drift_accuracy <-evaluate_model(drift_model)arima_accuracy <-evaluate_model(arima_model)holt_accuracy <-evaluate_model(holt_model)# Combine results for easier viewingaccuracy_results <-bind_rows(RW = rw_accuracy,ETS = ets_accuracy, Naïve = naive_accuracy,Drift = drift_accuracy,ARIMA = arima_accuracy,Holt = holt_accuracy,.id ="Model")print(accuracy_results)
Based on the Ljung-Box tes toutput, the ARIMA and Holt models are the best options for ATM4 series, as they have the highest p-values, suggesting that their residuals are closest to being white noise (uncorrelated). Either model would likely be effective.
PART B
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
1. Exploratory data analysis
Import and read the data into Excel
library(readxl)file_path <-"~/Downloads/ResidentialCustomerForecastLoad-624.xlsx"# Read the Excel filedata <-read_excel(file_path)power_ts <-data
# Assuming your dataset is named 'data'# Create the histogramggplot(data, aes(x = KWH)) +geom_histogram(binwidth =500000, color ="pink", fill ="blue", alpha =0.7) +# Fixed color argumentlabs(title ="Histogram of KWH", x ="KWH", y ="Frequency") +# Fixed labelstheme_minimal()
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
summary(power_ts)
CaseSequence YYYY-MMM KWH
Min. :733.0 Length:192 Min. : 770523
1st Qu.:780.8 Class :character 1st Qu.: 5429912
Median :828.5 Mode :character Median : 6283324
Mean :828.5 Mean : 6502475
3rd Qu.:876.2 3rd Qu.: 7620524
Max. :924.0 Max. :10655730
NA's :1
Data wrangling and data manipulation :
changing the name of the year column for personal convenience and removing unuseful columns
# A tsibble: 6 x 2 [1M]
KWH DATE
<dbl> <mth>
1 6863. 1998 Jan
2 5838. 1998 Feb
3 5421. 1998 Mar
4 5010. 1998 Apr
5 4665. 1998 May
6 6467. 1998 Jun
### Graph the series
# Plot the time seriespower_ts %>%autoplot(KWH) +labs(title ="Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")
Th data appears right skewed and may beneficiate from a transformation First, let’s create a duplicate dataframe
# Load necessary librarieslibrary(zoo) # For na.interplibrary(dplyr) # For data manipulation# Make a copy of the original data and ensure it's a data framepowerts2 <-as.data.frame(power_ts) # Convert to a data frame if it's not already# Check if the 'KWH' column existsif (!"KWH"%in%names(powerts2)) {stop("The KWH column does not exist in the dataset.")}# Impute missing values using na.interp from the zoo packagepowerts2$KWH <-na.interp(powerts2$KWH)# Replace minimum values with the median of the KWH columnpowerts2$KWH <-replace(powerts2$KWH, powerts2$KWH ==min(powerts2$KWH, na.rm =TRUE), median(powerts2$KWH, na.rm =TRUE))# Print the modified data to check the changesprint(powerts2)
KWH DATE
1 6862.583 1998 Jan
2 5838.198 1998 Feb
3 5420.658 1998 Mar
4 5010.364 1998 Apr
5 4665.377 1998 May
6 6467.147 1998 Jun
7 8914.755 1998 Jul
8 8607.428 1998 Aug
9 6989.888 1998 Sep
10 6345.620 1998 Oct
11 4640.410 1998 Nov
12 4693.479 1998 Dec
13 7183.759 1999 Jan
14 5759.262 1999 Feb
15 4847.656 1999 Mar
16 5306.592 1999 Apr
17 4426.794 1999 May
18 5500.901 1999 Jun
19 7444.416 1999 Jul
20 7564.391 1999 Aug
21 7899.368 1999 Sep
22 5358.314 1999 Oct
23 4436.269 1999 Nov
24 4419.229 1999 Dec
25 7068.296 2000 Jan
26 5876.083 2000 Feb
27 4807.961 2000 Mar
28 4873.080 2000 Apr
29 5050.891 2000 May
30 7092.865 2000 Jun
31 6862.662 2000 Jul
32 7517.830 2000 Aug
33 8912.169 2000 Sep
34 5844.352 2000 Oct
35 5041.769 2000 Nov
36 6220.334 2000 Dec
37 7538.529 2001 Jan
38 6602.448 2001 Feb
39 5779.180 2001 Mar
40 4835.210 2001 Apr
41 4787.904 2001 May
42 6283.324 2001 Jun
43 7855.129 2001 Jul
44 8450.717 2001 Aug
45 7112.069 2001 Sep
46 5242.535 2001 Oct
47 4461.979 2001 Nov
48 5240.995 2001 Dec
49 7099.063 2002 Jan
50 6413.429 2002 Feb
51 5839.514 2002 Mar
52 5371.604 2002 Apr
53 5439.166 2002 May
54 5850.383 2002 Jun
55 7039.702 2002 Jul
56 8058.748 2002 Aug
57 8245.227 2002 Sep
58 5865.014 2002 Oct
59 4908.979 2002 Nov
60 5779.958 2002 Dec
61 7256.079 2003 Jan
62 6190.517 2003 Feb
63 6120.626 2003 Mar
64 4885.643 2003 Apr
65 5296.096 2003 May
66 6051.571 2003 Jun
67 6900.676 2003 Jul
68 8476.499 2003 Aug
69 7791.791 2003 Sep
70 5344.613 2003 Oct
71 4913.707 2003 Nov
72 5756.193 2003 Dec
73 7584.596 2004 Jan
74 6560.742 2004 Feb
75 6526.586 2004 Mar
76 4831.688 2004 Apr
77 4878.262 2004 May
78 6421.614 2004 Jun
79 7307.931 2004 Jul
80 7309.774 2004 Aug
81 6690.366 2004 Sep
82 5444.948 2004 Oct
83 4824.940 2004 Nov
84 5791.208 2004 Dec
85 8225.477 2005 Jan
86 6564.338 2005 Feb
87 5581.725 2005 Mar
88 5563.071 2005 Apr
89 4453.983 2005 May
90 5900.212 2005 Jun
91 8337.998 2005 Jul
92 7786.659 2005 Aug
93 7057.213 2005 Sep
94 6694.523 2005 Oct
95 4313.019 2005 Nov
96 6181.548 2005 Dec
97 7793.358 2006 Jan
98 5914.945 2006 Feb
99 5819.734 2006 Mar
100 5255.988 2006 Apr
101 4740.588 2006 May
102 7052.275 2006 Jun
103 7945.564 2006 Jul
104 8241.110 2006 Aug
105 7296.355 2006 Sep
106 5104.799 2006 Oct
107 4458.429 2006 Nov
108 6226.214 2006 Dec
109 8031.295 2007 Jan
110 7928.337 2007 Feb
111 6443.170 2007 Mar
112 4841.979 2007 Apr
113 4862.847 2007 May
114 5022.647 2007 Jun
115 6426.220 2007 Jul
116 7447.146 2007 Aug
117 7666.970 2007 Sep
118 5785.964 2007 Oct
119 4907.057 2007 Nov
120 6047.292 2007 Dec
121 7964.293 2008 Jan
122 7597.060 2008 Feb
123 6085.644 2008 Mar
124 5352.359 2008 Apr
125 4608.528 2008 May
126 6548.439 2008 Jun
127 7643.987 2008 Jul
128 8037.137 2008 Aug
129 6569.470 2008 Sep
130 5101.803 2008 Oct
131 4555.602 2008 Nov
132 6442.746 2008 Dec
133 8072.330 2009 Jan
134 6976.800 2009 Feb
135 5691.452 2009 Mar
136 5531.616 2009 Apr
137 5264.439 2009 May
138 5804.433 2009 Jun
139 7713.260 2009 Jul
140 8350.517 2009 Aug
141 7583.146 2009 Sep
142 5566.075 2009 Oct
143 5339.890 2009 Nov
144 7089.880 2009 Dec
145 9397.357 2010 Jan
146 8390.677 2010 Feb
147 7347.915 2010 Mar
148 5776.131 2010 Apr
149 4919.289 2010 May
150 6696.292 2010 Jun
151 6314.472 2010 Jul
152 7922.701 2010 Aug
153 7819.472 2010 Sep
154 5875.917 2010 Oct
155 4800.733 2010 Nov
156 6152.583 2010 Dec
157 8394.747 2011 Jan
158 8898.062 2011 Feb
159 6356.903 2011 Mar
160 5685.227 2011 Apr
161 5506.308 2011 May
162 8037.779 2011 Jun
163 10093.343 2011 Jul
164 10308.076 2011 Aug
165 8943.599 2011 Sep
166 5603.920 2011 Oct
167 6154.138 2011 Nov
168 8273.142 2011 Dec
169 8991.267 2012 Jan
170 7952.204 2012 Feb
171 6356.961 2012 Mar
172 5569.828 2012 Apr
173 5783.598 2012 May
174 7926.956 2012 Jun
175 8886.851 2012 Jul
176 9612.423 2012 Aug
177 7559.148 2012 Sep
178 5576.852 2012 Oct
179 5731.899 2012 Nov
180 6609.694 2012 Dec
181 10655.730 2013 Jan
182 7681.798 2013 Feb
183 6517.514 2013 Mar
184 6105.359 2013 Apr
185 5940.475 2013 May
186 7920.627 2013 Jun
187 8415.321 2013 Jul
188 9080.226 2013 Aug
189 7968.220 2013 Sep
190 5759.367 2013 Oct
191 5769.083 2013 Nov
192 9606.304 2013 Dec
#ts plotpower_ts %>%autoplot() +labs(title ="Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")+ylab(label="KWH (Thousands)")
Plot variable not specified, automatically selected `.vars = KWH`
Check histogram and summary statistics for a general sense ofthe distribution
ggplot(powerts2, aes(x=KWH))+geom_histogram()+labs(title ="Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data are not normally distributed. But, the bulk of it is located at the center. We also see the presence of few outliers that need to be dealt with. Run a summary statistics
#summarysummary(powerts2$KWH)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4313 5444 6330 6532 7609 10656
power_ts %>%autoplot() +labs(y ="KWH in Thousands",title ="Transformed KWH with Lambda = -0.2130548")
Plot variable not specified, automatically selected `.vars = KWH`
There is not much difference before and after the transformation. But, ndiff() and ACF will identify if differencing is needed.
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
ggtsdisplay(power_ts, plot.type ="partial" )
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
The data are not stationary.The variance remains and we still see a few data falling outside of the critical values on both the ACF and the PACF plots. The spikes can be indicators of outliers as shown
ggseasonplot(power_ts)
There is a profound peak on july, indicating some sort of out of the ordinary/ routine activities. Could be an annual vacation time and families are away for vacation, therefore using less power? Could it related to something else? … That spike is definetly intriguing.
Modeling
# Plot the historical dataplot(power_ts, main ="Power Time Series Forecasts", ylab ="Power", xlab ="Time", col ="blue", lwd =2)# Add the forecast lineslines(forecasts$.model =="ETS", forecasts$.mean, col ="red", lty =2) # ETS forecastlines(forecasts$.model =="Seasonal_Naive", forecasts$.mean, col ="green", lty =2) # Seasonal Naive forecastlines(forecasts$.model =="ARIMA", forecasts$.mean, col ="orange", lty =2) # ARIMA forecast# Add a legendlegend("topright", legend =c("Historical", "ETS", "Seasonal Naive", "ARIMA"),col =c("blue", "red", "green", "orange"), lty =c(1, 2, 2, 2), lwd =2)
file_path <-"~/Downloads/ResidentialCustomerForecastLoad-624.xlsx"# Read the Excel filedata <-read_excel(file_path)power_ts <-data#change variable type power_ts <- power_ts %>%mutate(DATE =yearmonth(`YYYY-MMM`), KWH = KWH/1000) %>% dplyr::select(-CaseSequence, -'YYYY-MMM') %>%tsibble(index= DATE)# Handle missing values using imputation by interpolationpower_ts <- power_ts %>%mutate(KWH =na_interpolation(KWH))# Fit the modelsmodel_results <- power_ts %>%model(ETS =ETS(KWH), # Exponential Smoothing State Space ModelARIMA =ARIMA(KWH), # Auto ARIMANaive =NAIVE(KWH) # Naive Forecast )# Generate forecasts for the next 12 months (or adjust h as needed)forecasts <- model_results %>%forecast(h ="12 months")# View the forecast resultsprint(forecasts)
# A fable: 36 x 4 [1M]
# Key: .model [3]
.model DATE KWH .mean
<chr> <mth> <dist> <dbl>
1 ETS 2014 Jan N(9432, 1272329) 9432.
2 ETS 2014 Feb N(8199, 978288) 8199.
3 ETS 2014 Mar N(7040, 733808) 7040.
4 ETS 2014 Apr N(6204, 579768) 6204.
5 ETS 2014 May N(5872, 528015) 5872.
6 ETS 2014 Jun N(7652, 911557) 7652.
7 ETS 2014 Jul N(9283, 1363541) 9283.
8 ETS 2014 Aug N(9650, 1497129) 9650.
9 ETS 2014 Sep N(9001, 1323148) 9001.
10 ETS 2014 Oct N(6639, 730887) 6639.
# ℹ 26 more rows
# Plot the historical data power_ts %>%ggplot(aes(x = DATE, y = KWH)) +geom_line(color ="blue", size =1) +# Historical datageom_line(data = forecasts, aes(y = .mean, color = .model), size =1, linetype ="dashed") +# Forecastslabs(title ="KWH Forecasts from Different Models",x ="Time",y ="KWH") +scale_color_manual(values =c("ETS"="red", "ARIMA"="green", "Naive"="orange")) +theme_minimal() +theme(legend.title =element_blank())
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ARIMA and ETS appear to be the best models to use for this forecast