In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
# Load libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tsibble 1.1.6 ✔ feasts 0.4.2
## ✔ tsibbledata 0.4.1 ✔ fable 0.5.0
## Warning: package 'tsibble' was built under R version 4.4.3
## Warning: package 'tsibbledata' was built under R version 4.4.3
## Warning: package 'feasts' was built under R version 4.4.3
## Warning: package 'fabletools' was built under R version 4.4.3
## Warning: package 'fable' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.4.2
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
# Load dataset
atm <- read_excel("ATM624Data.xlsx")
# Review data
glimpse(atm)
## Rows: 1,474
## Columns: 3
## $ DATE <dbl> 39934, 39934, 39935, 39935, 39936, 39936, 39937, 39937, 39938, 39…
## $ ATM <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
## $ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …
class(atm)
## [1] "tbl_df" "tbl" "data.frame"
The data looks relatively clean but the date format should be converted and data frame should be converted to tsibble
# Clean data
atm <- atm |>
mutate(DATE = excel_numeric_to_date(DATE)) |>
as_tsibble(key = "ATM", index = "DATE")
# Plot data
atm1 <- atm |>
filter(ATM == "ATM1")
atm1 |> autoplot(Cash)
The data seems to have seasonal pattern with a relatively consistent
trend.
# ACF plot
atm1 |>
ACF(Cash) |>
autoplot()
The ACF plot shows peaks going outside of the blue dashed lines with
especially prominent peaks every 7 days at 7, 14, and 21 which confirms
the weekly seasonality of the data.
The missing data in the dataset should be replaced with median value of the data.
# Replace missing data with median
atm1$Cash[is.na(atm1$Cash)] <- median(atm1$Cash, na.rm=TRUE)
A box-cox transformation was attempted to try to smooth the data due to the variation in data spikes in the plot.
# Box cox transformation
# lambda
lambda_atm1 <- atm1 |>
features(Cash, features=guerrero) |>
pull(lambda_guerrero)
atm1_bc <- atm1 |>
mutate(bc=box_cox(Cash, lambda_atm1)) |>
select(DATE, bc)
atm1_bc |>
autoplot(bc)
The box-cox transformation does not appear to have made the data more
consistent and actually seems to have increased outliers so it will not
be utilized for the forecast.
Training and test dataset was created using 80% of the data for training and 20% for testing. Due to the qualities of the data described above, 3 different models were evaluated: seasonal naive bayes, ETS, and ARIMA.
# Split test/training data
atm1_train <- atm1[1:round(.8*nrow(atm1)),]
atm1_test <- anti_join(atm1, atm1_train, by = "DATE")
# Forecast test data
atm1_model <- atm1_train |>
model("snaive" = SNAIVE(Cash),
"ets" = ETS(Cash),
"arima" = ARIMA(Cash)) |>
forecast(h=nrow(atm1_test))
# Model accuracy
accuracy(atm1_model, atm1) |>
select(.model, RMSE)
The ETS model has the lowest RMSE in this case so I will use it for the final forecast. The ETS residuals show some white noise but no pattern or significant outliers. There are a few spikes outside the blue dotted line in the acf plot but the plots are mostly within the lines. The residuals also show a normal distribution.
# Residuals
atm1 |>
model(ETS(Cash)) |>
gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Using the ETS model, a 30 day forecast was created.
# Forecast
atm1_forecast <- atm1 |>
model(ETS(Cash)) |>
forecast(h=30)
atm1_forecast |>
autoplot(atm1, level =NULL)
# Final output
atm1_output <- atm1_forecast[,-c(2,4)] |>
rename(Cash = .mean)
head(atm1_output)
# Plot data
atm2 <- atm |>
filter(ATM == "ATM2")
atm2 |> autoplot(Cash)
This dataset seems similar to the last ATM1 dataset.
# ACF plot
atm2 |>
ACF(Cash) |>
autoplot()
The ACF plot confirms weekly seasonality at 7,14, and 21 days.
Like the previous, missing data will be replaced with median value of the data.
# Replace NA data with median
atm2$Cash[is.na(atm2$Cash)] <- median(atm2$Cash, na.rm=TRUE)
A box-cox transformation was applied to standardize the dataset but the shape of the data does not appear to have changed significantly so it will not be used in the forecast.
# Box cox
# lambda
lambda_atm2 <- atm2 |>
features(Cash, features=guerrero) |>
pull(lambda_guerrero)
atm2_bc <- atm2 |>
mutate(bc=box_cox(Cash, lambda_atm2)) |>
select(DATE, bc)
atm2_bc |>
autoplot(bc)
Like the previous ATM1 dataset, a training and test data split was used with 80% of datapoint for training and 20% for testing. Since the data is also seasonal a seasonal Naive Bayes, ETS, and ARIMA model evaluated for this dataset.
# Split test/training data
atm2_train <- atm2[1:round(.8*nrow(atm2)),]
atm2_test <- anti_join(atm2, atm2_train, by = "DATE")
# forecast test data
atm2_model <- atm2_train |>
model("snaive" = SNAIVE(Cash),
"ets" = ETS(Cash),
"arima" = ARIMA(Cash)) |>
forecast(h=nrow(atm2_test))
# model accuracy
accuracy(atm2_model, atm2) |>
select(.model, RMSE)
Seasonal naive bayes model had the lowest RMSE value so it will be utilized for the final forecast.
# Residuals
atm2 |>
model(SNAIVE(Cash)) |>
gg_tsresiduals()
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 7 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_bin()`).
The residuals plot shows consistency in spikes but the acf plot has a
significant outlier at lag 7. This may be because the cyclical
seasonality every 7 days is extremely consistent. I also checked the
residuals of the ETS and ARIMA models as well to see if there was a
similar pattern in these models.
# Residuals for ETS
atm2 |>
model(ETS(Cash)) |>
gg_tsresiduals()
# Residuals for ARIMA
atm2 |>
model(ARIMA(Cash)) |>
gg_tsresiduals()
Interestingly, the acf model for ARIMA seem closer within the confines
of the blue line than ECF and SNAIVE. However, there is a still a spike
at lag 7 although much closer to the blue lines. Based on these
residuals, I think ARIMA will provide a more accurate forecast but I
will forecast both SNAIVE and ARIMA to see if there is a major
difference.
# ARIMA forecast
atm2_forecast_arima <- atm2 |>
model(ARIMA(Cash)) |>
forecast(h=30)
atm2_forecast_arima |>
autoplot(atm2, level=NULL)
# SNAIVE forecast
atm2_forecast_snaive <- atm2 |>
model(SNAIVE(Cash)) |>
forecast(h=30)
atm2_forecast_snaive |>
autoplot(atm2, level=NULL)
From the plots we can see that both forecasts are very similar but SNAIVE has a wider range of peaks and lows compared to the ARIMA forecast. Visually, based on the shape of the historical data, I would say that the SNAIVE model actually looks more accurate. I would recommend utilizing the SNAIVE model for further forecasts in this dataset.
# Final output
atm2_output <- atm2_forecast_snaive[,-c(2,4)] |>
rename(Cash = .mean)
head(atm2_output)
# Plot data
atm3 <- atm |>
filter(ATM == "ATM3")
atm3 |> autoplot(Cash)
The plot for this data is very interesting showing cash taken out towards the end of the dataset.
tail(atm3, 10)
We can see that cash withdrawals happened starting from 4/28/2010 and there are only 3 nonzero data points which is insufficient for a reliable predictive model.
Since it will not be feasible to use a predictive model, I tried to create a model using the mean. However, the minimal data in this dataset is an extreme limitation to the accuracy of this forecast.
# Forecast data using nonzero values
atm3_nonzero <- atm3 |>
filter(Cash != 0)
atm3_forecast <- atm3_nonzero |>
model(MEAN(Cash)) |>
forecast(h=30)
atm3_forecast |>
autoplot(atm3, level =NULL)
atm3_output <- atm3_forecast[, -c(2,4)] |>
rename(Cash = .mean)
head(atm3_output)
# Plot data
atm4 <- atm |>
filter(ATM == "ATM4")
atm4 |> autoplot(Cash)
There is a clear outlier between February and March. To avoid this
outlier impacting the forecast, it was replaced with the median value of
the data set. Missing data points were also replaced with the
median.
# Clean data to remove outlier
atm4$Cash[atm4$Cash == max(atm4$Cash, na.rm=TRUE)] <- median(atm4$Cash, na.rm=TRUE)
atm4$Cash[is.na(atm4$Cash)] <- median(atm4$Cash, na.rm=TRUE)
atm4 |> autoplot(Cash)
Now it’s easier to see the actual shape of the other data points.
# ACF plot
atm4 |>
ACF(Cash) |>
autoplot()
The ACF plot shows that this data also has weekly seasonality like the
previous plots.
A box-cox transformation was applied to limit variation of data but it did not appear to be effective so it will not be utilized in the forecast.
# Find lambda
lambda_atm4 <- atm4 |>
features(Cash, features=guerrero) |>
pull(lambda_guerrero)
# Box cox transformation
atm4_bc <- atm4 |>
mutate(bc=box_cox(Cash, lambda_atm4)) |>
select(DATE, bc)
atm4_bc |>
autoplot(bc)
Like ATM1 and ATM2, 80% of the data will be used for training and 20% for testing. SNAIVE, ETS, and ARIMA models were utilized for this data.
# Test/train split
atm4_train <- atm4[1:round(.8*nrow(atm4)),]
atm4_test <- anti_join(atm4, atm4_train, by = "DATE")
# forecast test data
atm4_model <- atm4_train |>
model("snaive" = SNAIVE(Cash),
"ets" = ETS(Cash),
"arima" = ARIMA(Cash)) |>
forecast(h=nrow(atm4_test))
# model accuracy
accuracy(atm4_model, atm4) |>
select(.model, RMSE)
The ARIMA model had the lowest RMSE so I will be using this model for the final forecast.
# Residuals
atm4 |>
model(ARIMA(Cash)) |>
gg_tsresiduals()
The residuals show a plot skewed to the left and a spike outside the
blue lines. This may indicate that ARIMA is not the ideal plot but I
will try it to see what the plot visually looks like.
# Forecast
atm4_forecast <- atm4 |>
model(ARIMA(Cash)) |>
forecast(h=30)
atm4_forecast |>
autoplot(atm4, level =NULL)
The forecast does not seem accurate based on the plot as the range of
data points seems too narrow. I tried plotting ETS and seasonal naive
bayes model as well to see if either of these seemed more accurate.
# ETS Forecast
atm4_forecast_ETS <- atm4 |>
model(ETS(Cash)) |>
forecast(h=30)
atm4_forecast_ETS |>
autoplot(atm4, level =NULL)
# SNAIVE Forecast
atm4_forecast_snaive <- atm4 |>
model(SNAIVE(Cash)) |>
forecast(h=30)
atm4_forecast_snaive |>
autoplot(atm4, level =NULL)
The SNAIVE model forecast actually seems most accurate visually. This
will be used for the final output.
# Final output
atm4_output <- atm4_forecast_snaive[,-c(2,4)] |>
rename(Cash = .mean)
head(atm4_output)
# Combine all predictions
full_atm_output <- rbind(as_tibble(atm1_output), as_tibble(atm2_output), as_tibble(atm3_output), as_tibble(atm4_output)) |>
arrange(DATE)
# Create wide-format
atm_wide <- full_atm_output |>
pivot_wider(names_from = ATM, values_from = Cash)
# Write to CSV files
write.csv(full_atm_output, 'atm_output_Song_J.csv')
write.csv(atm_wide, 'atm_output_wide_Song_J.csv')
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
# Load data
power <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
# view data
head(power)
# clean data
power$"YYYY-MMM" <- yearmonth(power$"YYYY-MMM")
power <- power |>
rename(Date = "YYYY-MMM") |>
select(Date, KWH) |>
as_tsibble(index = "Date")
power |>
autoplot(KWH)
The plot seems to have a seasonal trend and the ACF plot confirms this
with peaks at 6, 12, and 18 months and lows at 3, 9, 15, and 21 months.
There is also a significant outlier in the plot.
power |>
ACF(KWH) |>
autoplot()
Missing data was replaced with the median of the dataset as well as the
outlier.
# Replace outlier with median
power$KWH[power$KWH == min(power$KWH, na.rm = TRUE)] <- median(power$KWH, na.rm = TRUE)
# Replace missing values with median
power$KWH[is.na(power$KWH)] <- median(power$KWH, na.rm=TRUE)
power |>
autoplot(KWH)
The plot is now easier to read once adjusting the outlier.
A box-cox transformation was applied to see if the variances would be more consistent although the data looks relatively consistent already. The resulting plot does not appear to have made a difference so it will not be utilized in the forecast.
# lambda
lambda_power <- power |>
features(KWH, features = guerrero) |>
pull(lambda_guerrero)
# Box-cox transformation
power_bc <- power |>
mutate(bc = box_cox(KWH, lambda_power)) |>
select(Date, bc)
power_bc |>
autoplot(bc)
A test and training split was created using 80% of data for training and 20% for testing. Like in part A, seasonal naive bayes, ETS, and ARIMA models were evaluated.
# Train and test data
power_train <- power[1:round(.8*nrow(power)),]
power_test <- anti_join(power, power_train, by = "Date")
# Predict test data
power_model <- power_train |>
model("snaive" = SNAIVE(KWH),
"ets" = ETS(KWH),
"arima" = ARIMA(KWH)) |>
forecast(h=nrow(power_test))
accuracy(power_model, power) |>
select(.model, RMSE)
The ARIMA model has the lowest RMSE would it will be used for the forecast.
# Residuals
power |>
model(ARIMA(KWH)) |>
gg_tsresiduals()
The residuals seem to white noise. There is 1 spike outside the blue
line in the acf plot but the rest are within the confines.
# 12 month forecast
power_forecast <- power |>
model(ARIMA(KWH)) |>
forecast(h=12)
power_forecast |>
autoplot(power, level=NULL)
# Final output
power_output <- power_forecast[,-c(1,3)] |>
rename(KWH = .mean)
print(power_output)
## # A tsibble: 12 x 2 [1M]
## Date KWH
## <mth> <dbl>
## 1 2014 Jan 10138894.
## 2 2014 Feb 8757344.
## 3 2014 Mar 6702667.
## 4 2014 Apr 6008940.
## 5 2014 May 5943702.
## 6 2014 Jun 8207930.
## 7 2014 Jul 9520071.
## 8 2014 Aug 10010697.
## 9 2014 Sep 8488807.
## 10 2014 Oct 5867038.
## 11 2014 Nov 6155281.
## 12 2014 Dec 8307433.
write.csv(power_output, "power_ouput_Song_J.csv")
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
# Load data
pipe1 <- read_xlsx('Waterflow_Pipe1.xlsx')
pipe2 <- read_xlsx('Waterflow_Pipe2.xlsx')
head(pipe1)
head(pipe2)
# Combine datasets
pipe <- rbind(pipe1 , pipe2) |>
rename(date_time = 'Date Time')
# Convert date time format
pipe <- pipe |>
mutate(date_time = convertToDateTime(date_time),
date = as.Date(date_time),
hour = format(date_time,"%H")) |>
group_by(date, hour) |>
mutate(hour_avg = mean(WaterFlow)) |>
ungroup() |>
select(date, hour, hour_avg) |>
distinct() |>
mutate(date_time = paste0(date, " ", hour, ':00:00'))
pipe$date_time <- as.POSIXct(pipe$date_time)
pipe <- pipe |> select(date_time, hour_avg) |> as_tsibble(index = 'date_time')
print(head(pipe))
## # A tsibble: 6 x 2 [1h] <?>
## date_time hour_avg
## <dttm> <dbl>
## 1 2015-10-23 00:00:00 26.1
## 2 2015-10-23 01:00:00 18.8
## 3 2015-10-23 02:00:00 24.5
## 4 2015-10-23 03:00:00 25.6
## 5 2015-10-23 04:00:00 22.4
## 6 2015-10-23 05:00:00 23.6
pipe |> autoplot(hour_avg)
The variance in data is lower in the beginning and higher towards the
middle and end of the dataset.
A box-cox transformation was applied due to the variance in this dataset.
# Find lambda
lambda_pipe <- pipe |>
features(hour_avg, features = guerrero) |>
pull(lambda_guerrero)
# Apply Box-Cox transformation
pipe_bc <- pipe |>
mutate(bc = box_cox(hour_avg, lambda_pipe)) |>
select(date_time, bc)
pipe_bc |>
autoplot(bc)
No seasonal differencing and one order of differencing were found to be
necessary to achieve stationarity
# Seasonal differencing
pipe_bc |>
features(bc, unitroot_nsdiffs)
# One order of differencing
pipe_bc |>
features(bc, unitroot_ndiffs)
# Apply one order of differencing
pipe_bc |>
mutate(diff_bc = difference(bc)) |>
autoplot(diff_bc)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
The data now does not appear to show an overall trend or
seasonality.
Due to the shape of the data, an ARIMA model is likely the best model to use in this case.
# Fill gaps in data
pipe_bc <- fill_gaps(pipe_bc)
# Generate forecast (forecast was made out to 5 days (120 hrs)
pipe_forecast <- pipe_bc |>
model(ARIMA(bc)) |>
forecast(h = 120)
pipe_forecast |>
autoplot(pipe_bc, level = NULL)
# Final output
pipe_output <- pipe_forecast[,-c(1,3)] |>
rename(bc = .mean) |>
mutate(hour_avg = inv_box_cox(bc, lambda_pipe)) |>
select(date_time, hour_avg)
head(pipe_output)
write.csv(pipe_output, 'pipe_output_Song_J.csv')