This project consists of 3 parts - two required and one bonus.
Part A – ATM Forecast, ATM624Data.xlsx
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx
PART A Forecasting Analysis
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
ATM 4 as more cash withdrawals.ATM 3 only has few data points near the end of the year, could be a new machine.ATM 4 Has one usually high outlier so we may have to impute this. Remove values that have both ATM and cash has NA. Then convert the data to tsibble.
Code
# Plot cash taken out plot1 <- inputf1 %>%group_by(ATM) %>%plot_ly(x=~DATE) %>%add_lines(y=~Cash, color =~factor(ATM)) %>%layout(title ="ATM Withdwals by Date")plot1
DATE ATM Cash
Min. :2009-05-01 ATM1:365 Min. : 0.0
1st Qu.:2009-07-31 ATM2:365 1st Qu.: 0.5
Median :2009-10-30 ATM3:365 Median : 73.0
Mean :2009-10-30 ATM4:365 Mean : 155.6
3rd Qu.:2010-01-29 3rd Qu.: 114.0
Max. :2010-04-30 Max. :10919.8
NA's :5
Review distributions
Let’s take a look at the distributions for the other three ATMs that have steady cash withdrawals first.
Code
plot1 <- tsibbleATM |>filter(ATM !="ATM3") |>ggplot(aes(x=Cash)) +geom_histogram(aes(fill=ATM), bins=30) +facet_wrap(~ATM, nrow=1, scales ="free_x") +theme(legend.position="none") +labs(title ="Histograms of ATM Cash Withdrawals", y ="Count", x ="Cash (Hundreds of $)")plot2 <- tsibbleATM |>filter(ATM !="ATM3") |>ggplot(aes(x=Cash)) +geom_boxplot(aes(fill=ATM)) +facet_wrap(~ATM, nrow=1, scales ="free_x") +theme(legend.position="none") +labs(title ="Boxplots of ATM Cash Withdrawals", y ="", x ="Cash (Hundreds of $)")plot_grid(plot1, plot2, ncol=1)
Impute missing values
ATM1 is missing 3 values for Cash. The data appears to be bimodally distributed as can be seen from the boxplot, we will use median imputation for this ATM.
ATM2 is missing 2 values. ATM2 appears to be bimodally distributed with about equal observations in each peak. We will use mode imputation for this ATM.
We will use median imputation to replace the large outlier in ATM4.
Code
# Various NA plots to inspect dataknitr::kable(miss_var_summary(tsibbleATM), caption ='Missing Values',format="html", table.attr="style='width:50%;'") %>% kableExtra::kable_styling()
tsibbleATM_imputed |>autoplot(Cash) +facet_wrap(~ATM, ncol=1, scales ="free_y") +labs(title ="ATM Withdrawal with imputed for Four ATMs (May 2009 - April 2010)")
ATM1 and ATM2 both require seasonal differencing to make the data stationary. So, we can check differencing with a lag 7 to see what happens.The spikes on the ACF are still higher than we want in a residual after differencing.
# df = with(ATM.df, ATM.df[(DATE >= "2009-07-01" & DATE <= "2009-08-31"), ])ts =ggplot(tsibbleATM_imputed, aes(x = DATE, y = Cash, group = ATM, color = ATM)) +geom_line() +geom_point() +labs(title ="ATM Cash Withdrawal", subtitle ="1 July, 2009 to 31 August, 2009",x ="Date") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K")) +theme_minimal() #+ transition_reveal(DATE)ts
TimeSeries Models
Code
# Transform the dataframe into a time seriestemp = tsibbleATM_imputed %>%drop_na() %>%spread(ATM, Cash)ATM.ts =ts(temp %>% dplyr::select(-DATE), frequency =7)
ATM1 Subseries using TS
From the time series plot, we can see seasonality in this set, weekly as previously reviewed. While there is no steady trend over the weeks, there are increasing and decreasing activities over periods. The ACF shows slow decrease in every 7th lags due to the trend.Plotting subseries to show the variations during the week
Code
# ggtsdisplay(ATM.ts[,1], main = "ATM #1 Cash Withdrawal", ylab = "Cash Withdrawal (in hundreds)", xlab = "Week")ggsubseriesplot(ATM.ts[,1]) +labs(title ="ATM #1 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))
# A tsibble: 365 x 7 [1D]
# Key: ATM, .model [1]
ATM .model DATE Cash .fitted .resid .innov
<fct> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 ATM1 stlf 2009-05-01 96 NA NA NA
2 ATM1 stlf 2009-05-02 82 NA NA NA
3 ATM1 stlf 2009-05-03 85 NA NA NA
4 ATM1 stlf 2009-05-04 90 NA NA NA
5 ATM1 stlf 2009-05-05 99 NA NA NA
6 ATM1 stlf 2009-05-06 88 NA NA NA
7 ATM1 stlf 2009-05-07 8 NA NA NA
8 ATM1 stlf 2009-05-08 104 100. 3.62 3.62
9 ATM1 stlf 2009-05-09 87 90.4 -3.43 -3.43
10 ATM1 stlf 2009-05-10 93 93.0 0.0293 0.0293
# ℹ 355 more rows
fc_atm_1A <- atm1_mod %>%forecast(h=31)fc_atm_1A %>%autoplot(df_atm_1 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM1 Forecast in May 2010")
ATM2 Subseries, lambda transformation and SNAIVE, ETS
Code
ggsubseriesplot(ATM.ts[,2]) +labs(title ="ATM #2 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))
fc_atm_2A <- atm2_mod %>%forecast(h=31)fc_atm_2A %>%autoplot(df_atm_2 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM1 Forecast in May 2010")
ggsubseriesplot(ATM.ts[,3]) +labs(title ="ATM #3 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))
ATM4 Subseries, lambda transformation , ETS and SNAIVE
Code
ggtsdisplay(ATM.ts[,4], main ="ATM #4 Cash Withdrawal", ylab ="Cash Withdrawal (in hundreds)", xlab ="Week")
Code
ggsubseriesplot(ATM.ts[,4]) +labs(title ="ATM #4 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))
Code
df_atm_4 <- tsibbleATM_imputed |>filter(ATM =="ATM4")lambda_4 <- df_atm_4 %>%features(Cash, features = guerrero) %>%pull(lambda_guerrero)fit_atm_4_a <- df_atm_4%>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_4)),ETS =ETS((box_cox(Cash, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_b <- df_atm_4%>%model(ETS =ETS((box_cox(Cash, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_c <- df_atm_4%>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_4)))fc_atm_4 <- fit_atm_4_a %>%forecast(h=31)fc_atm_4%>%autoplot(df_atm_4, level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM4 Forecast in May 2010")
fc_atm_4A <- atm4_mod %>%forecast(h=31)fc_atm_4A %>%autoplot(atm4 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM4 Forecast in May 2010")
fc_atm_4b <- fit_atm_4_b %>%forecast(h=61)fc_atm_4b%>%autoplot(train4 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM4 Forecast in May 2010")
Forecast - Write to file final forecast
Write out Arima for the forecasts. For ATM4, the model for ETS was better on review, therefore I have forecast ETS for ATM4.
Code
may_forecasts <- fc |>filter(DATE >="2010-05-01")may_forecasts
Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
Data Exploration & Plots
First we start reviewing the data and identify any missing variables. The data consists 192 rows and 3 columns. CaseSequence is redundant and can be removed. The date column (YYYY.MMM) is coded as a character. For timeseries analysis we will mutate and create a tsibble with the date after formatting. September 2008 is missing the recorded KWH.
The line plot shows seasonality of the series. The data follows a clear seasonal trend. There is more power usage in the beginning, middle, and end of the year. There is an outlier point outside of the range during July 2010. We will update the outlier point to NA so we can impute it. The seasonal trend also shows a pattern with usage higher in the beginning, middle and end of the year.
Code
plot1 <- Filepower_ts |>autoplot(KWH) +labs(title ="Lineplot of Power Usage", x ="Date") +scale_y_continuous(labels=scales::comma)plot2 <- Filepower_ts |>ggplot(aes(x = KWH)) +geom_histogram(bins=15, fill='lightpink') +scale_x_continuous(labels=scales::comma) +labs(title ="Histogram of KWH")plot3 <- Filepower_ts |>ggplot(aes(x = KWH)) +geom_boxplot(fill='lightpink') +scale_x_continuous(labels=scales::comma) +labs(title ="Boxplot of KWH")plot_grid(plot1, plot_grid(plot2, plot3, nrow=1), ncol=1)
Code
#Identify the outlierFilepower_ts |>filter(KWH <3000000)
# A tsibble: 1 x 2 [1M]
Date KWH
<mth> <dbl>
1 2010 Jul 770523
To impute the missing values, we can use the mean of the same month the 2 years prior. Given the right-skewed nature of the histogram, we can perform a box-cox transformation to reduce the skewness.
In light of the observed seasonality,we can try some seasonal methods such as Arima. Here we can conver the dataframe into time series and see the ACF plot. The spikes show outside of the range, indicates it is not white noise and not-normal.
Code
power_ts <-ts(Filepower_ts$KWH, start=1998, frequency =12)Acf(power_ts)
Code
ggtsdisplay(power_ts, main="Time series KWH before transformation")
Code
Filepower_ts%>%ACF(KWH) %>%autoplot() +labs(title="KWH consumption before transformation ACF")
The aim now is to find an appropriate ARIMA model based on the ACF and PACF. Both the ACF and PACF plots are sinusoidal decaying and there is a large significant spike at lag 12. There are smaller significant spikes.
We can use auto.arima() function to determine the appropriate seasonal AR and MA models. Also I have used lambda to see if we can tranform the data to minimize the variance.
Filepower_ts |>model(classical_decomposition(KWH, type ="additive") ) |>components() |>autoplot() +labs(title ="Classical additive decomposition of total KWH")
Here I have tried some seasonal methods such as holt winter and damp holt winter. I have also plotted the histogram showing that for the large part the models show a normal distribution bell curve on the residual, however, there is one large spike on the ACF. We can proceed to look at ARIMA to see how it does by reviewing the AICc value, we can proceed with the lowest AICc on the ARIMA model and write out our forecast.
# A mable: 2 x 2
# Key: Model name [2]
`Model name` Orders
<chr> <model>
1 holt <ETS(A,A,A)>
2 damp <ETS(M,Ad,M)>
Code
t |> dplyr::select(holt) |>gg_tsresiduals(lag=24)
Code
t |> dplyr::select(damp ) |>gg_tsresiduals(lag=24)
ARIMA models
The auto arima model <ARIMA(0,0,1)(2,1,0)[12] w/ drift> seems to have the lowest AICc value, hence we will select this model and store our forecast. Additionally, the residuals on this model seem better, however, that the auto arima model and the scores of AICc and BIC are lower as well. ARIMA(0,0,1)(0,1,1)[12] with drift.
# A fable: 12 x 4 [1M]
# Key: .model [1]
.model Date KWH .mean
<chr> <mth> <dist> <dbl>
1 auto 2014 Jan N(9082193, 5.1e+11) 9082193.
2 auto 2014 Feb N(8825149, 5.2e+11) 8825149.
3 auto 2014 Mar N(6827872, 5.2e+11) 6827872.
4 auto 2014 Apr N(6e+06, 5.2e+11) 6028873.
5 auto 2014 May N(5679457, 5.2e+11) 5679457.
6 auto 2014 Jun N(7723045, 5.2e+11) 7723045.
7 auto 2014 Jul N(9507586, 5.2e+11) 9507586.
8 auto 2014 Aug N(9795634, 5.2e+11) 9795634.
9 auto 2014 Sep N(8816531, 5.2e+11) 8816531.
10 auto 2014 Oct N(6e+06, 5.2e+11) 6007628.
11 auto 2014 Nov N(6060119, 5.2e+11) 6060119.
12 auto 2014 Dec N(7943802, 5.2e+11) 7943802.
Code
fc_2014a
# A fable: 12 x 4 [1M]
# Key: .model [1]
.model Date KWH .mean
<chr> <mth> <dist> <dbl>
1 arima001011drift 2014 Jan N(8922240, 4.9e+11) 8922240.
2 arima001011drift 2014 Feb N(8223524, 4.9e+11) 8223524.
3 arima001011drift 2014 Mar N(6801798, 4.9e+11) 6801798.
4 arima001011drift 2014 Apr N(6e+06, 4.9e+11) 5959830.
5 arima001011drift 2014 May N(5557517, 4.9e+11) 5557517.
6 arima001011drift 2014 Jun N(7198250, 4.9e+11) 7198250.
7 arima001011drift 2014 Jul N(8709822, 4.9e+11) 8709822.
8 arima001011drift 2014 Aug N(9089526, 4.9e+11) 9089526.
9 arima001011drift 2014 Sep N(8429024, 4.9e+11) 8429024.
10 arima001011drift 2014 Oct N(6118739, 4.9e+11) 6118739.
11 arima001011drift 2014 Nov N(5664756, 4.9e+11) 5664756.
12 arima001011drift 2014 Dec N(7291513, 4.9e+11) 7291513.
Code
dataframe1 <-data.frame (fc_2014a)
Write final forecast to a file
In conclusion :
From the book we looked at several models after checking seasonality, stationarity,looked at residuals, ran holt winter, ARIMA models and finally made ARIMA model selection and wrote the forecast to the file.
Definition from the book.
“Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.
A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.”
Code
date =seq(as.Date('2014-01-01'), as.Date('2014-12-01'), by ='month') %>%format('%Y-%b')openxlsx::write.xlsx(data.frame(YYYY.MM = date, KWH = fc_2014a$.mean), "forecasts_Banu.xlsx")
Part C – Bonus
Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.
This time series has gaps or missing values, therefore we will use na_interpolation to fill gaps and plot the ACF. The ACF shows one significant spike.
plot1 <- hourly_pipe1 |>fill_gaps() |>fill(avg_flow, .direction="down") |>autoplot(avg_flow) +labs(title ="Pipe 1: Hourly Water Flow", x ="Date", y ="Average Water Flow")plot2 <- hourly_pipe1 |>fill_gaps() |>fill(avg_flow, .direction="down") |>ACF(avg_flow) |>autoplot() +labs(title ="ACF")plot_grid(plot1, plot2, ncol=1)
Plot pipe2
Code
plot1 <- hourly_pipe2 |>fill_gaps() |>fill(avg_flow, .direction="down") |>autoplot(avg_flow) +labs(title ="Pipe 2: Hourly Water Flow", x ="Date", y ="Average Water Flow")plot2 <- hourly_pipe2 |>fill_gaps() |>fill(avg_flow, .direction="down") |>ACF(avg_flow) |>autoplot() +labs(title ="ACF")plot_grid(plot1, plot2, ncol=1)
Check Stationarity
I have checked ome ways to check differencing and then looking at the features to see the effects.
From the book: “Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.A stationary time series is one whose properties do not depend on the time at which the series is observed.17 Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time.”
Code
pipes2_tsb |>ACF(difference(avg_flow)) |>autoplot() +labs(subtitle ="Hourly Water Flow Pipe 2")
fit_atm_4_b |>gg_tsresiduals() +labs(title ="Hourly Water Flow boxcox ETS")
Code
fit_atm_4_c |>gg_tsresiduals() +labs(title ="Hourly Water Flow boxcox SNAIVE")
Code
fc_atm_4b <- fit_atm_4_b %>%forecast(h ="1 week")fc_atm_4c <- fit_atm_4_c %>%forecast(h="1 week")fc_atm_4b%>%autoplot(pipes2_tsb , level =80) +labs(y ="Hourly Water Flow Pipe 2 with boxcox and ETS",title ="Hourly Water Flow Pipe 2 with boxcox and ETS")
Code
fc_atm_4c%>%autoplot(pipes2_tsb , level =80) +labs(y ="Hourly Water Flow Pipe 2 with boxcox SNAIVE",title ="Hourly Water Flow Pipe 2 with boxcox and SNAIVE")
---title: "DATA624 Project 1"author: "Banu"date: "10/22/2024"toc: trueformat: html: html-math-method: katex code-fold: true code-tools: true self-contained: true toc_depth: 3execute: warning: false---```{r}#Load Packageslibrary(arrow)library(dplyr)library(ggplot2)library(reshape2)library(tidymodels)library(tidyr)library(broom)library(patchwork)library(lubridate)library(rpart.plot)library(kableExtra)library(DataExplorer)library(skimr)#install.packages("ranger")#install.packages("openxlsx")library(DescTools)library(corrplot)library(ggcorrplot)library(caret)library(rpart)library(car)library(rattle)library(ROSE)library(mice)library(MASS)library(fable)library(imputeTS)library(psych)library(forecast)library(openxlsx)library(plotly)library(naniar)library(cowplot)library(fpp3)```## Project### **Deliverable**This project consists of 3 parts - two required and one bonus.**Part A – ATM Forecast, ATM624Data.xlsx****Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx****Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx**## PART A Forecasting AnalysisIn part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.### Data Exploratory Analysis```{r}library(readxl)# inputf1 <- read.csv("C:/Users/Banu/Documents/RScriptfiles/DATA 624/PROJECT 1/ATM624Data.csv",stringsAsFactors = F,header=TRUE)inputf1 <- readxl::read_excel("C:/Users/Banu/Documents/RScriptfiles/DATA 624/PROJECT 1/ATM624Data.xlsx")#Glimpse variablesintroduce(inputf1)plot_intro(inputf1)glimpse(inputf1)inputf1$DATE <-as.Date(as.numeric(inputf1$DATE),origin ="1899-12-30")#sorted by dateinputf1 <- inputf1[order(inputf1$DATE),]inputf1$ATM <-as.factor(inputf1$ATM)#displayknitr::kable(inputf1[1:10, 1:3], caption ='ATM data',format="html", table.attr="style='width:50%;'") %>% kableExtra::kable_styling()knitr::kable(skim(inputf1), format="html") %>% kableExtra::kable_styling()```### Review missing variablesATM 4 as more cash withdrawals.ATM 3 only has few data points near the end of the year, could be a new machine.ATM 4 Has one usually high outlier so we may have to impute this. Remove values that have both ATM and cash has NA. Then convert the data to tsibble.```{r}# Plot cash taken out plot1 <- inputf1 %>%group_by(ATM) %>%plot_ly(x=~DATE) %>%add_lines(y=~Cash, color =~factor(ATM)) %>%layout(title ="ATM Withdwals by Date")plot1#Look at missing valuesplot_missing(inputf1)knitr::kable(miss_var_summary(inputf1), caption ='Missing data',format="html", table.attr="style='width:50%;'") %>% kableExtra::kable_styling()gg_miss_upset(inputf1)# summarize the missing data by ATMinputf1 %>%group_by(as.factor(ATM)) %>%summarise(`Missing Values`=sum(is.na(Cash)))# Drop any row that are missing both ATM and Cash valuestsibbleATMclean <- inputf1%>%filter(!(is.na(ATM) &is.na(Cash))) %>%drop()# # #Create a Tsibble object# #tsibbleATM<- read_excel("ATM624DATA.xlsx") %>%# # converting into date format# tsibbleATMclean# # %>% mutate(Date = as.Date(as.numeric(DATE),origin = "1899-12-30"),ATM= as.factor(ATM),Cash = as.numeric(Cash)) %>%# # renaming column name# rename(Date = DATE) %>%# # converting to tsibble# tsibbleATM<- tsibbleATMclean %>% mutate(DATE=DATE,ATM=ATM,Cash=Cash) %>%# as_tsibble(index = Date, key = ATM) tsibbleATM <-as_tsibble(tsibbleATMclean, index=DATE, key=ATM)# Confirm 14 rows were droppedprint(c('Original: ', nrow(inputf1), 'After Drop:', nrow(tsibbleATMclean)))tsibbleATM %>%autoplot(Cash) +facet_wrap(~ATM, scales ="free", nrow =4) +labs(title =" ATM Before Outlier Removal")summary(tsibbleATM)```### Review distributionsLet’s take a look at the distributions for the other three ATMs that have steady cash withdrawals first.```{r}plot1 <- tsibbleATM |>filter(ATM !="ATM3") |>ggplot(aes(x=Cash)) +geom_histogram(aes(fill=ATM), bins=30) +facet_wrap(~ATM, nrow=1, scales ="free_x") +theme(legend.position="none") +labs(title ="Histograms of ATM Cash Withdrawals", y ="Count", x ="Cash (Hundreds of $)")plot2 <- tsibbleATM |>filter(ATM !="ATM3") |>ggplot(aes(x=Cash)) +geom_boxplot(aes(fill=ATM)) +facet_wrap(~ATM, nrow=1, scales ="free_x") +theme(legend.position="none") +labs(title ="Boxplots of ATM Cash Withdrawals", y ="", x ="Cash (Hundreds of $)")plot_grid(plot1, plot2, ncol=1)```### Impute missing valuesATM1 is missing 3 values for Cash. The data appears to be bimodally distributed as can be seen from the boxplot, we will use median imputation for this ATM.ATM2 is missing 2 values. ATM2 appears to be bimodally distributed with about equal observations in each peak. We will use mode imputation for this ATM.We will use median imputation to replace the large outlier in ATM4.```{r}# Various NA plots to inspect dataknitr::kable(miss_var_summary(tsibbleATM), caption ='Missing Values',format="html", table.attr="style='width:50%;'") %>% kableExtra::kable_styling()tsibbleATM |>filter(is.na(Cash)) |> knitr::kable()# function for modemode <-function(x) { uniq_x <-unique(x) uniq_x[which.max(tabulate(match(x, uniq_x)))]}tsibbleATM_imputed <- tsibbleATM |>mutate(Cash =ifelse(Cash ==max(tsibbleATM$Cash, na.rm = T), NA, Cash)) |>group_by(ATM) |>mutate(Cash =case_when(is.na(Cash) & ATM %in%c("ATM1", "ATM4") ~median(Cash, na.rm=T),is.na(Cash) & ATM =="ATM2"~mode(Cash),TRUE~ Cash ))#no missingtsibbleATM_imputed |>filter(is.na(Cash)) |> knitr::kable()tsibbleATM_imputed |>autoplot(Cash) +facet_wrap(~ATM, ncol=1, scales ="free_y") +labs(title ="ATM Withdrawal with imputed for Four ATMs (May 2009 - April 2010)") ```ATM1 and ATM2 both require seasonal differencing to make the data stationary. So, we can check differencing with a lag 7 to see what happens.The spikes on the ACF are still higher than we want in a residual after differencing.```{r}tsibbleATM_imputed |>ACF(Cash) |>autoplot() +facet_wrap(~ATM, ncol=1, scales ="free_y") +labs(title ="ACF Plots after imputation") tsibbleATM_imputed |>features(Cash, unitroot_kpss) |> knitr::kable() tsibbleATM_imputed |>features(Cash, unitroot_ndiffs)``````{r}# atm_finimp |># features(Cash, unitroot_kpss) |># knitr::kable()# # atm_finimp |># features(Cash, unitroot_ndiffs)tsibbleATM_imputed |>features(difference(Cash,7), unitroot_ndiffs)``````{r}tsibbleATM_imputed |>filter(ATM =="ATM1") |>gg_tsdisplay(difference(Cash, lag=7), plot_type="partial") +labs(title ="ATM1: Seasonally Differenced")``````{r}tsibbleATM_imputed |>filter(ATM =="ATM2") |>gg_tsdisplay(Cash, plot_type="partial") +labs(title ="ATM2")``````{r}tsibbleATM_imputed |>filter(ATM =="ATM4") |>gg_tsdisplay(difference(Cash, lag=7), plot_type="partial") +labs(title ="ATM4: Seasonally Differenced")``````{r}# df = with(ATM.df, ATM.df[(DATE >= "2009-07-01" & DATE <= "2009-08-31"), ])ts =ggplot(tsibbleATM_imputed, aes(x = DATE, y = Cash, group = ATM, color = ATM)) +geom_line() +geom_point() +labs(title ="ATM Cash Withdrawal", subtitle ="1 July, 2009 to 31 August, 2009",x ="Date") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K")) +theme_minimal() #+ transition_reveal(DATE)ts```### TimeSeries Models```{r}# Transform the dataframe into a time seriestemp = tsibbleATM_imputed %>%drop_na() %>%spread(ATM, Cash)ATM.ts =ts(temp %>% dplyr::select(-DATE), frequency =7)```#### ATM1 Subseries using TSFrom the time series plot, we can see seasonality in this set, weekly as previously reviewed. While there is no steady trend over the weeks, there are increasing and decreasing activities over periods. The ACF shows slow decrease in every 7th lags due to the trend.Plotting subseries to show the variations during the week```{r}# ggtsdisplay(ATM.ts[,1], main = "ATM #1 Cash Withdrawal", ylab = "Cash Withdrawal (in hundreds)", xlab = "Week")ggsubseriesplot(ATM.ts[,1]) +labs(title ="ATM #1 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))df_atm_1 <- tsibbleATM_imputed |>filter(ATM =="ATM1")df_atm_1 %>%ACF(Cash)%>%autoplot()```#### ATM 1 Transformation boxcox using TS classWe can try lambda to stabilize the variance similar to the above PART A analysis. The AICc is 2339.304 for the ETS model.```{r}bct =function(ts, title){ a =autoplot(ts) +labs(title =sprintf("Before: %s", title), x ="Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K")) lambda =BoxCox.lambda(ts) at =autoplot(BoxCox(ts, lambda)) +labs(title =sprintf("Transformed: %s", title),subtitle =sprintf("lambda = %0.2f", lambda),y ="Box-Cox Transformed",x ="Week") gridExtra::grid.arrange(a, at)}lambda =BoxCox.lambda(ATM.ts[,1])bct(ATM.ts[,1], title ="ATM #1 Cash Withdrawal")ATM.ts[,1] %>%stl(s.window =7, robust =TRUE) %>%autoplot() +labs(title ="Seasonal & Trend Decomposition for ATM #1 Time Series", x ="Week")t<- ATM.ts[,1] %>%ets(lambda = lambda, biasadj =TRUE)tt %>%autoplot() +labs(title ="ETS for ATM #1 Time Series", x ="Week")```#### ATM 1 STL and ETS and holtwinter methods using tsibbleI have tried some seasonal methods such as STL and holtwinter since the above data has seasonality and stationarity to it.```{r}fit_dcmp <- df_atm_1 |>model(stlf =decomposition_model(STL(Cash ~trend(window =7), robust =TRUE),NAIVE(season_adjust) ))fit_dcmp |>forecast() |>autoplot(df_atm_1)+labs(y ="cash withdrawal",title ="ATM1 STL")fit_dcmp |>gg_tsresiduals()augment(fit_dcmp)fc <- fit_dcmp |>forecast(h =24)fc |>autoplot( df_atm_1,level=NULL)+labs(y ="cash withdrawal",title ="ATM1 STL forecast")fit3 <- df_atm_1 |>model(additive =ETS(Cash ~error("A") +trend("A") +season("A")),multiplicative =ETS(Cash ~error("M") +trend("A") +season("M")) )fc3 <- fit3 |>forecast(h =24)fc3 |>autoplot(df_atm_1, level =NULL) +labs(title="ATM1 ETS",y="cash withdrawal") +guides(colour =guide_legend(title ="ATM1 STL"))#augment(fit3)fit2 <- df_atm_1 %>%filter(DATE <="2010-04-30") |>model(hw =ETS(Cash ~error("M") +trend("Ad") +season("M")) ) |>forecast(h ="2 weeks") |>autoplot(df_atm_1 |>filter(DATE <="2010-05-14")) +labs(title ="ATM1 Holt Winter",y="cash withdrawal")fit2a <- df_atm_1 %>%filter(DATE <="2010-04-30") |>model(hw =ETS(Cash ~error("M") +trend("Ad") +season("M")) )fit2report(fit2a)fit2b <- fit2a %>%forecast(h ="2 weeks")fit2b#augment(fit2a)fit2c <- df_atm_1 %>%filter(DATE <="2010-04-30") |>model(hw =ETS(Cash ~error("M") +trend("Ad") +season("M")) ) %>%augment() autoplot(fit2c, .innov) +labs(y ="$US",title ="Residuals from the HW method")fit2c |>ggplot(aes(x = .innov)) +geom_histogram() +labs(title ="Histogram of HW residuals")fit2c |>ACF(.innov) |>autoplot() +labs(title ="Residuals from the HW method")components(fit2a) |>autoplot() +labs(title ="ATM1 Holtwinter")components(fit3) |>autoplot() +labs(title ="ATM1 ETS, additive and multiplicative")```#### ATM 1 ARIMAWe can use seasonal ARIMA modeling to forecast future withdrawals for each ATM.Let’s use the April 2010 data as a validation set and the rest of the data for training.We will use an automatically generated ARIMA model and compare with our assumptions from the ACF and PACF plots.```{r}atm1_mod <- df_atm_1 |>model(auto =ARIMA(Cash),arima001011 =ARIMA(Cash ~0+pdq(0,0,1) +PDQ(0,1,1, period=7)))atm1_mod |> knitr::kable()report(atm1_mod) |> dplyr::select(.model:BIC) |> knitr::kable()augment(atm1_mod) |>filter(.model =="auto") |>features(.innov, ljung_box, lag=14)augment(atm1_mod) |>filter(.model =="arima001011") |>features(.innov, ljung_box, lag=14)auto.arima(df_atm_1) df_atm_1 %>%model(auto =ARIMA(Cash ~0+pdq(0,0,1) +PDQ(0,1,2, period=7)))%>%residuals() %>%gg_tsdisplay()fc_atm_1A <- atm1_mod %>%forecast(h=31)fc_atm_1A %>%autoplot(df_atm_1 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM1 Forecast in May 2010")```#### ATM2 Subseries, lambda transformation and SNAIVE, ETS```{r}ggsubseriesplot(ATM.ts[,2]) +labs(title ="ATM #2 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))df_atm_2 <- tsibbleATM_imputed |>filter(ATM =="ATM2")df_atm_2 %>%ACF(Cash)%>%autoplot()lambda_2 <- df_atm_2%>%features(Cash, features = guerrero) %>%pull(lambda_guerrero)fit_atm_2_a <- df_atm_2%>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_2)),ETS =ETS((box_cox(Cash, lambda_2) ~error("A") +trend("A") +season("A"))))fit_atm_2_b <- df_atm_2%>%model(ETS =ETS((box_cox(Cash, lambda_2) ~error("A") +trend("A") +season("A"))))fit_atm_2_c <- df_atm_2%>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_2)))fc_atm_2 <- fit_atm_2_a %>%forecast(h=31)fc_atm_2%>%autoplot(df_atm_2, level =80) +labs(y ="",title = latex2exp::TeX(paste0("Transformed gas production with $\\lambda$ = ",round(lambda_2,2))))fit_atm_2_b |>gg_tsresiduals()fit_atm_2_c |>gg_tsresiduals()augment(fit_atm_2_a)%>%features(.innov, ljung_box, lag =7)```#### ATM2 ARIMA```{r}atm2 <- df_atm_2 |>filter(ATM =="ATM2")atm2_mod <- atm2 |>model(auto =ARIMA(Cash),arima000011 =ARIMA(Cash ~0+pdq(0,0,0) +PDQ(0,1,1, period=7)),arima000012 =ARIMA(Cash ~0+pdq(0,0,0) +PDQ(0,1,2, period=7)))atm2_mod |> knitr::kable()report(atm2_mod) |> dplyr::select(.model:BIC) |> knitr::kable()augment(atm2_mod)%>%features(.innov, ljung_box, lag =7)fc_atm_2A <- atm2_mod %>%forecast(h=31)fc_atm_2A %>%autoplot(df_atm_2 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM1 Forecast in May 2010")auto.arima(df_atm_2) df_atm_2 %>%model(auto =ARIMA(Cash ~0+pdq(2,0,2) +PDQ(0,1,1, period=7)))%>%residuals() %>%gg_tsdisplay()```#### ATM3```{r}ggsubseriesplot(ATM.ts[,3]) +labs(title ="ATM #3 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))```#### ATM4 Subseries, lambda transformation , ETS and SNAIVE```{r}ggtsdisplay(ATM.ts[,4], main ="ATM #4 Cash Withdrawal", ylab ="Cash Withdrawal (in hundreds)", xlab ="Week")ggsubseriesplot(ATM.ts[,4]) +labs(title ="ATM #4 Cash Withdrawal", subtitle ="1 May, 2009 to 30 April, 2010",x ="Days of the Week") +scale_y_continuous("Amount of Cash Withdrawal",labels = scales::dollar_format(scale =0.1, suffix ="K"))df_atm_4 <- tsibbleATM_imputed |>filter(ATM =="ATM4")lambda_4 <- df_atm_4 %>%features(Cash, features = guerrero) %>%pull(lambda_guerrero)fit_atm_4_a <- df_atm_4%>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_4)),ETS =ETS((box_cox(Cash, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_b <- df_atm_4%>%model(ETS =ETS((box_cox(Cash, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_c <- df_atm_4%>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_4)))fc_atm_4 <- fit_atm_4_a %>%forecast(h=31)fc_atm_4%>%autoplot(df_atm_4, level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM4 Forecast in May 2010")```#### ATM4 ARIMA```{r}atm4 <- tsibbleATM_imputed |>filter(ATM =="ATM4")atm4_mod <- atm4 |>model(auto =ARIMA(Cash),arima000100 =ARIMA(Cash ~0+pdq(0,0,0) +PDQ(1,0,0, period=7)))atm4_mod |> knitr::kable()report(atm4_mod) |> dplyr::select(.model:BIC) |> knitr::kable()fc_atm_4A <- atm4_mod %>%forecast(h=31)fc_atm_4A %>%autoplot(atm4 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM4 Forecast in May 2010")auto.arima(atm4) atm4 %>%model(auto =ARIMA(Cash ~0+pdq(3,0,2) +PDQ(1,0,0, period=7)))%>%residuals() %>%gg_tsdisplay()```#### ForecastLet's do ARIMA forecast and NAIVE for the ATM3 model and show forecast plots below.```{r}atm_train <- tsibbleATM_imputed |>filter(DATE <"2010-04-01")atms_mod <- atm_train |>filter(ATM !="ATM3") |>model(ARIMA(Cash))fc <- atms_mod |>forecast(h=61)``````{r}atm3 <- tsibbleATM_imputed |>filter(ATM =="ATM3")atm3_mod <- atm3 |>model(NAIVE(Cash))atm3_fc <- atm3_mod |>forecast(h=31)atm3_fc%>%autoplot(atm3, level =80) +labs(y ="",title ="ATM3 with NAIVE ", )fc <- fc |>bind_rows(atm3_fc)fc |>autoplot(tsibbleATM_imputed)```#### Forecast - Plot residualsWhite we have tried several types of models. We have proceeded with ARIMA as our final forecasts.```{r}atms_mod |>filter(ATM =="ATM1") |>gg_tsresiduals() +labs(title ="ATM1")atms_mod |>filter(ATM =="ATM2") |>gg_tsresiduals() +labs(title ="ATM2")atms_mod |>filter(ATM =="ATM4") |>gg_tsresiduals() +labs(title ="ATM4")train4 <- atm_train|>filter(ATM =="ATM4")lambda_4 <- train4 %>%features(Cash, features = guerrero) %>%pull(lambda_guerrero)fit_atm_4_a <- train4 %>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_4)),ETS =ETS((box_cox(Cash, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_b <- train4 %>%model(ETS =ETS((box_cox(Cash, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_c <- train4 %>%model(SNAIVE =SNAIVE(box_cox(Cash, lambda_4)))fc_atm_4 <- fit_atm_4_a %>%forecast(h=31)fc_atm_4%>%autoplot(train4 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM4 Forecast in May 2010")fit_atm_4_c |>gg_tsresiduals() +labs(title ="ATM4")fit_atm_4_b |>gg_tsresiduals() +labs(title ="ATM4")fc_atm_4b <- fit_atm_4_b %>%forecast(h=61)fc_atm_4b%>%autoplot(train4 , level =80) +labs(y ="Cash in hundreds of dollars",title ="ATM4 Forecast in May 2010")```#### Forecast - Write to file final forecastWrite out Arima for the forecasts. For ATM4, the model for ETS was better on review, therefore I have forecast ETS for ATM4.```{r}may_forecasts <- fc |>filter(DATE >="2010-05-01")may_forecasts may_forecasts_ets <- fc_atm_4b|>filter(DATE >="2010-05-01")may_forecasts_etsautoplot(may_forecasts_ets)openxlsx::write.xlsx(data.frame(may_forecasts), "forecasts_ATM_Banu_ARIMA.xlsx")openxlsx::write.xlsx(data.frame(may_forecasts_ets), "forecasts_ATM_Banu_ETSATM4.xlsx")```## Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsxPart B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above. ### Data Exploration & PlotsFirst we start reviewing the data and identify any missing variables. The data consists 192 rows and 3 columns. CaseSequence is redundant and can be removed. The date column (YYYY.MMM) is coded as a character. For timeseries analysis we will mutate and create a tsibble with the date after formatting. September 2008 is missing the recorded KWH.```{r}set.seed(2024)Filepower <- readxl::read_excel("C:/Users/Banu/Documents/RScriptfiles/DATA 624/PROJECT 1/ResidentialCustomerForecastLoad-624.xlsx")#Glimpse variablesintroduce(Filepower)plot_intro(Filepower)glimpse(Filepower)describe(Filepower)skim(Filepower)#Look at missing valuesknitr::kable(miss_var_summary(Filepower), caption ='Missing data',format="html", table.attr="style='width:50%;'") %>% kableExtra::kable_styling()Filepower_ts <- Filepower |>rename(Date ="YYYY-MMM") |>mutate(Date =yearmonth(Date)) |> dplyr::select(-CaseSequence) |>as_tsibble(index=Date)Filepower_ts |>filter(is.na(KWH)) |> knitr::kable()```### Data DistributionsThe line plot shows seasonality of the series. The data follows a clear seasonal trend. There is more power usage in the beginning, middle, and end of the year. There is an outlier point outside of the range during July 2010. We will update the outlier point to NA so we can impute it. The seasonal trend also shows a pattern with usage higher in the beginning, middle and end of the year.```{r}plot1 <- Filepower_ts |>autoplot(KWH) +labs(title ="Lineplot of Power Usage", x ="Date") +scale_y_continuous(labels=scales::comma)plot2 <- Filepower_ts |>ggplot(aes(x = KWH)) +geom_histogram(bins=15, fill='lightpink') +scale_x_continuous(labels=scales::comma) +labs(title ="Histogram of KWH")plot3 <- Filepower_ts |>ggplot(aes(x = KWH)) +geom_boxplot(fill='lightpink') +scale_x_continuous(labels=scales::comma) +labs(title ="Boxplot of KWH")plot_grid(plot1, plot_grid(plot2, plot3, nrow=1), ncol=1)#Identify the outlierFilepower_ts |>filter(KWH <3000000)outlierpt <-which(Filepower_ts$Date ==yearmonth("2010 Jul"))Filepower_ts$KWH[outlierpt] =NAFilepower_ts |>filter(is.na(KWH)) |> knitr::kable()Filepower_ts |>gg_lag(KWH, geom="point", lags =1:12)Filepower_ts |>gg_season(KWH) +scale_y_continuous(labels=scales::comma) +labs(title ="Seasonal Decomposition of KWH", x ="Month")Filepower_ts |>features(difference(KWH, lag=12), unitroot_kpss) |> knitr::kable()```### Data ImputationTo impute the missing values, we can use the mean of the same month the 2 years prior. Given the right-skewed nature of the histogram, we can perform a box-cox transformation to reduce the skewness.```{r}set.seed(2024)imputejul <- Filepower_ts |>filter(Date %in%c("2008 Jul", "2009 Jul"))imputesep <- Filepower_ts |>filter(Date %in%c("2006 Sep", "2007 Sep")) # imputeFilepower_ts <- Filepower_ts |>mutate(KWH =ifelse(Date ==yearmonth("2010 Jul"), mean(imputejul$KWH), KWH),KWH =ifelse(Date ==yearmonth("2008 Sep"), mean(imputesep$KWH), KWH))ggplot(Filepower_ts, aes(x = KWH)) +geom_histogram() +labs(title ="Histogram of KWH")Filepower_ts %>%plot_ly(x=~year(Date)) %>%add_lines(y=~KWH) %>%layout(title ="Power Usage",xaxis =list(title="Date"),yaxis =list(title="KWH"))Filepower_ts |>autoplot(KWH) +labs(title ="Lineplot of Power Usage", x ="Date") +scale_y_continuous(labels=scales::comma)knitr::kable(miss_var_summary(Filepower_ts), caption ='Missing data',format="html", table.attr="style='width:50%;'") %>% kableExtra::kable_styling()```### ModelsIn light of the observed seasonality,we can try some seasonal methods such as Arima. Here we can conver the dataframe into time series and see the ACF plot. The spikes show outside of the range, indicates it is not white noise and not-normal.```{r}power_ts <-ts(Filepower_ts$KWH, start=1998, frequency =12)Acf(power_ts)ggtsdisplay(power_ts, main="Time series KWH before transformation")Filepower_ts%>%ACF(KWH) %>%autoplot() +labs(title="KWH consumption before transformation ACF")```The aim now is to find an appropriate ARIMA model based on the ACF and PACF. Both the ACF and PACF plots are sinusoidal decaying and there is a large significant spike at lag 12. There are smaller significant spikes.We can use auto.arima() function to determine the appropriate seasonal AR and MA models. Also I have used lambda to see if we can tranform the data to minimize the variance.#### Train Models```{r}set.seed(2024)Filepower_ts |>gg_tsdisplay(KWH) +labs(title ="Power Usage")Filepower_ts |>gg_tsdisplay(difference(KWH, lag=12), plot_type="partial") +labs(title ="Power Usage: Seasonally Differenced")Filepower_ts|>features(difference(KWH, lag=12), unitroot_kpss) |>kable()Filepower_ts |>gg_subseries(KWH) +labs(title ="Power Usage")lambda <- Filepower_ts |>features(KWH, features = guerrero) |>pull(lambda_guerrero)Filepower_ts |>autoplot(box_cox(KWH, lambda)) +labs(y ="",title = latex2exp::TeX(paste0("Transformed KWH $\\lambda$ = ",round(lambda,2))))Filepower_ts |>model(classical_decomposition(KWH, type ="additive") ) |>components() |>autoplot() +labs(title ="Classical additive decomposition of total KWH")Filepower_ts %>%model(STL(KWH ~season(window ="periodic"),robust =TRUE)) |>components() |>autoplot()```#### Predictions and EvaluationHere I have tried some seasonal methods such as holt winter and damp holt winter. I have also plotted the histogram showing that for the large part the models show a normal distribution bell curve on the residual, however, there is one large spike on the ACF. We can proceed to look at ARIMA to see how it does by reviewing the AICc value, we can proceed with the lowest AICc on the ARIMA model and write out our forecast.```{r}Filepower_ts %>%auto.arima(lambda = lambda, biasadj =TRUE)Filepower_ts |>model(holt =ETS(KWH ~error("A") +trend("A") +season("A")),damp =ETS(KWH ~error("M") +trend("Ad") +season("M")) ) |>forecast(h =24) |>autoplot(Filepower_ts, level =NULL) +labs(title ="Power generation",y ="KWH") +guides(colour =guide_legend(title ="Forecast holt winters and damp holt winter"))count_gaps(Filepower_ts)t <- Filepower_ts |>model(holt =ETS(KWH ~error("A") +trend("A") +season("A")),damp =ETS(KWH ~error("M") +trend("Ad") +season("M")) )t |>pivot_longer(everything(), names_to ="Model name",values_to ="Orders")t |> dplyr::select(holt) |>gg_tsresiduals(lag=24)t |> dplyr::select(damp ) |>gg_tsresiduals(lag=24)```#### ARIMA modelsThe auto arima model \<ARIMA(0,0,1)(2,1,0)\[12\] w/ drift\> seems to have the lowest AICc value, hence we will select this model and store our forecast. Additionally, the residuals on this model seem better, however, that the auto arima model and the scores of AICc and BIC are lower as well. ARIMA(0,0,1)(0,1,1)\[12\] with drift.```{r}power_mods <-Filepower_ts |>filter(year(Date) <2012) |>model(auto =ARIMA(KWH),arima000011 =ARIMA(KWH ~0+pdq(0,0,0) +PDQ(0,1,1, period=12)),arima000110 =ARIMA(KWH ~0+pdq(0,0,0) +PDQ(1,1,0, period=12)),arima001011 =ARIMA(KWH ~0+pdq(0,0,1) +PDQ(0,1,1, period=12)),arima001110 =ARIMA(KWH ~0+pdq(0,0,1) +PDQ(1,1,0, period=12)),arima001011drift =ARIMA(KWH ~1+pdq(0,0,1) +PDQ(0,1,1, period=12)),arima001110drift =ARIMA(KWH ~1+pdq(0,0,1) +PDQ(1,1,0, period=12)))power_mods |>pivot_longer(everything(), names_to ="Model name",values_to ="Orders")report(power_mods) |> dplyr::select(.model:BIC) |> knitr::kable()#power_mods |> summary(auto)power_mods |>forecast(h="3 years") |>autoplot(Filepower_ts |>filter(year(Date) >2010), level=NULL) +labs(title ="KWH Forecast Models ARIMA") +scale_y_continuous(labels=scales::comma)power_mods |> dplyr::select(auto) |>gg_tsresiduals() +labs(title ="Auto ARIMA")power_mods |> dplyr::select(arima001011drift) |>gg_tsresiduals() +labs(title ="ARIMA(0,0,1)(0,1,1)[12] with drift")fc_2014 <- power_mods |>dplyr::select(auto) |>forecast(h="3 years") |>filter(year(Date) ==2014)fc_2014a <- power_mods |>dplyr::select(arima001011drift) |>forecast(h="3 years") |>filter(year(Date) ==2014)fc_2014b <- power_mods |>dplyr::select(arima001011drift) |>forecast(h="3 years") |>filter(year(Date) ==2014)fc_2014fc_2014a dataframe1 <-data.frame (fc_2014a)```#### Write final forecast to a fileIn conclusion :From the book we looked at several models after checking seasonality, stationarity,looked at residuals, ran holt winter, ARIMA models and finally made ARIMA model selection and wrote the forecast to the file.Definition from the book."Transformations such as logarithms can help to stabilise the variance of a time series. Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time."```{r}date =seq(as.Date('2014-01-01'), as.Date('2014-12-01'), by ='month') %>%format('%Y-%b')openxlsx::write.xlsx(data.frame(YYYY.MM = date, KWH = fc_2014a$.mean), "forecasts_Banu.xlsx")```## Part C – BonusPart C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file. ```{r}# Load the Excel datapipe1_df <- readxl::read_excel('C:/Users/Banu/Documents/RScriptfiles/DATA 624/PROJECT 1/Waterflow_Pipe1.xlsx')colnames(pipe1_df) <-c('DATE', 'WaterFlow')pipe2_df <- readxl::read_excel('C:/Users/Banu/Documents/RScriptfiles/DATA 624/PROJECT 1/Waterflow_Pipe2.xlsx')colnames(pipe2_df) <-c('DATE', 'WaterFlow')glimpse(pipe1_df)```Let's format the Date.Time columns to actual date types.```{r}waterpipe1_df <- pipe1_df |>mutate(Date.Time =as.POSIXct(pipe1_df$DATE *86400, origin ="1899-12-30", tz="GMT"))waterpipe2_df <- pipe2_df |>mutate(Date.Time =as.POSIXct(pipe2_df$DATE *86400, origin ="1899-12-30", tz="GMT"))```Look at missing data , aggregate the mean or average flow and look at hourly data.```{r}head(waterpipe1_df)head(waterpipe2_df)knitr::kable(miss_var_summary(waterpipe1_df), caption ='Pipe 1 Missing Data',format="html", table.attr="style='width:60%;'") %>% kableExtra::kable_styling()knitr::kable(miss_var_summary(waterpipe2_df), caption ='Pipe 1 Missing Data',format="html", table.attr="style='width:60%;'") %>% kableExtra::kable_styling()hourly_pipe1 <- waterpipe1_df |>mutate(Date.Time =floor_date(Date.Time, "hour")) |>group_by(Date.Time) |>summarize(avg_flow =mean(WaterFlow, na.rm=T)) |>as_tsibble(index=Date.Time)hourly_pipe2 <- waterpipe2_df |>mutate(Date.Time =floor_date(Date.Time, "hour")) |>group_by(Date.Time) |>summarize(avg_flow =mean(WaterFlow, na.rm=T)) |>as_tsibble(index=Date.Time)hourly_pipe1 |>features(avg_flow, unitroot_kpss) |> knitr::kable()# hourly_pipe1 |># ACF(avg_flow) |># autoplot() ```### Check for gaps and interpolateThis time series has gaps or missing values, therefore we will use na_interpolation to fill gaps and plot the ACF. The ACF shows one significant spike.```{r}count_gaps(hourly_pipe1)count_gaps(hourly_pipe2)pipes1_tsb <- hourly_pipe1 %>%fill_gaps(.full=TRUE) %>%na_interpolation(option="spline")pipes2_tsb <- hourly_pipe2%>%fill_gaps(.full=TRUE) %>%na_interpolation(option="spline")pipes1_tsb |>ACF(avg_flow) |>autoplot() pipes2_tsb |>ACF(avg_flow) |>autoplot() ```### Plot pipe1```{r}plot1 <- hourly_pipe1 |>fill_gaps() |>fill(avg_flow, .direction="down") |>autoplot(avg_flow) +labs(title ="Pipe 1: Hourly Water Flow", x ="Date", y ="Average Water Flow")plot2 <- hourly_pipe1 |>fill_gaps() |>fill(avg_flow, .direction="down") |>ACF(avg_flow) |>autoplot() +labs(title ="ACF")plot_grid(plot1, plot2, ncol=1)```### Plot pipe2```{r}plot1 <- hourly_pipe2 |>fill_gaps() |>fill(avg_flow, .direction="down") |>autoplot(avg_flow) +labs(title ="Pipe 2: Hourly Water Flow", x ="Date", y ="Average Water Flow")plot2 <- hourly_pipe2 |>fill_gaps() |>fill(avg_flow, .direction="down") |>ACF(avg_flow) |>autoplot() +labs(title ="ACF")plot_grid(plot1, plot2, ncol=1)```### Check StationarityI have checked ome ways to check differencing and then looking at the features to see the effects.From the book: "Differencing can help stabilise the mean of a time series by removing changes in the level of a time series, and therefore eliminating (or reducing) trend and seasonality.A stationary time series is one whose properties do not depend on the time at which the series is observed.17 Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. On the other hand, a white noise series is stationary — it does not matter when you observe it, it should look much the same at any point in time."```{r}pipes2_tsb |>ACF(difference(avg_flow)) |>autoplot() +labs(subtitle ="Hourly Water Flow Pipe 2")pipes2_tsb |>features(avg_flow, unitroot_ndiffs)pipes2_tsb |>mutate(diff_avg_flow =difference(avg_flow)) |>features(diff_avg_flow, unitroot_kpss)t<- pipes2_tsb |>mutate(diff_avg_flow =difference(avg_flow)) t %>%gg_tsdisplay(diff_avg_flow)```### Boxcox,ETS,SNAIVEThis time, the p-value is reported as 0.1 (and so it could be larger than that). We can conclude that the differenced data appear stationary.```{r}lambda_5 <- pipes2_tsb %>%features(avg_flow, features = guerrero) %>%pull(lambda_guerrero)fit_atm_4_a <- pipes2_tsb %>%model(SNAIVE =SNAIVE(box_cox(avg_flow, lambda_5)),ETS =ETS((box_cox(avg_flow, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_b <- pipes2_tsb%>%model(ETS =ETS((box_cox(avg_flow, lambda_4) ~error("A") +trend("A") +season("A"))))fit_atm_4_c <- pipes2_tsb %>%model(SNAIVE =SNAIVE(box_cox(avg_flow, lambda_4)))fc_atm_4 <- fit_atm_4_a %>%forecast(h="1 week")fc_atm_4%>%autoplot(pipes2_tsb , level =80) +labs(y ="Hourly Water Flow",title ="Hourly Water Flow Pipe 2 with boxcox and SNAIVE, ETS")fit_atm_4_b |>gg_tsresiduals() +labs(title ="Hourly Water Flow boxcox ETS")fit_atm_4_c |>gg_tsresiduals() +labs(title ="Hourly Water Flow boxcox SNAIVE")fc_atm_4b <- fit_atm_4_b %>%forecast(h ="1 week")fc_atm_4c <- fit_atm_4_c %>%forecast(h="1 week")fc_atm_4b%>%autoplot(pipes2_tsb , level =80) +labs(y ="Hourly Water Flow Pipe 2 with boxcox and ETS",title ="Hourly Water Flow Pipe 2 with boxcox and ETS")fc_atm_4c%>%autoplot(pipes2_tsb , level =80) +labs(y ="Hourly Water Flow Pipe 2 with boxcox SNAIVE",title ="Hourly Water Flow Pipe 2 with boxcox and SNAIVE")```### Forecast ARIMA```{r}pipe1_mod <- hourly_pipe1 |>fill_gaps() |>fill(avg_flow, .direction="down") |>model(ARIMA(avg_flow))report(pipe1_mod)``````{r}pipe2_mod <- hourly_pipe2 |>fill_gaps() |>fill(avg_flow, .direction="down") |>model(ARIMA(avg_flow))report(pipe2_mod)``````{r}pipe1_fc <- pipe1_mod |>forecast(h="1 week")pipe1_fc |>autoplot(hourly_pipe1) +labs(title ="Pipe 1 Week Forecast arima", y ="Average Water Flow", x ="Date")``````{r}pipe2_fc <- pipe2_mod |>forecast(h="1 week")pipe2_fc |>autoplot(hourly_pipe2) +labs(title ="Pipe 2 Week Forecast arima", y ="Average Water Flow", x ="Date")``````{r}openxlsx::write.xlsx(data.frame(pipe1_fc), "forecasts_ATM_Banu_pipe1.xlsx")openxlsx::write.xlsx(data.frame(pipe2_fc), "forecasts_ATM_Banu_pipe2.xlsx")```