Project 1

All libraries needed for the Project

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.0
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.3.1

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## -- Attaching packages -------------------------------------------- fpp3 1.0.0 --
## v tsibble     1.1.5     v fable       0.3.4
## v tsibbledata 0.4.1     v fabletools  0.4.2
## v feasts      0.3.2

## Warning: package 'tsibble' was built under R version 4.3.3

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## Warning: package 'fabletools' was built under R version 4.3.3

## Warning: package 'fable' was built under R version 4.3.3

## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x gridExtra::combine() masks dplyr::combine()
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()

library(readxl)

## Warning: package 'readxl' was built under R version 4.3.3

library(ggplot2)
library(forecast)
library(imputeTS)

## Warning: package 'imputeTS' was built under R version 4.3.3

#Project 1 Description

This project consists of 3 parts - two required and one bonus and is worth 15% of your grade. The project is due at 11:59 PM on Sunday Apr 11. I will accept late submissions with a penalty until the meetup after that when we review some projects.

#Part A – ATM Forecast, ATM624Data.xlsx In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

#PART A

##Data Acquisition

# reading in the ATM excel file from project folder
atm_df <- read_excel("ATM624Data.xlsx")

##Data Preprocessing

#The "Date" field is in the wrong format.Will convert it into the correct date format
atm_df$DATE<-as.Date(atm_df$DATE, origin = "1899-12-30")
atm_df <- atm_df %>%
  mutate(DATE = as_date(DATE))

#The "Cash" field will be converted to integer,since we only distribute whole dollar amount into the ATM
atm_df <- atm_df %>%
  mutate(Cash = as.numeric(Cash))

# Impute missing values using mean imputation 
atm_df <- atm_df %>%
  mutate(Cash = replace_na(Cash, mean(Cash, na.rm = TRUE)))

##Data Exploration

Plots for each of the ATM Daily Cash withdrawls series

#ATM 1
ATM_1 <- atm_df %>%
  filter(ATM == "ATM1") %>%
  as_tsibble(index = DATE)
  atm_1_plot <-autoplot(ATM_1) +
  labs(title="ATM 1 Daily Withdrawls", y="Hundreds USD")

## Plot variable not specified, automatically selected `.vars = Cash`

#ATM 2
ATM_2 <- atm_df %>%
  filter(ATM == "ATM2") %>%
  as_tsibble(index = DATE)
atm_2_plot <-autoplot(ATM_2) +
  labs(title="ATM 2 Daily Withdrawls", y="Hundreds USD")

## Plot variable not specified, automatically selected `.vars = Cash`

#ATM 3
ATM_3 <- atm_df %>%
  filter(ATM == "ATM3") %>%
  as_tsibble(index = DATE)
atm_3_plot <-autoplot(ATM_3) +
  labs(title="ATM 3 Daily Withdrawls", y="Hundreds USD")

## Plot variable not specified, automatically selected `.vars = Cash`

#ATM 4
ATM_4 <- atm_df %>%
  filter(ATM == "ATM4") %>%
  as_tsibble(index = DATE)
atm_4_plot <-autoplot(ATM_4) +
  labs(title="ATM 4 Daily Withdrawls", y="Hundreds USD")

## Plot variable not specified, automatically selected `.vars = Cash`

grid.arrange(atm_1_plot, atm_2_plot,atm_3_plot,atm_4_plot,nrow = 4)

Looking at the time series of ATM 1 and ATM 2 we see fluctuations/variance remain pretty consistent over time. No trend or seasonality is observed. ATM 3 looks like it only had a withdrawl on the last day. ATM 4,however, has a noticeable spike —- outlier between Jan 2010 and April 2010.

Applying the Box-Cox transformation to the plots of the ATM daily cash withdrawls series to normalize data and stabalize the variance

# For simplification, I will now implement a Box-Cox transformation to each of the ATM series

# retrieve the ATM 1 lambda
lambda_atm_1 <- ATM_1%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)
# plot ATM 1
ATM_1_Transform <- ATM_1 %>%
    autoplot(box_cox(Cash,lambda_atm_1)) +
  labs(title="Transformed:ATM 1 Daily Withdrawls", y="Hundreds USD")

# retrieve the ATM 2 lambda
lambda_atm_2 <- ATM_2%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)
# plot ATM 2
ATM_2_Transform <- ATM_2 %>%
 autoplot(box_cox(Cash, lambda_atm_2)) +
  labs(title="Transformed:ATM 2 Daily Withdrawls", y="Hundreds USD")

# retrieve the ATM 3 lambda
lambda_atm_3 <- ATM_3%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/Inf replaced by maximum positive value

## Warning in optimise(lambda_coef_var, c(lower, upper), x = x, .period =
## max(.period, : NA/Inf replaced by maximum positive value

# plot ATM 3
ATM_3_Transform <- ATM_3 %>%
  autoplot(box_cox(Cash, lambda_atm_3)) +
  labs(title="Transformed:ATM 3 Daily Withdrawls", y="Hundreds USD")

# retrieve the ATM 4 lambda
lambda_atm_4 <- ATM_4%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)
# plot ATM 4
ATM_4_Transform <- ATM_4 %>%
 autoplot(box_cox(Cash, lambda_atm_4)) +
  labs(title="Transformed:ATM 4 Daily Withdrawls", y="Hundreds USD")

grid.arrange(ATM_1_Transform , ATM_2_Transform ,ATM_3_Transform ,ATM_4_Transform ,nrow = 4)

We see that the variance of the data has been stabilized. This is evident as the spike – outlier between Jan 2010 and April 2010 in ATM 4 is no longer noticeable.

##Model Building

I will now analyze the decomposition of each ATM series to see if there is strong seasonality, in which case differencing will be needed.

# ATM 1 Decomposition
ATM_1%>%
  model(classical_decomposition(box_cox(Cash, lambda_atm_1), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Decomposition of ATM1")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

# ATM 2 Decomposition
ATM_2%>%
  model(classical_decomposition(box_cox(Cash, lambda_atm_2), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Decomposition of ATM2")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

# ATM 3 Decomposition
ATM_3%>%
  model(classical_decomposition(box_cox(Cash, lambda_atm_3), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Decomposition of ATM3")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

# ATM 4 Decomposition
ATM_4%>%
  model(classical_decomposition(box_cox(Cash, lambda_atm_4), type="additive")) %>%
  components () %>%
  autoplot() + 
  labs(title="Decomposition of ATM4")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

After performing decomposition, we can see that each of the 4 ATMs has repeating peaks and valleys with consistent magnitude.

I will now fit and run metrics on the ETS, sNaive and ARIMA models for all 4 ATM series to see which model(s) perform best for which ATM. For each ATM, I will utilize the Box-Cox transformation

                             #ATM 1
#Fitting the Models
ATM1_Models <- ATM_1%>%
  model(
    ATM_ETS=ETS(box_cox(Cash, lambda_atm_1)),
    ATM_ARIMA=ARIMA(box_cox(Cash, lambda_atm_1)),
    ATM_SNaive=SNAIVE(box_cox(Cash, lambda_atm_1)))

#Running the metrics
left_join(glance(ATM1_Models) %>% select(.model:BIC), 
          accuracy(ATM1_Models) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## Joining with `by = join_by(.model)`

## # A tibble: 3 x 7
##   .model     sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM_ARIMA    1.45   -575. 1159. 1159. 1174.  25.4
## 2 ATM_ETS      1.49  -1145. 2309. 2310. 2348.  25.5
## 3 ATM_SNaive   2.05     NA    NA    NA    NA   28.3

                             #ATM 2
#Fitting the Models
ATM2_Models <- ATM_2%>%
  model(
    ATM_ETS=ETS(box_cox(Cash, lambda_atm_2)),
    ATM_ARIMA=ARIMA(box_cox(Cash, lambda_atm_2)),
    ATM_SNaive=SNAIVE(box_cox(Cash, lambda_atm_2)))

#Running the metrics
left_join(glance(ATM2_Models) %>% select(.model:BIC), 
          accuracy(ATM2_Models) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## Joining with `by = join_by(.model)`

## # A tibble: 3 x 7
##   .model     sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM_ARIMA    57.2  -1232. 2476. 2476. 2500.  26.5
## 2 ATM_ETS      59.9  -1819. 3658. 3659. 3697.  27.3
## 3 ATM_SNaive   84.9     NA    NA    NA    NA   33.1

                           #ATM 3
#Fitting the Models
ATM3_Models <- ATM_3%>%
  model(
    ATM_ETS=ETS(box_cox(Cash, lambda_atm_3)),
    ATM_ARIMA=ARIMA(box_cox(Cash, lambda_atm_3)),
    ATM_SNaive=SNAIVE(box_cox(Cash, lambda_atm_3)))

#Running the metrics
left_join(glance(ATM3_Models) %>% select(.model:BIC), 
          accuracy(ATM3_Models) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## Joining with `by = join_by(.model)`

## # A tibble: 3 x 7
##   .model     sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM_ARIMA    382.  -1603. 3213. 3213. 3224.  5.02
## 2 ATM_ETS      383.  -2161. 4329. 4329. 4340.  5.03
## 3 ATM_SNaive   914.     NA    NA    NA    NA   8.04

                         #ATM 4
#Fitting the Models
ATM4_Models <- ATM_4%>%
  model(
    ATM_ETS=ETS(box_cox(Cash, lambda_atm_4)),
    ATM_ARIMA=ARIMA(box_cox(Cash, lambda_atm_4)),
    ATM_SNaive=SNAIVE(box_cox(Cash, lambda_atm_4)))

#Running the metrics
left_join(glance(ATM4_Models) %>% select(.model:BIC), 
          accuracy(ATM4_Models) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## Joining with `by = join_by(.model)`

## # A tibble: 3 x 7
##   .model     sigma2 log_lik   AIC  AICc   BIC  RMSE
##   <chr>       <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ATM_ETS     0.807  -1033. 2086. 2087. 2125.  665.
## 2 ATM_ARIMA   0.842   -486.  979.  979.  995.  679.
## 3 ATM_SNaive  1.32      NA    NA    NA    NA   896.

After carefully, evaluating the results of the three models we see that the ARIMA model has the lowest AIC and BIC criteria. This indicates that the ARIMA model is the best model among the three. It maintains the best balance between fitting the data, while at the same time not overfitting. Thus, I will choose the ARIMA to make my forecast.

I will now fit the ARIMA model for each of the four ATMs

                             #ATM 1
#Fitting the Model
ATM_1_ARIMA <- ATM_1 %>%
  model(ARIMA(box_cox(Cash, lambda_atm_1)))

#getting the residuals
ATM_1_ARIMA %>%
  gg_tsresiduals()

                             #ATM 2
#Fitting the Model
ATM_2_ARIMA <- ATM_2 %>%
  model(ARIMA(box_cox(Cash, lambda_atm_2)))

#getting the residuals
ATM_2_ARIMA %>%
  gg_tsresiduals()

                            #ATM 3
#Fitting the Model
ATM_3_ARIMA <- ATM_3 %>%
  model(ARIMA(box_cox(Cash, lambda_atm_3)))

#getting the residuals
ATM_3_ARIMA %>%
  gg_tsresiduals()

                           #ATM 4

#Fitting the Model
ATM_4_ARIMA <- ATM_4 %>%
  model(ARIMA(box_cox(Cash, lambda_atm_4)))

#getting the residuals
ATM_4_ARIMA %>%
  gg_tsresiduals()

From the residual plots above, it is evident that for ATM 1, ATM 2 and ATM 4 the residuals remains relatively consistent over time, i.e. the spread does not change with the progression of time. The distributions of ATM 1,ATM2 and ATM 4 look similar to normal distributions.

##Forecasting

Forecast for May, 2010 (31 days)

                             #ATM 1
ATM_1_ARIMA_forecast <- ATM_1_ARIMA %>% forecast(h=31)
                            #ATM 2
ATM_2_ARIMA_forecast <- ATM_2_ARIMA %>% forecast(h=31)
                           #ATM 3 
ATM_3_ARIMA_forecast <- ATM_3_ARIMA %>% forecast(h=31)
                           #ATM 4
ATM_4_ARIMA_forecast <- ATM_4_ARIMA %>% forecast(h=31)

#Writing the forecast of each ATM to csv
write.csv(ATM_1_ARIMA_forecast,"ATM1 Forecast.csv")
write.csv(ATM_2_ARIMA_forecast,"ATM2 Forecast.csv")
write.csv(ATM_3_ARIMA_forecast,"ATM3 Forecast.csv")
write.csv(ATM_4_ARIMA_forecast,"ATM4 Forecast.csv")

#Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

#PART B

##Data Acquisition and Preprocessing

I am reading in the residental power usage excel file from the project folder and converting the data frame into a tsibble object for time series analysis. For that, I will need to convert the “YYYY-MMM” column into the date format. I will rename the “YYYY-MMM” column to “date”

rcf_df <- read_excel("ResidentialCustomerForecastLoad-624.xlsx") %>%
  # renaming column name
  rename(Month = 'YYYY-MMM') %>%
  # converting into date format
  mutate(Month = yearmonth(Month)) %>%
  # converting to tsibble
  as_tsibble(index = Month)

##Data Exploration

#Analyze the usage by year for residential power usage
rcf_df_plot <-ts(rcf_df[, "KWH"], start = c(1998, 1), frequency = 12)
ggseasonplot(rcf_df_plot)+ggtitle('Residental Power Usage by Year')

#Retrieve summary to find the outlier value, which would be the lowest one in the plot.
summary(rcf_df$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

From the plot above it is clear that the data has seasonality. We also see a big dip (outlier) in July, 2010 with the value of 770523.

Firstly, I will plot the power usage data with the outliers. I will then replace the outliers (like value 770523) with NA. Since the data has seasonality, I will model the data with these NA (missing) values using the ARIMA model.I chose the Interpolation method to estimate the missing values based on the surrounding data. Finally,I will plot the residential power usage data without the oulier(s) and observe the changes.

# Residential Power Usage plot before outlier is removed
rcf_df %>%
  autoplot(KWH) +
  labs(title = "Residential Power Usage with outlier")

#remove the outlier by replacing it with NA
rcf_df_remove_outlier<- rcf_df %>%
  #replace outliers with NA
  mutate(KWH = replace(KWH, KWH == 770523, NA)) 

 #Model data using ARIMA
rcf_df <- rcf_df_remove_outlier%>%
  model(ARIMA(KWH))%>%
interpolate(rcf_df_remove_outlier)

# Residential Power Usage plot before outlier is removed
rcf_df%>%
  autoplot(KWH) +
  ggtitle("Residential Power Usage without Outlier")

Comparing the Residential Power usage plot before the outlier to the one after the outlier is removed, the data can be read more clearly and we also see an increasing pattern in the place where we saw that huge downward spike (outlier).

I will create a seasonality plot to try and observe recurring trends that repeat yearly with power usage

# STL decomposition
rcf_df %>%
  model(STL(KWH ~ season(window = "periodic"), robust = TRUE)) %>%
  components() %>%
  autoplot() +
  labs(title = "STL Decomposition with Outlier removed")

STL decomposition shows seasonality (repeating peaks and valleys). We can also observe an upward trend of the residential power usage.

##Model Building I will now fit and run metrics on the ETS, sNaive and ARIMA models on the residential power usage. The Box-Cox transformation will be used to view the best model.

# lambda for box cox transformation
lambda_rcf <- rcf_df%>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)
                             
#Fitting the Models
rcf_Models <- rcf_df%>%
  model(
    ATM_ETS=ETS(box_cox(KWH, lambda_rcf)),
    ATM_ARIMA=ARIMA(box_cox(KWH, lambda_rcf)),
    ATM_SNaive=SNAIVE(box_cox(KWH, lambda_rcf)))

#Running the metrics
left_join(glance(rcf_Models) %>% select(.model:BIC), 
          accuracy(rcf_Models) %>% select(.model, RMSE)) %>%
  arrange(RMSE)

## Joining with `by = join_by(.model)`

## # A tibble: 3 x 7
##   .model          sigma2 log_lik    AIC   AICc    BIC    RMSE
##   <chr>            <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
## 1 ATM_ARIMA  0.0000141      759. -1504. -1503. -1481. 614466.
## 2 ATM_ETS    0.000000633    577. -1124. -1121. -1075. 624963.
## 3 ATM_SNaive 0.0000219       NA     NA     NA     NA  809762.

From the model comparison results above, we see that the ARIMA model has a negative AIC and BIC. This would suggest that the model is the best fit, without being too complicated to interpret. In consequence, I will choose this model.

I will now fit the ARIMA model

#Fitting the Model
rcf_df_ARIMA <- rcf_df %>%
  model(ARIMA(box_cox(KWH, lambda_rcf )))

#getting the residuals
rcf_df_ARIMA%>%
  gg_tsresiduals()

Looking at the residuals, the distribution appears to be close to normal. The residual points on the residuals plot seem pretty random, indicating a good model fit. It looks like some residual points are constant. This indicates that our ARIMA model captures the most important characteristics of the Residential Power Usage time series data.

##Forecasting

Monthly Forecast for 2014

rcf_df_ARIMA_forecast  <- forecast(rcf_df_ARIMA , h=12)
rcf_df_ARIMA_forecast <- data.frame(rcf_df_ARIMA_forecast)

write.csv(rcf_df_ARIMA_forecast ,"residential power usage Forecast.csv", row.names = FALSE)

Project 1

Vladimir Nimchenko

2024-10-27