Project Description:

Part A – ATM Forecast, ATM624Data.xlsx

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file. Also please submit the forecast which you will put in an Excel readable file.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part-A:

Libraries

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3     ✔ fable       0.3.4
## ✔ ggplot2     3.5.1     ✔ fabletools  0.4.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(readxl)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Get Data:

atm_data<-read_excel("F:\\CUNY masters\\Data 624\\Project1\\ATM624Data.xlsx",col_types = c("date", "text", "numeric"))
#head(atm_data)
#view(atm_data)

Data Visualization

str(atm_data)

## tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
##  $ DATE: POSIXct[1:1474], format: "2009-05-01" "2009-05-01" ...
##  $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
##  $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

#atm_data<-atm_data|>spread(ATM, Cash)
head(atm_data)

## # A tibble: 6 × 3
##   DATE                ATM    Cash
##   <dttm>              <chr> <dbl>
## 1 2009-05-01 00:00:00 ATM1     96
## 2 2009-05-01 00:00:00 ATM2    107
## 3 2009-05-02 00:00:00 ATM1     82
## 4 2009-05-02 00:00:00 ATM2     89
## 5 2009-05-03 00:00:00 ATM1     85
## 6 2009-05-03 00:00:00 ATM2     90

summary(atm_data)

##       DATE                            ATM                 Cash        
##  Min.   :2009-05-01 00:00:00.00   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01 00:00:00.00   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01 00:00:00.00   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31 19:11:48.27                      Mean   :  155.6  
##  3rd Qu.:2010-02-01 00:00:00.00                      3rd Qu.:  114.0  
##  Max.   :2010-05-14 00:00:00.00                      Max.   :10919.8  
##                                                      NA's   :19

# plot box plot
ggplot(atm_data, aes(x = ATM, y = Cash)) +
  geom_boxplot()

## Warning: Removed 19 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# plot time series with missing values 
ggplot(atm_data, aes(x = DATE, y = Cash)) +
  geom_line() +
  # Plot missing values as red points at y = 0
  geom_point(data = atm_data[is.na(atm_data$Cash), ], aes(x = DATE, y =0, color = "Missing Values"), shape = 3, size = 3) +
  scale_color_manual(values = "red", name = "Legend") + # Manually set the color for the legend
  labs(title = "ATM Cash Data with Missing Values Highlighted",
       x = "Date",
       y = "Cash Amount") +
  theme_minimal() +
  theme(legend.position = "right")

## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

# Time series plot for individual ATM
atm_data %>% 
  #filter(DATE < "2010-05-01", !is.na(ATM)) %>% 
  ggplot(aes(x = DATE, y = Cash, col = ATM)) +
    geom_line(show.legend = FALSE) +
    facet_wrap(~ ATM, ncol = 1, scales = "free_y") +
    labs(title = "Daily ATM Withdrawals", x = "Date") +
    scale_y_continuous("Withdrawals (hundreds)")

## Warning: Removed 14 rows containing missing values or values outside the scale range
## (`geom_line()`).

This dataset is a tibble with 1,474 rows and 3 columns, containing daily cash transaction records across four ATMs from May 1, 2009, to May 14, 2010. Each row records the cash withdrawn from a specific ATM on a given date, with cash withdrawals varying significantly from a minimum of 0.0 to a maximum of 10919.8, and an average of 155.6. Key statistics include a first quartile of 0.5, a median of 73.0, and a third quartile of 114.0. Notably, the dataset contains 19 missing entries in the cash data.

Applying the view function to the dataset revealed missing values for cash withdrawals from May 1, 2010, to May 14, 2010. Furthermore, there are no other dates for May 2010 present in the dataset. Since this project focuses on producing forecasts for May 2010, and given the extent of the missing data, I will assume that the two weeks of missing values occur completely at random. Consequently, I will filter the dataset to include only dates up to April 30, 2010, and remove the five rows with missing values. I will then forecast the cash withdrawal data for May 2010 using both the ARIMA and ETS models, as both models can effectively handle gaps when generating forecasts.

Data preparation:

# Make dataset wider by getting individual ATM columns 
atm_data<-atm_data|>spread(ATM, Cash)
head(atm_data)

## # A tibble: 6 × 6
##   DATE                 ATM1  ATM2  ATM3  ATM4 `<NA>`
##   <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 2009-05-01 00:00:00    96   107     0 777.      NA
## 2 2009-05-02 00:00:00    82    89     0 524.      NA
## 3 2009-05-03 00:00:00    85    90     0 793.      NA
## 4 2009-05-04 00:00:00    90    55     0 908.      NA
## 5 2009-05-05 00:00:00    99    79     0  52.8     NA
## 6 2009-05-06 00:00:00    88    19     0  52.2     NA

# Filter data for getting all data before May 01, 2010
atm_data1<-atm_data|>filter(DATE<="2010-04-30")
head(atm_data1)

## # A tibble: 6 × 6
##   DATE                 ATM1  ATM2  ATM3  ATM4 `<NA>`
##   <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 2009-05-01 00:00:00    96   107     0 777.      NA
## 2 2009-05-02 00:00:00    82    89     0 524.      NA
## 3 2009-05-03 00:00:00    85    90     0 793.      NA
## 4 2009-05-04 00:00:00    90    55     0 908.      NA
## 5 2009-05-05 00:00:00    99    79     0  52.8     NA
## 6 2009-05-06 00:00:00    88    19     0  52.2     NA

# Remove NA column from atm_data1
atm_data1<-atm_data1|>select(-"<NA>")
head(atm_data1)

## # A tibble: 6 × 5
##   DATE                 ATM1  ATM2  ATM3  ATM4
##   <dttm>              <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01 00:00:00    96   107     0 777. 
## 2 2009-05-02 00:00:00    82    89     0 524. 
## 3 2009-05-03 00:00:00    85    90     0 793. 
## 4 2009-05-04 00:00:00    90    55     0 908. 
## 5 2009-05-05 00:00:00    99    79     0  52.8
## 6 2009-05-06 00:00:00    88    19     0  52.2

# Check missing values again in ATM columns
sapply(atm_data1[, c("ATM1", "ATM2", "ATM3", "ATM4")], function(x) sum(is.na(x)))

## ATM1 ATM2 ATM3 ATM4 
##    3    2    0    0

# Produce estimates for NAs and replace with new values in ATM1 and ATM2
atm_data1$ATM1 <- na.interp(atm_data1$ATM1)

atm_data1$ATM2 <-  na.interp(atm_data1$ATM2)

# Check missing values again in ATM1 and ATM2 columns
sapply(atm_data1[, c("ATM1", "ATM2")], function(x) sum(is.na(x)))

## ATM1 ATM2 
##    0    0

# Identify outlier index and generate a replacement for ATM4
tsoutliers(atm_data1$ATM4)

## $index
## [1] 285
## 
## $replacements
## [1] 230.1752

# Replace ouitlier
atm_data1$ATM4[285] <- tsoutliers(atm_data1$ATM4)$replacements

# Rename the DATE column to Date 
atm_data1 <- atm_data1|>rename(Date = DATE)
head(atm_data1)

## # A tibble: 6 × 5
##   Date                 ATM1  ATM2  ATM3  ATM4
##   <dttm>              <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01 00:00:00    96   107     0 777. 
## 2 2009-05-02 00:00:00    82    89     0 524. 
## 3 2009-05-03 00:00:00    85    90     0 793. 
## 4 2009-05-04 00:00:00    90    55     0 908. 
## 5 2009-05-05 00:00:00    99    79     0  52.8
## 6 2009-05-06 00:00:00    88    19     0  52.2

# Convert atm_data1 to a tsibble
atm_data1<-atm_data1|> mutate(Date = as.Date(Date))|>as_tsibble(index=Date)  
head(atm_data1)

## # A tsibble: 6 x 5 [1D]
##   Date        ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2009-05-01    96   107     0 777. 
## 2 2009-05-02    82    89     0 524. 
## 3 2009-05-03    85    90     0 793. 
## 4 2009-05-04    90    55     0 908. 
## 5 2009-05-05    99    79     0  52.8
## 6 2009-05-06    88    19     0  52.2

#atm_data1|>autoplot()

The data preparation process involves several steps to transform the ATM cash withdrawal dataset for analysis. Initially, the dataset is made wider by spreading the ATM columns to create individual columns for each ATM’s cash withdrawals. Data is then filtered to include only entries before May 1, 2010, and any NA columns are removed. Missing values in the ATM1 and ATM2 columns are addressed by applying interpolation to estimate and replace these gaps. Outliers in the ATM4 column are identified and corrected. Subsequently, the DATE column is renamed to “Date,” and the dataset is converted to a tsibble format with the Date as the index. The prepared will be used for the time series analysis of the cash withdrawal patterns. Subsequently, plots for the individual ATM cash withdrawals will be generated, and forecasts for each ATM’s data will also be conducted.

ATM1 Cash Withdrawal Data Forecast

# Plot time series for ATM1
atm_data1|>autoplot(ATM1)

# STL decomposition of ATM1
atm1_decomposed<-atm_data1 |>
  model(
    STL(ATM1 ~ trend(window = 7) +
                 season(window = "periodic"),
        robust = TRUE)
  ) |> components() 

autoplot(atm1_decomposed)

The time series plot for ATM1 indicates that the data exhibits seasonality and random fluctuations. STL decomposition reveals a weekly seasonality in the data, while no trend is apparent. The fluctuations toward the end of the remainder part of the series are more pronounced. An ACF plot will be generated to provide a clearer understanding of these features.

Get ACF plot for ATM1 data:

# ACF plot of ATM1 data
atm_data1|>
  ACF(ATM1,lag_max = 28) |>
  autoplot() + labs(title="ACF of ATM1 Cash Withdrawal")

The ACF plot reveals significant spikes at lag 7, lag 14, lag 21, and lag 28, indicating a weekly seasonality. However, their slow decline implies that the series may exhibit non-stationarity. The number of differences will be checked to address this issue.

Get the number of regular differencing needed:

#lambda1 <- BoxCox.lambda(atm_data1$ATM1)
#atm1_transformed<- BoxCox(atm_data1$ATM1, lambda1)

ndiffs(atm_data1$ATM1)

## [1] 0

#nsdiffs(atm1_transformed)

The number of regular differencing needed for this time series is found to be zero. It does mean that the series has no trend in the data.

Build train data to fit different models suited for capturing seasonality and make forecast for April 2010:

# Create train data using data upto March 2010 by filtering out the April 2010 data
atm1_train <- atm_data1|> filter(Date < "2010-04-01")

# Fit different models
atm1_fit <- atm1_train |>
  model(
    SNAIVE = SNAIVE(ATM1),
    ETS = ETS(ATM1),
    ARIMA = ARIMA(ATM1),
    `Auto ARIMA` = ARIMA(ATM1, stepwise = FALSE, approx = FALSE)
  )

# ATM1 forecasts for April 2010
atm1_forecast <- atm1_fit |>
  forecast(h = 30)

# Plot the forecasts for April 2010
atm1_forecast |>
  autoplot(atm_data1, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "ATM1 Forecasts",
       subtitle = "April 2010") +
  xlab("Date") +
  ylab("Cash Withdrawalin Hundreds")

A training data was created by filtering the dataset to include only entries up to March 2010, excluding April 2010 data. Various models were then fitted to this training set, including SNAIVE, ETS, ARIMA, and Auto ARIMA, to capture seasonality. Forecasts for April 2010 were generated from these models, and the results were visualized using a plot that displays the forecasts. It seems like the forecasts from all the models are quite similar to one another. Accuracy metrics such as RMSE will be utilized to evaluate the best performing model.

ATM1 Model Evaluation

Get accuracy metrics for model evaluation:

accuracy(atm1_forecast,atm_data1) |>select(.model, RMSE:MAPE)

## # A tibble: 4 × 5
##   .model      RMSE   MAE   MPE  MAPE
##   <chr>      <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA       12.4  9.71 -83.3  86.4
## 2 Auto ARIMA  12.9  9.75 -98.5 102. 
## 3 ETS         11.6  9.11 -61.0  64.7
## 4 SNAIVE      16.0 13.5  -66.2  73.4

The ETS model here has the lowest RMSE value, reflecting the best performing model for the ATM1 dataset. The MAE and MAPE values are also lowest for the ETS model. Therefore, I will use this model to produce the desired forecasts for the month of May 2010.

Reproduce the best performed ETS model using original dataset and generate forecasts for May 2010:

# Produce the best ETS model again using original dataset 
atm1_ets_fit <- atm_data1|>
  model(
    ETS = ETS(ATM1))

# Produce forecast for May 2010
atm1_ets_forecast <- atm1_ets_fit |>
  forecast(h=31)

# Plot the forecasts
atm1_ets_forecast |> 
  autoplot(atm_data1) +
  labs(title = "ATM1 Forecasts",
       subtitle = "May 2010",
       y = " ATM1 Withdrawals (in Hundreds)")

The best performing ETS model is re-fitted using the original dataset to generate forecasts for May 2010. The forecasts are produced by applying the ETS model to the ATM1 data, resulting in a 31-day forecast. The forecasted values are then visualized in a plot, highlighting the ATM1 withdrawals for May 2010. The forecasted values generated by the model appear to align well with the previous data patterns.

Get ATM1 forecasts value and export the values in a CSV file:

# Convert tsibble to data frame
atm1_forecast<- as.data.frame(atm1_ets_forecast)|>
  select(Date, .mean)
#head(atm1_forecast)

# Rename column names
atm1_forecast <- atm1_forecast |>
  rename(Cash = .mean)

# Rounding cash values
atm1_forecast <- atm1_forecast |>
    mutate(Cash= round(Cash))

# See forecasted values
atm1_forecast

##          Date Cash
## 1  2010-05-01   87
## 2  2010-05-02  101
## 3  2010-05-03   73
## 4  2010-05-04    6
## 5  2010-05-05  100
## 6  2010-05-06   79
## 7  2010-05-07   85
## 8  2010-05-08   87
## 9  2010-05-09  101
## 10 2010-05-10   73
## 11 2010-05-11    6
## 12 2010-05-12  100
## 13 2010-05-13   79
## 14 2010-05-14   85
## 15 2010-05-15   87
## 16 2010-05-16  101
## 17 2010-05-17   73
## 18 2010-05-18    6
## 19 2010-05-19  100
## 20 2010-05-20   79
## 21 2010-05-21   85
## 22 2010-05-22   87
## 23 2010-05-23  101
## 24 2010-05-24   73
## 25 2010-05-25    6
## 26 2010-05-26  100
## 27 2010-05-27   79
## 28 2010-05-28   85
## 29 2010-05-29   87
## 30 2010-05-30  101
## 31 2010-05-31   73

# Export file as CSV
atm1_forecast |> write.csv("ATM1_forecasts.csv")

The forecasted values for ATM1 are converted from a tsibble to a data frame and the data frame is then exported as a CSV file for future use.

ATM2 Cash Withdrawal Data Forecast

# Time series plot of ATM2
atm_data1|>autoplot(ATM2)

# STL decompostion of ATM2
atm2_decomposed<-atm_data1 |>
  model(
    STL(ATM2 ~ trend(window = 7) +
                 season(window = "periodic"),
        robust = TRUE)
  ) |> components() 

autoplot(atm2_decomposed)

The time series plot for ATM2 indicates that the data exhibits seasonality and random fluctuations. STL decomposition reveals a weekly seasonality in the data, while no trend is apparent. The fluctuations toward the end of the trend and the remainder part of the series are more pronounced. An ACF plot will be generated to provide a clearer understanding of these features.

Get ACF plot for ATM2 data:

# ACF plot of ATM1 data
atm_data1|>
  ACF(ATM2,lag_max = 28) |>
  autoplot() + labs(title="ACF of ATM2 Cash Withdrawal")

The ACF plot of ATM2 data reveals significant spikes at lag 7, lag 14, lag 21, and lag 28, indicating a weekly seasonality. However, their slow decline implies that the series may exhibit non-stationarity. The number of differences will be checked to address this issue.

Get the number of regular differencing needed for ATM2 series:

ndiffs(atm_data1$ATM2)

## [1] 1

Differencing is needed for one time to stabilize the data.

Build train data to fit different models suited for capturing seasonality and make forecast for April 2010:

# Add differenced data for ATM2
atm_data1 <- atm_data1|> 
  mutate(ATM2_diff= difference(ATM2))


# Create train data using data upto March 2010 by filtering out the April 2010 data
atm2_train <- atm_data1|> filter(Date < "2010-04-01")
#view(atm2_train)

# Fit seasonal models without differenced data
atm2_fit1 <- atm2_train |>
  model(
    SNAIVE = SNAIVE(ATM2),
    ETS = ETS(ATM2),
  )

# Fit ARIMA models with differenced data
atm2_fit2 <- atm2_train |>
  slice(2:335) |>
  model(
    ARIMA = ARIMA(ATM2_diff),
   `Auto ARIMA` = ARIMA(ATM2_diff, stepwise = FALSE, approx = FALSE)
  )


# ATM1 forecasts for April 2010
atm2_forecast1 <- atm2_fit1 |>
  forecast(h = 30)

# ATM1 forecasts for April 2010
atm2_forecast2 <- atm2_fit2 |>
  forecast(h = 30)

# Plot the forecasts for April 2010
atm2_forecast1 |>
  autoplot(atm_data1, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "ATM2 1st Forecasts",
       subtitle = "April 2010") +
  xlab("Date") +
  ylab("ATM2 Withdrawals in Hundreds")

# Plot the forecasts for April 2010
atm2_forecast2 |>
  autoplot(atm_data1, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "ATM2 2nd Forecasts",
       subtitle = "April 2010") +
  xlab("Date") +
  ylab("ATM2 Withdrawals in Hundreds")

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

A training dataset is created here by selecting entries only up to March 2010. Two sets of seasonal models namely SNAIVE and ETS are fitted using the original data and ARIMA models applied to the differenced data. Forecasts for April 2010 are generated from both model sets, and the results are displayed in separate plots to show the predicted cash withdrawals for ATM2 during that month.

ATM2 forecasts models evaluation:

# Get accuracy metrics for ATM2 forecasts-1st models  
accuracy(atm2_forecast1,atm_data1) |> select(.model, RMSE:MAPE)

## # A tibble: 2 × 5
##   .model  RMSE   MAE   MPE  MAPE
##   <chr>  <dbl> <dbl> <dbl> <dbl>
## 1 ETS     19.4  14.3 -25.4  57.4
## 2 SNAIVE  26.2  17.8  35.1  45.1

# Get accuracy metrics for ATM2 forecasts-2nd models  
accuracy(atm2_forecast2,atm_data1)|> select(.model, RMSE:MAPE)

## # A tibble: 2 × 5
##   .model      RMSE   MAE   MPE  MAPE
##   <chr>      <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA       26.0  19.5  206.  244.
## 2 Auto ARIMA  24.0  18.8  201.  255.

The ETS model here has the lowest RMSE value, reflecting the best performing model for the ATM2 dataset. The MAE value is also lowest for the ETS model. Therefore, I will use this model to produce the desired ATM2 forecasts for the month of May 2010.

Reproduce the best performed ETS model using original dataset and generate forecasts for ATM2 for May 2010:

# Produce the best ETS model again using original dataset 
atm2_ets_fit <- atm_data1|>
  model(
    ETS = ETS(ATM2))

# Produce forecast for May 2010
atm2_ets_forecast <- atm2_ets_fit |>
  forecast(h=31)

# Plot the forecasts
atm2_ets_forecast |> 
  autoplot(atm_data1) +
  labs(title = "ATM2 Forecasts",
       subtitle = "May 2010",
       y = " ATM Withdrawals (in Hundreds)")

For the ATM2 data, the best performing ETS model is re-fitted using the original dataset to generate forecasts for May 2010. The forecasts are produced by applying the ETS model, resulting in a 31-day forecast. The forecasted values are then visualized in a plot, highlighting the ATM1 withdrawals for May 2010. The forecasted values generated by the model appear to align well with the previous data patterns.

Get ATM2 forecasts value and export the values in a CSV file:

# Convert tsibble to data frame
atm2_forecast<- as.data.frame(atm2_ets_forecast)|>
  select(Date, .mean)
#head(atm1_forecast)

# Rename column names
atm2_forecast <- atm2_forecast |>
  rename(Cash = .mean)

# Rounding cash values
atm2_forecast <- atm2_forecast |>
    mutate(Cash= round(Cash))

# See forecasted values
atm2_forecast

##          Date Cash
## 1  2010-05-01   68
## 2  2010-05-02   74
## 3  2010-05-03   11
## 4  2010-05-04    2
## 5  2010-05-05  102
## 6  2010-05-06   92
## 7  2010-05-07   69
## 8  2010-05-08   68
## 9  2010-05-09   74
## 10 2010-05-10   11
## 11 2010-05-11    2
## 12 2010-05-12  102
## 13 2010-05-13   92
## 14 2010-05-14   69
## 15 2010-05-15   68
## 16 2010-05-16   74
## 17 2010-05-17   11
## 18 2010-05-18    2
## 19 2010-05-19  102
## 20 2010-05-20   92
## 21 2010-05-21   69
## 22 2010-05-22   68
## 23 2010-05-23   74
## 24 2010-05-24   11
## 25 2010-05-25    2
## 26 2010-05-26  102
## 27 2010-05-27   92
## 28 2010-05-28   69
## 29 2010-05-29   68
## 30 2010-05-30   74
## 31 2010-05-31   11

# Export file as CSV
atm2_forecast |> write.csv("ATM2_forecasts.csv")

The forecasted values for ATM2 are converted from a tsibble to a data frame and the data frame is then exported as a CSV file for future use.

ATM3 Cash Withdrawal Data Forecast

ATM3 has only three data points, which are insufficient for forecasting an entire month or identifying any trend or seasonality. The best approach would be to use the average of these three days to forecast the values for May 2010.

# Plot ATM3 data
atm_data1|>
  autoplot(ATM3) +
  ggtitle("ATM3 Series")

# Fit mean model for ATM3
atm3_fit <- atm_data1 |>
  filter(ATM3 != 0) |>
  model(MEAN(ATM3))

# Produce Forecasts for 31 days
atm3_mean_forecast <- atm3_fit |>
  forecast(h = 31)

# Plot forecast
atm3_mean_forecast|>
  autoplot(atm_data1) +
  ggtitle("ATM3 Forecasts")

Here, a mean model is fitted to the ATM3 data, excluding zero values to improve the model’s effectiveness. Following this, forecasts for the next 31 days are generated based on the mean model. The results are then visualized in a plot.

Get ATM3 forecasts value and export the values in a CSV file:

# Convert tsibble to data frame
atm3_forecast<- as.data.frame(atm3_mean_forecast)|>
  select(Date, .mean)
head(atm3_forecast)

##         Date    .mean
## 1 2010-05-01 87.66667
## 2 2010-05-02 87.66667
## 3 2010-05-03 87.66667
## 4 2010-05-04 87.66667
## 5 2010-05-05 87.66667
## 6 2010-05-06 87.66667

# Rename column names
atm3_forecast <- atm3_forecast |>
  rename(Cash = .mean)

# Rounding cash values
atm3_forecast <- atm3_forecast |>
    mutate(Cash= round(Cash))

# See forecasted values
atm3_forecast

##          Date Cash
## 1  2010-05-01   88
## 2  2010-05-02   88
## 3  2010-05-03   88
## 4  2010-05-04   88
## 5  2010-05-05   88
## 6  2010-05-06   88
## 7  2010-05-07   88
## 8  2010-05-08   88
## 9  2010-05-09   88
## 10 2010-05-10   88
## 11 2010-05-11   88
## 12 2010-05-12   88
## 13 2010-05-13   88
## 14 2010-05-14   88
## 15 2010-05-15   88
## 16 2010-05-16   88
## 17 2010-05-17   88
## 18 2010-05-18   88
## 19 2010-05-19   88
## 20 2010-05-20   88
## 21 2010-05-21   88
## 22 2010-05-22   88
## 23 2010-05-23   88
## 24 2010-05-24   88
## 25 2010-05-25   88
## 26 2010-05-26   88
## 27 2010-05-27   88
## 28 2010-05-28   88
## 29 2010-05-29   88
## 30 2010-05-30   88
## 31 2010-05-31   88

# Export file as CSV
atm3_forecast |> write.csv("ATM3_forecasts.csv")

The forecasted values for ATM3 are converted from a tsibble to a data frame and the data frame is then exported as a CSV file for future use.

ATM4 Cash Withdrawal Data Forecast

Plot time series for ATM4 and Perform STL decomposition:

# Time series
atm_data1|>autoplot(ATM4)

# STL decomposition
atm4_decomposed<-atm_data1 |>
  model(
    STL(ATM4 ~ trend(window = 7) +
                 season(window = "periodic"),
        robust = TRUE)
  ) |> components() 

autoplot(atm4_decomposed)

The time series exhibits significantly more variance than the previous series.The STL decomposition reveals that the data shows seasonality, lacks a clear trend, and contains substantial fluctuations with inconsistent variability. The ACF plot will later assist in assessing these features further, including the trend, and seasonality.

Get ACF plot for ATM4 data:

# ACF plot of ATM1 data
atm_data1|>
  ACF(ATM4,lag_max = 28) |>
  autoplot() + labs(title="ACF of ATM4 Cash Withdrawals")

The ACF plot indicates that the data exhibits weekly seasonality with varying magnitudes at lags 7, 14, 21, and 28. Additionally, the ACF values decrease slowly from lag 7 to lag 21, then increase again at lag 28. Moreover, the inconsistent variability suggests that a transformation may be necessary to stabilize the variance. Based on the lambda value, I will apply the required transformation to the data. Also, I will if the data needs any differencing before building models.

lambda <- atm_data1 |>
  features(ATM4, features = guerrero) |>
  pull(lambda_guerrero)
lambda

## [1] 0.4497706

atm_data1 |>
  autoplot(sqrt(ATM4)) +
  labs(y = "",
       title = latex2exp::TeX(paste0(
         "Transformed ATM4 Withdrawals with $\\lambda$ = ",
         round(lambda,2))))

# Add a new squared root transformed column to the original data
atm_data1 <- atm_data1 |>
    mutate(ATM4_trans= sqrt(ATM4))
#head(atm_data1)

A lambda value of 0.45 suggests that a square root transformation is appropriate for the data. Consequently, the square root transformation has been applied and a new transformed column has been added to the original dataset.

Get the number of regular differencing needed for ATM4 series:

ndiffs(atm_data1$ATM4)

## [1] 0

ndiffs(atm_data1$ATM4_trans)

## [1] 0

There is no differencing needed for the data.

Build train data to fit different models suited for capturing seasonality and make forecast for April 2010:

# Create train data using data upto March 2010 by filtering out the April 2010 data
atm4_train <- atm_data1|> filter(Date < "2010-04-01")

# Fit different models
atm4_fit <- atm4_train |>
  model(
    SNAIVE = SNAIVE(ATM4_trans),
    ETS = ETS(ATM4_trans),
    ARIMA = ARIMA(ATM4_trans),
    `Auto ARIMA` = ARIMA(ATM4_trans, stepwise = FALSE, approx = FALSE)
  )

# ATM4 forecasts for April 2010
atm4_forecast <- atm4_fit |>
  forecast(h = 30)

# Plot the forecasts for April 2010
atm4_forecast |>
  autoplot(atm_data1, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "ATM4 Forecasts",
       subtitle = "April 2010") +
  xlab("Date") +
  ylab(" ATM4 Withdrawalin Hundreds")

A training dataset is created using data up to March 2010 by filtering out the April 2010 entries. Various models are fitted to the training data, including SNAIVE, ETS, ARIMA, and Auto ARIMA, specifically for the transformed ATM4 data. Then forecasts for April 2010 are generated. The forecasts are then visualized in a plot and the predictions for ATM4 withdrawals are displayed in hundreds.

ATM4 Model Evaluation

Get accuracy metrics for model evaluation:

accuracy(atm4_forecast,atm_data1) |>select(.model, RMSE:MAPE)

## # A tibble: 4 × 5
##   .model      RMSE   MAE   MPE  MAPE
##   <chr>      <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA       7.82  6.33 -82.9 102. 
## 2 Auto ARIMA  7.82  6.33 -82.8 102. 
## 3 ETS         6.89  5.45 -69.3  84.2
## 4 SNAIVE      7.97  5.82 -61.8  70.1

The ETS model here has the lowest RMSE value, reflecting the best performing model for the ATM4 dataset. Therefore, I will use this model to produce the desired forecasts for the month of May 2010.

Reproduce the best performed ETS model using original dataset and generate forecasts for May 2010:

# Produce the best ETS model again using original dataset 
atm4_ets_fit <- atm_data1|>
  model(
    ETS = ETS(ATM4_trans))

# Produce forecast for May 2010
atm4_ets_forecast <- atm4_ets_fit |>
  forecast(h=31)

# Plot the forecasts
atm4_ets_forecast |> 
  autoplot(atm_data1) +
  labs(title = "ATM4 Forecasts",
       subtitle = "May 2010",
       y = " ATM4 Withdrawals (in Hundreds)")

The best performing ETS model is re-fitted using the original dataset to generate forecasts for May 2010. After fitting the ETS model to the transformed ATM4 data, forecasts for 31 days are produced. These forecasts are then visualized in a plot, displaying the predicted ATM4 withdrawals in hundreds for May 2010.

Get ATM4 forecasts value and export the values in a CSV file:

# Convert tsibble to data frame
atm4_forecast<- as.data.frame(atm4_ets_forecast)|>
  select(Date, .mean)
#head(atm4_forecast)

# Rename column names
atm4_forecast <- atm4_forecast |>
  rename(Cash = .mean)

# Back transform values by squaring and rounding
atm4_forecast <- atm4_forecast |>
    mutate(Cash= round(Cash**2))

# See forecasted values
atm4_forecast

##          Date Cash
## 1  2010-05-01  281
## 2  2010-05-02  377
## 3  2010-05-03  383
## 4  2010-05-04   42
## 5  2010-05-05  474
## 6  2010-05-06  316
## 7  2010-05-07  497
## 8  2010-05-08  281
## 9  2010-05-09  377
## 10 2010-05-10  383
## 11 2010-05-11   42
## 12 2010-05-12  474
## 13 2010-05-13  316
## 14 2010-05-14  497
## 15 2010-05-15  281
## 16 2010-05-16  378
## 17 2010-05-17  384
## 18 2010-05-18   42
## 19 2010-05-19  475
## 20 2010-05-20  316
## 21 2010-05-21  497
## 22 2010-05-22  281
## 23 2010-05-23  378
## 24 2010-05-24  384
## 25 2010-05-25   42
## 26 2010-05-26  475
## 27 2010-05-27  316
## 28 2010-05-28  498
## 29 2010-05-29  282
## 30 2010-05-30  378
## 31 2010-05-31  384

# Export file as CSV
atm4_forecast |> write.csv("ATM4_forecasts.csv")

The forecasted values for ATM4 are converted from a tsibble to a data frame and the data frame is then exported as a CSV file for future use.

Produce a single CSV file using all four ATMs forecasts:

# Combine all forecasts to a single data frame
atm_forecasts <- bind_rows(
  atm1_forecast |> mutate(ATM = "ATM1"),
  atm2_forecast |> mutate(ATM = "ATM2"),
  atm3_forecast |> mutate(ATM = "ATM3"),
  atm4_forecast |> mutate(ATM = "ATM4")
)

# Pivot data to create individual ATM column
atm_forecasts <- atm_forecasts |>
  pivot_wider(names_from = ATM, values_from = Cash)

head(atm_forecasts)

## # A tibble: 6 × 5
##   Date        ATM1  ATM2  ATM3  ATM4
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2010-05-01    87    68    88   281
## 2 2010-05-02   101    74    88   377
## 3 2010-05-03    73    11    88   383
## 4 2010-05-04     6     2    88    42
## 5 2010-05-05   100   102    88   474
## 6 2010-05-06    79    92    88   316

# Write a CSV file
write.csv(atm_forecasts, "ATM_forecasts.csv", row.names = FALSE)

A single data frame is created by combining the forecasted data from all four ATMs and then this comprehensive dataset is exported to a CSV file for future use.

Part-B:

Get Data:

kwh_data<-read_excel("F:\\CUNY masters\\Data 624\\Project1\\ResidentialCustomerForecastLoad-624.xlsx")

#head(kwh_data)
#view(kwh_data)

Data Visualization

str(kwh_data)

## tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
##  $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY-MMM    : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...

summary(kwh_data)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

hist(kwh_data$KWH)

boxplot(kwh_data$KWH)

The dataset is tibble type of data frame and it has 192 observations for 3 variables. The variable, CaseSequence, is not improtant for the analysis. The other two variables represent the time in Year-Month and the power consumption in KWH. The histogram of the power consumption data shows that the KWH distribution is rightly skewed. The boxplot shows that the KWH data has one outlier, which is also the value of the minimal KWH of power consumption. Additionally, the summary statistics of the data reveals that the minimal power consumption is 770523 KWH and the maximum power consumption is 10655730 KWH. The mean and median power consumption are 6502475 KWH and 6283324 KWH. The KWH data has one missing value. The dataset will need to be converted to a tsibble, and the Year-Month column’s data type should be changed to a date format, as required for conducting time series analysis.

Data Preparation and plot time plot series

# Converting the given tibble data type to tsibble type, changing the YYYY-MMM data type to Date type and its name to Month and removing unnecessary columns
kwh_data1 <- kwh_data |> mutate(Month = yearmonth(`YYYY-MMM`))|>
  select(-CaseSequence, -'YYYY-MMM')|>
  tsibble(index= Month)

head(kwh_data1)

## # A tsibble: 6 x 2 [1M]
##       KWH    Month
##     <dbl>    <mth>
## 1 6862583 1998 Jan
## 2 5838198 1998 Feb
## 3 5420658 1998 Mar
## 4 5010364 1998 Apr
## 5 4665377 1998 May
## 6 6467147 1998 Jun

# Plot time series with with missing data and potential outlier
kwh_data1 |>
  autoplot(KWH) +
  labs(title = "Monthly Residential Power Consumption from January 1998 to December 2013",
       subtitle="Missing data and oulier not addressed")

# Replace missing value by interpolation
kwh_data1$KWH<- na.interp(kwh_data1$KWH)

# Replace outlier with median value 
kwh_data1$KWH <- replace(kwh_data1$KWH, kwh_data1$KWH == min(kwh_data1$KWH),
                          median(kwh_data1$KWH))

# Plot time series after handling missing data and outlier
kwh_data1 |>
  autoplot(KWH) +
  labs(title = "Monthly Residential Power Consumption from January 1998 to December 2013",
       subtitle="Missing data and oulier addressed")

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

After converting the tibble data type to tsibble, unnecessary columns were removed, and a month column in date format was added. Missing values were replaced through interpolation within the KWH data, and outliers were substituted with the median value of KWH. Two time series plots are displayed: the first includes the missing value and outlier, showing a significant drop in KWH values that appears unusual for this series. The second plot excludes missing value and outlier, resulting in a more uniform representation of KWH values in the series. The time series show some inconsistent in variance.A transformation may help to reduce the variance.

Data transformation and plot time series with transformed data:

# Find lambda
lambda1 <- kwh_data1 |>
  features(KWH, features = guerrero) |>
  pull(lambda_guerrero)
lambda1

## [1] -0.2130548

# Add box-cox transformed column
kwh_data1<- kwh_data1 |>
    mutate(KWH_trans = box_cox(KWH, lambda1))

# Plot time series with after transformation
kwh_data1 |>
  autoplot(KWH_trans) +
  labs(title = "Transformed Monthly Residential Power Consumption",
                subtitle="January 1998 to December 2013")

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

After box-cox transformation, the variability in the data has almost been reduced. The STL decomposition will be performed to understand the trend, seasonality and the remainder pattern of the data more clearly.

kwh_data1|>
  model(STL(KWH_trans~ season(window = "periodic"), robust = TRUE))|>
  components()|>
  autoplot() +
  labs(title = "STL Decomposition of Power Consumption Series")

The STL decomposition shows that the data exhibits yearly seasonality and an upward trend in power consumption, with the remainder component containing random fluctuations.

kwh_data1 |>
  gg_season(KWH_trans, labels = "both") +
  labs(title = "Seasonal plot of Residential Power Consumption (KWH)")

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

kwh_data1 |>
  gg_subseries(KWH_trans) +
  labs(title = "Monthly Residential Power Consumption(KWH)")

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

The seasonal plot further shows the increase in power consumption every summer and winter. There is a clear increasing trend in the power consumption during the peak months.The off-peak months probably vary every year due to the temperature variations related to different weather conditions.

ACF plot:

kwh_data1|>
  ACF(KWH_trans, lag_max = 36) %>% 
  autoplot()

The ACF plot shows that the highest correlation at lag 12 is associated with the seasonal pattern of the data, occurring at every 12th lag value. The 12-month lags show a relatively slow decline, suggesting the data may be non-stationary and could benefit from differencing. Therefore, an additional check using the ndiffs() function will be performed.

Get the number of differencing needed:

ndiffs(kwh_data1$KWH_trans)

## [1] 1

One regular differencing of the data is needed to stabilize the data.

Add differenced data for building ARIMA models:

# Add differenced data
kwh_data1_diff <- kwh_data1 |>
  mutate(KWH_diff= difference(KWH), KWH_trans_diff = difference(KWH_trans))|>
  slice(-1) # Remove first observation row as it contains NA due to differencing

ndiffs(kwh_data1_diff$KWH_trans_diff)

## [1] 0

No more differencing is needed.

Building models

Build ARIMA models and get forecast for the year 2013:

# Get train data
kwh_data1_diff_train <- kwh_data1_diff |>
  filter(year(Month) < 2013)

# Fit ARIMA models
kwh_data1_diff_fit <- kwh_data1_diff_train |>
    model(
    ARIMA = ARIMA(KWH_diff),
    `Auto ARIMA` = ARIMA(KWH_diff, stepwise = FALSE, approx = FALSE)
  )

# Produce forecast for the year 2013
kwh_data1_diff_forecast <- kwh_data1_diff_fit |>
  forecast(h = "1 year")

# Plot the forecasts
kwh_data1_diff_forecast |>
  autoplot(kwh_data1_diff, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "Power Consumption (kwh) Forecasts",
       subtitle = "Jan 2013 to Dec 2013")+
  xlab("Month") +
  ylab("KWH")

Utilizing the differenced data, two ARIMA models were built here. A training dataset spanning from 1998 to 2012 was created by filtering out the original data, and then forecasts for the year 2013 were generated and displayed. The plots make it challenging to distinguish the forecasted patterns, indicating that the two models have produced nearly identical forecasts

Build other models utilizing non-differenced data and get forecasts for the year 2013:

# Get train data
kwh_data1_train <- kwh_data1|>
  filter(year(Month) < 2013)

# Fit different models
kwh_data1_fit <- kwh_data1_train |>
    model(
   ETS = ETS(KWH),
    `Additive ETS` = ETS(KWH~ error("A") + trend("A") + season("A")),
   `Multiplicative ETS` = ETS(KWH ~ error("M") + trend("A") + season("M")),  
    `Damped ETS`=ETS(KWH ~ error("A") + trend("Ad") + season("A")),
   SNAIVE = SNAIVE(KWH)
  )

# Produce forecast for the year 2013
kwh_data1_forecast <- kwh_data1_fit |>
  forecast(h = "1 year")

# Plot the forecasts
kwh_data1_forecast |>
  autoplot(kwh_data1, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "Power Consumption (kwh) Forecasts",
       subtitle = "Jan 2013 to Dec 2013")+
  xlab("Month") +
  ylab("KWH")

Utilizing the non-differenced data, fieve different models namely ETS, additive ETS, mnultiplication ETS, damped ETS, and SNAIVE were built here. Again a training dataset spanning from 1998 to 2012 was created by filtering out the original data, and then forecasts for the year 2013 were generated and displayed. From the plots, it seems like all the models perform very close to each other. The accuracy metrics will be used to identify the best performing model.

Model Evaluation

Get accuracy metrics:

accuracy(kwh_data1_diff_forecast,kwh_data1_diff)|>
  select(.model, RMSE:MAPE)

## # A tibble: 2 × 5
##   .model         RMSE     MAE   MPE  MAPE
##   <chr>         <dbl>   <dbl> <dbl> <dbl>
## 1 ARIMA      1167795. 774342.  239.  316.
## 2 Auto ARIMA 1153727. 770642.  258.  333.

accuracy(kwh_data1_forecast,kwh_data1) |>
  select(.model, RMSE:MAPE)

## # A tibble: 5 × 5
##   .model                 RMSE     MAE   MPE  MAPE
##   <chr>                 <dbl>   <dbl> <dbl> <dbl>
## 1 Additive ETS       1036110. 616891.  2.59  7.16
## 2 Damped ETS         1091327. 653034.  4.83  7.61
## 3 ETS                1032588. 691304.  3.52  8.40
## 4 Multiplicative ETS 1021694. 697734.  2.59  8.41
## 5 SNAIVE             1035538. 618606.  4.55  7.06

Based on the RMSE value, the multiplicative ETS has the lowest RMSE value here, indicating the best performing model here. Therefore, I will rebuild this model to forecast the monthly power consumption for the year 2014.

# Refit the multiplicative ETS model Fit using original data
kwh_data1_new_fit <- kwh_data1 |>
    model(
   `Multiplicative ETS` = ETS(KWH~ error("M") + trend("A") + season("M"))  
  )

# Produce forecast for the months of the year 2014
kwh_data1_fc <- kwh_data1_new_fit |>
  forecast(h = 12)

# Plot the forecasts
kwh_data1_fc |>
  autoplot(kwh_data1, level = NULL)+
  facet_wrap( ~ .model, scales = "free_y") +
  guides(colour = guide_legend(title = "Forecast"))+
  labs(title= "Power Consumption (kwh) Forecasts",
       subtitle = "Multiplicative ETS Forecasts of 2014")+
  xlab("Month") +
  ylab("KWH")

The best-performing multiplicative ETS model has been rebuilt here using the original data, generating forecasts for the monthly power consumption for the year 2014. The forecast patterns indicate that the projections align well with the data’s prior pattern nature.

Get datafarme of the forecasts and explore it to CSV file:

# Convert to dataframe 
kwh_forecast <- as.data.frame(kwh_data1_fc) |>
  select(Month, .mean)
head(kwh_forecast)

##      Month   .mean
## 1 2014 Jan 9365454
## 2 2014 Feb 8371660
## 3 2014 Mar 7111674
## 4 2014 Apr 6343895
## 5 2014 May 6017327
## 6 2014 Jun 7977068

# Rename columns
colnames(kwh_forecast) <- c('Month', 'KWH')

kwh_forecast

##       Month     KWH
## 1  2014 Jan 9365454
## 2  2014 Feb 8371660
## 3  2014 Mar 7111674
## 4  2014 Apr 6343895
## 5  2014 May 6017327
## 6  2014 Jun 7977068
## 7  2014 Jul 9332307
## 8  2014 Aug 9990075
## 9  2014 Sep 9268560
## 10 2014 Oct 6866275
## 11 2014 Nov 5918582
## 12 2014 Dec 7625763

# Export file as CSV
kwh_forecast |> write.csv("KWH_forecasts_2014.csv")

The monthly power consumption forecasts have been organized into a data frame, and a CSV file has been generated to store the data for future use.

Data624_Project1

Mahmud Hasan Al Raji

2024-10-28

Project Description:

Part-A:

Libraries

Get Data:

Data Visualization

Data preparation:

ATM1 Cash Withdrawal Data Forecast

ATM1 Model Evaluation

ATM2 Cash Withdrawal Data Forecast

ATM2 forecasts models evaluation:

ATM3 Cash Withdrawal Data Forecast

ATM4 Cash Withdrawal Data Forecast

ATM4 Model Evaluation

Part-B:

Get Data:

Data Visualization

Data Preparation and plot time plot series

Building models

Model Evaluation