Project 1 - Data 624

Author

Heleine Fouda

Published

October 27, 2024

Part A – ATM Forecast, ATM624Data.xlsx

In part A, the task is to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. The instructions are somewhat ambiguous on purpose to give this a slightly more professional tone. Explain and demonstrate your process, techniques used and not used, and your actual forecast. The data is given via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part B – Forecasting Power, ResidentialCustomerForecastLoad-624.xlsx

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part C – BONUS, optional (part or all), Waterflow_Pipe1.xlsx and Waterflow_Pipe2.xlsx

Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

Load necessary libraries

knitr::opts_chunk$set(echo = TRUE)
#| echo: false
library(fpp3)
Warning: package 'fpp3' was built under R version 4.3.3
Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr
── Attaching packages ──────────────────────────────────────────── fpp3 1.0.0 ──
✔ tibble      3.2.1     ✔ tsibble     1.1.5
✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
✔ tidyr       1.3.1     ✔ feasts      0.3.2
✔ lubridate   1.9.3     ✔ fable       0.3.4
✔ ggplot2     3.5.1     ✔ fabletools  0.4.2
Warning: package 'tsibble' was built under R version 4.3.3
── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
✖ lubridate::date()    masks base::date()
✖ dplyr::filter()      masks stats::filter()
✖ tsibble::intersect() masks base::intersect()
✖ tsibble::interval()  masks lubridate::interval()
✖ dplyr::lag()         masks stats::lag()
✖ tsibble::setdiff()   masks base::setdiff()
✖ tsibble::union()     masks base::union()
library(tsibble)
library(ggplot2)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0     ✔ readr   2.1.4
✔ purrr   1.0.2     ✔ stringr 1.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()     masks stats::filter()
✖ tsibble::interval() masks lubridate::interval()
✖ dplyr::lag()        masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggthemes)
library(gt)
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
library(fable)
library(fabletools)
library(dplyr)
library(knitr)
library(latex2exp) 
library(broom)
library(lubridate)
library(dplyr)
library(tsibbledata)
library(gridExtra)

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine
library(tseries)
Warning: package 'tseries' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(urca)
Warning: package 'urca' was built under R version 4.3.3
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(readxl)
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:scales':

    alpha, rescale

The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(tidyr)
library(tidyr)
library(forecast)
Warning: package 'forecast' was built under R version 4.3.3
library(openxlsx)
Warning: package 'openxlsx' was built under R version 4.3.3
library(imputeTS)

Attaching package: 'imputeTS'

The following object is masked from 'package:tseries':

    na.remove
library(zoo)

Attaching package: 'zoo'

The following object is masked from 'package:imputeTS':

    na.locf

The following object is masked from 'package:tsibble':

    index

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Part A – ATM Forecasts from ATM624Data.xlsx

1. Data Preparation: Import and read the data into R

#|label: Import the data from my Github repository for reproducibility and view the first three rows.

data <- read.csv("https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/refs/heads/main/ATM624Data.csv")
head(data, n=3)
                  DATE  ATM Cash
1 5/1/2009 12:00:00 AM ATM1   96
2 5/1/2009 12:00:00 AM ATM2  107
3 5/2/2009 12:00:00 AM ATM1   82
library(readxl)
file_path <- "/Users/heleinefouda/Library/Mobile Documents/com~apple~CloudDocs/Machine Learning- Data 622/ATM624Data-11.xlsx"
atm_data <- read_excel(file_path)
atm_data 
# A tibble: 1,474 × 3
    DATE ATM    Cash
   <dbl> <chr> <dbl>
 1 39934 ATM1     96
 2 39934 ATM2    107
 3 39935 ATM1     82
 4 39935 ATM2     89
 5 39936 ATM1     85
 6 39936 ATM2     90
 7 39937 ATM1     90
 8 39937 ATM2     55
 9 39938 ATM1     99
10 39938 ATM2     79
# ℹ 1,464 more rows

1.1. Data wrangling

#|label:label: Convert the DATE column 

atm_data$DATE <- as.Date(atm_data$DATE, origin = "1899-12-30")
#|label: The code uses Excel's origin date for dates

#|label: Ensure other columns are of correct type
atm_data$ATM <- as.character(atm_data$ATM)
atm_data$Cash <- as.numeric(atm_data$Cash)
head(atm_data, n=3)
# A tibble: 3 × 3
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2009-05-01 ATM1     96
2 2009-05-01 ATM2    107
3 2009-05-02 ATM1     82

1.2. Take a glimpse of the data and check its structure

#|label: Glimpse at the data
glimpse(atm_data)
Rows: 1,474
Columns: 3
$ DATE <date> 2009-05-01, 2009-05-01, 2009-05-02, 2009-05-02, 2009-05-03, 2009…
$ ATM  <chr> "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "ATM1", "ATM2", "…
$ Cash <dbl> 96, 107, 82, 89, 85, 90, 90, 55, 99, 79, 88, 19, 8, 2, 104, 103, …
#|label: Check the data structure
str(atm_data)
tibble [1,474 × 3] (S3: tbl_df/tbl/data.frame)
 $ DATE: Date[1:1474], format: "2009-05-01" "2009-05-01" ...
 $ ATM : chr [1:1474] "ATM1" "ATM2" "ATM1" "ATM2" ...
 $ Cash: num [1:1474] 96 107 82 89 85 90 90 55 99 79 ...

The structure function reveals that we have a dataframe. A conversion to a tsibble is necessary to make analysis and forecasting possible.

1.3. Run a summary statistics of the data frame

#|label: Summary statistics
summary(atm_data)
      DATE                ATM                 Cash        
 Min.   :2009-05-01   Length:1474        Min.   :    0.0  
 1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
 Median :2009-11-01   Mode  :character   Median :   73.0  
 Mean   :2009-10-31                      Mean   :  155.6  
 3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
 Max.   :2010-05-14                      Max.   :10919.8  
                                         NA's   :19       

The summary statistics reveals the existence of 15 NA values that need to be dealt with before we make any forecast.

sapply(atm_data, function(x) sum(is.na(x)))
DATE  ATM Cash 
   0   14   19 
#|label: Finding the Rows that have NA's
atm_data[which(is.na(atm_data$Cash)), ]
# A tibble: 19 × 3
   DATE       ATM    Cash
   <date>     <chr> <dbl>
 1 2009-06-13 ATM1     NA
 2 2009-06-16 ATM1     NA
 3 2009-06-18 ATM2     NA
 4 2009-06-22 ATM1     NA
 5 2009-06-24 ATM2     NA
 6 2010-05-01 <NA>     NA
 7 2010-05-02 <NA>     NA
 8 2010-05-03 <NA>     NA
 9 2010-05-04 <NA>     NA
10 2010-05-05 <NA>     NA
11 2010-05-06 <NA>     NA
12 2010-05-07 <NA>     NA
13 2010-05-08 <NA>     NA
14 2010-05-09 <NA>     NA
15 2010-05-10 <NA>     NA
16 2010-05-11 <NA>     NA
17 2010-05-12 <NA>     NA
18 2010-05-13 <NA>     NA
19 2010-05-14 <NA>     NA
tail(atm_data)
# A tibble: 6 × 3
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2010-04-25 ATM4  542. 
2 2010-04-26 ATM4  404. 
3 2010-04-27 ATM4   13.7
4 2010-04-28 ATM4  348. 
5 2010-04-29 ATM4   44.2
6 2010-04-30 ATM4  482. 

1.4 Get a visual sense of the cash distribution per ATM

#|label: Filter the dataset to include data up to April 30, 2010, where ATM values are non-missing. 
atm_data %>% 
  filter(DATE < "2010-05-01", !is.na(ATM)) %>% 
  ggplot(aes(x = Cash)) +
    geom_histogram(bins = 30, color= "lightblue") +
    facet_wrap(~ ATM, ncol = 2, scales = "free")

1.5. Take a closer look at each individual ATM

#|label: Take a closer look at ATM4
head(atm_data |>
       filter(ATM=="ATM4"))
# A tibble: 6 × 3
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2009-05-01 ATM4  777. 
2 2009-05-02 ATM4  524. 
3 2009-05-03 ATM4  793. 
4 2009-05-04 ATM4  908. 
5 2009-05-05 ATM4   52.8
6 2009-05-06 ATM4   52.2
#|label: Take a closer look at ATM3
head(atm_data |>
       filter(ATM=="ATM3"))
# A tibble: 6 × 3
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2009-05-01 ATM3      0
2 2009-05-02 ATM3      0
3 2009-05-03 ATM3      0
4 2009-05-04 ATM3      0
5 2009-05-05 ATM3      0
6 2009-05-06 ATM3      0
#|label: Take a closer look at ATM2
head(atm_data |>
       filter(ATM=="ATM2"))
# A tibble: 6 × 3
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2009-05-01 ATM2    107
2 2009-05-02 ATM2     89
3 2009-05-03 ATM2     90
4 2009-05-04 ATM2     55
5 2009-05-05 ATM2     79
6 2009-05-06 ATM2     19
#|label: Take a closer look at ATM1
head(atm_data |>
       filter(ATM=="ATM1"))
# A tibble: 6 × 3
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2009-05-01 ATM1     96
2 2009-05-02 ATM1     82
3 2009-05-03 ATM1     85
4 2009-05-04 ATM1     90
5 2009-05-05 ATM1     99
6 2009-05-06 ATM1     88

1.5 Identify the rows with the missing data

#|label: Identify  the rows with the missing data

missing_rows <- which(is.na(atm_data$Cash))
missing_rows
 [1]  87  93  98 105 110 731 732 733 734 735 736 737 738 739 740 741 742 743 744

2. Further Data cleaning & Data wrangling

2.1. Remove NA values

library(tidyr)
#|label: Remove NA values

atm_clean <- na.omit(atm_data)
head(atm_clean)
# A tibble: 6 × 3
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2009-05-01 ATM1     96
2 2009-05-01 ATM2    107
3 2009-05-02 ATM1     82
4 2009-05-02 ATM2     89
5 2009-05-03 ATM1     85
6 2009-05-03 ATM2     90
options(scipen = 999)
cash <-aggregate(Cash ~ ATM, data = atm_clean, summary)
print(cash)
   ATM     Cash.Min.  Cash.1st Qu.   Cash.Median     Cash.Mean  Cash.3rd Qu.
1 ATM1     1.0000000    73.0000000    91.0000000    83.8867403   108.0000000
2 ATM2     0.0000000    25.5000000    67.0000000    62.5785124    93.0000000
3 ATM3     0.0000000     0.0000000     0.0000000     0.7205479     0.0000000
4 ATM4     1.5632595   124.3344146   403.8393360   474.0433449   704.5070416
      Cash.Max.
1   180.0000000
2   147.0000000
3    96.0000000
4 10919.7616377

2.2. Run a summary statistics of the cleaned data

#|label: Summary statistics
summary(atm_clean)
      DATE                ATM                 Cash        
 Min.   :2009-05-01   Length:1455        Min.   :    0.0  
 1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
 Median :2009-10-31   Mode  :character   Median :   73.0  
 Mean   :2009-10-30                      Mean   :  155.6  
 3rd Qu.:2010-01-29                      3rd Qu.:  114.0  
 Max.   :2010-04-30                      Max.   :10919.8  

The summary statistics tells us that: The average withdrawal at ATM is around $58.53; The minimum withdrawal is $0.00 and The maximum withdrawal is $1,123.00.

2.3. Pivot wider to get a paralleland broader view of all the 4 ATM

#|label: Pivot wider
(atm_wider <- atm_clean %>% 
  mutate(DATE = as.Date(DATE)) %>%
   filter(DATE<"2010-05-01")%>%
  pivot_wider(names_from=ATM, values_from = Cash))
# A tibble: 365 × 5
   DATE        ATM1  ATM2  ATM3  ATM4
   <date>     <dbl> <dbl> <dbl> <dbl>
 1 2009-05-01    96   107     0 777. 
 2 2009-05-02    82    89     0 524. 
 3 2009-05-03    85    90     0 793. 
 4 2009-05-04    90    55     0 908. 
 5 2009-05-05    99    79     0  52.8
 6 2009-05-06    88    19     0  52.2
 7 2009-05-07     8     2     0  55.5
 8 2009-05-08   104   103     0 559. 
 9 2009-05-09    87   107     0 904. 
10 2009-05-10    93   118     0 879. 
# ℹ 355 more rows
head(atm_wider)
# A tibble: 6 × 5
  DATE        ATM1  ATM2  ATM3  ATM4
  <date>     <dbl> <dbl> <dbl> <dbl>
1 2009-05-01    96   107     0 777. 
2 2009-05-02    82    89     0 524. 
3 2009-05-03    85    90     0 793. 
4 2009-05-04    90    55     0 908. 
5 2009-05-05    99    79     0  52.8
6 2009-05-06    88    19     0  52.2

2.3 Convert the ATM dataframe to a tsibble to make a time series analysis possible

#|label: Convert data frame into a tsibble

atm_wider<-atm_wider%>%
  as_tsibble(index=DATE)
head(atm_wider)
# A tsibble: 6 x 5 [1D]
  DATE        ATM1  ATM2  ATM3  ATM4
  <date>     <dbl> <dbl> <dbl> <dbl>
1 2009-05-01    96   107     0 777. 
2 2009-05-02    82    89     0 524. 
3 2009-05-03    85    90     0 793. 
4 2009-05-04    90    55     0 908. 
5 2009-05-05    99    79     0  52.8
6 2009-05-06    88    19     0  52.2
#|label: Check the structure
str(atm_wider)
tbl_ts [365 × 5] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ DATE: Date[1:365], format: "2009-05-01" "2009-05-02" ...
 $ ATM1: num [1:365] 96 82 85 90 99 88 8 104 87 93 ...
 $ ATM2: num [1:365] 107 89 90 55 79 19 2 103 107 118 ...
 $ ATM3: num [1:365] 0 0 0 0 0 0 0 0 0 0 ...
 $ ATM4: num [1:365] 777 524.4 792.8 908.2 52.8 ...
 - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:1] 
  .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0) 
 - attr(*, "index")= chr "DATE"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "DATE"
 - attr(*, "interval")= interval [1:1] 1D
  ..@ .regular: logi TRUE

2.4 Run a summary statistics of the tsibble

#|label: Tsibble summary statistics
summary(atm_wider)
      DATE                 ATM1             ATM2             ATM3        
 Min.   :2009-05-01   Min.   :  1.00   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:2009-07-31   1st Qu.: 73.00   1st Qu.: 25.50   1st Qu.: 0.0000  
 Median :2009-10-30   Median : 91.00   Median : 67.00   Median : 0.0000  
 Mean   :2009-10-30   Mean   : 83.89   Mean   : 62.58   Mean   : 0.7206  
 3rd Qu.:2010-01-29   3rd Qu.:108.00   3rd Qu.: 93.00   3rd Qu.: 0.0000  
 Max.   :2010-04-30   Max.   :180.00   Max.   :147.00   Max.   :96.0000  
                      NA's   :3        NA's   :2                         
      ATM4          
 Min.   :    1.563  
 1st Qu.:  124.334  
 Median :  403.839  
 Mean   :  474.043  
 3rd Qu.:  704.507  
 Max.   :10919.762  
                    

##3. Exploratory data analysis(EDA)

3.1 Pivot longer

#|label: Pivot the data from wide to long format
atm_longer <- atm_wider %>%
  pivot_longer(cols = c("ATM1", "ATM2", "ATM3", "ATM4"), names_to = "ATM", values_to = "Cash")
head(atm_longer)
# A tsibble: 6 x 3 [1D]
# Key:       ATM [4]
  DATE       ATM    Cash
  <date>     <chr> <dbl>
1 2009-05-01 ATM1    96 
2 2009-05-01 ATM2   107 
3 2009-05-01 ATM3     0 
4 2009-05-01 ATM4   777.
5 2009-05-02 ATM1    82 
6 2009-05-02 ATM2    89 
#|label: Check structure of atm_ts to confirm it is a time series
atm_ts <- atm_longer
str(atm_ts)
tbl_ts [1,460 × 3] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ DATE: Date[1:1460], format: "2009-05-01" "2009-05-01" ...
 $ ATM : chr [1:1460] "ATM1" "ATM2" "ATM3" "ATM4" ...
 $ Cash: num [1:1460] 96 107 0 777 82 ...
 - attr(*, "key")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ ATM  : chr [1:4] "ATM1" "ATM2" "ATM3" "ATM4"
  ..$ .rows: list<int> [1:4] 
  .. ..$ : int [1:365] 1 5 9 13 17 21 25 29 33 37 ...
  .. ..$ : int [1:365] 2 6 10 14 18 22 26 30 34 38 ...
  .. ..$ : int [1:365] 3 7 11 15 19 23 27 31 35 39 ...
  .. ..$ : int [1:365] 4 8 12 16 20 24 28 32 36 40 ...
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE
 - attr(*, "index")= chr "DATE"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "DATE"
 - attr(*, "interval")= interval [1:1] 1D
  ..@ .regular: logi TRUE
#|label: Cash summary statistics
options(scipen = 999)
aggregate(Cash ~ ATM, data = atm_ts , summary)
   ATM     Cash.Min.  Cash.1st Qu.   Cash.Median     Cash.Mean  Cash.3rd Qu.
1 ATM1     1.0000000    73.0000000    91.0000000    83.8867403   108.0000000
2 ATM2     0.0000000    25.5000000    67.0000000    62.5785124    93.0000000
3 ATM3     0.0000000     0.0000000     0.0000000     0.7205479     0.0000000
4 ATM4     1.5632595   124.3344146   403.8393360   474.0433449   704.5070416
      Cash.Max.
1   180.0000000
2   147.0000000
3    96.0000000
4 10919.7616377

3.2: Grouped and combined visualization of the 4 ATM series

#|label: Time Series Graph by ATM
ggplot(atm_ts, aes(x = DATE, y = Cash, color = ATM)) +
  geom_line() +
  labs(title = "ATM Cash Withdrawals", x = "Date", y = "Cash in ($US)") +
  theme_minimal()

atm1_ts <- atm_ts %>%
  filter(ATM == "ATM1") %>%
  autoplot(Cash) +
  labs(title= "ATM1")

atm2_ts <- atm_ts %>%
  filter(ATM == "ATM2") %>%
  autoplot(Cash) +
  labs(title= "ATM2")

atm3_ts <- atm_ts %>%
  filter(ATM == "ATM3") %>%
  autoplot(Cash) +
  labs(title= "ATM3")

atm4_ts <-atm_ts %>%
  filter(ATM == "ATM4") %>%
  autoplot(Cash) +
  labs(title= "ATM4")

grid.arrange(atm1_ts, atm2_ts, atm3_ts, atm4_ts, nrow = 2)

#|label: Combined seasonal graphs using the face_wrap function and standard ggplot
ggplot(atm_ts, aes(x = DATE, y = Cash, col = ATM)) +
  geom_line() +
  facet_wrap(~ATM,ncol = 2, scales = "free_y") +
  labs(title = "Cash withdrawal per ATM: Seasonality Plot. ", x = "Date", y = "Cash in ($US)") +
  theme_minimal()

#|label: Combined seasonal graphs using the face_wrap function and the time series autoplot function 
atm_ts %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Cash Withdrawals")

4. Modeling and Forecasting each ATM as a separate time series

Our analysis of the 4 ATM series will heavily lean on seasonal and trend decomposition using Loess (STL), a method recommended as a best practice for similar series in the third edition of Forecasting: Principles and Practice by Rob J Hyndman & George Athanasopoulos . In that book, the two scholars present STL as a versatile and robust method for decomposing time series to assess trend and seasonality strength. They specifically highlight STL’s ability to handle any seasonal pattern, including those beyond monthly or quarterly data. They add that STL allows the seasonality component to evolve over time, with rates of change controlled by the user and that it can be set to perform a robust decomposition, minimizing the influence of outliers on trend-cycle and seasonal component estimates.

4.1 ATM1

plot(atm1_ts, main = "ATM1 Time Series", xlab = "Date", ylab = "Cash", type = "l", col = "pink", lwd = 2)

We have opted to use imputation through interpolation, instead of using the mean to fill in missing values. Interpolation is generally recommended over mean imputation for most time series analyses, as it produces a smooth, trend-sensitive series, which aligns better with the STL’s need to detect trends and seasonality accurately.

4.1.1. Apply STL on ATM1

#|label: Apply STL decomposition on ATM1 data
library(imputeTS)  # Load the imputeTS library for imputation

stl_atm1 <- atm_ts %>%
  filter(ATM == "ATM1") %>%
#|label: Fill implicit time gaps
  fill_gaps() %>%                      
  mutate(Cash = na_interpolation(Cash)) %>%  # Impute missing values in Cash column
  model(
    stl = STL(Cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
  )

#|label: Extract components using fabletools
stl_components <- fabletools::components(stl_atm1)

#|label: Plot the components
autoplot(stl_components)

####4.1.2. Check residuals

#|label: Apply the ggsdisplay function on atm1_ts
atm1_ts <- atm_ts %>%
 dplyr:: filter(ATM == "ATM1") %>%
 dplyr:: select(DATE, Cash) %>%
  as_tsibble(index = DATE)

#|label: check its structure
str(atm1_ts)
tbl_ts [365 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ DATE: Date[1:365], format: "2009-05-01" "2009-05-02" ...
 $ Cash: num [1:365] 96 82 85 90 99 88 8 104 87 93 ...
 - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:1] 
  .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0) 
 - attr(*, "index")= chr "DATE"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "DATE"
 - attr(*, "interval")= interval [1:1] 1D
  ..@ .regular: logi TRUE
#|label: extract the 'Cash' column for plotting
ggtsdisplay(atm1_ts$Cash, points = FALSE, plot.type = "histogram")

#|label: Check  residuals using the statndard ggtsdisplay function
ggtsdisplay(atm1_ts$Cash, plot.type = "partial" )

The ACF plot shows some spikes outside of the critical values at lags 5,10, 11, 12, 13, 14 and 21. The non stationary aspect of the data as revealed in the ACF plot of the remainder, is evidence of some form of seasonality. Most importantly,the shape of the histogram violates the assumption of normal distribution. Most of the data seems right -skewed indicating a need for some transformation, before moving forward.

4.1.3. Apply a Box Cox transformation & re- check residuals

#|label: Load the required library
library(MASS)  # For the boxcox() function

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
#|label: Box-Cox transformation
lambda <- boxcox(Cash ~ 1, data = atm1_ts, lambda = seq(-2, 2, by = 0.1))

#|label: Use the best lambda found
best_lambda <- lambda$x[which.max(lambda$y)]

#|label: Perform Box-Cox transformation
atm1_ts_log <- atm1_ts %>%
  mutate(Cash_BoxCox = (Cash^best_lambda - 1) / best_lambda)

#|label: Extract the transformed 'Cash' column for plotting
ggtsdisplay(atm1_ts_log$Cash_BoxCox, points = FALSE, plot.type = "histogram")
Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_bin()`).

4.1.4. Run the Ljung-Box test

# Perform the Ljung-Box test on the Cash series
Box.test(atm1_ts$Cash, lag = 24, fitdf = 0, type = "Ljung-Box")

    Box-Ljung test

data:  atm1_ts$Cash
X-squared = 571.11, df = 24, p-value < 0.00000000000000022

The results from the Box-Ljung test indicate that our Cash time series exhibits significant autocorrelation, which is a critical consideration for effective time series modeling and forecasting. Given that the p-value is extremely small (much smaller than the typical significance level of 0.05), we can reject the null hypothesis. This implies that there is significant autocorrelation in our Cash time series data at the lags tested (up to 24 lags or degrees of freedom)

4.1.5 Fit models and run accuracy metrics

# Fit models

ets_model <- atm1_ts %>% model(ETS(Cash))
Warning: 1 error encountered for ETS(Cash)
[1] ETS does not support missing values.
naive_model <- atm1_ts %>% model(NAIVE(Cash))
snaive_model <- atm1_ts %>% model(SNAIVE(Cash))

# Combine models into a list
models <- list(ets = ets_model, naive = naive_model, snaive = snaive_model)

# Generate forecasts for each model with a 30-day horizon
forecasts <- map(models, ~forecast(.x, h = 30))

# Print forecasts
print(forecasts)
$ets
# A fable: 30 x 4 [1D]
# Key:     .model [1]
   .model    DATE         Cash .mean
   <chr>     <date>     <dist> <dbl>
 1 ETS(Cash) 2010-05-01     NA    NA
 2 ETS(Cash) 2010-05-02     NA    NA
 3 ETS(Cash) 2010-05-03     NA    NA
 4 ETS(Cash) 2010-05-04     NA    NA
 5 ETS(Cash) 2010-05-05     NA    NA
 6 ETS(Cash) 2010-05-06     NA    NA
 7 ETS(Cash) 2010-05-07     NA    NA
 8 ETS(Cash) 2010-05-08     NA    NA
 9 ETS(Cash) 2010-05-09     NA    NA
10 ETS(Cash) 2010-05-10     NA    NA
# ℹ 20 more rows

$naive
# A fable: 30 x 4 [1D]
# Key:     .model [1]
   .model      DATE               Cash .mean
   <chr>       <date>           <dist> <dbl>
 1 NAIVE(Cash) 2010-05-01  N(85, 2516)    85
 2 NAIVE(Cash) 2010-05-02  N(85, 5032)    85
 3 NAIVE(Cash) 2010-05-03  N(85, 7548)    85
 4 NAIVE(Cash) 2010-05-04 N(85, 10064)    85
 5 NAIVE(Cash) 2010-05-05 N(85, 12580)    85
 6 NAIVE(Cash) 2010-05-06 N(85, 15096)    85
 7 NAIVE(Cash) 2010-05-07 N(85, 17611)    85
 8 NAIVE(Cash) 2010-05-08 N(85, 20127)    85
 9 NAIVE(Cash) 2010-05-09 N(85, 22643)    85
10 NAIVE(Cash) 2010-05-10 N(85, 25159)    85
# ℹ 20 more rows

$snaive
# A fable: 30 x 4 [1D]
# Key:     .model [1]
   .model       DATE               Cash .mean
   <chr>        <date>           <dist> <dbl>
 1 SNAIVE(Cash) 2010-05-01   N(85, 772)    85
 2 SNAIVE(Cash) 2010-05-02  N(109, 772)   109
 3 SNAIVE(Cash) 2010-05-03   N(74, 772)    74
 4 SNAIVE(Cash) 2010-05-04    N(4, 772)     4
 5 SNAIVE(Cash) 2010-05-05   N(96, 772)    96
 6 SNAIVE(Cash) 2010-05-06   N(82, 772)    82
 7 SNAIVE(Cash) 2010-05-07   N(85, 772)    85
 8 SNAIVE(Cash) 2010-05-08  N(85, 1543)    85
 9 SNAIVE(Cash) 2010-05-09 N(109, 1543)   109
10 SNAIVE(Cash) 2010-05-10  N(74, 1543)    74
# ℹ 20 more rows
lambda_1 <- atm1_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

fit_atm1_tsa <- atm1_ts%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_1)),
        ETS = ETS((box_cox(Cash, lambda_1) ~ error("A") + trend("A") +
                                                season("A"))))
Warning: 1 error encountered for ETS
[1] ETS does not support missing values.
fit_atm1_tsb <- atm1_ts%>%
  model(ETS = ETS((box_cox(Cash, lambda_1) ~ error("A") + trend("A") +
                                                season("A"))))
Warning: 1 error encountered for ETS
[1] ETS does not support missing values.
fit_atm1_tsc <- atm1_ts%>%
  model(SNAIVE = SNAIVE(box_cox(Cash, lambda_1)))

fc_atm1_ts <- fit_atm1_tsa %>%
  forecast(h= 31)

fc_atm1_ts%>%
  autoplot(atm1_ts, level = 80) +
  labs(y = "Cash in hundreds of dollars",
       title = "ATM1 Forecast in May 2010 per various models")
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
Warning: Removed 31 rows containing missing values or values outside the scale range
(`geom_line()`).

4.1.6. Adding ARIMA to the models

# Ensure time gaps are filled and missing values are interpolated in `Cash`
atm1_ts <- atm1_ts %>%
  fill_gaps() %>%                 
  mutate(Cash = na_interpolation(Cash))  # Impute missing values in Cash column

# Fit multiple models in a single call, including Seasonal Naive (SNAIVE)
atm1_models <- atm1_ts %>% 
  model(
    ARIMA = ARIMA(Cash),
    ETS = ETS(Cash),
    NAIVE = NAIVE(Cash),
    SNAIVE = SNAIVE(Cash)  # Add seasonal naive model
  )

# Forecast using each model (example: h = 30 for next 30 periods)
atm1_forecasts <- atm1_models %>% 
  forecast(h = 30)

# View the forecast results
atm1_forecasts
# A fable: 120 x 4 [1D]
# Key:     .model [4]
   .model DATE              Cash  .mean
   <chr>  <date>          <dist>  <dbl>
 1 ARIMA  2010-05-01  N(87, 556)  87.2 
 2 ARIMA  2010-05-02 N(104, 578) 104.  
 3 ARIMA  2010-05-03  N(73, 578)  73.1 
 4 ARIMA  2010-05-04   N(8, 578)   8.03
 5 ARIMA  2010-05-05 N(100, 578) 100.  
 6 ARIMA  2010-05-06  N(81, 578)  80.9 
 7 ARIMA  2010-05-07  N(87, 578)  86.7 
 8 ARIMA  2010-05-08  N(88, 675)  88.0 
 9 ARIMA  2010-05-09 N(103, 679) 103.  
10 ARIMA  2010-05-10  N(73, 679)  73.4 
# ℹ 110 more rows
atm1_forecasts %>%
  autoplot(atm1_ts, level = 80) +
  labs(y = "Cash in hundreds of dollars",
       title = "ATM1 Forecast in May 2010 per models")

4.1.7 Check residuals

#|label: SNAIVE residuals
fit_atm1_tsc|> gg_tsresiduals()

Confirming below that With the lowest RMSE, ARIMA, followed by ets appear to be the best fit models for the atm1 series.

4.1.8 Evaluate models performances on atm1

augment(fit_atm1_tsa)%>%
  features(.innov, ljung_box, lag = 7)
# A tibble: 2 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 ETS       NA   NA       
2 SNAIVE    81.7  6.33e-15
augment(atm1_models ) %>%
  features(.innov, ljung_box, lag = 7)
# A tibble: 4 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 ARIMA     8.26  3.10e- 1
2 ETS      22.1   2.47e- 3
3 NAIVE   314.    0       
4 SNAIVE   82.3   4.66e-15

4.2 ATM2

plot(atm2_ts, main = "ATM2Time Series", xlab = "Date", ylab = "Cash", type = "l", col = "grey", lwd = 2)

library(zoo)

atm2_ts <- atm_ts %>%
  as_tsibble(index = DATE, key = ATM)  # Set the appropriate index and key
sum(is.na(atm2_ts$Cash))  # This should return 0
[1] 5
# Impute missing values in the Cash column
atm2_ts$Cash <- na.approx(atm2_ts$Cash, na.rm = FALSE)

# Fit models
model_results <- atm2_ts %>%
  model(
    RW = RW(Cash),
    ETS = ETS(Cash),
    Naive = NAIVE(Cash),
    Drift = NAIVE(Cash ~ drift()),
    ARIMA = ARIMA(Cash)
  )

# Generate forecasts
forecasts <- model_results %>%
  forecast(h = 30)

# Plot the forecasts
forecasts %>%
  autoplot(atm2_ts) +
  labs(title = "Forecasts with Different Predictive Models", x = "Time", y = "Forecasted Cash") +
  theme_minimal()

library(ggplot2)
library(tsibble)
library(zoo)

atm2_ts <- atm_ts %>%
as_tsibble(index = DATE, key = ATM) 
sum(is.na(atm2_ts$Cash))
[1] 5
model_results <- atm2_ts %>%
  model(
    RW = RW(Cash),
    ETS = ETS(Cash),
    Naive = NAIVE(Cash),
    Drift = NAIVE(Cash ~ drift()),
    ARIMA = ARIMA(Cash)
  )

# Generate forecasts for each model
forecasts <- model_results %>%
  forecast(h = 30)

# Plot all forecasts on a single plot
autoplot(atm2_ts, Cash) +
  autolayer(forecasts, level = NULL, aes(color = .model)) +
  labs(title = "Forecasts from Different Predictive Models", x = "Time", y = "Forecasted Cash") +
  scale_color_manual(values = c("RW" = "red", "ETS" = "blue", "Naive" = "green", "Drift" = "purple", "ARIMA" = "orange")) +
  theme_minimal() +
  theme(legend.title = element_blank())

4.2.2. Check residuals

#|label: Apply the ggsdisplay function on atm2_ts
atm2_ts <- atm_longer %>%
 dplyr:: filter(ATM == "ATM2") %>%
 dplyr:: select(DATE, Cash) %>%
  as_tsibble(index = DATE)

#|label: check its structure
str(atm2_ts)
tbl_ts [365 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ DATE: Date[1:365], format: "2009-05-01" "2009-05-02" ...
 $ Cash: num [1:365] 107 89 90 55 79 19 2 103 107 118 ...
 - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:1] 
  .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0) 
 - attr(*, "index")= chr "DATE"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "DATE"
 - attr(*, "interval")= interval [1:1] 1D
  ..@ .regular: logi TRUE
#|label: extract the 'Cash' column for plotting
ggtsdisplay(atm2_ts$Cash, points = FALSE, plot.type = "histogram")

ggtsdisplay(atm2_ts$Cash, plot.type = "partial" )
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The presence of autocorrelation suggests the the existence of underlying seasonal patterns or trend in the data. We may consider using an ARIMA model to try and capture these patterns or trend.

4.2.3 Apply necessary Transformations to stabilize the variance

# Calculate lambda using Guerrero's method
lambda_2 <- atm2_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero)

# Fit multiple models including SNAIVE, ETS, and ARIMA to the transformed data
fit_atm2_tsa <- atm2_ts %>%
  model(
    SNAIVE = SNAIVE(box_cox(Cash, lambda_2)),
    ETS = ETS(box_cox(Cash, lambda_2) ~ error("A") + trend("A") + season("A")),
    ARIMA = ARIMA(box_cox(Cash, lambda_2))  # Adding ARIMA model
  )

# Fit ETS model separately
fit_atm2_tsb <- atm2_ts %>%
  model(
    ETS = ETS(box_cox(Cash, lambda_2) ~ error("A") + trend("A") + season("A"))
  )

# Fit SNAIVE model separately
fit_atm2_tsc <- atm2_ts %>%
  model(
    SNAIVE = SNAIVE(box_cox(Cash, lambda_2))
  )

# Generate forecasts with a 31-day horizon
fc_atm2_ts <- fit_atm2_tsa %>%
  forecast(h = 31)

# Plot forecasts using autoplot()
fc_atm2_ts %>%
  autoplot(atm2_ts, level = 80) +
  labs(
    y = "",
    title = latex2exp::TeX(paste0(
      "Transformed Cash withdrawals $\\lambda = ", round(lambda_2, 2)
    ))
  )

4.2.4 Use ARIMA() to automatically find the best ARIMA model

Apply ARIMA model alone on ATM2

#|label: residuals plots
atm2_ts <- atm2_ts %>%
  fill_gaps() %>%  
  mutate(Cash = na_interpolation(Cash)) # Impute missing values if necessary


model <- atm2_ts %>%
  model(ARIMA(Cash))

# Extract the residuals from the model
residuals_tsibble <- model %>%
  augment() %>% # This adds residuals to the dataset
 dplyr:: select(.resid) # Select only the residuals

# Rename the residual column
colnames(residuals_tsibble)[1] <- "Residuals"

# Create a date sequence for the residuals
residuals_tsibble <- residuals_tsibble %>%
 dplyr:: mutate(Date = index(atm2_ts)) %>% # Assuming atm2_ts has a time index
 dplyr:: select(Date, Residuals)

# Plot the residuals using gg_tsresiduals
gg_tsresiduals(model)

4.2.5 Check residuals

#|label: plot the residuals directly
residuals_tsibble %>%
  ggplot(aes(x = Date, y = Residuals)) +
  geom_line() +
  labs(title = "Residuals of ARIMA Model", x = "Date", y = "Residuals") +
  theme_minimal()

4.2.6 Conduct a portmanteau test:Run the Ljung-Box test

# Perform the Ljung-Box test on the Cash series
Box.test(atm2_ts$Cash, lag = 24, fitdf = 0, type = "Ljung-Box")

    Box-Ljung test

data:  atm2_ts$Cash
X-squared = 788.81, df = 24, p-value < 0.00000000000000022

The results of the Ljung-Box reveal that atm2_ts$Cash time series exhibits significant autocorrelation. This suggests that the ARIMA (2.0.0) may not adequately capture the structure of the data.

adf.test(atm2_ts$Cash)
Warning in adf.test(atm2_ts$Cash): p-value smaller than printed p-value

    Augmented Dickey-Fuller Test

data:  atm2_ts$Cash
Dickey-Fuller = -6.046, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary

The Augmented Dickey-Fuller (ADF) test, is commonly used to determine whether a time series is stationary. The results of the ADF test provide strong evidence that the Cash time series does not have a unit root and can be treated as stationary, which is a favorable condition for many time series modeling techniques.

# Ensure continuity and handle missing values in Cash
atm2_ts_ <- atm2_ts %>% 
  fill_gaps() %>%                      # Fill implicit time gaps
  mutate(Cash = na_interpolation(Cash))  # Impute missing values in Cash column

# Fit the ARIMA model (assuming a specified ARIMA model)
forecast_model <- atm2_ts_ %>% 
  model(ARIMA = ARIMA(Cash ~ 1))  

# Forecast for May 2010 (h = 30 )
forecast_results <- forecast_model %>% 
  forecast(h = 30)

# Plot forecast results for May 2010
autoplot(forecast_results, atm2_ts_) + 
  labs(title = "ATM2 Forecast for May 2010", 
       x = "Date", 
       y = "Cash Withdrawals") +
  theme_minimal()  # Optional: add minimal theme for better appearance

atm2_ts <- atm2_ts %>%
  fill_gaps() %>%  
  mutate(Cash = na_interpolation(Cash))  # Impute missing values if any

# Step 2: Fit ARIMA(2,0,0) and ARIMA(5,0,0) models
forecast_models <- atm2_ts %>% 
  model(
    ARIMA_2_0_0 = ARIMA(Cash ~ pdq(2,0,0))
  )

# Step 3: Forecast for a 30-step horizon (adjust 'h' based on your needs)
forecast_results <- forecast_models %>% 
  forecast(h = 30)

# Step 4: Plot forecast results
autoplot(forecast_results, atm2_ts) + 
  labs(
    title = "ATM2 Cash Forecast using ARIMA (2,0,0)",
    x = "Date",
    y = "Cash Withdrawals"
  ) +
  theme_minimal()

4.2.7 Evaluate models performances on ATM 2

augment(forecast_models)%>%
  features(.innov, ljung_box, lag = 7)
# A tibble: 1 × 3
  .model      lb_stat lb_pvalue
  <chr>         <dbl>     <dbl>
1 ARIMA_2_0_0    16.6    0.0201
augment(forecast_models)%>%
  features(.innov, ljung_box, lag = 7)
# A tibble: 1 × 3
  .model      lb_stat lb_pvalue
  <chr>         <dbl>     <dbl>
1 ARIMA_2_0_0    16.6    0.0201
model = auto.arima(atm2_ts$Cash)
summary(model)
Series: atm2_ts$Cash 
ARIMA(1,1,5) with drift 

Coefficients:
          ar1      ma1      ma2     ma3     ma4      ma5    drift
      -0.0338  -0.9765  -0.4023  0.0334  0.6431  -0.2860  -0.0596
s.e.   0.1565   0.1490   0.1324  0.0962  0.0798   0.0751   0.0244

sigma^2 = 1182:  log likelihood = -1803.27
AIC=3622.54   AICc=3622.94   BIC=3653.72

Training set error measures:
                      ME     RMSE      MAE  MPE MAPE      MASE          ACF1
Training set -0.05896491 33.99488 28.29514 -Inf  Inf 0.6650371 -0.0006432107

Based on the RMSE, auto_ARIMA and ARIMA (2,0,0) are the best fit for the atm2 series #### 4.2.8. Forecast from the chosen ARIMA models:

(atm2_forecast_results <- 
  as.data.frame(forecast_results ) %>%
   dplyr:: select(DATE, .mean) %>% 
    dplyr::  rename(Date = DATE, Cash = .mean)%>%
        mutate(Cash=round(Cash,2)))
         Date   Cash
1  2010-05-01  69.33
2  2010-05-02  70.19
3  2010-05-03  11.00
4  2010-05-04   2.87
5  2010-05-05 101.20
6  2010-05-06  91.86
7  2010-05-07  68.12
8  2010-05-08  68.52
9  2010-05-09  72.69
10 2010-05-10  11.14
11 2010-05-11   2.58
12 2010-05-12 101.17
13 2010-05-13  91.90
14 2010-05-14  68.12
15 2010-05-15  68.52
16 2010-05-16  72.69
17 2010-05-17  11.15
18 2010-05-18   2.58
19 2010-05-19 101.17
20 2010-05-20  91.90
21 2010-05-21  68.12
22 2010-05-22  68.52
23 2010-05-23  72.69
24 2010-05-24  11.15
25 2010-05-25   2.58
26 2010-05-26 101.17
27 2010-05-27  91.90
28 2010-05-28  68.12
29 2010-05-29  68.52
30 2010-05-30  72.69

4.3. ATM3

plot(atm3_ts, main = "ATM 3Time Series", xlab = "Date", ylab = "Cash", type = "l", col = "grey", lwd = 2)

4.3.1. Apply STL

#|label: Apply STL decomposition on ATM3 
library(imputeTS)  
stl_atm3 <- atm_longer %>%
  filter(ATM == "ATM3") %>%
#|label: Fill implicit time gaps
  fill_gaps() %>%                      
  mutate(Cash = na_interpolation(Cash)) %>%  # Impute missing values in Cash column
  model(
    stl = STL(Cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
  )

#|label: Extract components using fabletools
stl_components <- fabletools::components(stl_atm1)

#|label: Plot the components
autoplot(stl_components)

4.4.3 Modeling ATM3 series

atm3_ts <- atm_ts %>%
  as_tsibble(index = DATE, key = ATM)  # Set the appropriate index and key
sum(is.na(atm3_ts$Cash))  # This should return 0
[1] 5
# Impute missing values in the Cash column
atm3_ts$Cash <- na.approx(atm3_ts$Cash, na.rm = FALSE)

forecast_results <- atm3_ts %>% 
  model(
    ETS = ETS(Cash),
    Snaïve = SNAIVE(Cash),

    ARIMA = ARIMA(Cash)  
  ) %>% 
  forecast(h = 30)

autoplot(forecast_results, atm3_ts) + 
  labs(title = "ATM 3 Withdrawals Forecast by Models")

4.3.2. Check residuals

#|label: Apply the ggsdisplay function on atm3_ts
atm3_ts <- atm_ts %>%
 dplyr:: filter(ATM == "ATM3") %>%
 dplyr:: select(DATE, Cash) %>%
  as_tsibble(index = DATE)

#|label: check its structure
str(atm3_ts)
tbl_ts [365 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ DATE: Date[1:365], format: "2009-05-01" "2009-05-02" ...
 $ Cash: num [1:365] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:1] 
  .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0) 
 - attr(*, "index")= chr "DATE"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "DATE"
 - attr(*, "interval")= interval [1:1] 1D
  ..@ .regular: logi TRUE
#|label: extract the 'Cash' column for plotting
ggtsdisplay(atm1_ts$Cash, points = FALSE, plot.type = "histogram")

4.4.3 Evaluate models performances on ATM 3

#|label:  Evaluate models accuracy using Ljung-Box test on 

# Fit each model separately on atm3_ts
ets_model <- atm3_ts %>% model(ETS = ETS(Cash))
snaive_model <- atm3_ts %>% model(Snaïve = SNAIVE(Cash))
arima_model <- atm3_ts %>% model(ARIMA = ARIMA(Cash))

# Function to evaluate each model with Ljung-Box test
evaluate_model <- function(model) {
  model %>%
    augment() %>%                       # Extract residuals
    features(.innov, ljung_box, lag = 7) # Apply Ljung-Box test on residuals
}

# Apply the evaluation function to each model

ets_accuracy <- evaluate_model(ets_model)
snaive_accuracy <- evaluate_model(snaive_model)
arima_accuracy <- evaluate_model(arima_model)

# Combine results for easier viewing
accuracy_results <- bind_rows(
 
  ETS = ets_accuracy,
  Snaïve = snaive_accuracy,
  ARIMA = arima_accuracy,
  .id = "Model"
)

print(accuracy_results)
# A tibble: 3 × 4
  Model  .model lb_stat lb_pvalue
  <chr>  <chr>    <dbl>     <dbl>
1 ETS    ETS      0.306      1.00
2 Snaïve Snaïve 193.         0   
3 ARIMA  ARIMA    0.132      1.00

Based on the output of the Ljung-Box test, ETS model is recommended for forecasting ATM3 withdrawals because of its ability to capture the underlying data structure effectively while leaving residuals that behave like white noise. The ARIMA model can also be considered if further analysis or specific features of the data warrant its use. The Snaïve model should be avoided due to its failure to account for autocorrelation.

4.4 ATM4

# Filter the data for ATM4
atm4_ts <- atm_ts %>%
  filter(ATM == "ATM4")

# Create the plot
ggplot(atm4_ts, aes(x = DATE, y = Cash)) + 
  geom_line() +
  labs(title = "ATM4 Cash Over Time", x = "Date", y = "Cash")

4.4.1. Apply STL on ATM4

#|label: Apply STL decomposition on ATM4 data
library(imputeTS)  # Load the imputeTS library for imputation

stl_atm4 <- atm_ts %>%
  filter(ATM == "ATM4") %>%
#|label: Fill implicit time gaps
  fill_gaps() %>%                      
  mutate(Cash = na_interpolation(Cash)) %>%  # Impute missing values in Cash column
  model(
    stl = STL(Cash ~ trend(window = 7) + season(window = "periodic"), robust = TRUE)
  )

#|label: Extract components using fabletools
stl_components <- fabletools::components(stl_atm1)

#|label: Plot the components
autoplot(stl_components)

4.4.2. Calculate the ideal lambda for transformation

(atm4_lambda <- atm4_ts %>%
  features(Cash, features = guerrero) %>%
  pull(lambda_guerrero))
[1] -0.0737252

4.4.3 Modeling ATM4 series

# Fit models and forecast for 30 periods
forecast_results <- atm4_ts %>% 
  model(
    RW = RW(Cash),
    ETS = ETS(Cash),
    Naïve = NAIVE(Cash),
    Drift = NAIVE(Cash ~ drift()),
    ARIMA = ARIMA(Cash),
    Holt = ETS(Cash ~ error("A") + trend("A") + season("N"))  # Holt's method
  ) %>% 
  forecast(h = 30)

# Plot the forecasts along with the original data
autoplot(forecast_results, atm4_ts) + 
  labs(title = "ATM 4 Withdrawals Forecast by Models")

4.4.4 Remove Outliers

atm4_ts <- atm_ts %>%
  filter(ATM == 'ATM4') %>%
  filter(Cash <= 2000 ) 


atm4_ts <- atm4_ts %>% fill_gaps()

atm4_ts %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM 4 Cash Withdrawals(Outlier Removed)")

4.4.5. Check residuals

#|label: Apply the ggsdisplay function on atm4_ts
atm4_ts <- atm_ts %>%
 dplyr:: filter(ATM == "ATM4") %>%
 dplyr:: select(DATE, Cash) %>%
  as_tsibble(index = DATE)

#|label: check its structure
str(atm4_ts)
tbl_ts [365 × 2] (S3: tbl_ts/tbl_df/tbl/data.frame)
 $ DATE: Date[1:365], format: "2009-05-01" "2009-05-02" ...
 $ Cash: num [1:365] 777 524.4 792.8 908.2 52.8 ...
 - attr(*, "key")= tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  ..$ .rows: list<int> [1:1] 
  .. ..$ : int [1:365] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0) 
 - attr(*, "index")= chr "DATE"
  ..- attr(*, "ordered")= logi TRUE
 - attr(*, "index2")= chr "DATE"
 - attr(*, "interval")= interval [1:1] 1D
  ..@ .regular: logi TRUE
#|label: extract the 'Cash' column for plotting
ggtsdisplay(atm4_ts$Cash, points = FALSE, plot.type = "histogram")

4.4.6 Evaluate models performances on atm4

#|label:  Evaluate model accuracy using Ljung-Box test on 

# Fit each model separately on atm4_ts
rw_model <- atm4_ts %>% model(RW = RW(Cash))
ets_model <- atm4_ts %>% model(ETS = ETS(Cash))
naive_model <- atm4_ts %>% model(Naïve = NAIVE(Cash))
drift_model <- atm4_ts %>% model(Drift = NAIVE(Cash ~ drift()))
arima_model <- atm4_ts %>% model(ARIMA = ARIMA(Cash))
holt_model <- atm4_ts %>% model(Holt = ETS(Cash ~ error("A") + trend("A") + season("N")))  # Holt's method

# Function to evaluate each model with Ljung-Box test
evaluate_model <- function(model) {
  model %>%
    augment() %>%                       # Extract residuals
    features(.innov, ljung_box, lag = 7) # Apply Ljung-Box test on residuals
}

# Apply the evaluation function to each model
rw_accuracy <- evaluate_model(rw_model)
ets_accuracy <- evaluate_model(ets_model)
naive_accuracy <- evaluate_model(naive_model)
drift_accuracy <- evaluate_model(drift_model)
arima_accuracy <- evaluate_model(arima_model)
holt_accuracy <- evaluate_model(holt_model)

# Combine results for easier viewing
accuracy_results <- bind_rows(
  RW = rw_accuracy,
  ETS = ets_accuracy,
  Naïve = naive_accuracy,
  Drift = drift_accuracy,
  ARIMA = arima_accuracy,
  Holt = holt_accuracy,
  .id = "Model"
)

print(accuracy_results)
# A tibble: 6 × 4
  Model .model lb_stat lb_pvalue
  <chr> <chr>    <dbl>     <dbl>
1 RW    RW       90.1   1.11e-16
2 ETS   ETS       2.84  9.00e- 1
3 Naïve Naïve    90.1   1.11e-16
4 Drift Drift    90.1   1.11e-16
5 ARIMA ARIMA     2.51  9.26e- 1
6 Holt  Holt      2.45  9.31e- 1

Based on the Ljung-Box tes toutput, the ARIMA and Holt models are the best options for ATM4 series, as they have the highest p-values, suggesting that their residuals are closest to being white noise (uncorrelated). Either model would likely be effective.

PART B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

1. Exploratory data analysis

Import and read the data into Excel

library(readxl)
file_path <- "~/Downloads/ResidentialCustomerForecastLoad-624.xlsx"
# Read the Excel file
data <- read_excel(file_path)
power_ts <-data

Data analysis

head(power_ts)
# A tibble: 6 × 3
  CaseSequence `YYYY-MMM`     KWH
         <dbl> <chr>        <dbl>
1          733 1998-Jan   6862583
2          734 1998-Feb   5838198
3          735 1998-Mar   5420658
4          736 1998-Apr   5010364
5          737 1998-May   4665377
6          738 1998-Jun   6467147
str(power_ts)
tibble [192 × 3] (S3: tbl_df/tbl/data.frame)
 $ CaseSequence: num [1:192] 733 734 735 736 737 738 739 740 741 742 ...
 $ YYYY-MMM    : chr [1:192] "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
 $ KWH         : num [1:192] 6862583 5838198 5420658 5010364 4665377 ...
glimpse(power_ts)
Rows: 192
Columns: 3
$ CaseSequence <dbl> 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 74…
$ `YYYY-MMM`   <chr> "1998-Jan", "1998-Feb", "1998-Mar", "1998-Apr", "1998-May…
$ KWH          <dbl> 6862583, 5838198, 5420658, 5010364, 4665377, 6467147, 891…
names(power_ts)
[1] "CaseSequence" "YYYY-MMM"     "KWH"         
# Assuming your dataset is named 'data'
# Create the histogram
ggplot(data, aes(x = KWH)) +
  geom_histogram(binwidth = 500000, color = "pink", fill = "blue", alpha = 0.7) +  # Fixed color argument
  labs(title = "Histogram of KWH", x = "KWH", y = "Frequency") +  # Fixed labels
  theme_minimal()
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

summary(power_ts)
  CaseSequence     YYYY-MMM              KWH          
 Min.   :733.0   Length:192         Min.   :  770523  
 1st Qu.:780.8   Class :character   1st Qu.: 5429912  
 Median :828.5   Mode  :character   Median : 6283324  
 Mean   :828.5                      Mean   : 6502475  
 3rd Qu.:876.2                      3rd Qu.: 7620524  
 Max.   :924.0                      Max.   :10655730  
                                    NA's   :1         

Data wrangling and data manipulation :

changing the name of the year column for personal convenience and removing unuseful columns
 data.frame(power_ts$`YYYY-MMM`[power_ts$KWH %in% NA])
  power_ts..YYYY.MMM..power_ts.KWH..in..NA.
1                                  2008-Sep
#change variable type 
power_ts <- power_ts %>% 
  mutate(DATE = yearmonth(`YYYY-MMM`), KWH = KWH/1000) %>%
  dplyr::select(-CaseSequence, -'YYYY-MMM') %>% 
  tsibble(index= DATE)
head(power_ts)
# A tsibble: 6 x 2 [1M]
    KWH     DATE
  <dbl>    <mth>
1 6863. 1998 Jan
2 5838. 1998 Feb
3 5421. 1998 Mar
4 5010. 1998 Apr
5 4665. 1998 May
6 6467. 1998 Jun

### Graph the series

# Plot the time series
power_ts %>%
  autoplot(KWH) +
  labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")

Th data appears right skewed and may beneficiate from a transformation First, let’s create a duplicate dataframe

# Load necessary libraries
library(zoo)       # For na.interp
library(dplyr)     # For data manipulation

# Make a copy of the original data and ensure it's a data frame
powerts2 <- as.data.frame(power_ts)  # Convert to a data frame if it's not already

# Check if the 'KWH' column exists
if (!"KWH" %in% names(powerts2)) {
  stop("The KWH column does not exist in the dataset.")
}

# Impute missing values using na.interp from the zoo package
powerts2$KWH <- na.interp(powerts2$KWH)

# Replace minimum values with the median of the KWH column
powerts2$KWH <- replace(powerts2$KWH, powerts2$KWH == min(powerts2$KWH, na.rm = TRUE), 
                         median(powerts2$KWH, na.rm = TRUE))

# Print the modified data to check the changes
print(powerts2)
          KWH     DATE
1    6862.583 1998 Jan
2    5838.198 1998 Feb
3    5420.658 1998 Mar
4    5010.364 1998 Apr
5    4665.377 1998 May
6    6467.147 1998 Jun
7    8914.755 1998 Jul
8    8607.428 1998 Aug
9    6989.888 1998 Sep
10   6345.620 1998 Oct
11   4640.410 1998 Nov
12   4693.479 1998 Dec
13   7183.759 1999 Jan
14   5759.262 1999 Feb
15   4847.656 1999 Mar
16   5306.592 1999 Apr
17   4426.794 1999 May
18   5500.901 1999 Jun
19   7444.416 1999 Jul
20   7564.391 1999 Aug
21   7899.368 1999 Sep
22   5358.314 1999 Oct
23   4436.269 1999 Nov
24   4419.229 1999 Dec
25   7068.296 2000 Jan
26   5876.083 2000 Feb
27   4807.961 2000 Mar
28   4873.080 2000 Apr
29   5050.891 2000 May
30   7092.865 2000 Jun
31   6862.662 2000 Jul
32   7517.830 2000 Aug
33   8912.169 2000 Sep
34   5844.352 2000 Oct
35   5041.769 2000 Nov
36   6220.334 2000 Dec
37   7538.529 2001 Jan
38   6602.448 2001 Feb
39   5779.180 2001 Mar
40   4835.210 2001 Apr
41   4787.904 2001 May
42   6283.324 2001 Jun
43   7855.129 2001 Jul
44   8450.717 2001 Aug
45   7112.069 2001 Sep
46   5242.535 2001 Oct
47   4461.979 2001 Nov
48   5240.995 2001 Dec
49   7099.063 2002 Jan
50   6413.429 2002 Feb
51   5839.514 2002 Mar
52   5371.604 2002 Apr
53   5439.166 2002 May
54   5850.383 2002 Jun
55   7039.702 2002 Jul
56   8058.748 2002 Aug
57   8245.227 2002 Sep
58   5865.014 2002 Oct
59   4908.979 2002 Nov
60   5779.958 2002 Dec
61   7256.079 2003 Jan
62   6190.517 2003 Feb
63   6120.626 2003 Mar
64   4885.643 2003 Apr
65   5296.096 2003 May
66   6051.571 2003 Jun
67   6900.676 2003 Jul
68   8476.499 2003 Aug
69   7791.791 2003 Sep
70   5344.613 2003 Oct
71   4913.707 2003 Nov
72   5756.193 2003 Dec
73   7584.596 2004 Jan
74   6560.742 2004 Feb
75   6526.586 2004 Mar
76   4831.688 2004 Apr
77   4878.262 2004 May
78   6421.614 2004 Jun
79   7307.931 2004 Jul
80   7309.774 2004 Aug
81   6690.366 2004 Sep
82   5444.948 2004 Oct
83   4824.940 2004 Nov
84   5791.208 2004 Dec
85   8225.477 2005 Jan
86   6564.338 2005 Feb
87   5581.725 2005 Mar
88   5563.071 2005 Apr
89   4453.983 2005 May
90   5900.212 2005 Jun
91   8337.998 2005 Jul
92   7786.659 2005 Aug
93   7057.213 2005 Sep
94   6694.523 2005 Oct
95   4313.019 2005 Nov
96   6181.548 2005 Dec
97   7793.358 2006 Jan
98   5914.945 2006 Feb
99   5819.734 2006 Mar
100  5255.988 2006 Apr
101  4740.588 2006 May
102  7052.275 2006 Jun
103  7945.564 2006 Jul
104  8241.110 2006 Aug
105  7296.355 2006 Sep
106  5104.799 2006 Oct
107  4458.429 2006 Nov
108  6226.214 2006 Dec
109  8031.295 2007 Jan
110  7928.337 2007 Feb
111  6443.170 2007 Mar
112  4841.979 2007 Apr
113  4862.847 2007 May
114  5022.647 2007 Jun
115  6426.220 2007 Jul
116  7447.146 2007 Aug
117  7666.970 2007 Sep
118  5785.964 2007 Oct
119  4907.057 2007 Nov
120  6047.292 2007 Dec
121  7964.293 2008 Jan
122  7597.060 2008 Feb
123  6085.644 2008 Mar
124  5352.359 2008 Apr
125  4608.528 2008 May
126  6548.439 2008 Jun
127  7643.987 2008 Jul
128  8037.137 2008 Aug
129  6569.470 2008 Sep
130  5101.803 2008 Oct
131  4555.602 2008 Nov
132  6442.746 2008 Dec
133  8072.330 2009 Jan
134  6976.800 2009 Feb
135  5691.452 2009 Mar
136  5531.616 2009 Apr
137  5264.439 2009 May
138  5804.433 2009 Jun
139  7713.260 2009 Jul
140  8350.517 2009 Aug
141  7583.146 2009 Sep
142  5566.075 2009 Oct
143  5339.890 2009 Nov
144  7089.880 2009 Dec
145  9397.357 2010 Jan
146  8390.677 2010 Feb
147  7347.915 2010 Mar
148  5776.131 2010 Apr
149  4919.289 2010 May
150  6696.292 2010 Jun
151  6314.472 2010 Jul
152  7922.701 2010 Aug
153  7819.472 2010 Sep
154  5875.917 2010 Oct
155  4800.733 2010 Nov
156  6152.583 2010 Dec
157  8394.747 2011 Jan
158  8898.062 2011 Feb
159  6356.903 2011 Mar
160  5685.227 2011 Apr
161  5506.308 2011 May
162  8037.779 2011 Jun
163 10093.343 2011 Jul
164 10308.076 2011 Aug
165  8943.599 2011 Sep
166  5603.920 2011 Oct
167  6154.138 2011 Nov
168  8273.142 2011 Dec
169  8991.267 2012 Jan
170  7952.204 2012 Feb
171  6356.961 2012 Mar
172  5569.828 2012 Apr
173  5783.598 2012 May
174  7926.956 2012 Jun
175  8886.851 2012 Jul
176  9612.423 2012 Aug
177  7559.148 2012 Sep
178  5576.852 2012 Oct
179  5731.899 2012 Nov
180  6609.694 2012 Dec
181 10655.730 2013 Jan
182  7681.798 2013 Feb
183  6517.514 2013 Mar
184  6105.359 2013 Apr
185  5940.475 2013 May
186  7920.627 2013 Jun
187  8415.321 2013 Jul
188  9080.226 2013 Aug
189  7968.220 2013 Sep
190  5759.367 2013 Oct
191  5769.083 2013 Nov
192  9606.304 2013 Dec
#ts plot
power_ts %>%
  autoplot() +
  labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")+
  ylab(label= "KWH (Thousands)")
Plot variable not specified, automatically selected `.vars = KWH`

Check histogram and summary statistics for a general sense ofthe distribution

ggplot(powerts2, aes(x=KWH))+
  geom_histogram()+
  labs(title = "Monthly Distributions Residential Power Usage | Jan '98 - Dec '13")
Don't know how to automatically pick scale for object of type <ts>. Defaulting
to continuous.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data are not normally distributed. But, the bulk of it is located at the center. We also see the presence of few outliers that need to be dealt with. Run a summary statistics

#summary
summary(powerts2$KWH)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4313    5444    6330    6532    7609   10656 
power_ts %>% 
  autoplot() +
  labs(y = "KWH in Thousands",
       title = "Transformed KWH with Lambda = -0.2130548")
Plot variable not specified, automatically selected `.vars = KWH`

There is not much difference before and after the transformation. But, ndiff() and ACF will identify if differencing is needed.

Check residuals

power_ts <- ts(power_ts[, "KWH"], start = c(1998, 1), frequency = 12)
ggtsdisplay(power_ts, points = FALSE, plot.type = "histogram") 
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

ggtsdisplay(power_ts, plot.type = "partial" )
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

The data are not stationary.The variance remains and we still see a few data falling outside of the critical values on both the ACF and the PACF plots. The spikes can be indicators of outliers as shown

ggseasonplot(power_ts)

There is a profound peak on july, indicating some sort of out of the ordinary/ routine activities. Could be an annual vacation time and families are away for vacation, therefore using less power? Could it related to something else? … That spike is definetly intriguing.

Modeling

# Plot the historical data
plot(power_ts, main = "Power Time Series Forecasts", ylab = "Power", xlab = "Time", col = "blue", lwd = 2)

# Add the forecast lines
lines(forecasts$.model == "ETS", forecasts$.mean, col = "red", lty = 2)  # ETS forecast
lines(forecasts$.model == "Seasonal_Naive", forecasts$.mean, col = "green", lty = 2)  # Seasonal Naive forecast
lines(forecasts$.model == "ARIMA", forecasts$.mean, col = "orange", lty = 2)  # ARIMA forecast

# Add a legend
legend("topright", legend = c("Historical", "ETS", "Seasonal Naive", "ARIMA"),
       col = c("blue", "red", "green", "orange"), lty = c(1, 2, 2, 2), lwd = 2)

file_path <- "~/Downloads/ResidentialCustomerForecastLoad-624.xlsx"
# Read the Excel file
data <- read_excel(file_path)
power_ts <-data

#change variable type 
power_ts <- power_ts %>% 
  mutate(DATE = yearmonth(`YYYY-MMM`), KWH = KWH/1000) %>%
  dplyr::select(-CaseSequence, -'YYYY-MMM') %>% 
  tsibble(index= DATE)


# Handle missing values using imputation by interpolation
power_ts <- power_ts %>%
  mutate(KWH = na_interpolation(KWH))

# Fit the models
model_results <- power_ts %>%
  model(
    ETS = ETS(KWH),          # Exponential Smoothing State Space Model
    ARIMA = ARIMA(KWH),      # Auto ARIMA
    Naive = NAIVE(KWH)       # Naive Forecast
  )

# Generate forecasts for the next 12 months (or adjust h as needed)
forecasts <- model_results %>%
  forecast(h = "12 months")

# View the forecast results
print(forecasts)
# A fable: 36 x 4 [1M]
# Key:     .model [3]
   .model     DATE              KWH .mean
   <chr>     <mth>           <dist> <dbl>
 1 ETS    2014 Jan N(9432, 1272329) 9432.
 2 ETS    2014 Feb  N(8199, 978288) 8199.
 3 ETS    2014 Mar  N(7040, 733808) 7040.
 4 ETS    2014 Apr  N(6204, 579768) 6204.
 5 ETS    2014 May  N(5872, 528015) 5872.
 6 ETS    2014 Jun  N(7652, 911557) 7652.
 7 ETS    2014 Jul N(9283, 1363541) 9283.
 8 ETS    2014 Aug N(9650, 1497129) 9650.
 9 ETS    2014 Sep N(9001, 1323148) 9001.
10 ETS    2014 Oct  N(6639, 730887) 6639.
# ℹ 26 more rows
 # Plot the historical data 
power_ts %>%
  ggplot(aes(x = DATE, y = KWH)) +
  geom_line(color = "blue", size = 1) +  # Historical data
  geom_line(data = forecasts, aes(y = .mean, color = .model), size = 1, linetype = "dashed") +  # Forecasts
  labs(title = "KWH Forecasts from Different Models",
       x = "Time",
       y = "KWH") +
  scale_color_manual(values = c("ETS" = "red", "ARIMA" = "green", "Naive" = "orange")) +
  theme_minimal() +
  theme(legend.title = element_blank())
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

ARIMA and ETS appear to be the best models to use for this forecast