Midterm_Project

Libraries used in the project

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──

## ✔ tibble      3.3.0     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.2
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     3.5.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tseries)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(readxl)
library(dplyr)
library(tidyr)
library(purrr)

Part A – ATM Forecast

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Data Preparation

#Loading the excel file
atm <- read_excel("ATM624Data.xlsx")

#Display the top observations and variables' data type
head(atm)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

#Coverting data type to proper type
atm$DATE <- as.Date(atm$DATE, origin="1899-12-30")
atm$Cash <- as.integer(atm$Cash)

#Define as tsibble for later time series execution 
atm <- atm %>%
  as_tsibble(index = DATE, key = ATM)

#Quick overview each variable's summary
summary(atm)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1474        Min.   :    0.0  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:    0.5  
##  Median :2009-11-01   Mode  :character   Median :   73.0  
##  Mean   :2009-10-31                      Mean   :  155.5  
##  3rd Qu.:2010-02-01                      3rd Qu.:  114.0  
##  Max.   :2010-05-14                      Max.   :10919.0  
##                                          NA's   :19

#Checking NA missing values 
atm[!complete.cases(atm),]

## # A tsibble: 19 x 3 [1D]
## # Key:       ATM [3]
##    DATE       ATM    Cash
##    <date>     <chr> <int>
##  1 2009-06-13 ATM1     NA
##  2 2009-06-16 ATM1     NA
##  3 2009-06-22 ATM1     NA
##  4 2009-06-18 ATM2     NA
##  5 2009-06-24 ATM2     NA
##  6 2010-05-01 <NA>     NA
##  7 2010-05-02 <NA>     NA
##  8 2010-05-03 <NA>     NA
##  9 2010-05-04 <NA>     NA
## 10 2010-05-05 <NA>     NA
## 11 2010-05-06 <NA>     NA
## 12 2010-05-07 <NA>     NA
## 13 2010-05-08 <NA>     NA
## 14 2010-05-09 <NA>     NA
## 15 2010-05-10 <NA>     NA
## 16 2010-05-11 <NA>     NA
## 17 2010-05-12 <NA>     NA
## 18 2010-05-13 <NA>     NA
## 19 2010-05-14 <NA>     NA

#remove missing values for ATM machine and Cash withdrawal and to avoid bias or days gap for those have ATM machine but missing value

atm_clean <- atm %>%
  filter(!is.na(Cash) | !is.na(ATM))

# Fille with 0 for those have ATM machine active but NA cash withdrawal, assume there is no withdrawal in those case 

atm_filled <- atm_clean %>%
  replace_na(list(Cash = 0))

Finding: Overall removing 14 rows for those missing ATM machine and Cash Withdrawal.

# Visualize overall dataframe for the time series 
atm_filled %>%
  autoplot(Cash) +
  labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")

# Based on the above graph, ATM4 significant outlier
ggplot(atm_filled, aes(x=ATM, y=Cash)) +
geom_boxplot()

#Dive into individual time series for effeciency
#ts for ATM1 
atm_filled %>%
  filter(ATM =="ATM1")%>%
  autoplot(Cash) +
  labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")

#ts for ATM2
atm_filled %>%
  filter(ATM =="ATM2")%>%
  autoplot(Cash) +
  labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")

#ts for ATM3
atm_clean %>%
  filter(ATM =="ATM3")%>%
  autoplot(Cash) +
  labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")

#ts for ATM4
atm_filled  %>%
  filter(ATM =="ATM4")%>%
  autoplot(Cash) +
  labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")

Finding: ATM 4 has the highest values among all the ATMs, and there is one observation that appears to be an outlier.

Approaches for each of ATM based on the visualzation of the above plots as following,

ATM 1:

The time series shows clear seasonality and appears very stable, with no evident trend. I would begin with Augmented Dickey-Fuller (ADF) and Autocorrelation Function (ACF) tests, using a lag of 7 (weekly intervals) to assess stationarity. If stationarity is confirmed, I would proceed with the Holt-Winters Exponential Smoothing method, which is well-suited for seasonal and stable data.

atm1 <- atm_filled %>%
  filter (ATM == "ATM1")

# ADF test for stationary
adf.test(atm1$Cash)

## Warning in adf.test(atm1$Cash): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm1$Cash
## Dickey-Fuller = -4.435, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

# ACF test for seasonality 
atm1 %>%
  ACF(Cash, lag_max = 28) %>%
  autoplot()

# Approach with Holt-Winter and using the ETS to select the model by minimising the AICc
atm1_fit <- atm1 %>%
  model(ETS(Cash))

report(atm1_fit)

## Series: Cash 
## Model: ETS(A,N,A) 
##   Smoothing parameters:
##     alpha = 0.009973623 
##     gamma = 0.2852744 
## 
##   Initial states:
##      l[0]      s[0]   s[-1]    s[-2]    s[-3]    s[-4]    s[-5]    s[-6]
##  80.03332 -61.68553 6.05267 7.807144 4.195869 21.06173 10.93578 11.63234
## 
##   sigma^2:  686.8422
## 
##      AIC     AICc      BIC 
## 4548.568 4549.189 4587.567

# Plot forecasts for ATM1
atm1_fit %>%
  forecast(h=31) %>%
  autoplot(atm1)

# Export cvs file for the final forecast for atm1
forecast_atm1 <- atm1_fit %>%
  forecast(h=31)

write.csv(forecast_atm1,"forecast_atm1.csv")

Based on the approach for ATM 1, the ADF test results confirm that the selected lag aligns with my initial assumption. With a p-value of 0.01, we can reject the null hypothesis, indicating that the series is stationary. Therefore, I proceed with the Holt-Winters Exponential Smoothing method.

ATM 2:

This time series also shows seasonality and appears relatively stable and unsure trend. Similar to ATM 1, I would start with ADF and ACF tests using a 7 multiple interval to evaluate stationarity. If the series is stationary, I would apply the Holt-Winters Exponential Smoothing approach and ARIMA, which can accommodate with seasonality or potential trend.

atm2 <- atm_filled %>%
  filter (ATM == "ATM2")

# ADF test for stationary
adf.test(atm2$Cash)

## Warning in adf.test(atm2$Cash): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm2$Cash
## Dickey-Fuller = -5.9542, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

# ACF test for seasonality 
atm2 %>%
  ACF(Cash, lag_max = 28) %>%
  autoplot()

#  Box-Cox transformation
lambda2 <- atm2%>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

atm2 %>%
  features(box_cox(Cash, lambda2), unitroot_nsdiffs)

## # A tibble: 1 × 2
##   ATM   nsdiffs
##   <chr>   <int>
## 1 ATM2        1

atm2 %>%
  features(box_cox(Cash, lambda2), unitroot_ndiffs)

## # A tibble: 1 × 2
##   ATM   ndiffs
##   <chr>  <int>
## 1 ATM2       1

# Approach with Holt-Winter and using the ETS or ARIMA to select the model by minimising the AICc
atm2_fit <- atm2 %>%
  model(arima = ARIMA(box_cox(Cash, lambda2)),
        ets =ETS(Cash))

report(atm2_fit)

## Warning in report.mdl_df(atm2_fit): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 12
##   ATM   .model sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots    MSE  AMSE
##   <chr> <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>    <dbl> <dbl>
## 1 ATM2  arima    38.0  -1159. 2329. 2330. 2353. <cpl [2]> <cpl [9]>   NA    NA 
## 2 ATM2  ets     627.   -2248. 4515. 4516. 4554. <NULL>    <NULL>     611.  613.
## # ℹ 1 more variable: MAE <dbl>

# Plot the forecast for time series with better AICc 
atm2_fit %>%
  forecast(h=31) %>%
  ##filter(.model=='arima') %>%
  autoplot(atm2)

# Export cvs file for the final forecast for atm2 with better AICc
forecast_atm2 <- atm2_fit %>%
  forecast(h=31)%>%
  filter(.model=='arima')

write.csv(forecast_atm2,"forecast_atm2.csv")

Similar to the approach used for ATM 1, the ADF test results for this series confirm that the selected lag aligns with my initial assumption. With a p-value of 0.01, we can confidently reject the null hypothesis, indicating that the series is stationary. Consequently, I applied both the Holt-Winters Exponential Smoothing and ARIMA models. Based on the AICc values, the ARIMA model demonstrated a better fit, making it the preferred choice for this time series

ATM 3:

The cash withdrawal values remain at zero for most of the time series, with a sudden spike in the final few data points. Given the lack of consistent historical patterns, I would use the Naïve forecasting method. This simple and conservative approach is appropriate when data is sparse or highly irregular.

# Fit the model for ATM3
atm3_fit <- atm_clean %>%
  filter(ATM == "ATM3") %>%
  model(Naive= NAIVE())

## Model not specified, defaulting to automatic modelling of the `Cash` variable.
## Override this using the model formula.

# Generate forecasts for 31 days for May 2010
atm3_fc <- atm3_fit%>%
  forecast(h=31)

# Plot forecasts
atm3_fc %>%
  autoplot(atm)

# Export cvs file for the final forecast for atm3
forecast_atm3 <- atm3_fit %>%
  forecast(h=31)

write.csv(forecast_atm3,"forecast_atm3.csv")

Taking a conservative approach, I believe holding the forecast flat is appropriate until more reliable data becomes available. This method minimizes the risk of overfitting or misinterpreting noise as signal. Alternatively, once reliable data increased, I would consider using a panel regression model that leverages data from ATM1, ATM2, and ATM4 to inform the forecast for this ATM3. This approach could provide more robust insights by capturing shared patterns across similar ATMs.

ATM 4:

Aside from a significant spike in the first quarter of 2010. Due to this outlier, I would consider two ways approach without imputation of the data point and imputation of data point by using the average of the neighbor points on both side. I believe the ARIMA (AutoRegressive Integrated Moving Average) model would be the most suitable, as it can handle both irregularities and stable trends effectively.

atm4 <- atm_filled %>%
  filter (ATM == "ATM4")

# Exploring the outlier data point and compared with 6 times than the second high data point
atm4 %>%
  arrange(desc(Cash))

## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `ATM`, `DATE` first.

## # A tsibble: 365 x 3 [1D]
## # Key:       ATM [1]
##    DATE       ATM    Cash
##    <date>     <chr> <int>
##  1 2010-02-09 ATM4  10919
##  2 2009-09-22 ATM4   1712
##  3 2010-01-29 ATM4   1574
##  4 2009-09-04 ATM4   1495
##  5 2009-06-09 ATM4   1484
##  6 2009-09-05 ATM4   1301
##  7 2010-02-28 ATM4   1275
##  8 2009-08-23 ATM4   1245
##  9 2009-06-14 ATM4   1220
## 10 2009-09-29 ATM4   1195
## # ℹ 355 more rows

atm4 %>%
  model(stl=STL(Cash))%>%
  components(Cash) %>%
  autoplot()

# Without Imputation

# ADF test for stationary
adf.test(atm4$Cash)

## Warning in adf.test(atm4$Cash): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  atm4$Cash
## Dickey-Fuller = -6.3252, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

# ACF test for seasonality 
atm4 %>%
  ACF(Cash, lag_max = 28) %>%
  autoplot()

atm4 %>% 
  gg_tsdisplay(lag=7)

## Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsdisplay()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Plot variable not specified, automatically selected `y = Cash`

# Fit the model for ATM4 w/o imputation
atm4_fit_test1 <- atm4 %>%
  model(ARIMA(Cash))

report(atm4_fit_test1)

## Series: Cash 
## Model: ARIMA(0,0,0) w/ mean 
## 
## Coefficients:
##       constant
##       473.5726
## s.e.   34.0241
## 
## sigma^2 estimated as 423698:  log likelihood=-2882.02
## AIC=5768.05   AICc=5768.08   BIC=5775.85

# Plot the forecasts
atm4_fit_test1 %>%
  forecast(h=31) %>%
  autoplot(atm4) +
  labs(title ="ATM4 Cash Withdrawals")

Based on the no imputation time serie, the output is ARIMA(0,0,0) model with zero mean is white noise, so it means that the errors are uncorrelated across time. I would not consider this approach.

# With Imputation by taking the average of the previous and next days' cash values

# Find the index of the row to impute
target_index <- which(atm4$DATE == as.Date("2010-02-09"))

# Calculate the average of the previous and next day's Cash values
prev_cash <- atm4$Cash[target_index - 1]
next_cash <- atm4$Cash[target_index + 1]
imputed_value <- round((prev_cash + next_cash) / 2)

# update the impute vlaue in the new dataframe
atm4_imputed <- atm4

atm4_imputed$Cash[target_index] <- imputed_value

#  Box-Cox transformation
lambda4_test2 <- atm4_imputed %>%
  features(Cash,features = guerrero)%>%
  pull(lambda_guerrero)

# no differecing require
atm4_imputed %>%
  features(box_cox(Cash, lambda4_test2), unitroot_nsdiffs)

## # A tibble: 1 × 2
##   ATM   nsdiffs
##   <chr>   <int>
## 1 ATM4        0

# no differecing require
atm4_imputed %>%
  features(box_cox(Cash, lambda4_test2), unitroot_ndiffs)

## # A tibble: 1 × 2
##   ATM   ndiffs
##   <chr>  <int>
## 1 ATM4       0

# Fit the model for ATM4 for data imputation 
atm4_fit_test2 <-atm4_imputed %>%
  model(lambda = ARIMA(box_cox(Cash, lambda4_test2)),
        auto=ARIMA(Cash, stepwise= FALSE, approximation = FALSE))

report(atm4_fit_test2)

## Warning in report.mdl_df(atm4_fit_test2): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.

## # A tibble: 2 × 9
##   ATM   .model  sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
##   <chr> <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
## 1 ATM4  lambda    179.  -1463. 2936. 2936. 2956. <cpl [14]> <cpl [1]>
## 2 ATM4  auto   117389.  -2645. 5306. 5306. 5337. <cpl [9]>  <cpl [3]>

# Plot the forecast for imputation time series
forecast(atm4_fit_test2, h=31) %>%
  ##filter(.model=='lambda') %>%
  autoplot(atm4_imputed)

# Plot the forecast for ATM4 original time series 
forecast(atm4_fit_test2, h=31) %>%
  ##filter(.model=='lambda') %>%
  autoplot(atm4)

# Export cvs file for the final forecast for atm4 with better AICc
forecast_atm4 <- atm4_fit_test2 %>%
  forecast(h=31)%>%
  filter(.model=="lambda")
  

write.csv(forecast_atm4,"forecast_atm4.csv")

After comparing the two methodological approaches, I prefer the time series imputation method. By addressing the outlier on the specific date, the previously obscured spike is smoothed out, revealing a clearer seasonal pattern in the data. Without imputation, the series might misleadingly appear as white noise. Therefore, the imputation approach provides a more accurate representation of the underlying trend and seasonality, making it the more suitable choice in this case. Based on the AICc values, the ARIMA model with box-cox transformed demonstrated a better fit, making it the preferred choice for this time series.

Part B – Forecasting Power

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

#Loading the excel file
rcfl <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")

#Display the top observations and variables' data type and exploration
head(rcfl)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

# Convert to yearmon format for yyyy-mmm
rcfl <- rcfl %>%
  select(`YYYY-MMM`, KWH) %>%
  mutate(YearMonth = yearmonth(`YYYY-MMM`)) %>%
  as_tsibble(index = YearMonth) %>%
  select(-`YYYY-MMM`)

# Overivew the KWH value 
summary(rcfl$KWH)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##   770523  5429912  6283324  6502475  7620524 10655730        1

# Plot the data 
rcfl %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = KWH`

# Checking NA missing values 
rcfl[!complete.cases(rcfl),]

## # A tsibble: 1 x 2 [1M]
##     KWH YearMonth
##   <dbl>     <mth>
## 1    NA  2008 Sep

To preserve the seasonality of the data, rather than removing the missing data point, I prefer to impute it using the average of its neighboring values on both sides.

# Find the index of the missing value
missing_index <- which(is.na(rcfl$KWH))

# Get previous and next KWH values
prev_kwh <- rcfl$KWH[missing_index - 1]
next_kwh <- rcfl$KWH[missing_index + 1]

# Calculate the average
imputed_kwh <- mean(c(prev_kwh, next_kwh), na.rm = TRUE)


rcfl_imputed <- rcfl

# Replace the missing value
rcfl_imputed$KWH[missing_index] <- imputed_kwh

# Plot the dataset 
rcfl_imputed %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = KWH`

# Plot the data wigh gg_season
rcfl_imputed %>%
  gg_season(KWH, labels = "both") +
  labs(title = "Seasonal Plot: KWH")

## Warning: `gg_season()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_season()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# ADF test for stationary
adf.test(rcfl_imputed$KWH)

## Warning in adf.test(rcfl_imputed$KWH): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  rcfl_imputed$KWH
## Dickey-Fuller = -4.7933, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

# Test for seasonality 
rcfl_imputed %>% 
  gg_tsdisplay(lag=12)

## Plot variable not specified, automatically selected `y = KWH`

Based on the above plot, noticed that the time series shows strong seasonality and a outlier in 2010. Furthermore, the time series is statioanry with p-value less than 0.05 which reject the null hypothesis. Even though the output shows is stationary, but

#  Box-Cox transformation
lambda_rcfl <- rcfl_imputed%>%
  features(KWH,features = guerrero)%>%
  pull(lambda_guerrero)

# differencing require
rcfl_imputed %>%
  features(box_cox(KWH, lambda_rcfl), unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

# differencing require
rcfl_imputed %>%
  features(box_cox(KWH, lambda_rcfl), unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# applying first differencing 
rcfl_imputed <- rcfl_imputed %>%
  mutate(KWH_diff = difference(box_cox(KWH, lambda_rcfl), differences = 1))

# differecing require or not
rcfl_imputed %>%
  features(KWH_diff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

# differencing require
rcfl_imputed %>%
  features(KWH, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       1

# differencing require
rcfl_imputed %>%
  features(KWH, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      1

# using original imputed ts for combining both differencing 
rcfl_update <- rcfl_imputed %>%
  mutate(KWH_diff = difference(difference(KWH, lag = 12), differences = 1))


# differencing require
rcfl_update %>%
  features(KWH_diff, unitroot_nsdiffs)

## # A tibble: 1 × 1
##   nsdiffs
##     <int>
## 1       0

# differencing require
rcfl_update %>%
  features(KWH_diff, unitroot_ndiffs)

## # A tibble: 1 × 1
##   ndiffs
##    <int>
## 1      0

# Fit ARIMA models
# Model 1: with BOX-COX transformation and differencing
rcfl_fit <- rcfl_imputed %>%
  model(
     lambda = ARIMA(KWH_diff)
  )

report (rcfl_fit)

## Series: KWH_diff 
## Model: ARIMA(5,0,0)(1,0,0)[12] 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5    sar1
##       -0.6262  -0.5669  -0.5727  -0.5079  -0.2472  0.2420
## s.e.   0.0720   0.0753   0.0760   0.0762   0.0710  0.0785
## 
## sigma^2 estimated as 1.491:  log likelihood=-307.82
## AIC=629.64   AICc=630.25   BIC=652.44

rcfl_fit %>% 
  gg_tsresiduals()

## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

# Fit the model for auto w/o lambda transformed
rcfl_fit_2 <-rcfl_update %>%
  model(
        auto=ARIMA(KWH_diff, stepwise= FALSE, approximation = FALSE))

report(rcfl_fit_2)

## Series: KWH_diff 
## Model: ARIMA(1,0,1)(0,0,2)[12] 
## 
## Coefficients:
##          ar1      ma1     sma1    sma2
##       0.2086  -0.9517  -0.9322  0.1762
## s.e.  0.0788   0.0250   0.0844  0.0825
## 
## sigma^2 estimated as 6.902e+11:  log likelihood=-2706.09
## AIC=5422.17   AICc=5422.5   BIC=5438.46

rcfl_fit_2 %>% 
  gg_tsresiduals()

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Plot the forecast for imputation time series for model1 & model2
forecast(rcfl_fit, h=12)%>%
  autoplot(rcfl_imputed)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

forecast_diff <- forecast(rcfl_fit, h=12)

# Get the last observed KWH value from your original data
last_value <- tail(rcfl_imputed$KWH, 1)

# Reverse differencing manually
forecast_kwh <- forecast_diff %>%
  as_tibble() %>%
  mutate(
    KWH = accumulate(.x = .mean, .f = `+`, .init = last_value)[-1]
  )

# Plot the forecast for imputation time series for model 2
forecast(rcfl_fit_2, h=12)%>%
  autoplot(rcfl_update)

## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).

forecast2_diff <- forecast(rcfl_fit_2, h=12)

# Get the last observed KWH value from your original data
last_value_2 <- tail(rcfl_update$KWH, 1)

# Reverse differencing manually
forecast_kwh_2 <- forecast2_diff %>%
  as_tibble() %>%
  mutate(
    KWH = accumulate(.x = .mean, .f = `+`, .init = last_value)[-1]
  )

# Export cvs file for the forecast

write.csv(forecast_kwh,"forecast_kwh_with_lambda.csv")

write.csv(forecast_kwh_2,"forecast_kwh_without_lambda.csv")

Conclusion:

The dataset includes monthly residential power consumption (KWH) from January 1998 to December 2013. After identifying one missing value, imputation was performed using the average of adjacent months to preserve seasonal continuity. Exploratory plots revealed clear annual seasonality and a notable outlier in 2010.

Two ARIMA models were developed: one with the Box-Cox-transformed series and another without transformation using an automatic ARIMA selection. Based on the model output, noticed that the time series with box-cox transformed has better AICc.

Midterm_Project_1

Joyce Aldrich

2025-10-25