Libraries used in the project
library(fpp3)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tibble 3.3.0 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.2
## ✔ lubridate 1.9.4 ✔ fable 0.4.1
## ✔ ggplot2 3.5.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(readxl)
library(dplyr)
library(tidyr)
library(purrr)
In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.
Data Preparation
#Loading the excel file
atm <- read_excel("ATM624Data.xlsx")
#Display the top observations and variables' data type
head(atm)
## # A tibble: 6 × 3
## DATE ATM Cash
## <dbl> <chr> <dbl>
## 1 39934 ATM1 96
## 2 39934 ATM2 107
## 3 39935 ATM1 82
## 4 39935 ATM2 89
## 5 39936 ATM1 85
## 6 39936 ATM2 90
#Coverting data type to proper type
atm$DATE <- as.Date(atm$DATE, origin="1899-12-30")
atm$Cash <- as.integer(atm$Cash)
#Define as tsibble for later time series execution
atm <- atm %>%
as_tsibble(index = DATE, key = ATM)
#Quick overview each variable's summary
summary(atm)
## DATE ATM Cash
## Min. :2009-05-01 Length:1474 Min. : 0.0
## 1st Qu.:2009-08-01 Class :character 1st Qu.: 0.5
## Median :2009-11-01 Mode :character Median : 73.0
## Mean :2009-10-31 Mean : 155.5
## 3rd Qu.:2010-02-01 3rd Qu.: 114.0
## Max. :2010-05-14 Max. :10919.0
## NA's :19
#Checking NA missing values
atm[!complete.cases(atm),]
## # A tsibble: 19 x 3 [1D]
## # Key: ATM [3]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2009-06-13 ATM1 NA
## 2 2009-06-16 ATM1 NA
## 3 2009-06-22 ATM1 NA
## 4 2009-06-18 ATM2 NA
## 5 2009-06-24 ATM2 NA
## 6 2010-05-01 <NA> NA
## 7 2010-05-02 <NA> NA
## 8 2010-05-03 <NA> NA
## 9 2010-05-04 <NA> NA
## 10 2010-05-05 <NA> NA
## 11 2010-05-06 <NA> NA
## 12 2010-05-07 <NA> NA
## 13 2010-05-08 <NA> NA
## 14 2010-05-09 <NA> NA
## 15 2010-05-10 <NA> NA
## 16 2010-05-11 <NA> NA
## 17 2010-05-12 <NA> NA
## 18 2010-05-13 <NA> NA
## 19 2010-05-14 <NA> NA
#remove missing values for ATM machine and Cash withdrawal and to avoid bias or days gap for those have ATM machine but missing value
atm_clean <- atm %>%
filter(!is.na(Cash) | !is.na(ATM))
# Fille with 0 for those have ATM machine active but NA cash withdrawal, assume there is no withdrawal in those case
atm_filled <- atm_clean %>%
replace_na(list(Cash = 0))
Finding: Overall removing 14 rows for those missing ATM machine and Cash Withdrawal.
# Visualize overall dataframe for the time series
atm_filled %>%
autoplot(Cash) +
labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")
# Based on the above graph, ATM4 significant outlier
ggplot(atm_filled, aes(x=ATM, y=Cash)) +
geom_boxplot()
#Dive into individual time series for effeciency
#ts for ATM1
atm_filled %>%
filter(ATM =="ATM1")%>%
autoplot(Cash) +
labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")
#ts for ATM2
atm_filled %>%
filter(ATM =="ATM2")%>%
autoplot(Cash) +
labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")
#ts for ATM3
atm_clean %>%
filter(ATM =="ATM3")%>%
autoplot(Cash) +
labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")
#ts for ATM4
atm_filled %>%
filter(ATM =="ATM4")%>%
autoplot(Cash) +
labs(title = "ATM Daily Cash Withdrawals", y="Cash (hundreds of dollars)")
Finding: ATM 4 has the highest values among all the ATMs, and there is
one observation that appears to be an outlier.
Approaches for each of ATM based on the visualzation of the above plots as following,
The time series shows clear seasonality and appears very stable, with no evident trend. I would begin with Augmented Dickey-Fuller (ADF) and Autocorrelation Function (ACF) tests, using a lag of 7 (weekly intervals) to assess stationarity. If stationarity is confirmed, I would proceed with the Holt-Winters Exponential Smoothing method, which is well-suited for seasonal and stable data.
atm1 <- atm_filled %>%
filter (ATM == "ATM1")
# ADF test for stationary
adf.test(atm1$Cash)
## Warning in adf.test(atm1$Cash): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: atm1$Cash
## Dickey-Fuller = -4.435, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
# ACF test for seasonality
atm1 %>%
ACF(Cash, lag_max = 28) %>%
autoplot()
# Approach with Holt-Winter and using the ETS to select the model by minimising the AICc
atm1_fit <- atm1 %>%
model(ETS(Cash))
report(atm1_fit)
## Series: Cash
## Model: ETS(A,N,A)
## Smoothing parameters:
## alpha = 0.009973623
## gamma = 0.2852744
##
## Initial states:
## l[0] s[0] s[-1] s[-2] s[-3] s[-4] s[-5] s[-6]
## 80.03332 -61.68553 6.05267 7.807144 4.195869 21.06173 10.93578 11.63234
##
## sigma^2: 686.8422
##
## AIC AICc BIC
## 4548.568 4549.189 4587.567
# Plot forecasts for ATM1
atm1_fit %>%
forecast(h=31) %>%
autoplot(atm1)
# Export cvs file for the final forecast for atm1
forecast_atm1 <- atm1_fit %>%
forecast(h=31)
write.csv(forecast_atm1,"forecast_atm1.csv")
Based on the approach for ATM 1, the ADF test results confirm that the selected lag aligns with my initial assumption. With a p-value of 0.01, we can reject the null hypothesis, indicating that the series is stationary. Therefore, I proceed with the Holt-Winters Exponential Smoothing method.
This time series also shows seasonality and appears relatively stable and unsure trend. Similar to ATM 1, I would start with ADF and ACF tests using a 7 multiple interval to evaluate stationarity. If the series is stationary, I would apply the Holt-Winters Exponential Smoothing approach and ARIMA, which can accommodate with seasonality or potential trend.
atm2 <- atm_filled %>%
filter (ATM == "ATM2")
# ADF test for stationary
adf.test(atm2$Cash)
## Warning in adf.test(atm2$Cash): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: atm2$Cash
## Dickey-Fuller = -5.9542, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
# ACF test for seasonality
atm2 %>%
ACF(Cash, lag_max = 28) %>%
autoplot()
# Box-Cox transformation
lambda2 <- atm2%>%
features(Cash,features = guerrero)%>%
pull(lambda_guerrero)
atm2 %>%
features(box_cox(Cash, lambda2), unitroot_nsdiffs)
## # A tibble: 1 × 2
## ATM nsdiffs
## <chr> <int>
## 1 ATM2 1
atm2 %>%
features(box_cox(Cash, lambda2), unitroot_ndiffs)
## # A tibble: 1 × 2
## ATM ndiffs
## <chr> <int>
## 1 ATM2 1
# Approach with Holt-Winter and using the ETS or ARIMA to select the model by minimising the AICc
atm2_fit <- atm2 %>%
model(arima = ARIMA(box_cox(Cash, lambda2)),
ets =ETS(Cash))
report(atm2_fit)
## Warning in report.mdl_df(atm2_fit): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.
## # A tibble: 2 × 12
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ATM2 arima 38.0 -1159. 2329. 2330. 2353. <cpl [2]> <cpl [9]> NA NA
## 2 ATM2 ets 627. -2248. 4515. 4516. 4554. <NULL> <NULL> 611. 613.
## # ℹ 1 more variable: MAE <dbl>
# Plot the forecast for time series with better AICc
atm2_fit %>%
forecast(h=31) %>%
##filter(.model=='arima') %>%
autoplot(atm2)
# Export cvs file for the final forecast for atm2 with better AICc
forecast_atm2 <- atm2_fit %>%
forecast(h=31)%>%
filter(.model=='arima')
write.csv(forecast_atm2,"forecast_atm2.csv")
Similar to the approach used for ATM 1, the ADF test results for this series confirm that the selected lag aligns with my initial assumption. With a p-value of 0.01, we can confidently reject the null hypothesis, indicating that the series is stationary. Consequently, I applied both the Holt-Winters Exponential Smoothing and ARIMA models. Based on the AICc values, the ARIMA model demonstrated a better fit, making it the preferred choice for this time series
The cash withdrawal values remain at zero for most of the time series, with a sudden spike in the final few data points. Given the lack of consistent historical patterns, I would use the Naïve forecasting method. This simple and conservative approach is appropriate when data is sparse or highly irregular.
# Fit the model for ATM3
atm3_fit <- atm_clean %>%
filter(ATM == "ATM3") %>%
model(Naive= NAIVE())
## Model not specified, defaulting to automatic modelling of the `Cash` variable.
## Override this using the model formula.
# Generate forecasts for 31 days for May 2010
atm3_fc <- atm3_fit%>%
forecast(h=31)
# Plot forecasts
atm3_fc %>%
autoplot(atm)
# Export cvs file for the final forecast for atm3
forecast_atm3 <- atm3_fit %>%
forecast(h=31)
write.csv(forecast_atm3,"forecast_atm3.csv")
Taking a conservative approach, I believe holding the forecast flat is appropriate until more reliable data becomes available. This method minimizes the risk of overfitting or misinterpreting noise as signal. Alternatively, once reliable data increased, I would consider using a panel regression model that leverages data from ATM1, ATM2, and ATM4 to inform the forecast for this ATM3. This approach could provide more robust insights by capturing shared patterns across similar ATMs.
Aside from a significant spike in the first quarter of 2010. Due to this outlier, I would consider two ways approach without imputation of the data point and imputation of data point by using the average of the neighbor points on both side. I believe the ARIMA (AutoRegressive Integrated Moving Average) model would be the most suitable, as it can handle both irregularities and stable trends effectively.
atm4 <- atm_filled %>%
filter (ATM == "ATM4")
# Exploring the outlier data point and compared with 6 times than the second high data point
atm4 %>%
arrange(desc(Cash))
## Warning: Current temporal ordering may yield unexpected results.
## ℹ Suggest to sort by `ATM`, `DATE` first.
## # A tsibble: 365 x 3 [1D]
## # Key: ATM [1]
## DATE ATM Cash
## <date> <chr> <int>
## 1 2010-02-09 ATM4 10919
## 2 2009-09-22 ATM4 1712
## 3 2010-01-29 ATM4 1574
## 4 2009-09-04 ATM4 1495
## 5 2009-06-09 ATM4 1484
## 6 2009-09-05 ATM4 1301
## 7 2010-02-28 ATM4 1275
## 8 2009-08-23 ATM4 1245
## 9 2009-06-14 ATM4 1220
## 10 2009-09-29 ATM4 1195
## # ℹ 355 more rows
atm4 %>%
model(stl=STL(Cash))%>%
components(Cash) %>%
autoplot()
# Without Imputation
# ADF test for stationary
adf.test(atm4$Cash)
## Warning in adf.test(atm4$Cash): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: atm4$Cash
## Dickey-Fuller = -6.3252, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
# ACF test for seasonality
atm4 %>%
ACF(Cash, lag_max = 28) %>%
autoplot()
atm4 %>%
gg_tsdisplay(lag=7)
## Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsdisplay()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Plot variable not specified, automatically selected `y = Cash`
# Fit the model for ATM4 w/o imputation
atm4_fit_test1 <- atm4 %>%
model(ARIMA(Cash))
report(atm4_fit_test1)
## Series: Cash
## Model: ARIMA(0,0,0) w/ mean
##
## Coefficients:
## constant
## 473.5726
## s.e. 34.0241
##
## sigma^2 estimated as 423698: log likelihood=-2882.02
## AIC=5768.05 AICc=5768.08 BIC=5775.85
# Plot the forecasts
atm4_fit_test1 %>%
forecast(h=31) %>%
autoplot(atm4) +
labs(title ="ATM4 Cash Withdrawals")
Based on the no imputation time serie, the output is ARIMA(0,0,0) model
with zero mean is white noise, so it means that the errors are
uncorrelated across time. I would not consider this approach.
# With Imputation by taking the average of the previous and next days' cash values
# Find the index of the row to impute
target_index <- which(atm4$DATE == as.Date("2010-02-09"))
# Calculate the average of the previous and next day's Cash values
prev_cash <- atm4$Cash[target_index - 1]
next_cash <- atm4$Cash[target_index + 1]
imputed_value <- round((prev_cash + next_cash) / 2)
# update the impute vlaue in the new dataframe
atm4_imputed <- atm4
atm4_imputed$Cash[target_index] <- imputed_value
# Box-Cox transformation
lambda4_test2 <- atm4_imputed %>%
features(Cash,features = guerrero)%>%
pull(lambda_guerrero)
# no differecing require
atm4_imputed %>%
features(box_cox(Cash, lambda4_test2), unitroot_nsdiffs)
## # A tibble: 1 × 2
## ATM nsdiffs
## <chr> <int>
## 1 ATM4 0
# no differecing require
atm4_imputed %>%
features(box_cox(Cash, lambda4_test2), unitroot_ndiffs)
## # A tibble: 1 × 2
## ATM ndiffs
## <chr> <int>
## 1 ATM4 0
# Fit the model for ATM4 for data imputation
atm4_fit_test2 <-atm4_imputed %>%
model(lambda = ARIMA(box_cox(Cash, lambda4_test2)),
auto=ARIMA(Cash, stepwise= FALSE, approximation = FALSE))
report(atm4_fit_test2)
## Warning in report.mdl_df(atm4_fit_test2): Model reporting is only supported for
## individual models, so a glance will be shown. To see the report for a specific
## model, use `select()` and `filter()` to identify a single model.
## # A tibble: 2 × 9
## ATM .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ATM4 lambda 179. -1463. 2936. 2936. 2956. <cpl [14]> <cpl [1]>
## 2 ATM4 auto 117389. -2645. 5306. 5306. 5337. <cpl [9]> <cpl [3]>
# Plot the forecast for imputation time series
forecast(atm4_fit_test2, h=31) %>%
##filter(.model=='lambda') %>%
autoplot(atm4_imputed)
# Plot the forecast for ATM4 original time series
forecast(atm4_fit_test2, h=31) %>%
##filter(.model=='lambda') %>%
autoplot(atm4)
# Export cvs file for the final forecast for atm4 with better AICc
forecast_atm4 <- atm4_fit_test2 %>%
forecast(h=31)%>%
filter(.model=="lambda")
write.csv(forecast_atm4,"forecast_atm4.csv")
After comparing the two methodological approaches, I prefer the time series imputation method. By addressing the outlier on the specific date, the previously obscured spike is smoothed out, revealing a clearer seasonal pattern in the data. Without imputation, the series might misleadingly appear as white noise. Therefore, the imputation approach provides a more accurate representation of the underlying trend and seasonality, making it the more suitable choice in this case. Based on the AICc values, the ARIMA model with box-cox transformed demonstrated a better fit, making it the preferred choice for this time series.
Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.
#Loading the excel file
rcfl <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
#Display the top observations and variables' data type and exploration
head(rcfl)
## # A tibble: 6 × 3
## CaseSequence `YYYY-MMM` KWH
## <dbl> <chr> <dbl>
## 1 733 1998-Jan 6862583
## 2 734 1998-Feb 5838198
## 3 735 1998-Mar 5420658
## 4 736 1998-Apr 5010364
## 5 737 1998-May 4665377
## 6 738 1998-Jun 6467147
# Convert to yearmon format for yyyy-mmm
rcfl <- rcfl %>%
select(`YYYY-MMM`, KWH) %>%
mutate(YearMonth = yearmonth(`YYYY-MMM`)) %>%
as_tsibble(index = YearMonth) %>%
select(-`YYYY-MMM`)
# Overivew the KWH value
summary(rcfl$KWH)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 770523 5429912 6283324 6502475 7620524 10655730 1
# Plot the data
rcfl %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`
# Checking NA missing values
rcfl[!complete.cases(rcfl),]
## # A tsibble: 1 x 2 [1M]
## KWH YearMonth
## <dbl> <mth>
## 1 NA 2008 Sep
To preserve the seasonality of the data, rather than removing the missing data point, I prefer to impute it using the average of its neighboring values on both sides.
# Find the index of the missing value
missing_index <- which(is.na(rcfl$KWH))
# Get previous and next KWH values
prev_kwh <- rcfl$KWH[missing_index - 1]
next_kwh <- rcfl$KWH[missing_index + 1]
# Calculate the average
imputed_kwh <- mean(c(prev_kwh, next_kwh), na.rm = TRUE)
rcfl_imputed <- rcfl
# Replace the missing value
rcfl_imputed$KWH[missing_index] <- imputed_kwh
# Plot the dataset
rcfl_imputed %>%
autoplot()
## Plot variable not specified, automatically selected `.vars = KWH`
# Plot the data wigh gg_season
rcfl_imputed %>%
gg_season(KWH, labels = "both") +
labs(title = "Seasonal Plot: KWH")
## Warning: `gg_season()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_season()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# ADF test for stationary
adf.test(rcfl_imputed$KWH)
## Warning in adf.test(rcfl_imputed$KWH): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: rcfl_imputed$KWH
## Dickey-Fuller = -4.7933, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
# Test for seasonality
rcfl_imputed %>%
gg_tsdisplay(lag=12)
## Plot variable not specified, automatically selected `y = KWH`
Based on the above plot, noticed that the time series shows strong
seasonality and a outlier in 2010. Furthermore, the time series is
statioanry with p-value less than 0.05 which reject the null hypothesis.
Even though the output shows is stationary, but
# Box-Cox transformation
lambda_rcfl <- rcfl_imputed%>%
features(KWH,features = guerrero)%>%
pull(lambda_guerrero)
# differencing require
rcfl_imputed %>%
features(box_cox(KWH, lambda_rcfl), unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
# differencing require
rcfl_imputed %>%
features(box_cox(KWH, lambda_rcfl), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
# applying first differencing
rcfl_imputed <- rcfl_imputed %>%
mutate(KWH_diff = difference(box_cox(KWH, lambda_rcfl), differences = 1))
# differecing require or not
rcfl_imputed %>%
features(KWH_diff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
# differencing require
rcfl_imputed %>%
features(KWH, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
# differencing require
rcfl_imputed %>%
features(KWH, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
# using original imputed ts for combining both differencing
rcfl_update <- rcfl_imputed %>%
mutate(KWH_diff = difference(difference(KWH, lag = 12), differences = 1))
# differencing require
rcfl_update %>%
features(KWH_diff, unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 0
# differencing require
rcfl_update %>%
features(KWH_diff, unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
# Fit ARIMA models
# Model 1: with BOX-COX transformation and differencing
rcfl_fit <- rcfl_imputed %>%
model(
lambda = ARIMA(KWH_diff)
)
report (rcfl_fit)
## Series: KWH_diff
## Model: ARIMA(5,0,0)(1,0,0)[12]
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 sar1
## -0.6262 -0.5669 -0.5727 -0.5079 -0.2472 0.2420
## s.e. 0.0720 0.0753 0.0760 0.0762 0.0710 0.0785
##
## sigma^2 estimated as 1.491: log likelihood=-307.82
## AIC=629.64 AICc=630.25 BIC=652.44
rcfl_fit %>%
gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
# Fit the model for auto w/o lambda transformed
rcfl_fit_2 <-rcfl_update %>%
model(
auto=ARIMA(KWH_diff, stepwise= FALSE, approximation = FALSE))
report(rcfl_fit_2)
## Series: KWH_diff
## Model: ARIMA(1,0,1)(0,0,2)[12]
##
## Coefficients:
## ar1 ma1 sma1 sma2
## 0.2086 -0.9517 -0.9322 0.1762
## s.e. 0.0788 0.0250 0.0844 0.0825
##
## sigma^2 estimated as 6.902e+11: log likelihood=-2706.09
## AIC=5422.17 AICc=5422.5 BIC=5438.46
rcfl_fit_2 %>%
gg_tsresiduals()
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 13 rows containing non-finite outside the scale range
## (`stat_bin()`).
# Plot the forecast for imputation time series for model1 & model2
forecast(rcfl_fit, h=12)%>%
autoplot(rcfl_imputed)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
forecast_diff <- forecast(rcfl_fit, h=12)
# Get the last observed KWH value from your original data
last_value <- tail(rcfl_imputed$KWH, 1)
# Reverse differencing manually
forecast_kwh <- forecast_diff %>%
as_tibble() %>%
mutate(
KWH = accumulate(.x = .mean, .f = `+`, .init = last_value)[-1]
)
# Plot the forecast for imputation time series for model 2
forecast(rcfl_fit_2, h=12)%>%
autoplot(rcfl_update)
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).
forecast2_diff <- forecast(rcfl_fit_2, h=12)
# Get the last observed KWH value from your original data
last_value_2 <- tail(rcfl_update$KWH, 1)
# Reverse differencing manually
forecast_kwh_2 <- forecast2_diff %>%
as_tibble() %>%
mutate(
KWH = accumulate(.x = .mean, .f = `+`, .init = last_value)[-1]
)
# Export cvs file for the forecast
write.csv(forecast_kwh,"forecast_kwh_with_lambda.csv")
write.csv(forecast_kwh_2,"forecast_kwh_without_lambda.csv")
Conclusion:
The dataset includes monthly residential power consumption (KWH) from January 1998 to December 2013. After identifying one missing value, imputation was performed using the average of adjacent months to preserve seasonal continuity. Exploratory plots revealed clear annual seasonality and a notable outlier in 2010.
Two ARIMA models were developed: one with the Box-Cox-transformed series and another without transformation using an automatic ARIMA selection. Based on the model output, noticed that the time series with box-cox transformed has better AICc.