DATA 624 Project 1 (B)

Project Introduction

This project involves analyzing and forecasting residential power usage data spanning from January 1998 to December 2013, with the goal of generating monthly forecasts for the year 2014. The dataset, provided in a single file, includes the variable ‘KWH,’ which represents power consumption in kilowatt-hours. This data will be examined to identify underlying trends, seasonal patterns, and any anomalies, which will inform the selection and calibration of appropriate time series models.

Residential Power Usage

Loading the data that will be used for the analysis.

residential_power_og <- read_xlsx("ResidentialCustomerForecastLoad-624.xlsx")

residential_power <- residential_power_og |> mutate(Date = yearmonth(`YYYY-MMM`)) |> as_tsibble(index = Date) |> select(-`YYYY-MMM`) |> select(-CaseSequence)

residential_power |> head() |>  kable() |> kable_styling() |>  kable_classic()

KWH	Date
6862583	1998 Jan
5838198	1998 Feb
5420658	1998 Mar
5010364	1998 Apr
4665377	1998 May
6467147	1998 Jun

Time Series Exploration

In this section, we examine the presence and impact of missing values within the ATM withdrawal dataset.

missing1 <- plot_intro(residential_power, title = 'Missing Information on the Residential Power Comsumption Dataset',
           ggtheme = theme_minimal())

# Plot missing volume in Column 
missing2 <- plot_missing(residential_power,title = 'Information about Missing Values in the Residential Power Comsumption Dataset',ggtheme = theme_minimal())

grid.arrange(missing1, missing2, nrow = 1,layout_matrix= rbind(1,2))

# residential_power |> filter(is.na(KWH)) 
residential_power |> filter(is.na(KWH)) |>  kable() |> kable_styling() |>  kable_classic()

KWH	Date
NA	2008 Sep

Missing data accounts for only the 0.52% of the total observations, which represents only one missing observation and thus suggests a minor impact on overall data integrity. Nonetheless, handling missing values is essential to ensure that our forecasting models are robust and not influenced by gaps in the dataset.

Visualizing Energy Comsumption in kilowatt-hour

residential_power |> filter(!is.na(KWH)) |> autoplot(KWH) +ggtitle("Residential Power Comsumption without Missing Values")

plotResiduals <- function(data){
plot1 <- data |> autoplot(.innov) + labs(title = "innovation Residuals")
plot2 <-data |> ACF(.innov)  |> autoplot() + labs(title = "Auto Correlation Function")
plot3 <-data  |> ggplot(aes(x = .innov)) + geom_histogram(binwidth = bins_cal) + labs(title = "Histogram of residuals")
  
return( grid.arrange(plot1, plot2, plot3, nrow = 2,layout_matrix= rbind(1,c(2,3))))
}

bins_cal <- function(data){
  num_bins <- nclass.FD(data)
bin_width <- (max(data) - min(data)) / num_bins
return(bin_width)
}

get_IQR <- function(var){
  Q1 <- quantile(var, 0.25)
  Q3 <- quantile(var, 0.75)
  IQR_value <- IQR(var)
  print(Q1)
    print(Q3)
    print(IQR_value)

    df <- data.frame(Q1,
                   Q3,
                   IQR_value,
                   )
  return(df)
}


residential_power |> filter(!is.na(KWH)) |> ggplot(aes(x = KWH)) +  geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4)

residential_power |> filter(!is.na(KWH)) |> ggplot(aes(x = KWH)) +   geom_histogram(binwidth = bins_cal)

residential_power |> filter(!is.na(KWH)) |> ggplot(aes(x=Date, y = KWH)) +   geom_point()

The analysis of energy consumption data reveals a right-skewed distribution with an outlier on the left side, as observed in the histogram, boxplot, and scatterplot. This left-side outlier creates a long tail, impacting the overall symmetry of the distribution. The right-skew in the histogram indicates that most values are concentrated at lower consumption levels, with fewer high values extending to the right.

In the scatterplot, values are mostly centered and aligned in a pattern resembling a “beam,” with an increasing trend toward the right end of the plot. This indicates that higher consumption values are gradually increasing over time or with another variable of interest. The beam shape, with values concentrated in the center and tapering outward, reflects a stable pattern with periodic higher values, potentially driven by seasonal or cyclical factors in consumption behavior.

The left-side outlier’s impact on the data suggests that addressing it may improve symmetry and model reliability. Potential next steps include investigating the outlier’s source to determine if it is a valid observation or a data entry error. If retained, applying a transformation, such as a log transformation, could help reduce skewness, compress the spread of high values, and bring the data closer to a normal distribution, facilitating more accurate analysis and modeling.

res_power_no_na <- residential_power |> filter(!is.na(KWH))


res_power_outlier <- res_power_no_na  |> filter(KWH == min(KWH))

z_scores_manual <- (res_power_outlier$KWH - median(res_power_no_na$KWH)) / mad(res_power_no_na$KWH ,constant = 1.4826)

res_power_outlier["Z-score"] <- z_scores_manual

# res_power_outlier
res_power_outlier |>  kable() |> kable_styling() |>  kable_classic()

KWH	Date	Z-score
770523	2010 Jul	-3.57261

In this analysis, a modified z-score approach was used to detect outliers in the KWH (kilowatt-hour) values for residential power usage. The modified z-score was calculated using the Median Absolute Deviation (MAD), scaled by a constant factor of 1.4826, to provide a robust measure of spread that is less influenced by extreme values. This method effectively identifies anomalies without being overly sensitive to existing outliers.

The KWH value recorded in July 2010 produced a modified z-score of -3.57, indicating it lies approximately 3.57 scaled MADs below the median, well beyond the threshold of ±3 typically used to flag outliers. This substantial deviation confirms July 2010 as an outlier, suggesting that its power usage figure does not align with the typical distribution seen across the dataset.

Given that no additional context was provided with the dataset, there is no clear information to guide an investigation into the underlying cause of this outlier. Therefore, to maintain the consistency of the data, we will impute a more representative value in place of the July 2010 entry. A suitable imputation approach might involve replacing the outlier with the median KWH value for surrounding months or estimating a value based on the overall time series trend, ensuring minimal disruption to the dataset’s integrity.

residential_power[residential_power$Date ==  res_power_outlier$Date[1],"KWH"] <- NA

After removing the outlier, the distribution will look as follows:

plt1<- residential_power |> filter(!is.na(KWH)) |> ggplot(aes(x = KWH)) +  geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4)+ ggtitle("Residential Power Comsumption")
plt2 <-residential_power |> filter(!is.na(KWH)) |> ggplot(aes(x = KWH)) +   geom_histogram(binwidth = bins_cal) + ggtitle("Residential Power Comsumption")
plt3 <-residential_power |> filter(!is.na(KWH)) |> ggplot(aes(x=Date, y = KWH)) +   geom_point()+ ggtitle("Residential Power Comsumption")


grid.arrange(plt3,plt1, plt2, nrow = 2,layout_matrix= rbind(1,c(2,3)))

Imputing Data

In this analysis step, multiple imputation is conducted on missing data within the Power Comsumption dataset to ensure complete and accurate time series for subsequent analysis. Using the mice package, Predictive Mean Matching (PMM) is applied, a method that fills in missing values by selecting observed values from similar cases, thereby maintaining realistic values within the range of existing data.

power_com_imp <- mice(residential_power, m = 5, method = 'pmm', maxit = 5, seed = 123, printFlag = FALSE)
residential_power_c <- complete(power_com_imp, 1) |> as_tsibble(index=Date)

Visualizations after Imputed Data

power_box <- residential_power_c |> ggplot(aes(x = KWH)) +  geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) + ggtitle("Residential Power Comsumption")
power_point <-residential_power_c |> ggplot(aes(x=Date, y = KWH)) +   geom_point() + ggtitle("Residential Power Comsumption")
power_hist <- residential_power_c |> ggplot(aes(x = KWH)) +  geom_histogram(binwidth = bins_cal) + ggtitle("Residential Power Comsumption")

arr1 <- grid.arrange(power_point,power_box, power_hist, nrow = 2,layout_matrix= rbind(1,c(2,3)))

The difference is barely noticeable because only on observation was missing in the data.

Data Transformation

To enhance the reliability of the residential power consumption data (KWH), we will address outliers and stabilize variance through data transformation. An initial outlier detection using a modified z-score flagged an anomalous value in July 2010, with a z-score of -3.57. Lacking additional context to explore the cause, we will impute a more representative value, such as the median of neighboring months, to ensure consistency.

Following this, a log transformation will be applied to stabilize variance and reduce skewness, making the data more suitable for time series modeling. This transformation will help compress extreme values, further reducing the influence of any outliers, and create a more stable foundation for accurate forecasting of power consumption.

res_power_lambda <- residential_power_c |> features(KWH,features = guerrero) |> pull(lambda_guerrero)


# #Boxcox
# residential_power_c$KWH_TR <- box_cox(residential_power_c$KWH, res_power_lambda)
# res_power_lambda
residential_power_c$KWH_TR <- log(residential_power_c$KWH)

Data Post-Transformation

After the Log transformation, that data looks very similar to a normal distribution.

power_box <- residential_power_c |> ggplot(aes(x = KWH_TR)) +  geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) + ggtitle("Residential Power Comsumption")
power_point <-residential_power_c |> ggplot(aes(x=Date, y = KWH_TR)) +   geom_point() + ggtitle("Residential Power Comsumption")
power_hist <- residential_power_c |> ggplot(aes(x = KWH_TR)) +  geom_histogram(binwidth = bins_cal) + ggtitle("Residential Power Comsumption")

arr1 <- grid.arrange(power_point,power_box, power_hist, nrow = 2,layout_matrix= rbind(1,c(2,3)))

Power Comsumption Insights

To gain insights into the Power Consumption Data, STL decomposition will be applied to separate the time series into its seasonal, trend, and residual components, offering a clearer view of the underlying patterns and structures within the data.

power_con_comp <- residential_power_c |> model(stl = STL(KWH_TR)) |> components()

power_con_comp |> autoplot()

As part of this process, visualizing the ACF and PACF will aid in identifying any autocorrelation patterns and seasonal structures within the time series, which are essential for informing model selection and parameter tuning.

power_con_comp |> gg_tsdisplay(KWH_TR,plot_type = "partial",lag_max = 30)

After decomposing the residential power consumption time series using STL (Seasonal-Trend decomposition using Loess), the Autocorrelation Function (ACF) reveals an interesting pattern, with a regular six-month cycle in both the positive and negative correlations.

Specifically, the ACF displays positive autocorrelation every six lags, starting at lag 6 and repeating at 12, 18, 24, and so on. Each of these main positive peaks is flanked by neighboring lags on either side, each with significant, though smaller, autocorrelations (approximately half the size of the main peaks). This positive pattern suggests a six-monthly recurring relationship in power consumption, where each peak reflects similar consumption levels half a year apart.

Similarly, on the negative side, there is a recurring negative autocorrelation every six months, beginning at lag 3 and continuing at lags 9, 15, 21, and so forth. Each of these main negative correlations also has neighboring lags with high autocorrelation, creating a cyclical structure where power usage shows inverse relationships at half-year intervals.

This pattern highlights a rich seasonal structure in the data, with power consumption fluctuating cyclically in both positive and negative directions every six months.

Model Preparations

To forecast residential power consumption, several models will be implemented to capture the seasonal and trend patterns identified in the data. The models under consideration include a Seasonal Naïve model (SNAIVE), an ETS model with additive error, damped trend, and additive seasonality suited to the annual cycle (ETS(A, Ad, A)), a standard ETS model with additive error, additive trend, and additive seasonality (ETS(A, A, A)), a Seasonal ARIMA model with annual differencing (informed by this analysis), and a dynamic ARIMA model that will automatically determine optimal parameters, likely selecting seasonal components to align with the yearly structure as needed.

For the ARIMA models, ensuring stationarity is essential to produce reliable forecasts. To evaluate stationarity, we will apply unit root tests to assess whether seasonal or non-seasonal differencing is necessary, guiding us in preparing the dataset for effective ARIMA modeling and enabling robust forecasts of power consumption.

ndiffs <- bind_cols(
residential_power_c |>
  features(KWH_TR, unitroot_ndiffs),

residential_power_c |>
  features(KWH_TR, unitroot_nsdiffs)
)

# ndiffs
ndiffs |> kable() |> kable_styling() |>  kable_classic()

ndiffs	nsdiffs
1	1

Unit root tests indicate that only one seasonal difference is required to make the data stationary for the ARIMA model. Ensuring stationarity enhances model accuracy and reliability in forecasting. While ARIMA will be a primary model, we will also explore other forecasting models to determine the most effective approach for our data.

residential_power_c$diff <- difference(residential_power_c$KWH_TR,lag = 12)
# residential_power_c$diff <- difference(residential_power_c$diff)

residential_power_c |>
  features(diff, unitroot_kpss) |> kable() |> kable_styling() |>  kable_classic()

kpss_stat	kpss_pvalue
0.1477957	0.1

residential_power_c |>
  features(diff, unitroot_nsdiffs) |> kable() |> kable_styling() |>  kable_classic()

nsdiffs
0

The KPSS test was conducted to assess the stationarity of the data. The KPSS statistic of 0.1477957 and a p-value of 0.1 suggest that the data is likely stationary, as the high p-value (greater than common significance levels like 0.05 or 0.01) does not provide evidence to reject the null hypothesis of stationarity.

residential_power_c |> autoplot(diff)

residential_power_c |> gg_tsdisplay(diff,plot_type='partial',lag_max = 50)

After applying yearly differencing (lag 12) to the residential power consumption data, the autocorrelation structure reveals a few persistent patterns. The ACF shows a strong positive autocorrelation at lag 1, suggesting a significant relationship between consecutive years. This pattern often indicates a short-term trend or dependency, where power consumption in one year closely relates to that of the following year. Additionally, the ACF displays a notable negative autocorrelation at lag 12, which suggests a seasonal effect that persists on an annual basis despite the differencing applied. In the PACF, a similar structure emerges, with an additional negative autocorrelation at lag 24, indicating a potential dampening or alternating effect over two-year intervals.

Based on these insights, a SARIMA model is well-suited to capture the remaining structure in the data. To address the positive lag 1 autocorrelation, a non-seasonal AR(1) term will be included to capture the short-term dependency. The negative autocorrelation at lag 12 can be effectively addressed by incorporating a seasonal MA(1) term, which models the annual cycle that still influences the data after differencing. Given the two-year effect indicated by lag 24 in the PACF, a seasonal AR(1) term is also recommended to capture any extended seasonal structure.

The proposed SARIMA model configuration is:

ARIMA(1,0,0)(0,1,1)[12]

Model Training

In this section, we outline the approach used to train and evaluate several forecasting models on the power consumption dataset. For training, we utilized 80% of the data, allowing for a thorough evaluation on the remaining test set. The models explored include a Seasonal Naïve model (SNAIVE), an ETS model with additive error, damped trend, and additive seasonality (ETS(A, Ad, A)) tailored to the yearly cycle, an ETS model with purely additive components (ETS(A, A, A)), a Seasonal ARIMA model incorporating yearly seasonal differencing, and an ARIMA model with parameters automatically inferred by R. By applying and comparing these models, we aim to identify the best-performing approach in terms of accuracy and generalizability for reliable power consumption forecasting.

power_con_count <- residential_power_c |> nrow() * 0.8
power_con_count <- floor(power_con_count)
residential_power_train <- residential_power_c |> select(-KWH) |> slice(1:power_con_count)

residential_power_forecast_len <- nrow(residential_power_c) - power_con_count

residential_power_models <- residential_power_train |> model(snaive=SNAIVE(KWH_TR  ~ lag("year")),
                                AAdA=ETS(KWH_TR ~ error("A")+trend("Ad")+season("A",period=12)),
                                AAA=ETS(KWH_TR ~ error("A")+trend("A")+season("A",period=12)),
                                sarima = ARIMA(KWH_TR ~ pdq(1, 0, 0) + PDQ(0, 1, 1, period = 12)),
                                sarima_auto = ARIMA(KWH_TR)
                                )

residential_power_fit_snaive <- residential_power_models |> select(snaive)
residential_power_fit_aada <- residential_power_models |> select(AAdA)
residential_power_fit_aaa <- residential_power_models |> select(AAA)
residential_power_fit_sarima <- residential_power_models |> select(sarima)
residential_power_fit_sarima_auto <- residential_power_models |> select(sarima_auto)

Forecast

This section presents the forecasted values produced by each model, evaluated against the actual values in the test set. By comparing these forecasts with the observed data, we can later assess each model’s accuracy and ability to generalize beyond the training period.

residential_power_forecast1 <-  residential_power_fit_snaive |> forecast(h=residential_power_forecast_len)
residential_power_forecast2 <- residential_power_fit_aada |> forecast(h=residential_power_forecast_len)
residential_power_forecast3 <- residential_power_fit_aaa |> forecast(h=residential_power_forecast_len)
residential_power_forecast4 <-  residential_power_fit_sarima |> forecast(h=residential_power_forecast_len)
residential_power_forecast5 <-  residential_power_fit_sarima_auto |> forecast(h=residential_power_forecast_len)

residential_power_forecast1 |> autoplot(level = NULL) + autolayer(residential_power_c,KWH_TR)+ggtitle("Snaive Forecast")

residential_power_forecast2 |> autoplot(level = NULL) + autolayer(residential_power_c,KWH_TR)+ggtitle("ETS (A,Ad,A) Forecast")

residential_power_forecast3 |> autoplot(level = NULL) + autolayer(residential_power_c,KWH_TR)+ggtitle("ETS (A,A,A) Forecast")

residential_power_forecast4 |> autoplot(level = NULL) + autolayer(residential_power_c,KWH_TR)+ggtitle("SARIMA(1,0,0)(0,1,1)[12]")

residential_power_forecast5 |> autoplot(level = NULL) + autolayer(residential_power_c,KWH_TR)+ggtitle("SARIMA Forecast (Automatic Parameter Selection)")

Model Evaluation

The forecast performance of each model on the test set offers critical insights into their effectiveness and reliability for predicting future power consumption, aiding in the selection of the optimal forecasting approach. The models assessed include the Seasonal Naïve (SNAIVE), ETS(AAA), Seasonal ARIMA with automatic parameter selection (SARIMA_auto), and ARIMA models, each evaluated on both training and test datasets. Key performance metrics considered are Mean Error (ME), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Scaled Error (MASE), Root Mean Square Scaled Error (RMSSE), and Residual Autocorrelation at Lag 1 (ACF1).

bind_rows(
  residential_power_fit_snaive |> accuracy(),
residential_power_fit_aada |> accuracy(),
residential_power_fit_aaa |> accuracy(),
residential_power_fit_sarima |> accuracy(),
residential_power_fit_sarima_auto |> accuracy(),
residential_power_forecast1 |> accuracy(residential_power_c),
residential_power_forecast2 |> accuracy(residential_power_c),
residential_power_forecast3 |> accuracy(residential_power_c),
residential_power_forecast4 |> accuracy(residential_power_c),
residential_power_forecast5 |> accuracy(residential_power_c)) |> select(-MAPE) |> select(-MPE) |> kable() |> kable_styling() |>  kable_classic()

.model	.type	ME	RMSE	MAE	MASE	RMSSE	ACF1
snaive	Training	0.0113601	0.1143120	0.0914488	1.0000000	1.0000000	0.2594660
AAdA	Training	0.0028725	0.0847129	0.0676075	0.7392936	0.7410672	0.2513395
AAA	Training	-0.0007568	0.0844239	0.0658837	0.7204437	0.7385389	0.2806531
sarima	Training	-0.0005121	0.0830166	0.0638212	0.6978901	0.7262278	0.0048314
sarima_auto	Training	-0.0005804	0.0829831	0.0640263	0.7001324	0.7259348	-0.0050456
snaive	Test	0.0498867	0.1281067	0.1069666	1.1696882	1.1206755	0.3441222
AAdA	Test	0.0792147	0.1357593	0.1034322	1.1310393	1.1876205	0.1866026
AAA	Test	0.0738503	0.1307530	0.0988158	1.0805582	1.1438248	0.2021484
sarima	Test	0.0857929	0.1335112	0.0999421	1.0928744	1.1679539	0.2226685
sarima_auto	Test	0.0855011	0.1334425	0.0998193	1.0915317	1.1673525	0.2229940

Seasonal Naive Residuals

residential_power_models |> select(snaive) |> report()

## Series: KWH_TR 
## Model: SNAIVE 
## 
## sigma^2: 0.013

residential_power_snaive <- residential_power_models |> select(snaive) |> augment() 

plotResiduals(residential_power_snaive)

ETS(A,Ad,A) Residuals

residential_power_models |> select(AAdA) |> report()

## Series: KWH_TR 
## Model: ETS(A,Ad,A) 
##   Smoothing parameters:
##     alpha = 0.1164068 
##     beta  = 0.0001002402 
##     gamma = 0.0001006765 
##     phi   = 0.9784401 
## 
##   Initial states:
##      l[0]        b[0]        s[0]      s[-1]       s[-2]     s[-3]     s[-4]
##  15.59394 0.001677884 -0.07297249 -0.2697882 -0.08340296 0.1809632 0.2498246
##     s[-5]        s[-6]      s[-7]      s[-8]       s[-9]     s[-10]    s[-11]
##  0.189836 -0.005676624 -0.2359714 -0.1927465 -0.05773899 0.07897907 0.2186942
## 
##   sigma^2:  0.0081
## 
##       AIC      AICc       BIC 
##  50.29991  55.40439 104.84780

residential_power_aada <- residential_power_models |> select(AAdA) |> augment()

plotResiduals(residential_power_aada)

ETS(A,A,A) Residuals

residential_power_models |> select(AAA) |> report()

## Series: KWH_TR 
## Model: ETS(A,A,A) 
##   Smoothing parameters:
##     alpha = 0.07147638 
##     beta  = 0.00010008 
##     gamma = 0.0001006445 
## 
##   Initial states:
##      l[0]         b[0]        s[0]      s[-1]      s[-2]     s[-3]     s[-4]
##  15.62357 0.0006794174 -0.06511499 -0.2678763 -0.1006579 0.1901032 0.2546851
##      s[-5]        s[-6]      s[-7]     s[-8]       s[-9]     s[-10]    s[-11]
##  0.1963945 -0.004332634 -0.2424415 -0.173999 -0.05527932 0.06741975 0.2010991
## 
##   sigma^2:  0.008
## 
##      AIC     AICc      BIC 
## 47.25414 51.78747 98.77158

residential_power_aaa <- residential_power_models |> select(AAA) |> augment()

plotResiduals(residential_power_aaa)

Seasonal ARIMA (1, 0, 0),(0, 1, 1)m=12 Residuals

residential_power_models |> select(sarima) |> report()

## Series: KWH_TR 
## Model: ARIMA(1,0,0)(0,1,1)[12] w/ drift 
## 
## Coefficients:
##         ar1     sma1  constant
##       0.250  -0.7423    0.0066
## s.e.  0.083   0.0737    0.0026
## 
## sigma^2 estimated as 0.007641:  log likelihood=140.44
## AIC=-272.89   AICc=-272.59   BIC=-261.09

residential_power_sarima <- residential_power_models |> select(sarima) |> augment()
plotResiduals(residential_power_sarima)

Seasonal ARIMA (0, 0, 2),(0, 1, 1)m=7 Residuals (Automatic Parameter Selection)

residential_power_models |> select(sarima_auto) |> report()

## Series: KWH_TR 
## Model: ARIMA(0,0,1)(0,1,1)[12] w/ drift 
## 
## Coefficients:
##          ma1     sma1  constant
##       0.2740  -0.7387    0.0087
## s.e.  0.0928   0.0730    0.0033
## 
## sigma^2 estimated as 0.007635:  log likelihood=140.56
## AIC=-273.13   AICc=-272.83   BIC=-261.33

residential_power_sarima_auto <- residential_power_models |> select(sarima_auto) |> augment()
plotResiduals(residential_power_sarima_auto)

In evaluating the residential power consumption data, several models were considered, each showing varying degrees of effectiveness in capturing underlying patterns and generalizing to new data. The snaive model served as a basic benchmark, yielding the highest test set errors with an RMSE of 0.1281 and MAE of 0.1069, reflecting its simplicity and limited capacity for accurate forecasting. The ETS models—ETS(AAdA) and ETS(AAA)—performed relatively well on the training data, achieving lower training RMSE values of 0.0847 and 0.0844, respectively, and comparable MAE values. However, these models showed higher error on the test set, with ETS(AAdA) yielding an RMSE of 0.1358 and ETS(AAA) slightly lower at 0.1308. While effective on historical data, the ETS models’ increased test set errors suggest a degree of overfitting, reducing their reliability for forecasting.

The SARIMA models, both manual and automatic, emerged as the top-performing models. The SARIMA (manual) configuration demonstrated strong accuracy across datasets, with a training RMSE of 0.0830 and MAE of 0.0638. On the test set, it achieved competitive values, with an RMSE of 0.1335 and an MAE of 0.0999. The residual autocorrelation (ACF1) was minimal (0.0048), indicating that the SARIMA (manual) model successfully captured the core seasonal and trend components in the data without leaving significant residual patterns.

Selected Model Forecasts*

The SARIMA (auto) model provided the best overall performance, with slightly lower error metrics on the test set compared to the SARIMA (manual). It achieved a test RMSE of 0.1334 and an MAE of 0.0998, demonstrating its high accuracy in forecasting. The MASE (1.0915) and RMSSE (1.1674) metrics confirm its ability to generalize effectively, while the ACF1 of 0.2229 suggests minimal residual autocorrelation, indicating that the model captures the majority of the data’s structure. These results make the SARIMA (auto) model the optimal choice for forecasting power consumption, offering both low forecast error and strong generalizability.

 residential_power_final_forecast <- residential_power_fit_sarima_auto |> forecast(h=residential_power_forecast_len*2)

residential_power_final_forecast |> autoplot() + autolayer(residential_power_train,KWH_TR) + labs(title = "ARIMA(0,0,1)(0,1,1)[12] w/ drift")

residential_power_final_forecast |> autoplot() + autolayer(residential_power_c,KWH_TR) + labs(title = "ARIMA(0,0,1)(0,1,1)[12] w/ drift")

Saving Predictions

Forecast predictions are saved in both the original and transformed scales to ensure clarity and interpretability. While transformations improve model stability and performance, reverting predictions to the original scale makes results more accessible. This dual-saving approach maintains analytic precision and allows for straightforward interpretation of forecasted values.

residential_power_predicted <-  data.frame(Date=as.character(residential_power_final_forecast$Date),
                         KWH_Yhat=residential_power_final_forecast$.mean,
                         KWH_OG_Scale= round(exp(residential_power_final_forecast$.mean),2) )

write_xlsx(residential_power_predicted,"residential_power_predicted.xlsx")

Conclusion

This analysis and forecasting project on residential power consumption data from January 1998 to December 2013 offers significant insights into the dataset’s patterns and the performance of various forecasting models. A range of models was implemented, including a Seasonal Naïve model, ETS models with additive and damped trend adjustments, and two SARIMA configurations—one with manually set parameters and another with automatic parameter selection. The data was right-skewed with an outlier, which was managed with a modified z-score approach and imputed for stability, while a log transformation helped reduce skewness and enhance the distribution.

The evaluation of models revealed that the SARIMA models, both manual and automatic, were most effective in capturing the seasonal and trend structures in the data. The SARIMA with automatic parameter selection emerged as the best-performing model, achieving low test set errors, including an RMSE of 0.1334 and an MAE of 0.0998, while maintaining minimal residual autocorrelation. These metrics underscore its ability to generalize well, making it the optimal choice for forecasting future power consumption.