Project1 Part B: Residential Power Forecast

Exploratory Data Analysis (EDA)
ACF Plots
Power Forecast and Residual diagnostics

Part B comprises a straightforward dataset detailing residential power consumption from January 1998 to December 2013. The task involves modeling this dataset and generating a monthly forecast for 2014. The data is provided in a single file, with the variable ‘KWH’ representing power consumption in Kilowatt hours, while the remaining variables are self-explanatory.

Exploratory Data Analysis (EDA)

Uploading the data

Checking datatypes

str(df_power)

## 'data.frame':    192 obs. of  3 variables:
##  $ CaseSequence: int  733 734 735 736 737 738 739 740 741 742 ...
##  $ YYYY.MMM    : chr  "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
##  $ KWH         : int  6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...

skim(df_power)

Data summary
Name	df_power
Number of rows	192
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
YYYY.MMM	0	1	8	8	0	192	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
CaseSequence	0	1.00	828.5	55.57	733	780.75	828.5	876.25	924	▇▇▇▇▇
KWH	1	0.99	6502474.6	1447570.89	770523	5429912.00	6283324.0	7620523.50	10655730	▁▁▇▅▁

The CaseSequence column needs to be double-checked to confirm that it functions solely as a unique identifier for power consumption within a given year and month.

unique_ids <- unique(df_power$CaseSequence)
length_unique_ids <- length(unique_ids)
print(length_unique_ids)

## [1] 192

With 192 unique variables observed in the CaseSequence column, it appears more of an identifier. Consequently, using this information for forecasting purposes will be avoided.

Creating a tsibble object

df_power_ts<- df_power %>%
  mutate(Month = yearmonth(YYYY.MMM, "%Y-%b")) %>%
  select(-YYYY.MMM)%>%
  as_tsibble(index = Month)

Plotting the Residential Power data to provide a clearer understanding of the time series.

df_power_ts %>%
  autoplot(KWH) +
  labs(title= "Residential Power")

To address the single missing value and the outlier observed after January 2010 in the forecast, imputing the mean of the power for both instances has been chosen as the corrective measure.

df_power_mean <- df_power_ts %>%
  mutate(KWH = if_else(is.na(KWH), round(mean(KWH, na.rm = TRUE)), KWH))%>%
  mutate(KWH= if_else(KWH == 770523, round(mean(KWH, na.rm = TRUE)), KWH))

df_power_mean%>%
  autoplot(KWH) +
  labs(title= "Residential Power")

The decision to impute the mean effectively stabilized the data.

Examining the histogram to assess whether normalization of the data is necessary.

ggplot(df_power_mean, aes(x = KWH)) +
  geom_histogram() +
  labs(title = "Histogram of KWH")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Given the right-skewed nature of the histogram, incorporating a Box-Cox transformation into the forecasting process is advisable.

ACF Plots

A notable spike is observed at certain lags, indicating the presence of seasonality in the data. Specifically, there is a significant peak at lag 6, suggesting a 6-month seasonality pattern.

df_power_mean %>%
  ACF(KWH) %>%
  autoplot() +
  labs(title="Residential Power")

Power Forecast and Residual diagnostics

In light of the observed seasonality, both an SNAIVE model and a Holt-Winters’ method model were employed to forecast the power for each month of 2014. However, upon conducting residual diagnostics tests, although the plots of the residuals, ACF, and histogram provided the characteristics needed, the Portmanteau test of the models yielded significant p-values. A significant p-value typically indicates the presence of autocorrelation in the residuals. Autocorrelation in the residuals suggests that the model has not captured all the information present in the data, and there are still patterns or structures that the model has not adequately explained. Therefore, a potential future forecasting model may involve an ARIMA model.

lambda <- df_power_mean %>%
  features(KWH, features = guerrero) %>%
  pull(lambda_guerrero)

fit_df_power_a <- df_power_mean%>%
  model(SNAIVE = SNAIVE(box_cox(KWH, lambda)),
        ETS = ETS((box_cox(KWH, lambda) ~ error("A") + trend("A") +
                                                season("A"))))

fit_df_power_b <- df_power_mean%>%
  model(ETS = ETS((box_cox(KWH, lambda) ~ error("A") + trend("A") +
                                                season("A"))))
fit_df_power_c <- df_power_mean%>%
  model(SNAIVE = SNAIVE(box_cox(KWH, lambda)))


fc_df_power <- fit_df_power_a%>%
  forecast(h= 12)

fc_df_power%>%
  autoplot(df_power_mean, level = 80) +
  labs(y = "KWH in hundreds of dollars",
       title = "ATM1 Forecast in May 2010")

fit_df_power_b |> gg_tsresiduals()

fit_df_power_b |> gg_tsresiduals()

augment(fit_df_power_a)%>%
  features(.innov, ljung_box, lag = 6)

## # A tibble: 2 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 ETS       20.5   0.00223
## 2 SNAIVE    19.8   0.00303

Project1 Part B: Residential Power Forecast

Melissa Bowman

2024-03-24

Exploratory Data Analysis (EDA)

ACF Plots

Power Forecast and Residual diagnostics