Part B comprises a straightforward dataset detailing residential power consumption from January 1998 to December 2013. The task involves modeling this dataset and generating a monthly forecast for 2014. The data is provided in a single file, with the variable ‘KWH’ representing power consumption in Kilowatt hours, while the remaining variables are self-explanatory.
Uploading the data
Checking datatypes
str(df_power)
## 'data.frame': 192 obs. of 3 variables:
## $ CaseSequence: int 733 734 735 736 737 738 739 740 741 742 ...
## $ YYYY.MMM : chr "1998-Jan" "1998-Feb" "1998-Mar" "1998-Apr" ...
## $ KWH : int 6862583 5838198 5420658 5010364 4665377 6467147 8914755 8607428 6989888 6345620 ...
skim(df_power)
| Name | df_power |
| Number of rows | 192 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| YYYY.MMM | 0 | 1 | 8 | 8 | 0 | 192 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| CaseSequence | 0 | 1.00 | 828.5 | 55.57 | 733 | 780.75 | 828.5 | 876.25 | 924 | ▇▇▇▇▇ |
| KWH | 1 | 0.99 | 6502474.6 | 1447570.89 | 770523 | 5429912.00 | 6283324.0 | 7620523.50 | 10655730 | ▁▁▇▅▁ |
The CaseSequence column needs to be double-checked to confirm that it functions solely as a unique identifier for power consumption within a given year and month.
unique_ids <- unique(df_power$CaseSequence)
length_unique_ids <- length(unique_ids)
print(length_unique_ids)
## [1] 192
With 192 unique variables observed in the CaseSequence column, it appears more of an identifier. Consequently, using this information for forecasting purposes will be avoided.
Creating a tsibble object
df_power_ts<- df_power %>%
mutate(Month = yearmonth(YYYY.MMM, "%Y-%b")) %>%
select(-YYYY.MMM)%>%
as_tsibble(index = Month)
Plotting the Residential Power data to provide a clearer understanding of the time series.
df_power_ts %>%
autoplot(KWH) +
labs(title= "Residential Power")
To address the single missing value and the outlier observed after January 2010 in the forecast, imputing the mean of the power for both instances has been chosen as the corrective measure.
df_power_mean <- df_power_ts %>%
mutate(KWH = if_else(is.na(KWH), round(mean(KWH, na.rm = TRUE)), KWH))%>%
mutate(KWH= if_else(KWH == 770523, round(mean(KWH, na.rm = TRUE)), KWH))
df_power_mean%>%
autoplot(KWH) +
labs(title= "Residential Power")
The decision to impute the mean effectively stabilized the
data.
Examining the histogram to assess whether normalization of the data is necessary.
ggplot(df_power_mean, aes(x = KWH)) +
geom_histogram() +
labs(title = "Histogram of KWH")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Given the right-skewed nature of the histogram, incorporating a Box-Cox transformation into the forecasting process is advisable.
A notable spike is observed at certain lags, indicating the presence of seasonality in the data. Specifically, there is a significant peak at lag 6, suggesting a 6-month seasonality pattern.
df_power_mean %>%
ACF(KWH) %>%
autoplot() +
labs(title="Residential Power")
In light of the observed seasonality, both an SNAIVE model and a Holt-Winters’ method model were employed to forecast the power for each month of 2014. However, upon conducting residual diagnostics tests, although the plots of the residuals, ACF, and histogram provided the characteristics needed, the Portmanteau test of the models yielded significant p-values. A significant p-value typically indicates the presence of autocorrelation in the residuals. Autocorrelation in the residuals suggests that the model has not captured all the information present in the data, and there are still patterns or structures that the model has not adequately explained. Therefore, a potential future forecasting model may involve an ARIMA model.
lambda <- df_power_mean %>%
features(KWH, features = guerrero) %>%
pull(lambda_guerrero)
fit_df_power_a <- df_power_mean%>%
model(SNAIVE = SNAIVE(box_cox(KWH, lambda)),
ETS = ETS((box_cox(KWH, lambda) ~ error("A") + trend("A") +
season("A"))))
fit_df_power_b <- df_power_mean%>%
model(ETS = ETS((box_cox(KWH, lambda) ~ error("A") + trend("A") +
season("A"))))
fit_df_power_c <- df_power_mean%>%
model(SNAIVE = SNAIVE(box_cox(KWH, lambda)))
fc_df_power <- fit_df_power_a%>%
forecast(h= 12)
fc_df_power%>%
autoplot(df_power_mean, level = 80) +
labs(y = "KWH in hundreds of dollars",
title = "ATM1 Forecast in May 2010")
fit_df_power_b |> gg_tsresiduals()
fit_df_power_b |> gg_tsresiduals()
augment(fit_df_power_a)%>%
features(.innov, ljung_box, lag = 6)
## # A tibble: 2 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ETS 20.5 0.00223
## 2 SNAIVE 19.8 0.00303