Predictive Analytics

Abstract

My objective was to forecast monthly net generation of solar power in the United States. Solar power generation has grown almost 10-fold over the last 10 years alone. As renewable energy, which does not have static production capability, becomes a greater share of energy production it is important that accurate forecasts can be made. Grid operators need to combine expected consumer demand with expected production when making decisions. I used data from the U.S. Energy Information Administration from Jan 2014 to April 2023. The approach was to create three models recommended from previous literature on the topic. I created ETS, ARIMA, and TSLM models to forecast the data. The ARIMA model was particularly promising with the best fit of the data. The ARIMA and TSLM models required transformations of the data to mitigate issues with autocorrelation. Additionally, I created an ensemble model averaging the forecasts from the previous models. The results found that the ETS model predicted the test set the best with a MPE of -3.54%. However, it is worth noting that this corresponds to an error on the magnitude of 500,000 megawatt hours (MWh). Depending on the requirements of decision-makers, the model may not forecast to the accuracy required, but is still a useful benchmark for future modeling.

Introduction and Significance

With most modern power systems, forecasting is usually focused on the consumer end of the pipeline. Most grid operators will want to ensure they can provide enough power to meet consumer demand. This works well when the generation capability of the plant is static, such as with nuclear power, coal plants, or natural gas. However, when the generation from a particular source is more random, such as with wind or solar generation, it is not as simple as forecasting demand. It is important to forecast the source itself.

Why is this an important outcome to model? Solar power generation, while only representing about 5% of the monthly net generation over the past year (Source: EIA.gov), has increased almost 10 fold since the EIA began tracking the monthly output less than 10 years ago. As this trend continues, our grid will be powered by a greater share of a somewhat volatile source. It will be increasingly more important to forecast what a grid’s generation will be to enable informed decision making from operators. When does output from traditional sources (i.e. nuclear or fossil fuels) need to be increased to account for a shortfall from renewable sources? When is the best time to turn off a plant for retrofit or overhaul? These are just a couple examples of questions that could be answered with an accurate model.

In this paper I look to test different models and predict solar power output over a horizon of 23 months. I will create and test 3 different models along with an ensemble model that will average predictions.

Methods

Data Set

The data set I have chosen for the midterm is the monthly solar power generation data from the U.S. Energy Information Administration. This also includes monthly wind power generation data for one of the models I will be testing. Each observation is the monthly generation (in thousands of megawatt hours) of the two renewable energy sources.

Importing Data and Initial Plots

# Import Data Set
raw_data = read.csv(file.path(path, "solar_generation_monthly.csv"))

# Create the monthly_generation object for autoplotting the time series data
# Convert to make better time series data visually
monthly_generation = pivot_longer(raw_data,
                                  cols = c("wind", "solar"),
                                  names_to = "type",
                                  values_to = "generation")

# Convert data to tsibble
monthly_generation = monthly_generation %>% 
  filter(!is.na(generation)) %>% 
  mutate(date = ym(date)) %>% 
  mutate(date = yearmonth(date)) %>% 
  as_tsibble(key = type,
             index = date)

# Build Initial Plot
monthly_generation %>% 
  autoplot() +
  labs(x = "Date", y = "Generation (1000 MWh)",
       title = "Monthly Electricity Generation (Jan 2001 - April 2023)")

# Build the acutal data set for modeling
solar_gen_data = raw_data %>% 
  filter(!is.na(solar)) %>% 
  mutate(date = ym(date)) %>% 
  mutate(date = yearmonth(date)) %>% 
  as_tsibble(index = date)

# Conduct an STL Decomp
solar_gen_data %>% 
  model(STL(solar)) %>% 
  components() %>% 
  autoplot() + 
  labs(x = "Date", y = "Solar Generation (1000 MWh)",
       title = "Solar Power Generation STL Decomposition")

The wind generation data is from as far back as 2001 and the solar generation data starts in 2014. As shown in the plot there is a noticeable upward trend (as there is increased investment within the U.S.) and seasonality to the data (likely tied to weather patterns). It is worth noting that the seasonality has a lot of variance as time goes on and modeling may require transformation.

Models

Next I will construct three models and one ensemble model using the means of the previous models’ forecasts. The models I have chosen (ETS, ARIMA, and TSLM) are in line with what the literature suggests to use for forecasting solar generation (Source: Energy Conversion and Management). Additional research has found that ARIMA-based models can be effective when the time series data is stochastic in nature, like with wind or solar (Source: IEEE Transactions on Power Systems). Additionally, I will make an ensemble model averaging the three models. There are 112 months of solar generation data to work. The data will be split 80/20 into training and test sets.

# Build a training set and a test set (80/20 split)
total_obs = dim(solar_gen_data)[1]
train_obs = floor(total_obs*0.8)
test_obs = total_obs - train_obs

solar_gen_train = head(solar_gen_data, train_obs)
solar_gen_test = tail(solar_gen_data, test_obs)

ETS

I will begin with an Error, Trend, Seasonal (ETS) model. Fable will pick the model with the best fit. Unless there are issues with the residuals or forecasts that is the model I will stick with.

# Create model and print information
solar_gen_ets = solar_gen_train %>% 
  model(ETS(solar))

report(solar_gen_ets)

## Series: solar 
## Model: ETS(M,A,M) 
##   Smoothing parameters:
##     alpha = 0.3165826 
##     beta  = 0.05298594 
##     gamma = 0.000126038 
## 
##   Initial states:
##      l[0]     b[0]      s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]
##  1943.621 95.00579 0.6228554 0.7233501 0.9298238 1.055176 1.222603 1.289202
##     s[-6]    s[-7]    s[-8]    s[-9]    s[-10]    s[-11]
##  1.295394 1.261241 1.140846 1.033158 0.7640633 0.6622869
## 
##   sigma^2:  0.0031
## 
##      AIC     AICc      BIC 
## 1441.004 1449.624 1483.311

solar_gen_ets %>% 
  gg_tsresiduals(lag_max = 12) +
  labs(title = "ETS Model (M,A,M) Residuals")

components(solar_gen_ets) %>% 
  autoplot() +
  labs(title = "ETS Model (M,A,M) Components")

# One of the lag spikes is close to the limit, run a Ljung-Box test
augment(solar_gen_ets) %>% 
  features(.innov, ljung_box, lag = 12)

## # A tibble: 1 × 3
##   .model     lb_stat lb_pvalue
##   <chr>        <dbl>     <dbl>
## 1 ETS(solar)    13.8     0.314

The ETS model that was created is a Holt-Winters Multiplicative model w/ multiplicative errors (M,A,M). There is a continuous upward trend and notable seasonality. One of the lag spikes is close to being significant, but a Ljung-Box test shows that is not a problem. The Akaike Information Criterion (AIC) is 1441.

ARIMA

As with the ETS model I will let the Fable library pick the autoregressive integrated moving average (ARIMA) model that best fits the training set. Previous research shows this is a good choice for models that have a random component.

# Create model and print information
solar_gen_arima = solar_gen_train %>% 
  model(ARIMA(solar))

report(solar_gen_arima)

## Series: solar 
## Model: ARIMA(0,1,0)(2,1,0)[12] 
## 
## Coefficients:
##          sar1    sar2
##       -0.2674  0.2539
## s.e.   0.1179  0.1350
## 
## sigma^2 estimated as 315848:  log likelihood=-589.64
## AIC=1185.29   AICc=1185.62   BIC=1192.28

solar_gen_arima %>% 
  gg_tsresiduals(lag_max = 12) +
  labs(title = "ARIMA Model (0,1,0)(2,1,0)[12] Residuals")

# Significant lag spike
augment(solar_gen_arima) %>% 
  features(.innov, ljung_box, lag = 12)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 ARIMA(solar)    22.3    0.0343

The ARIMA model created by the Fable library has a lower AIC, which suggests a better fit. However, there appears to be autocorrelation in the residuals which is confirmed with a Ljung-Box test (p-value of 0.034). I will build another ARIMA model by performing a natural log transformation on the dependent variable (Solar Power Generation).

# Autoplot both Wind and Solar with log transformation
monthly_generation %>% 
  autoplot(log(generation)) +
  labs(x = "Date", y = "Natural Log of Generation",
       title = "Monthly Electricity Generation (Jan 2001 - April 2023)")

# Build revised model based on transformed data
solar_gen_arima2 = solar_gen_train %>% 
  model(ARIMA(log(solar)))

report(solar_gen_arima2)

## Series: solar 
## Model: ARIMA(0,1,1)(0,1,1)[12] 
## Transformation: log(solar) 
## 
## Coefficients:
##           ma1     sma1
##       -0.5286  -0.7443
## s.e.   0.1010   0.1667
## 
## sigma^2 estimated as 0.003496:  log likelihood=103.29
## AIC=-200.58   AICc=-200.24   BIC=-193.59

solar_gen_arima2 %>% 
  gg_tsresiduals(lag_max = 12) +
  labs(title = "ARIMA Model (0,1,1)(0,1,1)[12] Residuals (Log Transformed Data)")

augment(solar_gen_arima2) %>% 
  features(.innov, ljung_box, lag = 12)

## # A tibble: 1 × 3
##   .model            lb_stat lb_pvalue
##   <chr>               <dbl>     <dbl>
## 1 ARIMA(log(solar))    13.3     0.351

Transforming the data solves the problem with autocorrelation. There is still a lag spike, but another Ljung-Box test confirms it is no longer an issue (p-value of 0.351). This model has a negative AIC (-200) and appears to be a better fit than the ETS model.

Time Series Linear Model (TSLM)

The last model will be a time series linear model (TSLM). Ideally, monthly solar activity data would be useful to predict output, but I was unable to find the data. Using data provided by the EIA as well, I will see if there is any correlation between another renewable energy source, wind.

# Plot the initial data
solar_gen_data %>% 
  ggplot(aes(solar, wind)) +
  geom_point() +
  labs(x = "Solar (1000 MWh)",
       y = "Wind (1000 MWh)",
       title = "Relationship between Wind and Solar Generation")

# Calculate Correlation
cat("Correlation between solar and wind:",
    cor(solar_gen_data$solar, solar_gen_data$wind),
    "\n")

## Correlation between solar and wind: 0.7174577

# Plot the data with a log transformation
solar_gen_data %>% 
  ggplot(aes(log(solar), log(wind))) +
  geom_point() +
  labs(x = "Natural Log of Solar Generatioin",
       y = "Natural Log of Wind Generation",
       title = "Relationship between Wind and Solar Generation")

# Calculate Correlation
cat("Correlation between natural log transformations of solar and wind:",
    cor(log(solar_gen_data$solar), log(solar_gen_data$wind)),
    "\n")

## Correlation between natural log transformations of solar and wind: 0.7391381

There is strong correlation between solar and wind generation, but the points begin to fan out the greater the generation. I instead take a natural log transformation of both the predictor and the response. There is a tighter grouping of observations and a slightly stronger correlation. I will make the model with the transformed data.

solar_gen_tslm = solar_gen_train %>% 
  model(TSLM(log(solar) ~ log(wind)))

report(solar_gen_tslm)

## Series: solar 
## Model: TSLM 
## Transformation: log(solar) 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20155 -0.27865 -0.05712  0.33138  0.90422 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.5501     1.7043  -3.257  0.00161 ** 
## log(wind)     1.4273     0.1713   8.332 1.04e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.448 on 87 degrees of freedom
## Multiple R-squared: 0.4438,  Adjusted R-squared: 0.4374
## F-statistic: 69.42 on 1 and 87 DF, p-value: 1.0448e-12

solar_gen_tslm %>% 
  gg_tsresiduals(lag_max = 12) + 
  labs(title = "TSLM Model (log(solar) ~ log(wind)) Residuals")

augment(solar_gen_tslm) %>% 
  features(.innov, ljung_box, lag = 12)

## # A tibble: 1 × 3
##   .model                       lb_stat lb_pvalue
##   <chr>                          <dbl>     <dbl>
## 1 TSLM(log(solar) ~ log(wind))    250.         0

The first model has an Adjusted \(R^2\) of 0.4374 and it looks like the predictor has a significant effect on the response. Residuals show a seasonal pattern and the Ljung-Box test p-value is effectively 0. I will rebuild the model with a seasonal effect.

solar_gen_tslm2 = solar_gen_train %>% 
  model(TSLM(log(solar) ~ log(wind) + season(period = "1 year")))

report(solar_gen_tslm2)

## Series: solar 
## Model: TSLM 
## Transformation: log(solar) 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55441 -0.12025  0.01413  0.12256  0.37193 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          -12.950940   0.865474 -14.964  < 2e-16 ***
## log(wind)                              2.116949   0.086283  24.535  < 2e-16 ***
## season(period = "1 year")season_122    0.253588   0.097553   2.599 0.011211 *  
## season(period = "1 year")season_123    0.274874   0.097838   2.809 0.006306 ** 
## season(period = "1 year")season_124    0.355807   0.098030   3.630 0.000512 ***
## season(period = "1 year")season_125    0.673477   0.097503   6.907 1.31e-09 ***
## season(period = "1 year")season_126    0.886482   0.101326   8.749 4.04e-13 ***
## season(period = "1 year")season_127    1.228356   0.103428  11.876  < 2e-16 ***
## season(period = "1 year")season_128    1.368781   0.105096  13.024  < 2e-16 ***
## season(period = "1 year")season_129    1.020431   0.102829   9.924 2.32e-15 ***
## season(period = "1 year")season_1210   0.488572   0.100932   4.841 6.64e-06 ***
## season(period = "1 year")season_1211   0.133622   0.100918   1.324 0.189450    
## season(period = "1 year")season_1212  -0.007837   0.100934  -0.078 0.938314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1949 on 76 degrees of freedom
## Multiple R-squared: 0.908,   Adjusted R-squared: 0.8935
## F-statistic: 62.49 on 12 and 76 DF, p-value: < 2.22e-16

solar_gen_tslm2 %>% 
  gg_tsresiduals(lag_max = 12) +
  labs(title = "TSLM Model (log(solar) ~ log(wind) + season()) Residuals")

augment(solar_gen_tslm2) %>% 
  features(.innov, ljung_box, lag = 12)

## # A tibble: 1 × 3
##   .model                                                       lb_stat lb_pvalue
##   <chr>                                                          <dbl>     <dbl>
## 1 "TSLM(log(solar) ~ log(wind) + season(period = \"1 year\"))"    30.5   0.00231

There is still autocorrelation, but since it is confined to the first three lag spikes I will choose to continue with this model. The Adjusted \(R^2\) is significantly better, suggesting a better fitted model. Now that the three models have been created, I will move to forecasting and testing against the hold-out set.

Results

I will create a single model object to work with and consolidate the code.

# Create combined model
solar_gen_model = solar_gen_train %>% 
  model(`ETS` = ETS(solar),
        `ARIMA` = ARIMA(log(solar)),
        `TSLM` = TSLM(log(solar) ~ log(wind) + season(period = "1 year")),
        Ensemble = combination_ensemble(ets = ETS(solar),
                                        arima = ARIMA(solar),
                                        tslm = TSLM(log(solar) ~ log(wind) + 
                                                      season(period = "1 year"))))

Prediction and Testing

With the consolidated mable created I can build forecasts for each of the four models against the test set.

# Forecast
solar_gen_forecast = solar_gen_model %>% 
  forecast(solar_gen_test)

# Plot
solar_gen_forecast %>% 
  autoplot(level = NULL, lwd = 1) +
  labs(x = "Date", y = "Solar Generation (1000 MWh)",
       title = "Model Forecasts",
       subtitle = "Test Set is Black Line") + 
  guides(color = guide_legend(title = "Model")) + 
  geom_line(data = solar_gen_test, 
            aes(x = date, y = solar), 
            lwd = 1.25)

## Warning in ggplot2::geom_point(mapping = mapping, data =
## dplyr::semi_join(object, : Ignoring unknown parameters: `linewidth`

# Calculate accuracy
accuracy_table = solar_gen_forecast %>% 
  accuracy(solar_gen_test) %>% 
  select(-c(.type, MASE, RMSSE)) %>% 
  rename("Model" = ".model")

kable(accuracy_table,
      format = "markdown",
      caption = "Model Accuracy")

Model Accuracy
Model	ME	RMSE	MAE	MPE	MAPE	ACF1
ARIMA	-1202.2588	1589.1686	1305.6956	-7.275340	8.138507	0.6757117
ETS	-563.9701	869.4812	705.5944	-3.538574	4.709508	0.4034226
Ensemble	-1262.2853	1804.5184	1500.5235	-9.022739	10.376915	0.2493008
TSLM	-1745.0511	3710.2948	3089.2130	-12.384797	19.916922	0.2526623

Discussion

###Analysis The ETS model makes an accurate model from a percentage error standpoint (an absolute error of 4.71%). However, this corresponds to an error of hundreds of thousands of megawatt hours (MWh). At the national level this may be acceptable given the size of the generation. At the more local level an error rate of that magnitude may prove untenable. Still, the ETS model does serve as a useful benchmark for future work.

###Future Work From a research perspective I would drill down into acceptable error rates with power generation and consumer demand forecasts. This will be the key decider in the usefulness of the model going forward. From a continued modeling perspective, I would attempt to correct for the bias in the ETS model (ME of -564). Additionally, I would attempt to drill down the data to the state or county level. Will the error rate hold or will the rate increase the smaller the generation becomes.

# Create plot for most accurate model with prediction intervals
solar_gen_ets_forecast = solar_gen_ets %>% 
  forecast(solar_gen_test)

solar_gen_ets_forecast %>% 
  autoplot(lwd = 1) +
  labs(x = "Date", y = "Solar Generation (1000 MWh)",
       title = "ETS Model Forecast") +
  geom_line(data = solar_gen_test, 
            aes(x = date, y = solar), 
            lwd = 1)

Conclusion

While it would be ideal to conduct more analysis of potential predictors to create better models, an ETS model can create an accurate prediction based on percentage error. I would defer to grid operators about what error is acceptable when using the data to influence decisions. If more accurate forecasts are required, this ETS model can serve as the benchmark to further refine models.

Predictive Analytics - Midterm

Eric Kenney

2023-07-20