Fitting Coefficients — First Attempt

Robert Link

3/8/2018

knitr::opts_chunk$set(fig.show='hold', results='hold', fig.width=7, fig.height=4)
library(dplyr, warn.conflicts=FALSE)
library(GCAMelec)

Data

For each fuel, in each year, we need the share and the weighted mean cost. We’ll start with the dataset that gives cost and share by technology and aggregate those by fuel.

rawdata <- GCAMelec::electricity_raw
rawdata.exzero <- filter(rawdata, mshare > 0)
share_by_fuel.exzero <- aggregate_by_fuel(rawdata.exzero)
head(share_by_fuel.exzero, 10)

## # A tibble: 10 x 4
##     year fuel_gen    share  cost
##    <int> <fct>       <dbl> <dbl>
##  1  1995 biomass   0.00422 198. 
##  2  1995 coal      0.368    74.0
##  3  1995 gas       0.598    48.6
##  4  1995 MSW       0.00674 535. 
##  5  1995 petroleum 0.00258  34.7
##  6  1995 water     0.0180  124. 
##  7  1995 wind      0.00264 158. 
##  8  1996 biomass   0.0185  151. 
##  9  1996 coal      0.297    64.3
## 10  1996 gas       0.524    37.2

Loss function

We need to define a loss function, which tells us how far our model is from agreement with the data. We will use the cross-entropy loss function, \(H\). For a set of parameters \(\mathbf{a}\), observed shares \(p_i\), and model shares \(q_i\), \[ H(\mathbf{a}) = -\sum_i p_i \ln q_i \]

In order to have a little more flexibility, GCAMelec has defined a function that creates a loss function customized to a specific share model and data set. Thus, if our share model is called gcam_old_logit, and our dataset is called d, we would call xentropy(gcam_old_logit, d, 'share') to get back a function that will calculate the loss for gcam’s old logit, using the share column of data frame d. That function can then be passed to a minimization function like nlm.

GCAMelec also defines the function(s) that compute the model predictions, given the parameters. The general assumption (unless otherwise specified) is that the first parameter in the vector is the logit exponent (or equivalent), and the remaining ones are the share weight parameters in the same order as the levels in the factor of choices.

Model fit (ex. rows with zero share)

Fit the full model

Run the fit on the gcam old logit and print the coefficients retrieved.

fun <- xentropy(gcam_old_logit, share_by_fuel.exzero)
pstart <- c(-3, rep(0, length(levels(share_by_fuel.exzero$fuel_gen))))
moldlogit.1 <- nlm(fun, pstart)

beta <- moldlogit.1$estimate[1]
alpha <- exp(moldlogit.1$estimate[-1])
names(alpha) <- levels(share_by_fuel.exzero$fuel_gen)
cat('logit exponent = ', signif(beta,2), '\n')
cat('share weights=\n')
signif(alpha,2)

## logit exponent =  -0.46 
## share weights=
##    biomass       coal        gas geothermal        MSW    nuclear 
##      0.480      2.900     18.000      0.140      0.097      4.700 
##  petroleum      solar      water       wind 
##      0.170      2.700      0.220      6.200

Construct a plot of actual shares vs model predictions. Actual shares are in blue, model shares in black.

plt1 <- plotfit(share_by_fuel.exzero, gcam_old_logit, moldlogit.1)
plt1

The fit here is not great. The small absolute value for the logit exponent means that the shares don’t change very much with price, and the model is predicting that the shares stay more or less constant at their average values over the data set.

Fit a restricted model

We can fix the logit exponent to the value used in gcam and compare the results.

fixed_logit_val <- -3
logit_fixed3 <- old_logit_fixed_exp(fixed_logit_val)
fun <- xentropy(logit_fixed3, share_by_fuel.exzero)
pstart <- rep(0, length(levels(share_by_fuel.exzero$fuel_gen)))
moldlogit.2 <- nlm(fun,pstart)

alpha <- exp(moldlogit.2$estimate)
names(alpha) <- levels(share_by_fuel.exzero$fuel_gen)
cat('logit exponent = ', signif(fixed_logit_val,2), '\n')
cat('share weights=\n')
signif(alpha,2)

## logit exponent =  -3 
## share weights=
##    biomass       coal        gas geothermal        MSW    nuclear 
##      1.200      1.300      7.800      0.110      0.075      4.400 
##  petroleum      solar      water       wind 
##      0.062     29.000      0.100     13.000

Plot the restricted model against the actual shares. As before, actual shares are in blue, model shares in black.

plt2 <- plotfit(share_by_fuel.exzero, logit_fixed3, moldlogit.2)
plt2

The version with the exponent fixed at -3 “looks better” because it shows more variation over time, but it doesn’t actually fit the data better:

cat('Cross-entropy for model 1 (free logit exponent):  ', signif(moldlogit.1$minimum,3),
    '\nCross-entropy for model 2 (fixed logit exponent): ', signif(moldlogit.2$minimum,3), '\n')

## Cross-entropy for model 1 (free logit exponent):   22.4 
## Cross-entropy for model 2 (fixed logit exponent):  25

Comments

Modeling

As written, the GCAM model doesn’t do a very good job of reproducing the observed data. However, one intriguing observation is that in the fixed logit exponent fit, the model appears to be leading the observational data by 3-5 years. This can be seen especially in the coal data, where the model shows two peaks in new coal capacity that can be seen in the data a few years later. A similar, albeit weaker, effect can be seen in gas and wind. This could be telling us that choices are influenced by backward-looking cost estimates. It’s not hard to add a fittable lag to the model, but before I do that I want to be sure that I understand what “startyr” means in the input data.

Another refinement is that GCAM runs in five-year time steps, so the values being modeled should actually be five-year average shares. I think the best way to do this is to continue to model one-year shares, but to average them over five years and compare those to similarly averaged observation (vs. the alternative of trying to compute some kind of five-year averaged prices to use in modeling).

Data

This dataset still isn’t where we want it to be. Several of the fuel categories are missing in many of the years. Presumably this is because there was no additional capacity added in those years, but we still need those costs in the dataset. An observation that a choice wasn’t used is an observation and needs to be included as such. I haven’t looked at the technology-level categories, but I expect this same comment applies to them. If we genuinely can’t get costs for some of the options in some years, then we need to include those as explicitly missing data, in case we want to use Bayesian imputation in some future analysis.

Our original plan was to work with aggregate shares, but in the long run, I think we will need to work with actual counts of plant deployments, if we have them. If not, then we at least need the amount of capacity deployed in each category, so that we can try to estimate counts based on average plant sizes. The problem with what we are doing here is that although we can find the parameters that best reproduce the historical shares, we can’t compute any kind of uncertainty on the parameters. To do that, we need to know how many decisions were being made. If our model predicts a share of 0.3, and the actual was 0.5, that’s an excellent prediction if there were two plants constructed that year; it’s a terrible prediction if there were 1000.

Model fit (incl. rows with zero share)

Here, we’ll redo the fits including the zero-share rows.

share_by_fuel.inczero <- aggregate_by_fuel(rawdata)

Full model

fun <- xentropy(gcam_old_logit, share_by_fuel.inczero)
pstart <- c(-3, rep(1, length(levels(share_by_fuel.inczero$fuel_gen))))
moldlogit.3 <- nlm(fun, pstart)

beta <- moldlogit.3$estimate[1]
alpha <- exp(moldlogit.3$estimate[-1])
names(alpha) <- levels(share_by_fuel.inczero$fuel_gen)
cat('logit exponent = ', signif(beta,2), '\n')
cat('share weights=\n')
signif(alpha,2)

## logit exponent =  -0.6 
## share weights=
##    biomass       coal        gas geothermal        MSW    nuclear 
##       2.10      11.00      67.00       0.45       0.19       1.20 
##  petroleum      solar      water       wind 
##       0.64       9.80       0.85      26.00

plt3 <- plotfit(share_by_fuel.inczero, gcam_old_logit, moldlogit.3)
plt3

The magnitude of the logit exponent is a little larger, but not much, and the model still doesn’t fit very well. In particular, it shows virtually no variation over time. As before, we’ll try again with the logit exponent fixed at -3.

Restricted model

fixed_logit_val <- -3
logit_fixed3 <- old_logit_fixed_exp(fixed_logit_val)
fun <- xentropy(logit_fixed3, share_by_fuel.inczero)
pstart <- rep(0, length(levels(share_by_fuel.inczero$fuel_gen)))
moldlogit.4 <- nlm(fun,pstart)

alpha <- exp(moldlogit.4$estimate)
names(alpha) <- levels(share_by_fuel.inczero$fuel_gen)
cat('logit exponent = ', signif(fixed_logit_val,2), '\n')
cat('share weights=\n')
signif(alpha,2)

## logit exponent =  -3 
## share weights=
##    biomass       coal        gas geothermal        MSW    nuclear 
##      1.600      1.600      9.900      0.120      0.061      0.960 
##  petroleum      solar      water       wind 
##      0.077     33.000      0.130     16.000

Plot the restricted model against the actual shares. As before, actual shares are in blue, model shares in black.

plt4 <- plotfit(share_by_fuel.inczero, logit_fixed3, moldlogit.4)
plt4

These results are very similar to the results excluding zero. Basically, including the zero-share rows affects our estimate for the fuels that only appear in a few years (i.e., by pushing it closer to zero), but it doesn’t affect the results for the fuels that have positive shares in most years very much at all.

Comments

When we compare the cross entropies of the two models:

cat('Cross-entropy for model 1 (free logit exponent):  ', signif(moldlogit.3$minimum,3),
    '\nCross-entropy for model 2 (fixed logit exponent): ', signif(moldlogit.4$minimum,3), '\n')

## Cross-entropy for model 1 (free logit exponent):   23 
## Cross-entropy for model 2 (fixed logit exponent):  25.3

We see that the version with a free logit exponent is in fact a better fit, so we aren’t having local minimum problems. The model just doesn’t fit very well. The apparent lag between the data and observations is still suggestive, so we’ll follow up by developing a model where the costs can be lagged.