knitr::opts_chunk$set(fig.show='hold', results='hold', fig.width=7, fig.height=4)
library(dplyr, warn.conflicts=FALSE)
library(GCAMelec)For each fuel, in each year, we need the share and the weighted mean cost. We’ll start with the dataset that gives cost and share by technology and aggregate those by fuel.
rawdata <- GCAMelec::electricity_raw
rawdata.exzero <- filter(rawdata, mshare > 0)
share_by_fuel.exzero <- aggregate_by_fuel(rawdata.exzero)
head(share_by_fuel.exzero, 10)## # A tibble: 10 x 4
## year fuel_gen share cost
## <int> <fct> <dbl> <dbl>
## 1 1995 biomass 0.00422 198.
## 2 1995 coal 0.368 74.0
## 3 1995 gas 0.598 48.6
## 4 1995 MSW 0.00674 535.
## 5 1995 petroleum 0.00258 34.7
## 6 1995 water 0.0180 124.
## 7 1995 wind 0.00264 158.
## 8 1996 biomass 0.0185 151.
## 9 1996 coal 0.297 64.3
## 10 1996 gas 0.524 37.2
We need to define a loss function, which tells us how far our model is from agreement with the data. We will use the cross-entropy loss function, \(H\). For a set of parameters \(\mathbf{a}\), observed shares \(p_i\), and model shares \(q_i\), \[ H(\mathbf{a}) = -\sum_i p_i \ln q_i \]
In order to have a little more flexibility, GCAMelec has defined a function that creates a loss function customized to a specific share model and data set. Thus, if our share model is called gcam_old_logit, and our dataset is called d, we would call xentropy(gcam_old_logit, d, 'share') to get back a function that will calculate the loss for gcam’s old logit, using the share column of data frame d. That function can then be passed to a minimization function like nlm.
GCAMelec also defines the function(s) that compute the model predictions, given the parameters. The general assumption (unless otherwise specified) is that the first parameter in the vector is the logit exponent (or equivalent), and the remaining ones are the share weight parameters in the same order as the levels in the factor of choices.
Comments
Modeling
As written, the GCAM model doesn’t do a very good job of reproducing the observed data. However, one intriguing observation is that in the fixed logit exponent fit, the model appears to be leading the observational data by 3-5 years. This can be seen especially in the coal data, where the model shows two peaks in new coal capacity that can be seen in the data a few years later. A similar, albeit weaker, effect can be seen in gas and wind. This could be telling us that choices are influenced by backward-looking cost estimates. It’s not hard to add a fittable lag to the model, but before I do that I want to be sure that I understand what “startyr” means in the input data.
Another refinement is that GCAM runs in five-year time steps, so the values being modeled should actually be five-year average shares. I think the best way to do this is to continue to model one-year shares, but to average them over five years and compare those to similarly averaged observation (vs. the alternative of trying to compute some kind of five-year averaged prices to use in modeling).
Data
This dataset still isn’t where we want it to be. Several of the fuel categories are missing in many of the years. Presumably this is because there was no additional capacity added in those years, but we still need those costs in the dataset. An observation that a choice wasn’t used is an observation and needs to be included as such. I haven’t looked at the technology-level categories, but I expect this same comment applies to them. If we genuinely can’t get costs for some of the options in some years, then we need to include those as explicitly missing data, in case we want to use Bayesian imputation in some future analysis.
Our original plan was to work with aggregate shares, but in the long run, I think we will need to work with actual counts of plant deployments, if we have them. If not, then we at least need the amount of capacity deployed in each category, so that we can try to estimate counts based on average plant sizes. The problem with what we are doing here is that although we can find the parameters that best reproduce the historical shares, we can’t compute any kind of uncertainty on the parameters. To do that, we need to know how many decisions were being made. If our model predicts a share of 0.3, and the actual was 0.5, that’s an excellent prediction if there were two plants constructed that year; it’s a terrible prediction if there were 1000.