What's in a Zelig object?

tl;dr

If you already know what's in a Zelig object, you might still be curious about this (or might care to correct me if I'm wrong): it looks to me like plot.ci() is a bit of a general-purpose grapher, able to draw up predicted and expected confidence intervals when they can be told apart, i.e., in models with an ancillary parameter that models explicitly the stochastic component of total variation in the response variable. In the logit, which has no such parameter, the picture drawn by a call to plot.ci() with the qi='pr' argument seems to be simply identical to what you see if you call plot.ci() with the qi='ev' argument. This is kind of sneaky. Maybe a warning message should be issued when you call plot.ci() with the qi='pr' argument after you simulate a logit.

Motivation

The Zelig package's stated aim to be “everybody's statistical software” rests on the correct assumption that everybody's statistical models ought to be motivated by some obvious quantities of interest.

It used to be enough, for a journal article, to publish a table of parameter estimates with the appropriate number of stars by any such estimate that met any of the usual thresholds of statistical significance. For all I know, that may still be true (I'm in the private sector; I don't read journal articles).

But when you investigate relationships between variables, parameter estimates are only the first step. Next you simulate how your outcome responds as you change one or two covariates that you are curious about, while holding any other variables fixed at their typical levels. You will also want an idea of the uncertainty bounds around this simulated response.

The job of Zelig is to make this kind of work easy and easily replicable. How does it do it?

The working example

require(Zelig)
rm(list = ls(all = TRUE))
data(turnout)
z.out <- zelig(vote ~ race + educate + age + I(age^2) + income, model = "logit", 
    data = turnout)
age.range <- 18:95
x.low <- setx(z.out, educate = 12, age = age.range)
x.high <- setx(z.out, educate = 16, age = age.range)
s.out <- sim(z.out, x = x.low, x1 = x.high)

The example above uses Zelig to model the change in the probability of voting as voters age, separately by two groups: one with a high school education, the other with a college education. The z.out object holds the results of running a logistic regression. It is similar to a stock glm object from base R. The s.out object holds all the information you need for plotting the simulated relationship between voting behavior and voter age, using the age range set by age.range and the education levels set at 12, then 16 years.

You can plot any one of three quantities of interest (qi): expected values (qi="ev"), predicted values (qi="pr"), and first differences (qi="fd").

First differences are what you might think they are: a covariate is set at its baseline first, and simulations are computed; then this same covariate is set at some other level, and a new round of simulations is run. The results of each simulation run are paired and differences taken, so you can tell, literally, what difference this covariate makes to the outcome. In epidemiology first differences are called risk differences when the covariate in question describes a characteristic of the population, and average treatment effect when it describes some intervention.

The difference between expected and predicted values is subtle, and it motivated this whole post. Expected values average over the stochastic component. Predicted values do not. As a consequence, the latter have wider confidence intervals. These are the confidence intervals that you should believe when you're assessing your model's predictive prowess. More on this in a minute.

Expected values

xl <- "Age in Years"
yl <- "Expected Probability of Voting"
ma <- "Effect of Education and Age on Voting Behavior"
su <- "(blue is high school and red is college)"
plot.ci(s.out, qi = "ev", xlab = xl, ylab = yl, main = ma, sub = su)

plot of chunk showZeligEV

Predicted values

yl <- "Predicted Probability of Voting"
plot.ci(s.out, qi = "pr", xlab = xl, ylab = yl, main = ma, sub = su)

plot of chunk showZeligPR

First differences

yl <- "Difference in Expected Probability of Voting"
su <- "(E(P(voting | college)) - E(P(voting | high school)))"
plot.ci(s.out, qi = "fd", xlab = xl, ylab = yl, main = ma, sub = su)

plot of chunk showZeligFD

So, what exactly does s.out look like on the inside? I am curious because looking at the first two plots above, I had expected the shaded bands to be wider in the Predicted values plot and to have smoother edges in the Expected values plot.

My reason for expecting this is that Zelig co-author, booster and personal hero of mine Gary King has a nice discussion in his Gov-2001 class about how the uncertainty surrounding predicted values is greater than the uncertainty surrounding expected values, because the latter average over fundamental uncertainty, and only account for the estimation uncertainty, while the former are properly afflicted by both.

Dissecting `s.out`

By default, Zelig's sim() function runs 1,000 simulations. We have an age range of 78 years, so that's 78,000 numbers each for the expected and predicted probability of voting at educate = 12 and educate = 16. That's a total of 4 times 78,000, that is 312,000 numbers. Then there are the first differences between the expected probabilities of voting at educate = 12 and educate = 16, taken another thousand times for each age point. So, I expect that s.out stores a grand total of 5 times 78,000 numbers, that is 390,000 numbers in some fashion. Normally, I'd run str() to peer into the guts of any R object, but I don't recommend str(s.out). It will tie up your computer with scrolling what looks like a lot of gibberish for a long time. We need a more surgical approach.

First, notice that s.out is a list of lists:

length(s.out)

## [1] 78

length(s.out[1])

## [1] 1

length(s.out[[1]])

## [1] 15

length(s.out[[2]])

## [1] 15

length(s.out[[78]])

## [1] 15

head(names(s.out), n = 3)

## [1] "race=white, income=3.88663995, educate=12, age=18"
## [2] "race=white, income=3.88663995, educate=12, age=19"
## [3] "race=white, income=3.88663995, educate=12, age=20"

tail(names(s.out), n = 3)

## [1] "race=white, income=3.88663995, educate=12, age=93"
## [2] "race=white, income=3.88663995, educate=12, age=94"
## [3] "race=white, income=3.88663995, educate=12, age=95"

So, for each year in our age range, s.out stores the same 15 things. What are they? Actually, s.out appears to be a list of 78 lists of 15 items, each of which, I'm starting to suspect, is another list:

length(s.out[[1]][1])

## [1] 1

names(s.out[[1]])

##  [1] "model"              "x"                  "x1"                
##  [4] "stats"              "qi"                 "titles"            
##  [7] "bootfn"             "cond.data"          "zelig"             
## [10] "call"               "zcall"              "result"            
## [13] "num"                "special.parameters" "package.name"

for (i in 1:length(s.out[[1]])) {
    print(paste("Element ", i, " of list s.out[[1]], called ", names(s.out[[1]])[i], 
        ", has length ", length(s.out[[1]][[i]]), sep = ""))
}

## [1] "Element 1 of list s.out[[1]], called model, has length 1"
## [1] "Element 2 of list s.out[[1]], called x, has length 16"
## [1] "Element 3 of list s.out[[1]], called x1, has length 16"
## [1] "Element 4 of list s.out[[1]], called stats, has length 5"
## [1] "Element 5 of list s.out[[1]], called qi, has length 5"
## [1] "Element 6 of list s.out[[1]], called titles, has length 5"
## [1] "Element 7 of list s.out[[1]], called bootfn, has length 0"
## [1] "Element 8 of list s.out[[1]], called cond.data, has length 0"
## [1] "Element 9 of list s.out[[1]], called zelig, has length 11"
## [1] "Element 10 of list s.out[[1]], called call, has length 9"
## [1] "Element 11 of list s.out[[1]], called zcall, has length 6"
## [1] "Element 12 of list s.out[[1]], called result, has length 29"
## [1] "Element 13 of list s.out[[1]], called num, has length 1"
## [1] "Element 14 of list s.out[[1]], called special.parameters, has length 0"
## [1] "Element 15 of list s.out[[1]], called package.name, has length 1"

Hey, look: some of these lists have five items. That's a familiar number:

names(s.out[[1]][["stats"]])

## [1] "Expected Values: E(Y|X)"            
## [2] "Expected Values: E(Y|X1)"           
## [3] "Predicted Values: Y|X"              
## [4] "Predicted Values: Y|X1"             
## [5] "First Differences: E(Y|X1) - E(Y|X)"

names(s.out[[1]][["qi"]])  # qi = quantities of interest

## [1] "Expected Values: E(Y|X)"            
## [2] "Expected Values: E(Y|X1)"           
## [3] "Predicted Values: Y|X"              
## [4] "Predicted Values: Y|X1"             
## [5] "First Differences: E(Y|X1) - E(Y|X)"

names(s.out[[1]][["titles"]])

## NULL

s.out[[1]][["titles"]]

## [1] "Expected Values: E(Y|X)"            
## [2] "Expected Values: E(Y|X1)"           
## [3] "Predicted Values: Y|X"              
## [4] "Predicted Values: Y|X1"             
## [5] "First Differences: E(Y|X1) - E(Y|X)"

So it looks like for each year we have five kinds of stats and quantities of interest, with names as shown in the titles vector: expected and predicted values for each level of the covariate that we toggle, and first differences.

What kind of stats are we talking about?

s.out[[1]][["stats"]][[1]]  # this would be 'Expected Values: E(Y|X)'

##    mean      sd    50%   2.5%  97.5%
##  0.5302 0.03644 0.5314 0.4603 0.6011

OK, so for the first year of our simulated age range, age = 18, and at educate = 12, and with the remaining covariates held at levels shown in names(s.out)[1] (that is to say race=white, income=3.88663995, educate=12, age=18), sim() ran the model 1,000 times and it got 1,000 estimates of the probability of voting whose summary statistics work out to what you see above: a mean of 0.5302, a median of 0.5314, etc.

Could we possibly see these 1,000 actual values, and confirm these summary statistics with R's stock summary() function?

The simulation matrix

Turns out there's a simulation.matrix() function that will let you collect the simulation results in matrix form. Depending on which of the five titles you ask for, the simulation matrix will turn up the corresponding slice of 78,000 numbers as a 1,000 by 78 matrix. So, the slice we care about might be Expected Values: E(Y|X) – specifically, its first column. Let's see:

checkthese <- simulation.matrix(s.out, which = summary(s.out)$titles[1])[, 1]
summary(checkthese)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.402   0.505   0.531   0.530   0.555   0.636

Looks close. Let's see if the hand-rolled version of the summary statistics returned by s.out[[1]][['stats']][[1]] matches:

mean(checkthese)

## [1] 0.5302

sd(checkthese)

## [1] 0.03644

sort(checkthese)[c(499, 500, 501)]  # median

## [1] 0.5313 0.5313 0.5314

sort(checkthese)[c(24, 25, 26)]  # 2.5-th percentile

## [1] 0.4589 0.4601 0.4603

sort(checkthese)[c(974, 975, 976)]  # 97.5-th percentile

## [1] 0.6008 0.6011 0.6012

I'm almost sure that the summary stats s.out[[1]][['stats']][[1]] describe the vector simulation.matrix(s.out, which=summary(s.out)$titles[1])[,1]. This, again is Expected Values: E(Y|X) at age 18, high school education (remember, that's what X stands for; college is X1).

How about I try to pick out first differences at age 31? In other words, do the summary stats s.out[[14]][["stats"]][[5]] describe the vector simulation.matrix(s.out, which = summary(s.out)$titles[5])[, 14]?

names(s.out)[14]

## [1] "race=white, income=3.88663995, educate=12, age=31"

names(s.out[[1]][["stats"]])[5]

## [1] "First Differences: E(Y|X1) - E(Y|X)"

s.out[[14]][["stats"]][[5]]

##    mean      sd    50%    2.5%  97.5%
##  0.1201 0.01272 0.1204 0.09321 0.1434

checkthese <- simulation.matrix(s.out, which = summary(s.out)$titles[5])[, 14]
summary(checkthese)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0797  0.1120  0.1200  0.1200  0.1290  0.1580

mean(checkthese)

## [1] 0.1201

sd(checkthese)

## [1] 0.01272

sort(checkthese)[c(499, 500, 501)]  # median

## [1] 0.1204 0.1204 0.1204

sort(checkthese)[c(24, 25, 26)]  # 2.5-th percentile

## [1] 0.09273 0.09304 0.09321

sort(checkthese)[c(974, 975, 976)]  # 97.5-th percentile

## [1] 0.1428 0.1434 0.1438

Close enough. Now, one last thing. We have 78 sets of stats, curresponding to years of age in our range, each set consisting of Expected Values: E(Y|X), Expected Values: E(Y|X1), Predicted Values: Y|X, Predicted Values: Y|X1, First Differences: E(Y|X1) - E(Y|X). We know what the expected values and the first differences are. What exactly are the predicted values? I mean, where do they come from? Here's an example:

summary(s.out)[["stats"]][[1]][["Predicted Values: Y|X"]]

##      0     1
##  0.496 0.504

summary(s.out)[["stats"]][[1]][["Predicted Values: Y|X1"]]

##      0     1
##  0.319 0.681

Could they be the proportions in which the 1,000 simulated values of Predicted Values: Y|X and Predicted Values: Y|X1 fall below and above .5 respectively?

p.x <- simulation.matrix(s.out, which = summary(s.out)$titles[3])[, 1]
p.x1 <- simulation.matrix(s.out, which = summary(s.out)$titles[4])[, 1]
sum(p.x < 0.5)/length(p.x)  # should be same as summary(s.out)[['stats']][[1]][[3]][,'0']

## [1] 0.496

sum(p.x >= 0.5)/length(p.x)  # should be same as summary(s.out)[['stats']][[1]][[3]][,'1']

## [1] 0.504

sum(p.x1 < 0.5)/length(p.x1)  # should be same as summary(s.out)[['stats']][[1]][[4]][,'0']

## [1] 0.319

sum(p.x1 >= 0.5)/length(p.x1)  # should be same as summary(s.out)[['stats']][[1]][[4]][,'1']

## [1] 0.681

That is indeed the case. In fact, the predicted values are all either 0 or 1:

head(p.x)

## [1] "0" "0" "0" "0" "1" "0"

head(p.x1)

## [1] "1" "0" "0" "1" "0" "0"

Fundamental vs. estimation uncertainty in logistic regression

How exactly do you separate the two? In models where there's a so-called error term or nuisance parameter (bad names both) which is assumed to have mean 0, expected values average over it and wind up with tighter uncertainty bands as a result. This effectively hides the fundamental component of total uncertainty, causing a corresponding increase in unwarranted optimism about your model's prospects of market success. But logistic regression has no such parameter. In what sense are you still at risk of thinking that your model is better than it is?

Well, let's see what your simulated probabilities of vote/no vote would be if you applied the same .5 cutoff to expected, rather than predicted values. This will induce 'false confidence', so to speak, if doing so will cause P(votes) and P(does not vote) to separate better:

p.x <- simulation.matrix(s.out, which = summary(s.out)$titles[1])[, 1]
p.x1 <- simulation.matrix(s.out, which = summary(s.out)$titles[2])[, 1]
sum(p.x < 0.5)/length(p.x)  # is it < summary(s.out)[['stats']][[1]][[3]][,'0']?

## [1] 0.209

sum(p.x >= 0.5)/length(p.x)  # is it > summary(s.out)[['stats']][[1]][[3]][,'1']?

## [1] 0.791

sum(p.x1 < 0.5)/length(p.x1)  # is it < summary(s.out)[['stats']][[1]][[4]][,'0']?

## [1] 0

sum(p.x1 >= 0.5)/length(p.x1)  # is it > summary(s.out)[['stats']][[1]][[4]][,'1']?

## [1] 1

As you can see, the increase in separation is quite spectacular. Using expected rather than predicted probabilities you would be 100% certain that an 18 year-old prodigy with a college degree votes.

OK. But I think I pretty well exhausted whatever there is to be found in the guts of s.out. At the same time, the first two of the three plots above look indentical to me, even though the function calls that generated them have a different qi argument. It looks to me like plot.ci() is a bit of a general-purpose grapher, able to draw up predicted and expected confidence intervals when they can be separated in models with an error term (I know, I know, bad name). In models like the logit, where there's no such parameter, the two pictures will be simply identical. This is kind of sneaky. Maybe a warning message should be issued when you call plot.ci() with the qi='pr' argument after you simulate a logit. It's just a thought. If you've made it this far, congratulations.