Measuring Uncertainty

Author

Christopher P. Adams

Introduction

In microeconometrics it is important to provide information to the reader about how uncertain we are about the estimates. Standard microeconomic models of decision making suggest that people are risk averse and care about the uncertainty surrounding estimates (Kreps 1990). Less standard, but increasingly common models suggest that decision makers care explicitly about the extent of the uncertainty (Klibanoff, Marinacci, and Mukerji 2005). How do we measure and report this uncertainty?

This chapter introduces the two main approaches used in statistics; classical statistics and Bayesian statistics. It may surprise you that statistics does not have a standard approach. Rather, statistics has two approaches that differ in important ways. The chapter discusses the problems and benefits of each. It also introduces a third way, called **empirical Bayesian} statistics. The third method marries classical and Bayesian statistics. The three approaches are illustrated by considering the question of whether a baseball player who played one game of major league baseball, John Paciorek, was better than the great Babe Ruth.

Classical Statistics

The approach you are probably most familiar with from economics is called classical statistics. Classical statistics uses the analogy principle. This principle states that if the econometrician is interested in some characteristic of the population, then she should use the analogy in the sample. Consider that we have a sample of data on player statistics for the WNBA. If we are interested in the average of height of WNBA players then we should use the average height of players in the sample. This is the sample analogy.

This section shows how the analogy principle can be used to provide information about the uncertainty surrounding our estimate.

A Model of a Sample

Consider a case where we observe a sample, \(x = \{x_1, ...,x_N\}\), with \(N\) observations. In addition, assume that sample comes from a normal distribution, \(x \sim \mathcal{N}(\mu, \sigma^2)\), where \(\mu\) is the mean and \(\sigma^2\) is the variance of the normal distribution.

In a typical problem we are interested in estimating the mean, \(\mu\). Let that estimate be denoted \(\hat{\mu}\). We know that \(\hat{\mu} \neq \mu\), but by how much do they differ?

Simulation of a Sample

Consider the simulation of a sample of size 10 drawn from a normal distribution with mean -3 and standard deviation of 5.

set.seed(123456789)
N <- 10
mu <- -3
sigma <- 5
x <- rnorm(N, mean = mu, sd = sigma)
dens_x <- density(x)

Figure 1: Density plot of the sample and the true density function.

The Figure 1 shows that the sample distribution differs substantially from the true distribution. However, we are generally not interested in determining the true underlying distribution. Rather, we are interested in some aggregate characteristic of the distribution such as the mean.

mean(x)

[1] -3.484872

sd(x)

[1] 5.097328

The sample mean is -3.49, which is also substantially different from the true value of -3. The sample standard deviation is 5.10 which is not that different from the true value of 5.

How do we convey to the reader or the policy maker that our estimate of various statistics may not be accurate?

Many Imaginary Samples

One idea is to think about how our estimate would vary if we had a large number of different samples, all drawn from the same true distribution. For each imaginary sample that we draw from the true distribution, we will get a different estimate. As the number of samples gets large, we will have a distribution of the estimates. This distribution provides information about the uncertainty of our sample estimate. Luckily, with computers we can actually conduct this thought experiment.

M <- 500
sample_est <- matrix(NA,M,N)
for (m in 1:M) {
  sample_est[m,] <- rnorm(N, mean = -3, sd = 5)
}

Figure 2: Histogram of samples means where the true mean is -3.

Consider the following simulated data. Figure 2 presents the histogram of the distribution of sample means. The distribution is centered around the true mean of -3, but varies by more than 3 on each side. This illustrates that if we took a large number of samples our estimate would be correct on average.

We say that our estimate of the mean is unbiased. I generally don’t find this particularly interesting. “Your estimate is almost certainly wrong but correct in expectation.” Of course, it is also good to know if your estimate will have a tendency to under-estimate or over-estimate the true value.

More importantly, the weight of the distribution is around the true value. However, we cannot rule out the possibility of our mean estimate being as low as -6 or as high as 0. The extent of the dispersion is determined by the size of the sample. Try running the same experiment with a larger sample, say \(N = 100\).

If we were able to observe a large number of imaginary samples, we would be able to show how accurate our estimate is. But we don’t observe any imaginary samples.

Law of Large Numbers

Another idea is to consider what happens if we have a large number of imaginary observations. If we had a very large sample our estimate of the mean would be very close to the true value. With 100,000 observations the sample mean is -3.001, which is quite close to -3. An estimator with this property is called consistent.

set.seed(123456789)
N <- 100000
x <- rnorm(N, mean = -3, sd = 5)
mean(x)

[1] -3.001334

Theorem 1 Law of Large Numbers. If \(E(x_i) = \mu\) and \(E(x_i^2) < \infty\), then

\[ \lim_{N \rightarrow \infty} \frac{1}{N}\sum_{i=1}^N x_i = \mu \tag{1}\]

Various versions of Theorem 1 state that as the sample size gets large, the sample estimate converges (in some sense) to the true value. The Law of Large Numbers suggests that if our sample is “large enough” then our estimate may be close to the true value. Seems nice, but may not be that useful if our sample size is 10. It also does not provide us with any information about how uncertain our estimate actually is.

Central Limit Theorem

A central result to much of standard classical statistics is the Central Limit Theorem. The theorem states that as the number of observations gets large, then the estimated sample mean is distributed normally with a mean equal to the true mean and a standard deviation equal to the true standard deviation divided by the square root of the number of observations.

Theorem 2 Central Limit Theorem. If \(E(x_i) = \mu\) and \(E(x_i^2) - \mu^2 = \sigma^2 < \infty\), then

\[ \lim_{N \rightarrow \infty} \sqrt{N} \left(\frac{1}{N}\sum_{i=1}^N x_i - \mu \right) \sim \mathcal{N}(0, \sigma^2) \tag{2}\]

The following is a simulation of the Central Limit Theorem. It presents the density of the distribution of estimated sample means as the size of the sample increases. Note that the distributions are normalized so that they will be standard normal (if the sample size is large enough).

M <- 1000
sample_means <- matrix(NA,M,4)
for (m in 1:M) {
  sample_means[m,1] <- mean(rnorm(10, mean = -3, sd = 5))
  sample_means[m,2] <- mean(rnorm(100, mean = -3, sd = 5))
  sample_means[m,3] <- mean(rnorm(1000, mean = -3, sd = 5))
  sample_means[m,4] <- mean(rnorm(10000, mean = -3, sd = 5))
}

Figure 3: Densities of samples means for various sample sizes with known true mean and variance.

The interesting part of Figure 3 is that even for small samples the density is close to a standard normal distribution.¹

The Central Limit Theorem suggests a way of determining the uncertainty associated with the estimate. It gives a distribution of the sample mean. Moreover, the variance of the distribution, the uncertainty, is determined by the true variance and the sample size. The uncertainty associated with the estimate is smaller if the variance is smaller and if the sample size is larger.

The problem is that we don’t know the variance. We don’t know the component parts of the distribution of the sample mean.

Approximation of the Limiting Distribution

The standard solution to not knowing the true values is to use an approximation based on the analogy principle (Manski 1990). Thus, we simply replace the unknown true mean with its known analogy, the sample mean. We replace the unknown true variance with the known sample analogy, the sample variance.

Approximate Distribution

Assumption 1: \[ \begin{array}{l} \hat{\mu} \sim \mathcal{N}\left(\hat{\mu},\frac{\hat{\sigma}^2}{N}\right)\\ \\ s.t. \\ \hat{\mu} = \frac{1}{N} \sum_{i=1}^N x_i\\ \\ \hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N x_i^2 - \hat{\mu}^2 \end{array} \tag{3}\]

The Assumption 1 states that the estimated mean is distributed normally with a mean equal to itself and a variance equal to the sample variance divided by the sample size. This approximation gives us a measure of uncertainty that we can actually use.

There are two problems with this approach. First, there is no particularly good reason to believe that the sample mean is normally distributed. Sure, we know that it must be if the sample size is large, but our sample size is 10. Second, the measure of uncertainty is also estimated. Worse, it may be estimated poorly. Our measure of uncertainty is itself uncertain. The more uncertainty there is, the less certain we can be about how much uncertainty there is.

Simulation of Approximate Distributions

Consider five different samples and the corresponding approximate distribution of the sample mean from each.

M <- 5
N <- 10
sample_means <- matrix(NA,M,2)
x2 <- NULL
for (m in 1:M) {
  x1 <- rnorm(N,mean=mu,sd=sigma)
  sample_means[m,] <- c(mean(x1),sd(x1)/(N)^(.5))
  x2 <- c(x2,x1)
}
x2 <- sort(x2)

Figure 4: Density plots of normal distributions using five different “analogies,” where the true mean is -3.

The Figure 4 does not paint as rosy a picture as Figure 3. The estimates of the distribution of the sample mean are all over the place. Our estimate of uncertainty is affected by how uncertain our estimates are.

Bootstrap

An alternative approach to representing uncertainty, is called bootstrapping. This idea was developed by Stanford statistician, Brad Efron. The idea takes the analogy principle to its natural conclusion. If we need the true distribution we should use its sample analog, that is the sample itself. Using a computer we can create counterfactual pseudo samples by re-drawing from the sample with replacement.

Consider the following example using sample \(x_1\).

N <- 10
x1 <- rnorm(N, mean=mu, sd=sigma) 
M <- 1000
bs_mean <- matrix(NA,M,1)
for (m in 1:M) {
  index_bs <- round(runif(N,min=1,max=N))
  bs_mean[m,1] <- mean(x1[index_bs])
}
dens_bs <- density(bs_mean)

Figure 5: Histogram of bootstrapped sample means where the true mean is -3 (vertical line).

The Figure 5 shows that our estimate of uncertainty is still wrong. The bootstrap measure of uncertainty no longer makes the arbitrary assumption that the sample mean is distributed normally. However, the small sample implies that we still have a poor estimate of the uncertainty.

Hypothesis Testing

A standard method for representing uncertainty is via an hypothesis test. Above, the sample average is -3.49. Let’s say it is important to know the probability that the true mean is in fact 0 or is in fact less than 0. Can we rule out the possibility that the true mean is greater than 0?

\[ \begin{array}{l} H_0: \mu \ge 0\\ H_1: \mu < 0 \end{array} \tag{4}\]

The Equation 4 presents a standard representation of the hypothesis that the true mean is greater than or equal to 0. We call this the null-hypothesis.
Note that we are not testing whether the true mean is -3.49. Rather we are testing whether it is not greater than 0. We are ruling out a possibility.

To test the hypothesis, we begin by assuming the hypothesis is true. That is, we assume that the true mean is 0. Given that assumption, we can ask whether it is reasonable that we would observe a sample average of -3.49. How do we determine what is reasonable?

What is the probability that we could see a sample with a sample average of -3.49 if the true mean is 0? To answer this question we need to know the distribution of sample means. Above, we assumed that the distribution of sample means is approximately normal with a mean equal to the true mean and a variance equal to the true variance divided by the sample size. By assumption, the true mean is 0. We don’t know the true variance, but we can use the analogy principle and replace it with the sample variance, which is 5.10.

\[ \Pr(\hat{\mu} = -3.49) = \phi \left(\sqrt(10) \frac{-3.49}{5.10} \right) \tag{5}\]

This will produce a small probability but that is partly because we are asking about a point rather than a region. A better way is to ask at what sample mean do we no longer think the hypothesis is true?

\[ \begin{array}{l} \Pr(\hat{\mu} < t) = \alpha\\ \alpha = \Phi \left(\sqrt(10) \frac{t}{5.10} \right)\\ \mbox{ and}\\ t = \frac{5.10 \Phi^{-1}(\alpha)}{\sqrt(10)} \end{array} \tag{6}\]

Let \(\alpha\) represent some small probability. It is the probability that \(\hat{\mu}\) is less than some critical value \(t\).

alpha = 0.05
sd(x)*qnorm(alpha)/sqrt(10)

[1] -2.596642

The critical region is less than -2.60. In Figure 6, this is represented as the area under the curve to the left of the solid vertical line. Given the assumptions, we can reject the hypothesis that the true mean is 0 at the 5% level if the sample mean is less than -2.60, which it is.

Figure 6: Density of a normal distribution with standard deviation equal to \(\frac{5.10}{\sqrt(10)}\). The critical value is -2.597 (solid line). Under the hypothesis of a mean of 0, the probability of observing a mean below the critical value is 0.05. The estimated mean is -3.49 (dashed line).

An alternative is to calculate the p-value. This is the probability of observing a sample mean less than or equal to -3.49 given the hypothesis is true.

\[ \Pr(\hat{\mu} < -3.49) = \Phi \left(\sqrt(10) \frac{-3.49}{5.10} \right) \tag{7}\]

pnorm(sqrt(10)*((-3.49)/sd(x)))

[1] 0.01352641

This is 0.014 or 1.4%.

The nice property of the hypothesis test is that it allows the econometrician to assume the true mean, which is an unknown object. In addition, there are occasions where the null hypothesis is a value of interest. For example, Chapter 2 discusses the hypothesis that the direct effect of race on mortgage lending is 0. This particular hypothesis may have legal implications for bankers.

However, it is unclear why a hypothesis of 0 makes sense when describing the uncertainty around our estimate of -3.49. It may be preferable to present the 5th and 95th percentile of the approximate or bootstrap distribution. This is referred to as the **confidence interval}.

Bayesian Statistics

The major competitor to the classical approach is given the appellation “Bayesian” statistics (Berger 1985). One should be careful about the names because “Bayesian” is also used to refer to the standard approach to decision making under uncertainty used in microeconomic theory. It may be best to think of them as two different approaches to decision making under uncertainty with the same name (Berger 1985, Kreps:1990).

The section discusses how to calculate the posterior.

Bayes’ Rule

The Bayesian approach also provides us with a measure of the uncertainty but it uses Bayes’ rule to do so. If we knew the set of possible true parameters, \(\theta \in \Theta\), the likelihood of observing the sample, \(x\), given the true parameters, \(h(x | \theta)\), and the **prior} distribution over this set, \(g(\theta)\), then we can use the following formula.²

\[ \gamma(\theta | x) = \frac{h(x | \theta) g(\theta)}{\int_{\theta' \in \Theta} h(x | \theta') g(\theta')} \tag{8}\]

where \(\gamma(\theta | x)\) is the posterior distribution. This tells us the probability distribution over the true parameters given the sample that is observed. By construction this posterior distribution provides information about the extent of the uncertainty of the estimate.

Determining the Posterior

To calculate the posterior, we need to know the prior and the likelihood functions. Assume we know the likelihood functions. Let’s say we know the possible set of distributions that generated the data. This may not be as crazy as it sounds. Consider the problem of determining a true probability. Say the probability that you will become King or Queen of England. Unless your name is Prince William, this is a pretty small number. Nevertheless it is a number between 0 and 1. I don’t know the probability that you will become King or Queen of England but I do know that it is a number between 0 and 1. Here the problem is more complicated, but let’s say that I know that the true mean must lie between -5 and 10. Also for simplicity, assume that I know that the true standard deviation is 5. So \(\theta \in [-5, 10]\). And, I know that the true distribution is normal.

If I know the possible set of true means, the likelihood of observing the sample conditional on the true mean is just the likelihood function presented for OLS in Chapter 5. So we have the sample, the set of parameters and the likelihood function. All we need to use Equation 8 is the prior distribution.

What is the prior distribution? IDK. We could make an assumption that the prior distribution is uniform. In some cases, the uniform prior gives a sense of uncertainty around what the estimate could be.

Determining the Posterior in `R`

log_norm <- function(x, mu, sigma) {
  z = (x - mu)/sigma
  return(sum(log(dnorm(z)) - log(sigma)))
}
Theta <- 15*c(1:100)/100 - 10
g <- rep(1/100,100) # uniform approximation
h <- sapply(Theta, function(theta) log_norm(x1, theta, 5))
gamma <- exp(h + log(g))/sum(exp(h + log(g)))

Figure 7: Plot of Bayesian posterior distribution of mean estimate. The true value is -3 (dotted line). The analog estimate from the sample is lower (solid line).

The Figure 7 presents the Bayesian posterior of the mean given the sample and a uniform prior. This posterior suggests that the estimate is quite uncertain with the weight of the distribution running between -7 and 0.

Given the assumptions of the Bayesian approach, the Bayesian approach is the “correct” and rational approach to presenting statistical results. It is even consistent with Bayesian decision theory.

However, the assumptions are where the rubber hits the road. The Bayesian approach makes very strong assumptions about what information the econometrician has available. Alternatively, it makes strong assumptions about things that the econometrician does not know. Moreover, these assumptions matter.

Neither the classical nor the Bayesian approach makes a lot of sense when data sets are relatively small. One uses non-credible information to make non-credible claims, while the other uses non-credible assumptions to make non-credible claims.

Empirical Bayesian Estimation

In the middle of last century, Harvard statistician, Herb Robbins, suggested that there may exist a third way. He proposed an approach that used the logic of Bayesian statistics but replaced the strong assumptions with the classical idea of estimating the prior distribution using the analogy principle. He called this approach empirical Bayesian.

A Large Number of Samples

The empirical Bayesian approach may make sense when the econometrician has access to a large number of related samples. In panel data models, we can think of each time series for each individual as a separate but related sample. Similarly, we can think of each auction as a sample of bids from a large number of related samples.

Robbins pointed out that with a large number of related samples it may be possible to estimate the prior. That is, instead of assuming the prior, the data would provide an estimate of the prior. Given that the prior can be estimated, the rest of the procedure is Bayesian.

Robbins (1956) states that the observed distribution of sample means can be written as a mixture distribution, where a likelihood function is weighted by the true prior distribution.

\[ f(x) = \int_{\theta \in \Theta} h(x | \theta) g(\theta) \tag{9}\]

where \(f\) is the observed distribution of sample means and \(x \in \mathcal{X}\) is from the set of samples.

Robbins states that if the equation can be uniquely solved then we have an estimate of the prior. In a series of papers, Brad Efron and coauthors present assumptions which allow the equation to be uniquely solved.

Solving for the Prior and Posterior

Given both \(f\) and \(h\) are known from Equation 9, Efron (2014) shows that under certain regularity conditions on \(h\), we can uniquely solve for \(g\). The proof uses linear algebra, but linear algebra doesn’t actually help to find the solution.

Here we use an idea developed by statisticians at Penn State. It is similar to the mixture model estimator used in Chapter 11 (Benaglia, Chauveau, and Hunter 2009).

\[ \gamma_0(\theta | x) = \frac{h(x | \theta) g_0(\theta)}{\int_{\theta' \in \Theta} h(x | \theta') g_0(\theta')} \tag{10}\]

where \(\gamma_0\) is the posterior given the posited prior \(g_0\) for each \(x \in \mathcal{X}\).

The posteriors can be aggregated up to provide an estimate of the prior.

\[ g_1(\theta) = \sum_{x \in \mathcal{X}} \gamma_0(\theta | x) f(x) \tag{11}\]

where \(g_1\) is the next iteration of the estimate of the prior.

We can say that the system converges if the following inequality holds.

\[ \int_{\theta \in \Theta} |g_t(\theta) - g_{t-1}(\theta)| < \epsilon \tag{12}\]

for some small number \(\epsilon\), then \(\hat{g}(\theta) = g_t(\theta)\) is the solution.

I am unaware of any proof showing that this method will actually lead to the solution, but if it does, then if conditions presented in Efron (2014) hold, it must be the unique solution.³

Solving for the Prior in `R`

Previously, we generated 500 imaginary samples in order to illustrate the idea that estimates of means have distributions. Now imagine we actually have a data set of these 500 samples. We don’t know all the possible true means, but it seems reasonable to assume that the set of true means spans the observed data. We can approximate that set with 300 “true means” evenly spaced across the range. To make things simpler, assume that we know the mean is distributed normally with a standard deviation of 5. What we don’t know is the true mean.

Note that we have 500 samples and we will calculate the posterior for each. We use a matrix with 500 rows and 300 columns.

Theta = (max(sample_est) - min(sample_est))*c(1:300)/300 + 
  min(sample_est)
H <- matrix(NA,500,300)
for (i in 1:500) {
  x <- sample_est[i,]
  H[i,] <- sapply(c(1:300), function(k) 
    log_norm(x, mean(Theta[k]), 5))
  #print(i)
}

Given the assumptions described above, we can calculate the likelihood of observing each sample given all the possible “true distributions.” This provides an estimate of \(h\) from Equation 9.

The simplest place to start is to assume that the initial prior is a uniform distribution over the possible parameter values. This is similar to the assumption made for the Bayesian estimator. Here, it is only the starting point for estimating the prior.

g0 <- rep(1/300,300) 
log_G0 <- log(t(matrix(rep(g0, 500),nrow=300)))
gamma0 <- exp(H + log_G0)/sum(exp(H) + log_G0)
g1 <- colSums(gamma0)/sum(gamma0) # new estimate of prior

Given this assumption and the likelihood calculated above, we have our first iteration of the estimate of the posterior. Integrating over the posterior estimates provides the next iteration of the prior.

We repeat the process to get a second iteration of the estimate of the prior. With these two we can then create a while () loop to determine when the process converges.

log_G1 <- log(t(matrix(rep(g1, 500), nrow=300)))
gamma1 <- exp(H + log_G1)/sum(exp(H + log_G1))
g2 <- colSums(gamma1)/sum(gamma1)

Figure 8: Density of each of the first two iterations of the estimated.

The Figure 8 presents the results of the first two iterations of the estimated priors. Note we start with a uniform or “flat” prior. The first iteration looks like a normal distribution. The second iteration is a normal distribution with thinner tails.

eps <- 1e-3 # small number
g_old <- g1
g_new <- g2
diff <- sum(abs(g_new - g_old))
while (diff > eps) {
  g_old <- g_new
  log_G <- log(t(matrix(rep(g_old,500),nrow=300)))
  gamma <- exp(H + log_G)/sum(exp(H + log_G))
  g_new <- colSums(gamma)/sum(gamma)
  diff <- sum(abs(g_new - g_old))
  #print(diff)
}

The following is a method for creating a representation of the prior distribution. It creates a long vector where each sample mean is repeated in correspondence to its frequency draw from the prior.

J <- 100000
g_dist <- unlist(lapply(c(1:300), function(k) 
  rep(mean(Theta[k]),round(J*g_new[k]))))

Figure 9: Density of the estimated prior (solid line) and the observed sample means (dashed line) and estimated prior mean is -2.88 (solid vertical line) where the true mean is -3 (dashed line).

The Figure 9 presents the estimated density of the prior distribution for the simulated data generated above. For comparison, it presents the density of the observed sample means. Note that the empirical Bayesian prior is a lot tighter than the observed distribution of sample means. This is a standard result and is reason that the eBayesian estimator is called a shrinkage estimator. It “shrinks” the estimates in toward the true mean.

Estimating the Posterior of the Mean

Once we have the prior estimated, we can use Bayes’ rule to determine the posterior distribution of the mean conditional on the observed sample. Consider the sample of 10 observations generated above.

h_x1 <- 
  sapply(c(1:300), function(k) log_norm(x1, Theta[k], 5))
gamma_x1 <- exp(h_x1 + log(g_new))/sum(exp(h_x1 + log(g_new)))

Given this calculation of the posterior distribution, we can use the same procedure as above to create a density that can be plotted.

J <- 100000
g_dist_x1 <- unlist(lapply(c(1:300), function(k) 
  rep(Theta[k],round(J*gamma_x1[k]))))

Figure 10: Density of the estimated posterior for the sample where the true mean is -3 (vertical dashed line), mean of the sample is -3.72 (vertical dotted line) and mean of the eBayesian posterior is -2.88 (vertical solid line).

The Figure 10 presents the analog estimate which is close to -4, the true value which is -3 and the posterior estimate of the mean which is -2.88. Figure 10 presents the posterior distribution using the eBayesian procedure. The fact that the posterior estimate is close to the true value is not really that interesting. It occurs because the sample has only 10 observations, so the posterior is mostly determined by the prior. The variance is a lot smaller than suggested by the classical procedures and even the standard Bayesian procedure.

The Sultan of the Small Sample Size

In baseball, the Sultan of Swing refers to Babe Ruth. Over his lifetime in major league baseball, the Bambino had a batting average of 0.341. That is, for every 1,000 “at bats,” he hit the ball and got on base or scored a home run, 341 times.⁴

In 2013, the New York Times ran an article about John Paciorek (Hoffman 2013). You have probably never heard of Paciorek, but if you have, you know that he has a lifetime batting average in major league baseball of 1.000. Paciorek had a “hit” with every “at bat,” which totaled 3. Paciorek was a rookie who came up to the major leagues with the Houston Colt .45s for the last game of the 1963 season. He and his teammates had a great game against the hapless Mets. In the off-season, Paciorek was injured and never played another game in the majors.

We may ask whether Paciorek is “better” than Ruth because he has a higher lifetime batting average. You know the answer. Paciorek had only 3 at bats, while Ruth had thousands. If we think that each player has a “true” lifetime batting average, then the Law of Large Numbers suggests that Ruth’s is probably pretty close to it. How close is Paciorek’s? How uncertain is our estimate of Paciorek’s lifetime batting average?

The section presents the standard classical and Bayesian approaches to representing the uncertainty around the estimate of John Paciorek’s lifetime batting average. It shows that neither provides a satisfactory answer. It shows how to determine the empirical Bayesian posterior and shows that it does provide a satisfactory answer.

Classical or Bayesian?

Classical statistics gives a surprising answer. The analogy principle states that if we want to compare Paciorek to Ruth on their true lifetime batting average then we should use the lifetime batting average we observe. It says that Paciorek is better than Ruth, because 1.000 > 0.341. Moreover there is little uncertainty. Using both the approximation method and the bootstrap method we get the same answer; there is no variance around the estimate of 1.000. Remember our sample data is \(\{1, 1, 1\}\).

Bayesian statistics does slightly better. With only 3 at bats, the posterior distribution of Paciorek’s lifetime batting average is completely dominated by the prior. What is the prior? Batting averages are numbers between 0 and 1, so one guess is a uniform distribution. However, anyone who knows anything about baseball will tell you that a uniform prior is ridiculous.

Uniform Prior

In addition to being a ridiculous assumption about baseball, the analysis in this section shows that it gives a ridiculous result.

Nevertheless, it is worth going through the exercise. As above, we calculate the likelihood function. Let all the true possible probabilities be the 999 numbers from \(\frac{1}{1000}\) to \(\frac{999}{1000}\). In this case the likelihood function is the binomial function. Note that this is not an assumption; it is really the binomial function.⁵

log_bin <- function(N, p_hat, p) {
  K = round(p_hat*N)
  return(lchoose(N,K) + K*log(p) + (N-K)*log(1-p))
}
# lchoose gives the log of the binomial coefficient.
Theta <- c(1:999)/1000
log_g <- log(rep(1/1000,999))
h_jp <- sapply(Theta, function(theta) log_bin(3,1,theta)) 
gamma_jp <- exp(h_jp + log_g)/sum(exp(h_jp + log_g))
J <- 100000
gamma_dist_jp <- unlist(lapply(c(1:999), function(j) 
    rep(j,round(J*gamma_jp[j]))))
# mean
sum(gamma_jp*Theta)

[1] 0.7995997

Figure 11: Density of the estimated posterior for Paciorek’s lifetime batting average given a uniform prior. The vertical line is Ruth’s lifetime batting average.

The Figure 11 suggests that Paciorek’s lifetime batting average is very likely to be greater than Ruth’s. In fact, the posterior probability is 0.986. However, this result is all driven by the assumption that the prior is uniform.

sum(gamma_jp[342:999])

[1] 0.9863721

Estimating the Prior

To estimate the prior we need to observe a large number of lifetime batting averages. Luckily, thanks to people like Sean Lahman, we have access to data on almost every player to play Major League Baseball or Negro League Baseball between 1871 and 2018.⁶

M <- 1000
x <- read.csv("bat_ave.csv", as.is = TRUE,nrows=M)
# data set created from Lahman's data.
# limited the observations for computational reasons.
H <- matrix(NA,M,999)
for (m in 1:M) {
  H[m,] <- sapply(Theta, function(theta) 
    log_bin(round(x$AB[m]),x$AVE[m],theta))
  #print(m)
  # sapply uses apply on a vector.
}
log_G <- t(matrix(rep(log_g,M),nrow = 999))
gamma0 <- exp(H + log_G)/sum(exp(H + log_G))
g1 <- colSums(gamma0)/sum(gamma0)

This uses the same method for estimating the prior as above.

eps <- 0.001
g_old <- exp(log_g)
g_new <- g1
diff <- sum(abs(g_new - g_old))
while (diff > eps) {
  g_old <- g_new
  Log_G = log(t(matrix(rep(g_old,M),nrow=999)))
  Gamma <- exp(H + Log_G)
  Gamma <- Gamma/rowSums(Gamma)
  g_new <- colSums(Gamma)/sum(Gamma)
  diff <- sum(abs(g_new - g_old))
  #print(diff)
}

Paciorek’s Posterior

Using the likelihood for John Paciorek calculated in the previous section and the prior estimated above we can create vectors for graphing.

gamma_jp2 <- 
  exp(h_jp + log(g_new))/sum(exp(h_jp + log(g_new)))
J <- 100000
g_dist_mlb <- unlist(lapply(c(1:999), function(j) 
  rep(Theta[j],round(J*g_new[j]))))
# lapply uses apply on a list.
# unlist turns a list into a vector.
gamma_dist_jp2 <- unlist(lapply(c(1:999), function(j) 
  rep(Theta[j],round(J*gamma_jp2[j]))))
# mean
sum(gamma_jp2*Theta)

[1] 0.2506879

The mean of the posterior is 0.251. We could take this is our estimate of John Paciorek’s true lifetime batting average. John’s brothers Tom and Jim finished their MLB careers with lifetime batting averages of 0.282 and 0.228 respectively.

Figure 12: Density of the estimated posterior (dashed line) for Paciorek’s lifetime batting average given an empirical Bayesian prior (solid line). The mean of the posterior is the dashed vertical line. The sold vertical line is Ruth’s lifetime batting average.

The Figure 12 presents the density functions of the estimated prior and the new posterior distribution for John Paciorek. The prior using data from almost 150 years of baseball suggests that the probability of any player having a lifetime batting average of 0.341 or greater is remote. The probability that John Paciorek’s innate lifetime batting average is greater than 0.341 is 0.005, which is a lot smaller than both 1.000 and 0.986.

sum(gamma_jp2[342:999])

[1] 0.004715468

Decision Theory

Statistics is fundamentally about making decisions under uncertainty. In some cases the statistician herself has a decision problem. Should I collect more data given data collection is costly and the data I have provides some information? In other cases, the statistician acts as a go-between. They sit between the data and the ultimate decision maker. The statistician provides a summary of data to the decision maker. But how should she summarize the data? What information should the decision maker get?

Consider the problem above. If you are limited to providing one number about John Paciorek’s batting average, which number would you choose? Consider a more realistic example. You work for a baseball team and you are considering a trade of one first baseman for another. Assume, unrealistically, that the team-manager only cares about batting average. If you are to provide a single number to the team-manager that summarizes the information about the two players, which number does that?

Decision Making Under Uncertainty

Economists have been thinking about how to make decisions under uncertainty for a very long time. The standard paradigm is called expected utility theory. In this model there is some outcome \(x\), say hits from our baseball example, and a probability distribution \(p\), which is the probability of a hit. The probability of some outcome \(x_k\) is then \(p_k\), \(\Pr(x=x_k) = p_k\).

\[ U(x, p) = \sum_{k=1}^K p_k u(x_k) \tag{13}\]

If there are \(K\) possible outcomes then the decision maker’s utility is represented as expectation over the utility the decision maker gets from the hit. Note that if \(u(x_k) = x_k\) then expected utility is equal to the average. In our baseball case, it is the batting average. More generally, it is not equal to the average.

The problem with this standard economic approach for our purposes is that we don’t know what \(p\) is. In statistics, we have a sample estimate of \(p\), \(\hat{p}\) or a posterior distribution \(\gamma(p)\) or an estimate of that distribution \(\hat{\gamma}(p)\).

Compound Lotteries

One response to the fact that we don’t know \(p\) is to introduce another distribution \(q\), which represents the probability of \(p\). The probability of some distribution \(p_l\) is \(q_l\), \(\Pr(p = p_l) = q_l\). In the baseball example above, \(p_l\) represents batting average (the probability of a hit) and \(q_l\) is probability that a particular player has that batting average.

If we don’t know \(p\) but we do know \(q\) then we can write down expected utility as a compound expected utility.

\[ U(x, p, q) = \sum_{l=1}^L \sum_{k=1}^K q_l p_{lk} u(x_k) \tag{14}\]

where there are \(L\) possible distributions of \(p\) and \(K\) possible outcomes. This representation uses something called reduction of compound lotteries. This means that the two vectors of probabilities are multiplied together to give one reduced vector of probabilities.

For most economists, this is generally where they stop. In fact, some argue that a rational decision maker must use Equation 14 as a basis for decision making under uncertainty. The problem with the representation in Equation 14 is that it does not really account for uncertainty over what \(p\) is.

It is easiest to see the implication if we assume that \(u(x_k)=x_k\). In this case, Equation 14 implies the decision maker only cares about the average of the average. For a classical statistician, the average of the average is generally the analog estimate of the average. It is the number 1.000, for the case of John Paciorek. For a Bayesian statistician, it is the mean of the posterior distribution which is 0.800 for the standard Bayesian with a uniform prior and 0.251 for the eBayesian. The eBayesian is the only one of the three who would “rationally” prefer Babe Ruth to John Paciorek.

This is pretty strange given we have the whole field of statistics dedicated to determining the best way to measure uncertainty over \(p\). Even economists spend a good chunk of time and writing talking about the uncertainty over \(p\). Why spend so much time talking about something a rational decision maker cares naught for?

What in the World Does Wald Think?

In the 1940s, Hungarian-American mathematician, Abraham Wald, took decision theory directly to the statistician’s problem. In Wald’s set up, the decision maker has sample data and must make a choice given their utility. Interestingly, Wald considered the problem as a game with two players, Nature and Statistician. Nature chooses the true state of the world, although Nature could have mixed strategies (a distribution over the true state of the world). The Statistician does not know Nature’s choice, although she may know the mixed strategies (Wald 1949).⁷

Wald states that the Statistician should optimize against Nature by considering expectations over a utility function of the true state of the world and the outcome the Statistician receives.

\[ U(x, p, q) = \sum_{l=1}^L q_l L(p_l, x) \tag{15}\]

where \(L\) is for “loss” gives the decision maker’s utility given the true state of the world \(p_l\) and the set of outcomes \(x\). In Classical statistics, \(q_l\) is determined by the approximation to the limit distribution (Assumption 1). In Bayesian statistics, it is the posterior distribution.

Importantly, if the “loss” function \(L\) is not linear then we no longer have reduction of compound lotteries. Our utility can now explicitly account for the uncertainty.

Two Types of Uncertainty?

In the discussion above there are two probability distributions, \(p\) and \(q\). Some people use different words for the two types of uncertainty. The first is often called “risk” and the second is called “ambiguity.” The first is the probability distribution over outcomes; it is the player’s true batting average. The second is a distribution of Nature’s choice of the state of the world. In Bayesian statistics, it is the posterior distribution. Can we have two types of uncertainty? Should they be treated differently?

Yes. The first type is fixed, albeit unknown. It is Nature’s choice. The second type is not fixed. The second type of uncertainty varies with data and information. It may even be under control of the Statistician. For example, the Statistician may have the ability to create more data.

If we are to give just one number to summarize the data it should explicitly account for uncertainty associated with the sample data. For example the utility representation of Klibanoff, Marinacci, and Mukerji (2005),

\[ U(x, p, q) = \sum_{l=1}^L q_l h \left(\sum_{k=1}^K p_k u(x_k) \right) \tag{16}\]

where \(h\) may be some non-linear function. With this function we can help the decision maker by providing summaries of the data that explicitly account for the uncertainty of our estimates.

Discussion and Further Reading

Given that much of the focus of statistics is providing information about uncertainty, there is surprisingly little agreement about the best approach. Moreover, the two leading candidates, classical and Bayesian statistics, do a surprisingly poor job at representing the uncertainty.

I find the empirical Bayesian approach suggested in Robbins (1956) and developed by Brad Efron and coauthors to be a compelling third way. Admittedly, the applicability of the approach may be limited, but I come across more and more applications every day. The approach also seems to be consistent with modern ideas on rational decision making under uncertainty. See Klibanoff, Marinacci, and Mukerji (2005) and Gilboa (2013).

Here, the eBayesian approach is illustrated with strong assumptions on the distribution functions. Efron (2014) points out that if the set of outcomes is finite then under random sampling the likelihood function is known; it is the multinomial function. Note, in the baseball example the likelihood function is a binomial function under random sampling.

References

Benaglia, Tatiana, Didier Chauveau, and David R. Hunter. 2009. “An EM-Like Algorithm for Semi- and Non-Parametric Estimation in Multivariate Mixtures.” Journal of Computational and Graphical Statistics 18 (2): 505–26.

Berger, James O. 1985. Statistical Decision Theory and Bayesian Analysis. Springer.

Efron, Bradley. 2014. “Two Modeling Strategies for Empirical Bayes Estimation.” Statistical Science 29: 285–301.

Gilboa, Itzhak. 2013. “Lecture Notes for Introduction to Decision Theory.”

Hoffman, Benjamin. 2013. “For the Sultan of Small Sample Size, a 1.000 Career Average.” The New York Times 10 (1).

Klibanoff, Peter, Massimo Marinacci, and Sujoy Mukerji. 2005. “A Smooth Model of Decision Making Under Ambiguity.” Econometrica 73 (6): 1849–92.

Kreps, David. 1990. A Course in Microeconomic Theory. Harvester Wheatsheaf.

Levine, M., D. R. Hunter, and D. Chauveau. 2011. “Maximum Smoothed Likelihood for Multivariate Mixtures.” Biometrika 98 (2): 403–16.

Manski, Charles F. 1990. “Nonparametric Bounds on Treatment Effects.” American Economic Review Papers and Proceedings 80: 319–23.

Robbins, Herbert. 1956. “The Empirical Bayes Approach to Statistical Decision Problems.” Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 157–63.

Wald, Abraham. 1949. “Statistical Decision Functions.” Annals of Mathematical Statistics 20 (2): 165–205.

Footnotes

We should be a little careful because we are using a normal distribution to approximate the density.↩︎
The prior distribution is the probability distribution over the true values prior to observing the data. The posterior distribution is distribution after observing the data.↩︎
Levine, Hunter, and Chauveau (2011) shows that a similar algorithm does converge.↩︎
An “at bat” refers to each time the batter goes up to bat, but does not include the times when the batter is “walked” and moves to first base without hitting the ball.↩︎
Of course, there is an assumption. We are assuming random sampling. That is each at-bat is iid.↩︎
Go to http://www.seanlahman.com/baseball-archive/statistics/ This is almost 20,000 players.↩︎
A more modern game-theoretic approach would use the Nash equilibrium and set it up as a game of incomplete information. That is, the Statistician does not know Nature’s type. Each type plays an equilibrium strategy in choosing the true state of the world. The Statistician does not know Nature’s type, however by interacting with Nature through experiments, the Statistician can learn Nature’s type. In Bayes-Nash equilibrium, the Statistician uses Bayes’ rule to update the distribution over Nature’s type and the true state of the world.↩︎