\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}\) \(\newcommand{\E}[1]{\operatorname{E}\left(#1\right)}\) \(\newcommand{\Cov}[1]{\operatorname{Cov}\left(#1\right)}\) \(\newcommand{\Cor}[1]{\operatorname{Corr}\left(#1\right)}\) \(\newcommand{\sd}[1]{\operatorname{sd}\left(#1\right)}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\DeclareMathOperator*{\minf}{minimize \quad}\) \(\newcommand{\mininlineeq}[3]{\mbox{minimize}_{#1}\quad#2\qquad\mbox{subject to }#3}\) \(\newcommand{\mininline}[2]{\mbox{minimize}_{#1}\quad#2}\)
Explosion of Models and Algorithms starting in 1950s
Bayesian Regularisation and Sparsity
Hierarchical Models and Shrinkage
Hidden Markov Models
Nonlinear Non-Gaussian State Space Models
Algorithms
Monte Carlo Method (von Neumann and Ulam, 1940s)
Metropolis-Hastings (Metropolis, 1950s)
Gibbs Sampling (Geman and Geman, Gelfand and Smith, 1980s)
Sequential Particle Filtering
Bayesian Probability (Ramsey, 1926, de Finetti, 1931)
Beta-Binomial Learning: Black Swans
Elections: Nate Silver
Baseball: Kenny Lofton and Derek Jeter
Monte Carlo (von Neumann and Ulam, Metropolis, 1940s)
Shrinkage Estimation (Lindley and Smith, Efron and Morris, 1970s)
Explicit use of probability for summarizing uncertainty.
A for data given parameters \[f(y| \theta ) \; \; \text{Likelihood}\]
A for unknown parameters \[p(\theta) \; \; \text{Prior}\]
Inference for unknowns conditional on observed data
Inverse probability (Bayes Theorem);
Formal decision making (Loss, Utility)
to derive posterior distributions \[\begin{aligned} p( \theta | y ) & = \frac{p(y| \theta)p( \theta)}{p(y)} \\ p(y) & = \int p(y| \theta)p( \theta)d \theta\end{aligned}\] Allows you to make probability statements
They can be very different from p-values!
Hypothesis testing and Sequential problems
Markov chain Monte Carlo (MCMC) and Filtering (PF)
{.example} - Let \({\cal F}\) denote the class of distributions \(f ( y | \theta )\).
A class $\Pi$ of prior distributions is for ${\cal F}$ if the
posterior distribution is in the class $\Pi$ for all
$f \in {\cal F} , \pi \in \Pi , y \in {\cal Y}$.
Example: Binomial/Beta:
Suppose that \(Y_1 , \ldots , Y_n \sim Ber ( p )\).
Let \(p \sim Beta ( \alpha , \beta )\) where \(( \alpha , \beta )\) are known hyper-parameters.
The beta-family is very flexible
Prior mean \(E ( p ) = \frac{\alpha}{ \alpha + \beta }\).
Likelihood for Bernoulli \[p\left( y|\theta\right) =\prod_{t=1}^{T}p\left( y_{t}|\theta\right) =\theta^{\sum_{t=1}^{T}y_{t}}\left( 1-\theta\right) ^{T-\sum_{t=1}^{T}y_{t}}.\] Initial prior distribution \(\theta\sim\mathcal{B}\left( a,A\right)\) given by \[p\left( \theta|a,A\right) =\frac{\theta^{a-1}\left( 1-\theta\right) ^{A-1}}{B\left( a,A\right) }\]
Updated posterior distribution is also Beta \[p\left( \theta|y\right) \sim\mathcal{B}\left( a_{T},A_{T}\right) \; {\rm and} \; a_{T}=a+\sum_{t=1}^{T}y_{t} , A_{T}=A+T-\sum_{t=1}^{T}y_{t}\] The posterior mean and variance are \[E\left[ \theta|y\right] =\frac{a_{T}}{a_{T}+A_{T}}\text{ and }var\left[ \theta|y\right] =\frac{a_{T}A_{T}}{\left( a_{T}+A_{T}\right) ^{2}\left( a_{T}+A_{T}+1\right) }\]
\(p ( p | \bar{y} )\) is the for \(p\)
\(\bar{y}\) is a sufficient statistic.
Bayes theorem gives \[\begin{aligned} p ( p | y ) & \propto f ( y | p ) p ( p | \alpha , \beta )\\ & \propto p^{\sum y_i} (1 - p )^{n - \sum y_i } \cdot p^{\alpha - 1} ( 1 - p )^{\beta - 1} \\ & \propto p^{ \alpha + \sum y_i - 1 } ( 1 - p )^{ n - \sum y_i + \beta - 1} \\ & \sim Beta ( \alpha + \sum y_i , \beta + n - \sum y_i ) \end{aligned}\]
The posterior mean is a shrinkage estimator
Combination of sample mean \(\bar{y}\) and prior mean \(E( p )\) \[E(p|y) = \frac{\alpha + \sum_{i=1}^n y_i}{\alpha + \beta + n} = \frac{n}{n+ \alpha +\beta} \bar{y} + \frac{\alpha + \beta}{\alpha + \beta+n} \frac{\alpha}{\alpha+\beta}\]
Suppose you’re only see a sequence of White Swans, having never seen a Black Swan.
What’s the Probability of Black Swan event sometime in the future?
Suppose that after \(T\) trials you have only seen successes \(( y_1 , \ldots , y_T ) = ( 1 , \ldots , 1 )\). The next trial being a success has \[p( y_{T+1} =1 | y_1 , \ldots , y_T ) = \frac{T+1}{T+2}\] For large \(T\) is almost certain. Here \(a=A=1\).
Principle of Induction (Hume)
The probability of ever seeing a Black Swan is given by \[p( y_{T+1} =1 , \ldots , y_{T+n} = 1 | y_1 , \ldots , y_T ) = \frac{ T+1 }{ T+n+1 } \rightarrow 0\]
lack Swan will eventually happen – don’t be surprised when it actually happens.
{.example} Poisson/Gamma: Suppose that \(Y_1 , \ldots , Y_n \mid \lambda \sim Poi ( \lambda )\).
Let \(\lambda \sim Gamma ( \alpha , \beta )\)
\(( \alpha , \beta )\) are known hyper-parameters.
The distribution is
\[\begin{aligned} p ( \lambda | y ) & \propto \exp ( - n \lambda ) \lambda^{ \sum y_i } \lambda^{ \alpha - 1 } \exp ( - \beta \lambda ) \\ & \sim Gamma ( \alpha + \sum y_i , n + \beta ) \end{aligned}\]
Novick and Grizzle: Bayesian Analysis of Clinical Trials
Four treatments for duodenal ulcers.
Doctors assess the state of the patient.
Sequential data
(\(\alpha\)-spending function, can only look at prespecified times).
| Treat | Excellent | Fair | Death |
|---|---|---|---|
| A | 76 | 17 | 7 |
| B | 89 | 10 | 1 |
| C | 86 | 13 | 1 |
| D | 88 | 9 | 3 |
Cannot reject at the 5% level
Conjugate binomial/beta model+sensitivity analysis.
Let \(p_i\) be the death rate proportion under treatment \(i\).
To compare treatment \(A\) to \(B\) directly compute \(P ( p_1 > p_2 | D )\).
Prior \(beta ( \alpha , \beta )\) with prior mean \(E ( p ) = \frac{\alpha}{\alpha + \beta }\).
Posterior \(beta ( \alpha + \sum x_i , \beta + n - \sum x_i )\)
For \(A\), \(beta ( 1 , 1 ) \rightarrow beta ( 8 , 94 )\)
For \(B\), \(beta ( 1 , 1 ) \rightarrow beta ( 2 , 100 )\)
Inference: \(P ( p_1 > p_2 | D ) \approx 0.98\)
Important to do a sensitivity analysis.
| Treat | Excellent | Fair | Death |
|---|---|---|---|
| A | 76 | 17 | 7 |
| B | 89 | 10 | 1 |
| C | 86 | 13 | 1 |
| D | 88 | 9 | 3 |
Poisson-Gamma, prior \(\Gamma ( m , z)\) and \(\lambda_i\) be the expected death rate.
Compute \(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > c | D \right )\)
| Prob | ( 0 , 0 ) | ( 100, 2) | ( 200 , 5) |
|---|---|---|---|
| \(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.3 | D \right )\) | 0.95 | 0.88 | 0.79 |
| \(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.6 | D \right )\) | 0.91 | 0.80 | 0.64 |
Using we get \[p( \mu | y ) \propto p( y| \mu ) p( \mu )\]
SAT (\(200-800\)): 8 high schools and estimate effects.
| School | Estimated \(y_j\) | St. Error \(\sigma_j\) | Average Treatment \(\theta_i\) |
|---|---|---|---|
| A | 28 | 15 | ? |
| B | 8 | 10 | ? |
| C | -3 | 16 | ? |
| D | 7 | 11 | ? |
| E | -1 | 9 | ? |
| F | 1 | 11 | ? |
| G | 18 | 10 | ? |
| H | 12 | 18 | ? |
\(\theta_j\) average effects of coaching programs
\(y_j\) estimated treatment effects, for school \(j\), standard error \(\sigma_j\).
Two programs appear to work (improvements of \(18\) and \(28\))
Large standard errors. Overlapping Confidence Intervals?
Classical hypothesis test fails to reject the hypothesis that the \(\theta_j\)’s are equal.
Pooled estimate has standard error of \(4.2\) with \[\hat{\theta} = \frac{ \sum_j ( y_j / \sigma_j^2 ) }{ \sum_j ( 1 / \sigma_j^2 ) } = 7.9\]
Neither separate or pooled seems sensible.
Bayesian shrinkage!
Hierarchical Model (\(\sigma_j^2\) known) is given by \[\bar{y}_j | \theta_j \sim N ( \theta_j , \sigma_j^2 )\] Unequal variances–differential shrinkage.
Prior Distribution: \(\theta_j \sim N ( \mu , \tau^2 )\) for \(1 \leq j \leq 8\).
Traditional random effects model.
Exchangeable prior for the treatment effects.
As \(\tau \rightarrow 0\) (complete pooling) and as \(\tau \rightarrow \infty\) (separate estimates).
Hyper-prior Distribution: \(p( \mu , \tau^2 ) \propto 1 / \tau\).
The posterior \(p( \mu , \tau^2 | y )\) can be used to “estimate” \(( \mu , \tau^2 )\).
Joint Posterior Distribution \(y = ( y_1 , \ldots , y_J )\) \[p( \theta , \mu , \tau | y ) \propto p( y| \theta ) p( \theta | \mu , \tau )p( \mu , \tau )\] \[\propto p( \mu , \tau^2) \prod_{i=1}^8 N ( \theta_j | \mu , \tau^2 ) \prod_{j=1}^8 N ( y_j | \theta_j )\] \[\propto \tau^{-9} \exp \left ( - \frac{1}{2} \sum_j \frac{1}{\tau^2} ( \theta_j - \mu )^2 - \frac{1}{2} \sum_j \frac{1}{\sigma_j^2} ( y_j - \theta_j )^2 \right )\] MCMC!
Report posterior quantiles
| School | 2.5% | 25% | 50% | 75% | 97.5% |
|---|---|---|---|---|---|
| A | -2 | 6 | 10 | 16 | 32 |
| B | -5 | 4 | 8 | 12 | 20 |
| C | -12 | 3 | 7 | 11 | 22 |
| D | -6 | 4 | 8 | 12 | 21 |
| E | -10 | 2 | 6 | 10 | 19 |
| F | -9 | 2 | 6 | 10 | 19 |
| G | -1 | 6 | 10 | 15 | 27 |
| H | -7 | 4 | 8 | 13 | 23 |
| \(\mu\) | -2 | 5 | 8 | 11 | 18 |
| \(\tau\) | 0.3 | 2.3 | 5.1 | 8.8 | 21 |
Schools \(A\) and \(G\) are similar!
Bayesian shrinkage provides a way of modeling complex datasets.
Baseball batting averages: Stein’s Paradox
Batter-pitcher match-up: Kenny Lofton and Derek Jeter
Bayes Elections
Toxoplasmosis
Bayes MoneyBall
Bayes Portfolio Selection
Batter-pitcher match-up?
Prior information on overall ability of a player.
Small sample size, pitcher variation.
Let \(p_i\) denote Jeter’s ability. Observed number of hits \(y_i\) \[(y_i | p_i ) \sim Bin ( T_i , p_i ) \; \; {\rm with} \; \; p_i \sim Be ( \alpha , \beta )\] where \(T_i\) is the number of at-bats against pitcher \(i\). A priori \(E( p_i ) = \alpha / (\alpha+\beta ) = \bar{p}_i\).
The extra heterogeneity leads to a prior variance \(Var (p_i ) = \bar{p}_i (1 - \bar{p}_i ) \phi\) where \(\phi = ( \alpha + \beta + 1 )^{-1}\).
| Pitcher | At-bats | Hits | ObsAvg | ||
|---|---|---|---|---|---|
| J.C. Romero | 9 | 6 | .667 | ||
| S. Lewis | 5 | 3 | .600 | ||
| B. Tomko | 20 | 11 | .550 | ||
| T. Hoffman | 6 | 3 | .500 | ||
| K. Tapani | 45 | 22 | .489 | ||
| A. Cook | 9 | 4 | .444 | ||
| J. Abbott | 34 | 14 | .412 | ||
| A.J. Burnett | 15 | 6 | .400 | ||
| K. Rogers | 43 | 17 | .395 | ||
| A. Harang | 6 | 2 | .333 | ||
| K. Appier | 49 | 15 | .306 | ||
| R. Clemens | 62 | 14 | .226 | ||
| C. Zambrano | 9 | 2 | .222 | ||
| N. Ryan | 10 | 2 | .200 | ||
| E. Hanson | 41 | 7 | .171 | ||
| E. Milton | 19 | 1 | .056 | ||
| M. Prior | 7 | 0 | .000 | ||
| Total | 7630 | 2283 | .299 |
Kenny Lofton versus individual pitchers.
Kenny Lofton (career \(.299\) average, and current \(.308\) average for \(2006\) season) was facing the pitcher Milton (current record \(1\) for \(19\))
.
Is putting in a weaker player really a better bet?
Over-reaction to bad luck?
\(\mathbb{P}\left ( \leq 1 \; {\rm hit \; in \; } 19 \; {\rm attempts} | p = 0.3 \right ) = 0.01\)
An unlikely \(1\)-in-\(100\) event.
Bayes solution: shrinkage. Borrow strength across pitchers
Bayes estimate: use the posterior mean
Lofton’s batting estimates that vary from \(.265\) to \(.340\).
The lowest being against Milton.
\(.265 < .275\)
Conclusion: resting Lofton against Milton was justified!!
Here’s our model again ...
Small sample sizes and pitcher variation.
Let \(p_i\) denote Lofton’s ability. Observed number of hits \(y_i\) \[(y_i | p_i ) \sim Bin ( T_i , p_i ) \; \; {\rm with} \; \; p_i \sim Be ( \alpha , \beta )\] where \(T_i\) is the number of at-bats against pitcher \(i\).
Estimate \(( \alpha , \beta )\)
| Pitcher | At-bats | Hits | ObsAvg | EstAvg | 95% Int |
|---|---|---|---|---|---|
| R. Mendoza | 6 | 5 | .833 | .322 | (.282,.394) |
| H. Nomo | 20 | 12 | .600 | .326 | (.289,.407) |
| A.J.Burnett | 5 | 3 | .600 | .320 | (.275,.381) |
| E. Milton | 28 | 14 | .500 | .324 | (.291,.397) |
| D. Cone | 8 | 4 | .500 | .320 | (.218,.381) |
| R. Lopez | 45 | 21 | .467 | .326 | (.291,.401) |
| K. Escobar | 39 | 16 | .410 | .322 | (.281, .386) |
| J. Wettland | 5 | 2 | .400 | .318 | (.275,.375) |
| T. Wakefield | 81 | 26 | .321 | .318 | (.279,.364) |
| P. Martinez | 83 | 21 | .253 | .312 | (.254,.347) |
| K. Benson | 8 | 2 | .250 | .317 | (.264,.368) |
| T. Hudson | 24 | 6 | .250 | .315 | (.260,.362) |
| J. Smoltz | 5 | 1 | .200 | .314 | (.253,.355) |
| F. Garcia | 25 | 5 | .200 | .314 | (.253,.355) |
| B. Radke | 41 | 8 | .195 | .311 | (.247,.347) |
| D. Kolb | 5 | 0 | .000 | .316 | (.258,.363) |
| J. Julio | 13 | 0 | .000 | .312 | (.243,.350 ) |
| Total | 6530 | 2061 | .316 |
Derek Jeter 2006 season versus individual pitchers.
Stern stimates \(\hat{\phi} = ( \alpha + \beta + 1 )^{-1} = 0.002\) for Jeter
Doesn’t vary much across the population of pitchers.
The extremes are shrunk the most also matchups with the smallest sample sizes.
Jeter had a season \(.308\) average.
Bayes estimates vary from \(.311\) to \(.327\)–he’s very consistent.
If all players had a similar record then a constant batting average would make sense.
Predicting the Electoral Vote (EV)
Multinomial-Dirichlet: \((\hat{p} | p) \sim Multi (p), ( p | \alpha ) \sim Dir (\alpha)\) \[p_{Obama} = ( p_{1}, \ldots ,p_{51} | \hat{p}) \sim Dir \left ( \alpha + \hat{p} \right )\]
Flat uninformative prior \(\alpha\equiv 1\).
http://www.electoral-vote.com/evp2012/Pres/prespolls.csv
Calculate probabilities via simulation: rdirichlet \[{\color{blue}{p \left ( p_{j,O} | {\rm data} \right ) \;\; {\rm and} \; \;
p \left ( EV >270 | {\rm data} \right )}}\]
The election vote prediction is given by the sum \[{\color{red}{EV =\sum_{j=1}^{51} EV(j) \mathbb{E} \left ( p_{j} | {\rm data} \right )}}\] where \(EV(j)\) are for individual states
electoral-vote.comElectoral Vote (EV), Polling Data: Mitt and Obama percentages
library(kableExtra)
library(LearnBayes)
library(plyr)
library(lattice)
file1 <- "http://www.electoral-vote.com/evp2012/Pres/pres_polls.csv"
election.2012 <- read.csv(file = file1)
# Remove a pollster: elect2012 <- election.2012[!grepl('Rasmussen', election.2012$Pollster),]
elect2012 <- election.2012
# Aggregrate the data
elect2012 <- ddply(elect2012, .(State), subset, Day == max(Day))
elect2012 <- ddply(elect2012, .(State), summarise, R.pct = mean(GOP), O.pct = mean(Dem),
EV = mean(EV))
| State | R.pct | O.pct | EV |
|---|---|---|---|
| Alabama | 61 | 38 | 9 |
| Alaska | 55 | 42 | 3 |
| Arizona | 54 | 44 | 11 |
| Arkansas | 61 | 37 | 6 |
| California | 38 | 59 | 55 |
| Colorado | 47 | 51 | 9 |
| Connecticut | 40 | 58 | 7 |
| D.C. | 7 | 91 | 3 |
| Delaware | 40 | 59 | 3 |
| Florida | 49 | 50 | 29 |
| Georgia | 53 | 45 | 16 |
| Hawaii | 28 | 71 | 4 |
| Idaho | 65 | 33 | 4 |
| Illinois | 41 | 57 | 20 |
| Indiana | 54 | 44 | 11 |
| Iowa | 47 | 52 | 6 |
| Kansas | 60 | 38 | 6 |
| Kentucky | 61 | 38 | 8 |
| Louisiana | 58 | 41 | 8 |
| Maine | 41 | 56 | 4 |
| Maryland | 37 | 62 | 10 |
| Massachusetts | 38 | 61 | 11 |
| Michigan | 45 | 54 | 16 |
| Minnesota | 45 | 53 | 10 |
| Mississippi | 56 | 44 | 6 |
# Run the Simulation
prob.Obama <- function(mydata) {
p <- rdirichlet(1000, 500 * c(mydata$R.pct, mydata$O.pct, 100 - mydata$R.pct -
mydata$O.pct)/100 + 1)
mean(p[, 2] > p[, 1])
}
win.probs <- ddply(elect2012, .(State), prob.Obama)
win.probs$Romney <- 1 - win.probs$V1
names(win.probs)[2] <- "Obama"
win.probs$EV <- elect2012$EV
win.probs <- win.probs[order(win.probs$EV), ]
rownames(win.probs) <- win.probs$state
# Plot Probabilities by State
barplot(t(as.matrix(win.probs[, 2:3])), legend = TRUE, args.legend = list(x = "topright",
cex = 0.5), names.arg = win.probs$State, horiz = TRUE, cex.names = 0.5,
las = 2, xlab = "Probability", main = "Probability of Obama or Romney by State")
# Simulation among these probabilities
sim.election <- function(win.probs) {
winner <- rbinom(51, 1, win.probs$Obama)
sum(win.probs$EV * winner)
}
sim.EV <- replicate(10000, sim.election(win.probs))
oprob <- sum(sim.EV >= 270)/length(sim.EV)
# probability of Obama having >270 EV or more:
oprob
## [1] 0.9709
## [1] 0.92
# Lattice Graph
densityplot(sim.EV, plot.points = "rug", xlab = "Electoral Votes for Obama",
panel = function(x, ...) {
panel.densityplot(x, ...)
panel.abline(v = 270)
panel.text(x = 285, y = 0.01, "270 EV to Win")
panel.abline(v = 332)
panel.text(x = 347, y = 0.01, "Actual Obama")
}, main = "Electoral College Results Probability")
Let’s do the same for 2008 elections!
library(LearnBayes)
data(election.2008)
attach(election.2008)
## Dirichlet simulation
prob.Obama = function(j)
{
p=rdirichlet(5000,500*c(M.pct[j],O.pct[j],100-M.pct[j]-O.pct[j])/100+1)
mean(p[,2]>p[,1])
}
## sapply function to compute Obama win prob for all states
Obama.win.probs=sapply(1:51,prob.Obama)
## sim.EV function
sim.election = function()
{
winner = rbinom(51,1,Obama.win.probs)
sum(EV*winner)
}
sim.EV = replicate(1000,sim.election())
## histogram of simulated election
hist(sim.EV,min(sim.EV):max(sim.EV),col="blue",prob=T)
abline(v=365,lwd=3) # Obama received 365 votes
text(375,30,"Actual \n Obama \n total")
Update our beliefs in light of
In the 2014-2015 season.
The Bears suffered back-to-back \(50\)-points defeats.
Partiots-Bears \(51-23\)
Packers-Bears \(55-14\)
Their next game was at home against the Minnesota Vikings.
Current line against the Vikings was .
Slightly over a field goal
What’s the Bayes approach to learning the line?
Hierarchical model for the current average win/lose this year \[\begin{aligned} \bar{y} | \theta & \sim N \left ( \theta , \frac{\sigma^2}{n} \right ) \sim N \left ( \theta , \frac{18.34^2}{9} \right )\\ \theta & \sim N( 0 , \tau^2 )\end{aligned}\] Here \(n =9\) games so far. With \(s = 18.34\) points
Pre-season prior mean \(\mu_0 = 0\), standard deviation \(\tau = 4\).
Record so-far. Data \(\bar{y} = -9.22\).
Bayes Shrinkage estimator \[\mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = \frac{ \tau^2 }{ \tau^2 + \frac{\sigma^2}{n} } \bar{y}\]
The is \(0.3\)!!
That’s quite a bit of shrinkage.
Our updated estimator is \[\mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = - 2.75 > -.3.5\] where current line is \(-3.5\).
Based on our hierarchical model this is an .
One point change on the line is about \(3\)% on a probability scale.
Alternatively, calculate a given line \(=-3.5\).
Last two defeats were \(50\) points scored by opponent (2014-15)
bears=c(-3,8,8,-21,-7,14,-13,-28,-41)
> mean(bears)
[1] -9.222222
> sd(bears)
[1] 18.34242
> tau=4
> sig2=sd(bears)*sd(bears)/9
> tau^2/(sig2+tau^2)
[1] 0.2997225
> 0.29997*-9.22
[1] -2.765723
> pnorm(-2.76/18)
[1] 0.4390677
Home advantage is worth \(3\) points. Vikings an average record.
possible to make a on the MLE in terms of MSE.
Mistrust of the statistical interpretation of Stein’s result.
In particular, the loss function.
Difficulties in adapting the procedure to special cases
Long familiarity with good properties for the MLE
Any gains from a “complicated” procedure could not be worth the extra trouble (Tukey, savings not more than 10 % in practice)
For \(k\ge 3\), we have the remarkable inequality \[MSE(\hat \theta_{JS},\theta) < MSE(\bar y,\theta) \; \forall \theta\] Bias-variance explanation! Inadmissability of the classical stats.
Data: 18 major-league players after 45 at bats (1970 season)
bball = read.table("http://www.swarthmore.edu/NatSci/peverso1/Sports%20Data/JamesSteinData/Efron-Morris%20Baseball/EfronMorrisBB.txt",header=TRUE, stringsAsFactors=FALSE)
bball$shrinkage = bball$BattingAverage * .212 + .788 * (0.265)
knitr::kable(bball[c("LastName","BattingAverage","RemainingAverage", "shrinkage")], digits=3)
| LastName | BattingAverage | RemainingAverage | shrinkage |
|---|---|---|---|
| Clemente | 0.400 | 0.346 | 0.294 |
| Robinson | 0.378 | 0.298 | 0.289 |
| Howard | 0.356 | 0.276 | 0.284 |
| Johnstone | 0.333 | 0.222 | 0.279 |
| Berry | 0.311 | 0.273 | 0.275 |
| Spencer | 0.311 | 0.270 | 0.275 |
| Kessinger | 0.289 | 0.264 | 0.270 |
| Alvarado | 0.267 | 0.210 | 0.265 |
| Santo | 0.244 | 0.269 | 0.261 |
| Swaboda | 0.244 | 0.230 | 0.261 |
| Petrocelli | 0.222 | 0.264 | 0.256 |
| Rodriguez | 0.222 | 0.226 | 0.256 |
| Scott | 0.222 | 0.303 | 0.256 |
| Unser | 0.222 | 0.264 | 0.256 |
| Williams | 0.222 | 0.330 | 0.256 |
| Campaneris | 0.200 | 0.285 | 0.251 |
| Munson | 0.178 | 0.316 | 0.247 |
| Alvis | 0.156 | 0.200 | 0.242 |
Let \(\theta_i\) denote the end of season average
Lindley: shrink to the overall grand mean \[c = 1 - \frac{ ( k - 3 ) \sigma^2 }{ \sum ( \bar{y}_i - \bar{y} )^2 }\] where \(\bar{y}\) is the overall grand mean and \[\hat{\theta} = c \bar{y}_i + ( 1 - c ) \bar{y}\]
Baseball data: \(c = 0.212\) and \(\bar{y} = 0.265\).
Compute \(\sum ( \hat{\theta}_i - \bar{y}^{obs}_i )^2\) and see which is lower: \[MLE = 0.077 \; \; STEIN = 0.022\] That’s a factor of \(3.5\) times better!
bball$js = bball$BattingAverage * .212 + .788 * (0.265)
bball$LastName[!is.na(match(bball$LastName, c("Scott","Williams", "Rodriguez", "Unser","Swaboda","Spencer")))] = ""
a = matrix(rep(1:3, nrow(bball)), 3, nrow(bball))
b = matrix(c(bball$BattingAverage, bball$SeasonAverage, bball$js), 3, nrow(bball), byrow=TRUE)
matplot(a, b, pch=" ", ylab="predicted average", xaxt="n", xlim=c(0.5, 3.1), ylim=c(0.13, 0.42))
matlines(a, b)
text(rep(0.7, nrow(bball)), bball$BattingAverage, bball$LastName, cex=0.6)
text(1, 0.14, "First 45\nat bats", cex=0.5)
text(2, 0.14, "Average\nof remainder", cex=0.5)
text(3, 0.14, "J-S\nestimator", cex=0.5)
# Data was scrapped from here: https://www.baseball-almanac.com/players/hittinglogs.php?p=clemero01&y=1970
d = read.csv("https://www.dropbox.com/s/hxp5bieq4ykah84/clemente.csv?dl=1")[-(1:2),]
ab = cumsum(d$AB)
h = cumsum(d$H)
plot(sqrt(ab), h/ab, bty='n', xlab="Number at Bats (square root scale)", ylab="Batting Average")
lines(sqrt(ab), h/ab, col="green")
abline(v=sqrt(45))
too severe: \(z_{Cl} = 0.265 + 0.212 ( 0.400 - 0.265) = 0.294\).
The \(0.212\) seems a little severe
Limited translation rules, maximum shrinkage eg. 80%
Not enough shrinkage eg O’Connor ( \(y = 1 , n = 2\)). \(z_{O'C} = 0.265 + 0.212 ( 0.5 - 0.265 ) = 0.421\).
Still better than Ted Williams \(0.406\) in 1941.
Foreign car sales (\(k = 19\)) will further improve MSE performance! It will change the shrinkage factors.
Clearly an improvement over the Stein estimator is \[\hat{\theta}_{S+} = \max \left ( \left ( 1 - \frac{k-2}{ \sum \bar{Y}_i^2 } \right ) , 0 \right ) \bar{Y}_i\]
Include extra prior knowledge
Empirical distribution of all major league players \[\theta_i \sim N ( 0.270 , 0.015 )\] The \(0.270\) provides another origin to shrink to and the prior variance \(0.015\) would give a different shrinkage factor.
To fully understand maybe we should build a probabilistic model and see what the posterior mean is as our estimator for the unknown parameters.
Model \(Y_i | \theta_i \sim N ( \theta_i , D_i )\) where \(\theta_i \sim N ( \theta_0 , A ) \sim N ( 0.270 , 0.015 )\).
The \(D_i\) can be different – unequal variances
Bayes posterior means are given by \[E ( \theta_i | Y ) = ( 1 - B_i ) Y_i \; \; {\rm where} \; \; B_i = \frac{ D_i }{ D_i + A }\] where \(\hat{A}\) is estimated from the data, see Efron and Morris (1975).
Different shrinkage factors as different variances \(D_i\).
\(D_i \propto n_i^{-1}\) and so smaller sample sizes are shrunk more.
Makes sense.