\(\newcommand{\ci}{\perp\!\!\!\perp}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\newcommand{\Var}[1]{\operatorname{Var}\left(#1\right)}\) \(\newcommand{\E}[1]{\operatorname{E}\left(#1\right)}\) \(\newcommand{\Cov}[1]{\operatorname{Cov}\left(#1\right)}\) \(\newcommand{\Cor}[1]{\operatorname{Corr}\left(#1\right)}\) \(\newcommand{\sd}[1]{\operatorname{sd}\left(#1\right)}\) \(\newcommand{\prob}[1]{\operatorname{P}\left(#1\right)}\) \(\DeclareMathOperator*{\minf}{minimize \quad}\) \(\newcommand{\mininlineeq}[3]{\mbox{minimize}_{#1}\quad#2\qquad\mbox{subject to }#3}\) \(\newcommand{\mininline}[2]{\mbox{minimize}_{#1}\quad#2}\)

Things to Know

Explosion of Models and Algorithms starting in 1950s

Algorithms

Probabilistic Reasoning

Bayesian Inference

Explicit use of probability for summarizing uncertainty.

  1. A for data given parameters \[f(y| \theta ) \; \; \text{Likelihood}\]

  2. A for unknown parameters \[p(\theta) \; \; \text{Prior}\]

  3. Inference for unknowns conditional on observed data

    Inverse probability (Bayes Theorem);

    Formal decision making (Loss, Utility)

Posterior Inference

to derive posterior distributions \[\begin{aligned} p( \theta | y ) & = \frac{p(y| \theta)p( \theta)}{p(y)} \\ p(y) & = \int p(y| \theta)p( \theta)d \theta\end{aligned}\] Allows you to make probability statements

Conjugate Priors

{.example} - Let \({\cal F}\) denote the class of distributions \(f ( y | \theta )\).

A class $\Pi$ of prior distributions is for ${\cal F}$ if the
posterior distribution is in the class $\Pi$ for all
$f \in {\cal F} , \pi \in \Pi , y \in {\cal Y}$.

Bayes Learning: Beta-Binomial

How do I update my beliefs about a coin toss?

Likelihood for Bernoulli \[p\left( y|\theta\right) =\prod_{t=1}^{T}p\left( y_{t}|\theta\right) =\theta^{\sum_{t=1}^{T}y_{t}}\left( 1-\theta\right) ^{T-\sum_{t=1}^{T}y_{t}}.\] Initial prior distribution \(\theta\sim\mathcal{B}\left( a,A\right)\) given by \[p\left( \theta|a,A\right) =\frac{\theta^{a-1}\left( 1-\theta\right) ^{A-1}}{B\left( a,A\right) }\]

Bayes Learning: Beta-Binomial

Updated posterior distribution is also Beta \[p\left( \theta|y\right) \sim\mathcal{B}\left( a_{T},A_{T}\right) \; {\rm and} \; a_{T}=a+\sum_{t=1}^{T}y_{t} , A_{T}=A+T-\sum_{t=1}^{T}y_{t}\] The posterior mean and variance are \[E\left[ \theta|y\right] =\frac{a_{T}}{a_{T}+A_{T}}\text{ and }var\left[ \theta|y\right] =\frac{a_{T}A_{T}}{\left( a_{T}+A_{T}\right) ^{2}\left( a_{T}+A_{T}+1\right) }\]

Binomial-Beta

\(p ( p | \bar{y} )\) is the for \(p\)

\(\bar{y}\) is a sufficient statistic.

Black Swans

Taleb, The Black Swan: the Impact of the Highly Improbable

Suppose you’re only see a sequence of White Swans, having never seen a Black Swan.

What’s the Probability of Black Swan event sometime in the future?

Suppose that after \(T\) trials you have only seen successes \(( y_1 , \ldots , y_T ) = ( 1 , \ldots , 1 )\). The next trial being a success has \[p( y_{T+1} =1 | y_1 , \ldots , y_T ) = \frac{T+1}{T+2}\] For large \(T\) is almost certain. Here \(a=A=1\).

Black Swans

Principle of Induction (Hume)

The probability of ever seeing a Black Swan is given by \[p( y_{T+1} =1 , \ldots , y_{T+n} = 1 | y_1 , \ldots , y_T ) = \frac{ T+1 }{ T+n+1 } \rightarrow 0\]

lack Swan will eventually happen – don’t be surprised when it actually happens.

Bayesian Learning: Poisson-Gamma

{.example} Poisson/Gamma: Suppose that \(Y_1 , \ldots , Y_n \mid \lambda \sim Poi ( \lambda )\).

Let \(\lambda \sim Gamma ( \alpha , \beta )\)

\(( \alpha , \beta )\) are known hyper-parameters.

Example: Clinical Trials

Novick and Grizzle: Bayesian Analysis of Clinical Trials

Four treatments for duodenal ulcers.

Doctors assess the state of the patient.

Sequential data

(\(\alpha\)-spending function, can only look at prespecified times).

Treat Excellent Fair Death
A 76 17 7
B 89 10 1
C 86 13 1
D 88 9 3

Cannot reject at the 5% level

Conjugate binomial/beta model+sensitivity analysis.

Binomial-Beta

Let \(p_i\) be the death rate proportion under treatment \(i\).

Sensitivity Analysis

Important to do a sensitivity analysis.

Treat Excellent Fair Death
A 76 17 7
B 89 10 1
C 86 13 1
D 88 9 3

Poisson-Gamma, prior \(\Gamma ( m , z)\) and \(\lambda_i\) be the expected death rate.

Compute \(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > c | D \right )\)

Prob ( 0 , 0 ) ( 100, 2) ( 200 , 5)
\(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.3 | D \right )\) 0.95 0.88 0.79
\(P \left ( \frac{ \lambda_1 }{ \lambda_2 } > 1.6 | D \right )\) 0.91 0.80 0.64

Bayesian Learning: Normal-Normal

Using we get \[p( \mu | y ) \propto p( y| \mu ) p( \mu )\]

SAT Scores

SAT (\(200-800\)): 8 high schools and estimate effects.

School Estimated \(y_j\) St. Error \(\sigma_j\) Average Treatment \(\theta_i\)
A 28 15 ?
B 8 10 ?
C -3 16 ?
D 7 11 ?
E -1 9 ?
F 1 11 ?
G 18 10 ?
H 12 18 ?

Estimates

Two programs appear to work (improvements of \(18\) and \(28\))

Hierarchical Model

Hierarchical Model (\(\sigma_j^2\) known) is given by \[\bar{y}_j | \theta_j \sim N ( \theta_j , \sigma_j^2 )\] Unequal variances–differential shrinkage.

Posterior

Joint Posterior Distribution \(y = ( y_1 , \ldots , y_J )\) \[p( \theta , \mu , \tau | y ) \propto p( y| \theta ) p( \theta | \mu , \tau )p( \mu , \tau )\] \[\propto p( \mu , \tau^2) \prod_{i=1}^8 N ( \theta_j | \mu , \tau^2 ) \prod_{j=1}^8 N ( y_j | \theta_j )\] \[\propto \tau^{-9} \exp \left ( - \frac{1}{2} \sum_j \frac{1}{\tau^2} ( \theta_j - \mu )^2 - \frac{1}{2} \sum_j \frac{1}{\sigma_j^2} ( y_j - \theta_j )^2 \right )\] MCMC!

Posterior Inference

Report posterior quantiles

School 2.5% 25% 50% 75% 97.5%
A -2 6 10 16 32
B -5 4 8 12 20
C -12 3 7 11 22
D -6 4 8 12 21
E -10 2 6 10 19
F -9 2 6 10 19
G -1 6 10 15 27
H -7 4 8 13 23
\(\mu\) -2 5 8 11 18
\(\tau\) 0.3 2.3 5.1 8.8 21

Schools \(A\) and \(G\) are similar!

Bayesian Shrinkage

Bayesian shrinkage provides a way of modeling complex datasets.

  1. Baseball batting averages: Stein’s Paradox

  2. Batter-pitcher match-up: Kenny Lofton and Derek Jeter

  3. Bayes Elections

  4. Toxoplasmosis

  5. Bayes MoneyBall

  6. Bayes Portfolio Selection

Example: Baseball

Batter-pitcher match-up?

Prior information on overall ability of a player.

Small sample size, pitcher variation.

Sports Data: Baseball

Kenny Lofton hitting
Pitcher At-bats Hits ObsAvg
J.C. Romero 9 6 .667
S. Lewis 5 3 .600
B. Tomko 20 11 .550
T. Hoffman 6 3 .500
K. Tapani 45 22 .489
A. Cook 9 4 .444
J. Abbott 34 14 .412
A.J. Burnett 15 6 .400
K. Rogers 43 17 .395
A. Harang 6 2 .333
K. Appier 49 15 .306
R. Clemens 62 14 .226
C. Zambrano 9 2 .222
N. Ryan 10 2 .200
E. Hanson 41 7 .171
E. Milton 19 1 .056
M. Prior 7 0 .000
Total 7630 2283 .299

Kenny Lofton versus individual pitchers.

Baseball

Kenny Lofton

Kenny Lofton (career \(.299\) average, and current \(.308\) average for \(2006\) season) was facing the pitcher Milton (current record \(1\) for \(19\))

.

  • Is putting in a weaker player really a better bet?

  • Over-reaction to bad luck?

    \(\mathbb{P}\left ( \leq 1 \; {\rm hit \; in \; } 19 \; {\rm attempts} | p = 0.3 \right ) = 0.01\)

    An unlikely \(1\)-in-\(100\) event.

Baseball

Kenny Lofton

Bayes solution: shrinkage. Borrow strength across pitchers

Bayes estimate: use the posterior mean

Lofton’s batting estimates that vary from \(.265\) to \(.340\).

The lowest being against Milton.

\(.265 < .275\)

Conclusion: resting Lofton against Milton was justified!!

Bayes Batter-pitcher match-up

Here’s our model again ...

Example: Derek Jeter

Derek Jeter hierarchical model estimates
Pitcher At-bats Hits ObsAvg EstAvg 95% Int
R. Mendoza 6 5 .833 .322 (.282,.394)
H. Nomo 20 12 .600 .326 (.289,.407)
A.J.Burnett 5 3 .600 .320 (.275,.381)
E. Milton 28 14 .500 .324 (.291,.397)
D. Cone 8 4 .500 .320 (.218,.381)
R. Lopez 45 21 .467 .326 (.291,.401)
K. Escobar 39 16 .410 .322 (.281, .386)
J. Wettland 5 2 .400 .318 (.275,.375)
T. Wakefield 81 26 .321 .318 (.279,.364)
P. Martinez 83 21 .253 .312 (.254,.347)
K. Benson 8 2 .250 .317 (.264,.368)
T. Hudson 24 6 .250 .315 (.260,.362)
J. Smoltz 5 1 .200 .314 (.253,.355)
F. Garcia 25 5 .200 .314 (.253,.355)
B. Radke 41 8 .195 .311 (.247,.347)
D. Kolb 5 0 .000 .316 (.258,.363)
J. Julio 13 0 .000 .312 (.243,.350 )
Total 6530 2061 .316

Derek Jeter 2006 season versus individual pitchers.

Bayes Estimates

Stern stimates \(\hat{\phi} = ( \alpha + \beta + 1 )^{-1} = 0.002\) for Jeter

Doesn’t vary much across the population of pitchers.

The extremes are shrunk the most also matchups with the smallest sample sizes.

Jeter had a season \(.308\) average.

Bayes estimates vary from \(.311\) to \(.327\)–he’s very consistent.

If all players had a similar record then a constant batting average would make sense.

Bayes Elections: Nate Silver

Multinomial-Dirichlet

Predicting the Electoral Vote (EV)

  • Multinomial-Dirichlet: \((\hat{p} | p) \sim Multi (p), ( p | \alpha ) \sim Dir (\alpha)\) \[p_{Obama} = ( p_{1}, \ldots ,p_{51} | \hat{p}) \sim Dir \left ( \alpha + \hat{p} \right )\]

  • Flat uninformative prior \(\alpha\equiv 1\).

    http://www.electoral-vote.com/evp2012/Pres/prespolls.csv

Bayes Elections: Nate Silver

Simulation

Calculate probabilities via simulation: rdirichlet \[{\color{blue}{p \left ( p_{j,O} | {\rm data} \right ) \;\; {\rm and} \; \; p \left ( EV >270 | {\rm data} \right )}}\]

The election vote prediction is given by the sum \[{\color{red}{EV =\sum_{j=1}^{51} EV(j) \mathbb{E} \left ( p_{j} | {\rm data} \right )}}\] where \(EV(j)\) are for individual states

Polling Data: electoral-vote.com

Electoral Vote (EV), Polling Data: Mitt and Obama percentages

library(kableExtra)
library(LearnBayes)
library(plyr)
library(lattice)
file1 <- "http://www.electoral-vote.com/evp2012/Pres/pres_polls.csv"
election.2012 <- read.csv(file = file1)

# Remove a pollster: elect2012 <- election.2012[!grepl('Rasmussen', election.2012$Pollster),]

elect2012 <- election.2012


# Aggregrate the data

elect2012 <- ddply(elect2012, .(State), subset, Day == max(Day))

elect2012 <- ddply(elect2012, .(State), summarise, R.pct = mean(GOP), O.pct = mean(Dem), 
    EV = mean(EV))
State R.pct O.pct EV
Alabama 61 38 9
Alaska 55 42 3
Arizona 54 44 11
Arkansas 61 37 6
California 38 59 55
Colorado 47 51 9
Connecticut 40 58 7
D.C. 7 91 3
Delaware 40 59 3
Florida 49 50 29
Georgia 53 45 16
Hawaii 28 71 4
Idaho 65 33 4
Illinois 41 57 20
Indiana 54 44 11
Iowa 47 52 6
Kansas 60 38 6
Kentucky 61 38 8
Louisiana 58 41 8
Maine 41 56 4
Maryland 37 62 10
Massachusetts 38 61 11
Michigan 45 54 16
Minnesota 45 53 10
Mississippi 56 44 6
# Run the Simulation

prob.Obama <- function(mydata) {
    p <- rdirichlet(1000, 500 * c(mydata$R.pct, mydata$O.pct, 100 - mydata$R.pct - 
        mydata$O.pct)/100 + 1)
    mean(p[, 2] > p[, 1])
}

win.probs <- ddply(elect2012, .(State), prob.Obama)
win.probs$Romney <- 1 - win.probs$V1
names(win.probs)[2] <- "Obama"

win.probs$EV <- elect2012$EV

win.probs <- win.probs[order(win.probs$EV), ]
rownames(win.probs) <- win.probs$state

# Plot Probabilities by State

barplot(t(as.matrix(win.probs[, 2:3])), legend = TRUE, args.legend = list(x = "topright", 
    cex = 0.5), names.arg = win.probs$State, horiz = TRUE, cex.names = 0.5, 
    las = 2, xlab = "Probability", main = "Probability of Obama or Romney by State")

#  Simulation among these probabilities

sim.election <- function(win.probs) {
    winner <- rbinom(51, 1, win.probs$Obama)
    sum(win.probs$EV * winner)
}

sim.EV <- replicate(10000, sim.election(win.probs))
oprob <- sum(sim.EV >= 270)/length(sim.EV)
# probability of Obama having >270 EV or more:
oprob
## [1] 0.9709
## [1] 0.92


# Lattice Graph
densityplot(sim.EV, plot.points = "rug", xlab = "Electoral Votes for Obama", 
    panel = function(x, ...) {
        panel.densityplot(x, ...)
        panel.abline(v = 270)
        panel.text(x = 285, y = 0.01, "270 EV to Win")
        panel.abline(v = 332)
        panel.text(x = 347, y = 0.01, "Actual Obama")
}, main = "Electoral College Results Probability")

Let’s do the same for 2008 elections!

library(LearnBayes)
data(election.2008)
attach(election.2008)

##  Dirichlet simulation


prob.Obama = function(j)
 {
 p=rdirichlet(5000,500*c(M.pct[j],O.pct[j],100-M.pct[j]-O.pct[j])/100+1)
 mean(p[,2]>p[,1])
 }

## sapply function to compute Obama win prob for all states

Obama.win.probs=sapply(1:51,prob.Obama)

##  sim.EV function

sim.election = function()
 {
 winner = rbinom(51,1,Obama.win.probs)
 sum(EV*winner)
 }

sim.EV = replicate(1000,sim.election())

## histogram of simulated election
hist(sim.EV,min(sim.EV):max(sim.EV),col="blue",prob=T)
abline(v=365,lwd=3)   # Obama received 365 votes
text(375,30,"Actual \n Obama \n total")

Chicago Bears 2014-2015 Season

Hierarchical Model

Update our beliefs in light of

  • In the 2014-2015 season.

    The Bears suffered back-to-back \(50\)-points defeats.

    Partiots-Bears \(51-23\)

    Packers-Bears \(55-14\)

  • Their next game was at home against the Minnesota Vikings.

    Current line against the Vikings was .

    Slightly over a field goal

What’s the Bayes approach to learning the line?

Hierarchical Model

Hierarchical model for the current average win/lose this year \[\begin{aligned} \bar{y} | \theta & \sim N \left ( \theta , \frac{\sigma^2}{n} \right ) \sim N \left ( \theta , \frac{18.34^2}{9} \right )\\ \theta & \sim N( 0 , \tau^2 )\end{aligned}\] Here \(n =9\) games so far. With \(s = 18.34\) points

Pre-season prior mean \(\mu_0 = 0\), standard deviation \(\tau = 4\).

Record so-far. Data \(\bar{y} = -9.22\).

Chicago Bears

Bayes Shrinkage estimator \[\mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = \frac{ \tau^2 }{ \tau^2 + \frac{\sigma^2}{n} } \bar{y}\]

The is \(0.3\)!!

That’s quite a bit of shrinkage.

  • Our updated estimator is \[\mathbb{E} \left ( \theta | \bar{y} , \tau \right ) = - 2.75 > -.3.5\] where current line is \(-3.5\).

  • Based on our hierarchical model this is an .

    One point change on the line is about \(3\)% on a probability scale.

    Alternatively, calculate a given line \(=-3.5\).

Chicago Bears

Last two defeats were \(50\) points scored by opponent (2014-15)

bears=c(-3,8,8,-21,-7,14,-13,-28,-41)
> mean(bears)
[1] -9.222222
> sd(bears)
[1] 18.34242
> tau=4

> sig2=sd(bears)*sd(bears)/9
> tau^2/(sig2+tau^2)
[1] 0.2997225
> 0.29997*-9.22
[1] -2.765723
> pnorm(-2.76/18)
[1] 0.4390677

Home advantage is worth \(3\) points. Vikings an average record.

Stein’s Paradox

possible to make a on the MLE in terms of MSE.

Any gains from a “complicated” procedure could not be worth the extra trouble (Tukey, savings not more than 10 % in practice)

For \(k\ge 3\), we have the remarkable inequality \[MSE(\hat \theta_{JS},\theta) < MSE(\bar y,\theta) \; \forall \theta\] Bias-variance explanation! Inadmissability of the classical stats.

Baseball Data and Shrinkage Estimator: Efron and Morris

Data: 18 major-league players after 45 at bats (1970 season)

bball = read.table("http://www.swarthmore.edu/NatSci/peverso1/Sports%20Data/JamesSteinData/Efron-Morris%20Baseball/EfronMorrisBB.txt",header=TRUE, stringsAsFactors=FALSE)
bball$shrinkage = bball$BattingAverage * .212 + .788 * (0.265)
knitr::kable(bball[c("LastName","BattingAverage","RemainingAverage", "shrinkage")], digits=3)
LastName BattingAverage RemainingAverage shrinkage
Clemente 0.400 0.346 0.294
Robinson 0.378 0.298 0.289
Howard 0.356 0.276 0.284
Johnstone 0.333 0.222 0.279
Berry 0.311 0.273 0.275
Spencer 0.311 0.270 0.275
Kessinger 0.289 0.264 0.270
Alvarado 0.267 0.210 0.265
Santo 0.244 0.269 0.261
Swaboda 0.244 0.230 0.261
Petrocelli 0.222 0.264 0.256
Rodriguez 0.222 0.226 0.256
Scott 0.222 0.303 0.256
Unser 0.222 0.264 0.256
Williams 0.222 0.330 0.256
Campaneris 0.200 0.285 0.251
Munson 0.178 0.316 0.247
Alvis 0.156 0.200 0.242

Shrinkage

Let \(\theta_i\) denote the end of season average

bball$js = bball$BattingAverage * .212 + .788 * (0.265)
bball$LastName[!is.na(match(bball$LastName,   c("Scott","Williams", "Rodriguez", "Unser","Swaboda","Spencer")))] = ""

a = matrix(rep(1:3, nrow(bball)), 3, nrow(bball))
b = matrix(c(bball$BattingAverage, bball$SeasonAverage, bball$js),    3, nrow(bball), byrow=TRUE)

matplot(a, b, pch=" ", ylab="predicted average", xaxt="n", xlim=c(0.5, 3.1), ylim=c(0.13, 0.42))
matlines(a, b)
text(rep(0.7, nrow(bball)), bball$BattingAverage, bball$LastName, cex=0.6)
text(1, 0.14, "First 45\nat bats", cex=0.5)
text(2, 0.14, "Average\nof remainder", cex=0.5)
text(3, 0.14, "J-S\nestimator", cex=0.5)

Batting Averages

# Data was scrapped from here: https://www.baseball-almanac.com/players/hittinglogs.php?p=clemero01&y=1970
d = read.csv("https://www.dropbox.com/s/hxp5bieq4ykah84/clemente.csv?dl=1")[-(1:2),]
ab = cumsum(d$AB)
h = cumsum(d$H)
plot(sqrt(ab), h/ab, bty='n', xlab="Number at Bats (square root scale)", ylab="Batting Average")
lines(sqrt(ab), h/ab, col="green")
abline(v=sqrt(45))

Baseball Paradoxes

too severe: \(z_{Cl} = 0.265 + 0.212 ( 0.400 - 0.265) = 0.294\).

The \(0.212\) seems a little severe

Baseball Prior

Include extra prior knowledge

Empirical distribution of all major league players \[\theta_i \sim N ( 0.270 , 0.015 )\] The \(0.270\) provides another origin to shrink to and the prior variance \(0.015\) would give a different shrinkage factor.

To fully understand maybe we should build a probabilistic model and see what the posterior mean is as our estimator for the unknown parameters.

Shrinkage: Unequal Variances

Model \(Y_i | \theta_i \sim N ( \theta_i , D_i )\) where \(\theta_i \sim N ( \theta_0 , A ) \sim N ( 0.270 , 0.015 )\).