Estimating proportions and predicting future samples.

A study has been conducted on the effects of exposure to moderate levels of lead (found in certain foods) in the long-term cognitive development of children. Researchers analyzed the lead content in baby teeth of the children, once they had been dropped. Of the 29 children studied who had a lead content greater than 22.22 ppm (parts per million), 22 completed Education Secondary and 7 others did not finish it. We assume a prior distribution p ⇠ Be (1, 1) for the proportion of students completing Secondary Education.

1. Write the likelihood function, L (p) = P (X = x | p), and the posterior distribution, f (p | x) (except multiplicative constants).

I write the likelihood function (or the model) which is, in this case, binomial, P r (X = x | p) = nxpx (1 - p) n − x, x = 0, 1,. . . n. A posteriori distribution is given by: f (p | x) = f (p) P r (X = x | p) / integral f (p) P r (X = x | p) dp ⇒ f (p | x) ∝ f (p) P r (X = x | p) f (p | x) ∝ f (p) P r (x | p) ∝ pα − 1 (1 - p) β − 1 px (1 - p) n − x ∝ px + α − 1 (1 - p) n − x + β − 1, where p ∈ (0, 1) and f (p | x) = 0 for p / ∈ (0, 1). It is observed, then, that p | x ∼ Beta (x + α, n - x + β) α ’= x + α = 22 + 1 = 23 and β’ = n - x + β = 29-22 + 1 = 8 Where: p | x ∼ Beta (23,8)

library(Bolstad)
binogcp(22,29,density="beta", params = c(1,1),n.pi = 1e4)
results<-binogcp(22,29,density="beta", params = c(1,1),n.pi = 1e4)

#Likelihood function
head(results$likelihood,15)
##  [1] 2.383351e-95 7.473975e-85 5.674402e-80 9.298925e-77 2.340491e-74
##  [6] 1.933333e-72 7.622846e-71 1.774463e-69 2.783756e-68 3.213882e-67
## [11] 2.903828e-66 2.147111e-65 1.343439e-64 7.298541e-64 3.513053e-63
mean(results$likelihood)
## [1] 2.135684e-08
#Posterior probability function
head(results$posterior,15)
##  [1] 1.115966e-87 3.499569e-77 2.656948e-72 4.354073e-69 1.095897e-66
##  [6] 9.052523e-65 3.569276e-63 8.308638e-62 1.303449e-60 1.504849e-59
## [11] 1.359671e-58 1.005350e-57 6.290436e-57 3.417425e-56 1.644931e-55
mean(results$posterior)
## [1] 1
head(results$pi,15)
##  [1] 0.00005 0.00015 0.00025 0.00035 0.00045 0.00055 0.00065 0.00075 0.00085
## [10] 0.00095 0.00105 0.00115 0.00125 0.00135 0.00145
mean(results$pi)
## [1] 0.5
mean(results$pi.prior)
## [1] 1

2. Obtain a random sample of size 1000 from the posterior distribution of p, and represent its empirical density, comparing it on the same graph with the prior and posterior density of p. With a sample n = 1000

n=1000
x1 <-rbeta (1000, shape1 = 23, shape2 = 8)

I represent it with:

plot (density (x1), type = "beta", xlab = "Values", ylab = "Empirical Probability Function", lwd = 3, col = 4, main = "")
curve (expr = dbeta (x, 1,1), from =, to =, col = 2, lwd = 3, add = T,) #For the a priori Blue
curve (expr = dbeta (x, 23,8), from =, to =, col = 3, lwd = 3, add = T,) #and later the Green

3. Calculate the posterior mean and posterior standard deviation.

To calculate the posterior mean and posterior standard deviation, I follow a reasoning as in the practical exercises, I use the sintegral function. I call

#For posteriori mean
p1<-results$pi
p2<-results$posterior
dens<-p1*p2
post.mean<-sintegral(p1,dens)$value
post.mean
## [1] 0.7419355

And for the standard deviation a posteriori I do

dens2<-(p1-post.mean)^2*p2
post.var<-sintegral(p1,dens2)$value
post.var
## [1] 0.005983351
round(post.var,4)
## [1] 0.006
post.sd<-sqrt(post.var)
post.sd
## [1] 0.07735212

4. Calculate a 90% probability interval for p.

To calculate the probability interval, first I use the empirical distribution function, for that I call p1 and p2, already found, and use the command

cdf <-sintegral (p1, p2, n.pts = length (p1)) $ cdf
# I calculate the lower limit
lcb <-cdf $ x [with (cdf, which.max (x [y <= 0.05]))]
lcb
## [1] 0.6059394
# and the upper limit of the probability interval
ucb <-cdf $ x [with (cdf, which.max (x [y <= 0.95]))]
ucb
## [1] 0.859714

Remaining the probability interval [0.6059394; 0.859714]

5. Contrast the hypothesis H0: p <= 0.4 against H1: p> 0.4.

I have to calculate the probabilities P0 = R p0 0 f (p | data) and P1 = R 1 p0 f (p | data), with p0 = 0.4. To do this, I define two vectors

p0 = 0.4
x0 = p1 [p1 <= p0]
x1 = p1 [p1> p0]
P0 <-sintegral (x0, p2 [p1 <= p0]) $ value
P1 <-sintegral (x1, p2 [p1> p0]) $ value

Null hypothesis round (P0,6) = 4.9e-05 Alternative Hypothesis round (P1,6) = 0.999951 So we reject the null hypothesis.

6. In a next phase of the study, data are received from 10 other children who have had a lead content of more than 22.22 ppm. Calculate the predictive probability that at least 9 of them finish Secondary Education.

Beta Binomial Predictive Distribution Function

BetaBinom <- Vectorize(function(rp){
  log.val <- lchoose(np, rp) + lbeta(rp+a+r,b+n-r+np-rp) - lbeta(a+r,b+n-r)
  return(exp(log.val))})

n <- 29; r <-22; a <- 1; b <- 1; np <- 10
barplot(1:10,BetaBinom(1:10),type="b",xlab="r*",ylab="P(R=r*|Data)", main = "Posterior predictive: a=1, b=1",space=1,col=4,cex.axis= 1.5,cex.lab=1.5,lwd=4)

sum(BetaBinom(9:10))
## [1] 0.2663718