In-class Project: EM Algorithm and Bootstrap

Epidemiologists are interested in studying the sexual behavior of individuals at risk for HIV infection. Suppose 1500 gay men were surevyed and each was asked how many risky sexual encounts he had in the previous 30 days. Let \(n_i\) denote the number of respondents reporting \(i\) encounters, for \(i=1,2,...,16\).

It is realistic to assume that the responders comprise three groups. First, there is a group of people who, for whatever reason, report zero risky encounters even if this is not true. Suppose a respondent has probability \(\alpha\) of belonging to this group.

With probability \(\beta\), a respondent belongs to the second group representing typical behavior. Such people respond truthfully, and their numbers of risky encounters are assumed to follow a Poisson(\(\mu\)) distribution.

Finally, with probability \(1-\alpha-\beta\), a respondent belongs to a high-risk group. Such people respond truthfully, and their numbers of risky encounters are assumed to follow a Poisson(\(\lambda\)) distribution.

The parameters of the model are \(\theta = (\alpha; \beta; \mu; \lambda)\). At the \(t^{th}\) iteration of EM, we use \(\theta^{(t)} = (\alpha^{(t)}; \beta^{(t)}; \mu^{(t)}; \lambda^{(t)})\) to denote the current parameter values. The likelihood of the observed data is given by \[L(\theta|n_0,...,n_{16})\propto \prod_{i=0}^{16}\left[\frac{\pi_i(\theta)}{i!}\right]^{n_i}\], where \[\pi_i(\theta)=\alpha1_{{i=0}}+\beta\mu^iexp(-\mu)+(1-\alpha-\beta)\lambda^iexp(-\lambda)\] for \(i=1,...,16\).

The EM algorithm provides the following updates:

From the given information, we know that the observed data is \[\textbf{X}=\left(n_0,n_1,...,n_{16}\right)\] And the complete data is \[\textbf{Y}=(n_{z,0},n_{t,0},n_{p,0},n_{t,1},n_{p,1},......,n_{t,16},n_{p,16})\] ,where \[n_0=n_{z,0}+n_{t,0}+n_{p,0} \mbox{ and } n_i=n_{t,i}+n_{p,i}\] Hence, the complete data is distributed as Multinomial distribution. The likelihood function is defined by: \[L(\theta|\textbf{Y})=const\times[z_0(\theta)]^{n_{z,0}}\times[t_0(\theta)]^{n_{t,0}}\times[p_0(\theta)]^{n_{p,0}}...\times[t_{16}(\theta)]^{n_{t,16}}\times[p_{16}(\theta)]^{n_{p,16}}\] Then log-likelihood function is given by: \[logL(\theta|\textbf{Y})=log(const)+{n_{z,0}}log[z_0(\theta)]+{n_{t,0}}log[t_0(\theta)]+{n_{p,0}}log[p_0(\theta)]+...+{n_{t,16}}log[t_{16}(\theta)]+{n_{p,16}}log[p_{16}(\theta)]\] ,where \[\begin{aligned} z_0(\theta)&=\frac{\alpha}{\pi_0(\theta)}\\ t_i(\theta)&=\frac{\beta\mu^i exp(-\mu)}{\pi_i(\theta)}\\ p_i(\theta)&=\frac{(1-\alpha-\beta)\lambda^i exp(-\lambda)}{\pi_i(\theta)} \end{aligned}\] and \[\pi_i(\theta)=\alpha 1_{(i=0)}+\beta\mu^i exp(-\mu)+(1-\alpha-\beta)\lambda^i exp(-\lambda)\] Take expectation of log-likelihood, we have \[ \begin{aligned} E(logL(\theta|\textbf{Y})|\textbf{X},\theta^{(t)})&=E(log(const))+E({n_{z,0}}log[z_0(\theta^{(t)})])+...+E({n_{t,16}}log[t_{16}(\theta^{(t)})])+E({n_{p,16}}log[p_{16}(\theta^{(t)})])\\ &=log(const)+E({n_{z,0}})log[z_0(\theta^{(t)})]+...+E({n_{t,16}})log[t_{16}(\theta^{(t)})]+E({n_{p,16}})log[p_{16}(\theta^{(t)})]\\ &=log(const)+{n_{z,0}}z_{0}(\theta)log[z_0(\theta^{(t)})]+...+{n_{t,16}}t_{16}(\theta)log[t_{16}(\theta^{(t)})]+{n_{p,16}}p_{16}(\theta)log[p_{16}(\theta^{(t)})] \end{aligned} \] Take partial derivatives of expectation of log-likelihood with respect to each parameter, and equate to zero, we have \[\frac{\partial{E(logL(\theta|\textbf{Y})|\textbf{X},\theta^{(t)})}}{\partial{\alpha}}=0, \alpha^{(t+1)}=\frac{n_0 z_0(\theta^{(t)})}{N}\] \[\frac{\partial{E(logL(\theta|\textbf{Y})|\textbf{X},\theta^{(t)})}}{\partial{\beta}}=0, \beta^{(t+1)} = \sum_{i=0}^{16} \frac{n_i t_i(\theta^{(t)})}{N}\] \[\frac{\partial{E(logL(\theta|\textbf{Y})|\textbf{X},\theta^{(t)})}}{\partial{\mu}}=0, \mu^{(t+1)} = \frac{\sum_{i=1}^{16}i n_i t_i(\theta^{(t)})}{\sum_{i=1}^{16}n_i t_i(\theta^{(t)})}\] \[\frac{\partial{E(logL(\theta|\textbf{Y})|\textbf{X},\theta^{(t)})}}{\partial{\lambda}}=0, \lambda^{(t+1)} = \frac{\sum_{i=1}^{16}i n_i p_i(\theta^{(t)})}{\sum_{i=1}^{16}n_i p_i(\theta^{(t)})}\]

Estimate the parameters of the model, using the observed data.

dat <- c(379,299,222,145,109,95,73,59,45,30,24,12,4,2,0,1,1)

pi_i <- function(alpha, beta, mu, lambda,i) {
  if (i==0) return(alpha + beta * exp(-mu) + (1-alpha-beta)*exp(-lambda))
  else return(beta* mu^i* exp(-mu) + (1-alpha-beta)* lambda^i *exp(-lambda))
}

z_0 <- function(alpha, beta, mu, lambda,i) {
  return(alpha / pi_i(alpha, beta, mu, lambda,i))
}

t_i <- function(alpha, beta, mu, lambda,i){
  return(beta * mu^i * exp(-mu)/pi_i(alpha, beta, mu, lambda,i))
}
p_i <- function(alpha, beta, mu, lambda,i){
  return((1-alpha-beta) * lambda^i * exp(-lambda)/pi_i(alpha, beta, mu, lambda,i))
}

##Initial Values
alpha <- 0.6
beta <- 0.1
mu <- 3
lambda <- 4
j <- c(0:16)
n_i <- dat
result <- c(1, alpha, beta, mu, lambda)

esp <- 1e-3
r <- 1

#Run iterations
repeat{
  alpha_t <- alpha
  beta_t <- beta
  mu_t <- mu
  lambda_t <- lambda
  
  alpha_t1 <- n_i[1] * z_0(alpha_t,  beta_t, mu_t, lambda_t, j[1])/sum(n_i)
  
  b <- 0
  for (i in 1:length(j))  {
    b <- b + n_i[i]*t_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])
  }
  beta_t1 <- b/sum(n_i)
  
  m <- 0
  for (i in 1:length(j))  {
    m <- m + j[i]*n_i[i]*t_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])/b
  }
  mu_t1 <- m
  
  l <- 0
  p <- 0
  for (i in 1:length(j))  {
    p <- p + n_i[i]*p_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])
  }
  for (i in 1:length(j))  {
    l <- l + j[i]*n_i[i]*p_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])/p
  }
  lambda_t1 <- l
  
  if(abs(alpha_t1-alpha_t) < esp && abs(beta_t1-beta_t) < esp 
     && abs(mu_t1-mu_t) < esp && abs(lambda_t1-lambda_t) < esp) break
  r <- r+1
  result <- rbind(result, c(r, alpha_t1,beta_t1,mu_t1,lambda_t1))
  
  alpha <- alpha_t1
  beta <- beta_t1
  mu <- mu_t1
  lambda <- lambda_t1
}
colnames(result) <- c("Iteration", "Alpha","Beta","Mu","Lambda")
tail(result)

##  Iteration     Alpha      Beta       Mu   Lambda
##         34 0.1198552 0.5625465 1.453134 5.921141
##         35 0.1200528 0.5625478 1.454361 5.922672
##         36 0.1202342 0.5625479 1.455484 5.924067
##         37 0.1204004 0.5625474 1.456513 5.925341
##         38 0.1205527 0.5625465 1.457454 5.926505
##         39 0.1206921 0.5625454 1.458316 5.927569

From the output, the estimated parameters are \(\hat{\alpha}=0.12\), \(\hat{\beta}=0.56\), \(\hat{\mu}=1.46\), \(\hat{\lambda}=5.93\). Note: if we set the initial value \(\mu = \lambda\), we would get a different result.

Estimate the standard errors and pairwise correlations of your parameter estimates, using any available method.

The bootstrapping method is used for this question.

B = 1000                 # number of boostrapped samples
n = sum(dat)            
prob0 = dat/n           

set.seed(2)
result.boot <- NULL

for(i in 1: B) {          # loops begin
  freqi = rmultinom(1, size=n, prob=prob0);
  ##Initial Values
  alpha <- 0.6
  beta <- 0.1
  mu <- 3
  lambda <- 4
  j <- c(0:16)
  n_i <- freqi
  
  esp <- 1e-3
  #Run iterations
  repeat{
    alpha_t <- alpha
    beta_t <- beta
    mu_t <- mu
    lambda_t <- lambda
    
    alpha_t1 <- n_i[1] * z_0(alpha_t,  beta_t, mu_t, lambda_t, j[1])/sum(n_i)
    
    b <- 0
    for (i in 1:length(j))
    {
      b <- b + n_i[i]*t_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])
    }
    beta_t1 <- b/sum(n_i)
    
    m <- 0
    for (i in 1:length(j))
    {
      m <- m + j[i]*n_i[i]*t_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])/b
    }
    mu_t1 <- m
    
    l <- 0
    p <- 0
    for (i in 1:length(j))
    {
      p <- p + n_i[i]*p_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])
    }
    for (i in 1:length(j))
    {
      l <- l + j[i]*n_i[i]*p_i(alpha_t,  beta_t, mu_t, lambda_t, j[i])/p
    }
    lambda_t1 <- l
    
    if(abs(alpha_t1-alpha_t) < esp && abs(beta_t1-beta_t) < esp
       && abs(mu_t1-mu_t) < esp && abs(lambda_t1-lambda_t) < esp) break
    
    
    alpha <- alpha_t1
    beta <- beta_t1
    mu <- mu_t1
    lambda <- lambda_t1
  }
  result.boot <- rbind(result.boot, c(alpha_t1,beta_t1,mu_t1,lambda_t1))
}
cov(result.boot)

##               [,1]          [,2]         [,3]        [,4]
## [1,]  0.0004240705 -2.068329e-04 1.631691e-03 0.001491914
## [2,] -0.0002068329  4.899533e-04 3.871143e-05 0.001410703
## [3,]  0.0016316909  3.871143e-05 1.190641e-02 0.012663369
## [4,]  0.0014919137  1.410703e-03 1.266337e-02 0.036775035

cor(result.boot)

##            [,1]        [,2]       [,3]      [,4]
## [1,]  1.0000000 -0.45375697 0.72615349 0.3777879
## [2,] -0.4537570  1.00000000 0.01602771 0.3323392
## [3,]  0.7261535  0.01602771 1.00000000 0.6051768
## [4,]  0.3777879  0.33233922 0.60517682 1.0000000