MATH8670 Homework 3

1. [15 points] EM algorithm for missing data.

Simulate data with missing values. Generate \(x= (x1,...,x_10)\) from the Normal distribution \(N(\mu,\sigma^2)\). Randomly pick 1 out of the 10 values and mask it as a missing value. Form the final data set by \(z= (x_1,...,x_9,y)\) where \(y\) is NA.

set.seed(1777)

x <- rnorm(10, 5, 2)

y <- x[10]

x[10] <- NA

x; y

##  [1] 5.086176 9.393123 4.262776 5.740724 4.102214 4.351987 5.454146 6.777824
##  [9] 4.185534       NA

## [1] 5.876828

If your goal is to impute the missing value, what strategies would you try?

There are many methods to choose from and it typically depends on the dimensions of your data set. If this case, assuming I knew nothing about the distribution from which the data came from, I would likely use a Shapiro-Wilk to test for normality. From there, a simple solution would be replacing the missing value with the mean. If the data was not normal I might choose the median instead. If I had a bigger data set I might use some sort imputation via bagged trees or knn.

Now consider the EM algorithm. Start with \(\theta_0 = (\mu_0, \sigma_0)\) as the initial estimate of parameters. You may use your own way to find the initiate estimate

theta_0 <- c(mean(sample(x, 4), na.rm = T), sd(sample(x, 4), na.rm = T))

E-Step: Given \(\theta_0\), derive the pdf of conditional distribution of \(y \ | \ x_1, . . . , x_9, \theta_0\).

We first note the following: \[ \\ g(x \ | \ \theta_0) = \text{The joint pdf of }x, \\ h(x, y \ | \ \theta_0) = \text{The joint pdf of }x \text{ and } y.\]

Now, we can define the conditional pdf as: \[ \\ k(y \ | \ \theta_0, x) = \frac{h(x, y \ | \ \theta_0)}{g(x \ | \ \theta_0)}\]

E-Step: Find the formula of the complete log-likelihood function \(l(\theta; x_1, . . . , x_9, y)\).

\[\begin{split} \log L(\theta \ | \ x) =&\ \int \log L(\theta \ | \ x)k(y \ | \ \theta_0, x)dy\\ =&\ \int \log g(x \ | \ \theta)k(y \ | \ \theta_0, x)dy\\ =&\ \int [\log h(x, y \ | \ \theta) - \log k(y \ | \ \theta, x)]k(y \ | \ \theta_0, x)dy\\ =&\ \int [\log h(x, y \ | \ \theta)]k(y \ | \ \theta_0, x)dy - \int \log[k(y \ | \ \theta, x)]k(y \ | \ \theta_0, x)dy \end{split}\]

E-Step: Derive the formula of conditional expectation \(Q(\theta) = E_y \ | \ x_1,...,x_9,\theta_0 [l(\theta; x_1, . . . , x_9, y)].\)

\[\begin{split} Q(\theta \ | \ \theta_0, x) =&\ \int [\log h(x, y \ | \ \theta)]k(y \ | \ \theta_0, x)dy\\ =&\ E_{\theta_0}[\log L(\theta \ | \ x, y) \ | \ \theta_0, x] \end{split}\]

M-step. Find the expression of \(\theta\) that maximizes \(Q(\theta)\).

\[\hat \theta^{m + 1}=arg \ max[Q(\theta \ | \ \hat \theta^m, x)]\]

Write R code for the E-step and M-step. Repeat them until the result converges.

e_step <- function(parvec) {
  sum(-0.5 * log(parvec[2]) - 0.5 * (x[-10] - parvec[1])^2 / parvec[2])
}

m_step <- optim(
  par = theta_0, 
  fn = e_step, 
  method = "L-BFGS-B",         # this method lets set lower bounds
  lower = 0.0000001,           # lower limit for parameters
  control = list(fnscale = -1) # maximize the function
)

m_step$message

## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

paste0(c("Mu: ", "Sigma: "), round(m_step$par, 3))

## [1] "Mu: 5.484"    "Sigma: 2.616"

Use the final estimate \(\theta\) to impute the missing value.

rnorm(1, m_step$par[1], m_step$par[2])

## [1] 4.554501

2. [15 points] Proofs in Hidden Markov Model.

\(\alpha_t(j) = \sum^K_{i=1} \alpha_{t-1}(i) \alpha_{ij}\phi_j(xt).\)

\[\begin{split} \alpha_t(j) =&\ P(x_1, . . . x_t, z_t = j) \\ =&\ \sum_{i =1}^K P(x_1, . . . x_t, z_t = j, z_{t-1} = i) \\ =&\ \sum_{i =1}^K P(x_1, . . . x_{t-1}, z_{t-1} = i) \cdot P(x_t, z_t = j \ | \ z_{t-1} = i, (x_1, . . . x_{t-1}) \\ =&\ \sum_{i =1}^K\alpha_{t-1}(i)\cdot P(x_t, z_t = j \ | \ z_{t-1} = i) \\ =&\ \sum_{i =1}^K\alpha_{t-1}(i)\cdot P(x_t \ | \ z_t = j, z_{t-1} = i) \cdot P(z_t = j \ | \ z_{t-1} = i) \\ =&\ \sum_{i =1}^K\alpha_{t-1}(i)\cdot P(x_t \ | \ z_t = j) \cdot P(z_t = j \ | \ z_{t-1} = i) \\ =&\ \sum^K_{i=1} \alpha_{t-1}(i) \cdot \alpha_{ij}\phi_j(xt) \end{split}\]

\(\beta_t(i) = \sum_K^{j=1} \alpha_{ij}\phi_j (x_{t+1})\beta_{t+1}(j).\)

\[\begin{split} \beta_t(i) =&\ P(x_{t+1}, . . . x_T \ | \ z_t = i) \\ =&\ \sum_{j =1}^K P(x_{t+1}, . . . x_T, z_{t+1} = j \ | \ z_{t} = i) \\ =&\ \sum_{j =1}^K P(z_{t+1}=j \ | \ z_t = i)\cdot P(x_{t+1}, . . . x_T \ | \ z_{t+1} = j, z_{t} = i) \\ =&\ \sum_{j =1}^K\alpha_{ij}\cdot P(x_{t+1}, . . . x_T \ | \ z_{t+1} = j ) \\ =&\ \sum_{j =1}^K\alpha_{ij}\cdot P(x_{t+1} \ | \ z_{t+1} = j) \cdot P(x_{t+2}, . . . x_T, \ | \ z_{t+1} = j, x_{t+1}) \\ =&\ \sum_{j =1}^K\alpha_{ij}\cdot P(x_{t+1} \ | \ z_{t+1} = j) \cdot P(x_{t+2}, . . . x_T, \ | \ z_{t+1}) \\ =&\ \sum_K^{j=1} \alpha_{ij}\cdot \phi_j (x_{t+1})\cdot \beta_{t+1}(j) \end{split}\]

\(f(z_t = j, x_1, . . . x_T \ | \ \theta) = \alpha_t(j)\beta_t(j).\)

\[\begin{split} f(z_t = j, x_1, . . . x_T \ | \ \theta) =&\ f(x_1, . . . x_t, . . . x_T, z_t = j) \\ =&\ f(x_1, . . . x_t, z_t = j)\cdot f(x_{t+1}, . . . x_T \ | \ z_t = J, x_1, . . . x_t) \\ =&\ \alpha_t(j) \cdot P(x_{t+1}, . . . x_T \ | \ z_t = j) \\ =&\ \alpha_t(j)\cdot \beta_t(j) \end{split}\]

\(f(x1, . . . xT \ | \ \theta) = \sum_{j =1}^K \alpha_t(j)\beta_t(j).\)

\[\begin{split} f(x1, . . . xT \ | \ \theta) =&\ f(x_1, . . . x_t, . . . x_T) \\ =&\ \sum_{j =1}^K f(x_1, . . . x_t, . . . x_T, z_t=j) \\ =&\ \sum_{j =1}^K f(x_1, . . . x_t, z_t = j)\cdot P(x_{t+1}, . . . x_T \ | \ z_t, x_1, . . . x_t)\\ =&\ \sum_{j =1}^K \alpha_t(j) \cdot P(x_{t+1}, . . . x_T \ | \ z_t) \\ =&\ \sum_{j =1}^K \alpha_t(j)\cdot \beta_t(j) \end{split}\]

\(f(z_t = i, z_{t+1} = j, x_1, . . . , x_T \ | \ \theta) = \alpha_t(i)\beta_{t+1}(j)\alpha_{ij}\phi_j (x_{t+1}).\)

\[\begin{split} (z_t = i, z_{t+1} = j, x_1, . . . , x_T \ | \ \theta) =&\ f(x_1, . . . x_t, . . . x_T, z_t=i, z_{t+1} = j) \\ =&\ f(x_1, . . . x_t, z_t = i)\cdot f(x_{t+1}, z_{t+1}= j \ | \ z_t = i, x_1, . . . x_t)\cdot \\ &\ f(x_{t+2}, . . . x_T \ | \ z_{t+1} = j, z_t = i, x_1, . . . x_{t+1}) \\ =&\ \alpha_t(i)\cdot f(x_{t+1}, z_{t+1} = j \ | \ z_t = i) \cdot f(x_{t+2}, . . . x_T \ | \ z_{t+1} = j) \\ =&\ \alpha_t(i)\cdot f(x_{t+1} \ | \ z_t = i) \cdot f(x_{t+2} \ | \ z_{t+1} = j, z_t = i) \beta_{t+1}(j) \\ =&\ \alpha_t(j)\cdot \alpha_{ij} \cdot f(x_{t+1} \ | \ z_{t+1} = j) \cdot \beta_{t+1}(j) \\ =&\ \alpha_t(j)\cdot \alpha_{ij} \cdot \phi_j (x_{t+1}) \cdot \beta_{t+1}(j) \end{split}\]

3. [10 points] Bayesian Networks.

Exercise 3.3 & 3.4 in textbook “Bayesian Reasoning and Machine Learning”

The Chest Clinic network concerns the diagnosis of lung disease (tuberculosis, lung cancer, or both, or neither), see the following image. In this model a visit to Asia is assumed to increase the probability of tuberculosis. State if the following conditional independence relationships are true or false.

tuberculosis \(\perp\) smoking | shortness of breath

Path is not blocked from t to s: e is a collider and d (its decendant) is in the conditioning set, thus FALSE.

lung cancer \(\perp\) bronchitis | smoking

Path from l to b: every path is blocked, thus TRUE.

visit to Asia \(\perp\) smoking | lung cancer

Path from a to s: every path is blocked, thus TRUE.

visit to Asia \(\perp\) smoking | lung cancer, shortness of breath

Path from a to s: d is a collider and in the conditioning - all other nodes for that path are not colliders and not in the conditioning set, thus FALSE.

Calculate by hand the values for \(p(d)\), \(p(d \ | \ s = \text{T})\), \(p(d \ | \ s = \text{F})\)

\[\begin{split} p(d) =&\ \sum p(a,s,t,l,b,e,x,d) \\ =&\ \sum p(a)p(s)p(t|a)p(l|s)p(b|s)p(e|t, l)p(x|e)p(d|b, e) \\ =&\ \sum \bigg(p(a)p(s)p(t|a)p(l|s)p(b|s)p(e|t, l)p(d|b, e)\sum p(x|e) \bigg) \\ =&\ \sum \bigg(p(s)p(l|s)p(b|s)p(e|t, l)p(d|b, e) \sum p(t|a)p(a) \bigg) \\ &\ p(t = \text{T}) = 0.05\cdot0.01+0.01\cdot0.99 = 0.0104 \\ =&\ \sum \bigg(p(s)p(l|s)p(b|s)p(d|b, e) \sum p(e|t, l)p(t) \bigg) \\ &\ p(e = \text{T}\ | \ l = \text{T}) = 1\cdot0.0104+1\cdot0.9896 = 1 \\ &\ p(e = \text{T}\ | \ l = \text{F}) = 1 \cdot 0.0104 + 0 \cdot 0.9896 = 0.0104 \\ =&\ \sum p(s)p(b|s)p(d|b, e) \sum p(e|l)p(l|s) \\ =&\ \sum p(b|s)p(s) \sum p(d|b, e)p(e|s) \\ &\ p(d = \text{T}\ | \ b = \text{T}, s = \text{T}) = 0.9 \cdot 0.10936 + 0.8 \cdot 0.89064 = 0.8109360 \\ &\ p(d = \text{T}\ | \ b = \text{T}, s = \text{F}) = 0.9 \cdot 0.020296 + 0.8 \cdot 0.979704 = 0.8020296 \\ &\ p(d = \text{T}\ | \ b = \text{F}, s = \text{T}) = 0.7 \cdot 0.10936 + 0.1 \cdot 0.89064 = 0.1656160 \\ &\ p(d = \text{T}\ | \ b = \text{F}, s = \text{F}) = 0.7 \cdot 0.020296 + 0.1 \cdot 0.979704 = 0.1121776 \\ =&\ \sum p(s) \sum p(d|b, s)p(b|s) \\ &\ p(d = \text{T}\ | \ s = \text{T}) = 0.8109360 \cdot 0.6 + 0.1656160 \cdot 0.4 = 0.5528080 * \\ &\ p(d = \text{T}\ | \ s = \text{F}) = 0.8020296 \cdot 0.3 + 0.1121776 \cdot 0.7 = 0.3191332 * \\ =&\ \sum p(d|s)p(s) \\ &\ p(d = \text{T}) = 0.5528080 \cdot 0.5 + 0.3191332 \cdot 0.5 = 0.4359706 * \end{split}\]

Thus, we have: \[\begin{split} p(d = \text{T}) =&\ 0.4359706 \\ p(d = \text{T}\ | \ s = \text{T}) =&\ 0.5528080 \\ p(d = \text{T}\ | \ s = \text{F}) =&\ 0.3191332 \\ \end{split}\]