LING 275 HW3

1. Inference in DAGs using rejection sampling

Here is a little Greek-gods-and-weather DAG written up in R, with an example of how to take “forward samples” from the full joint distribution on its variables.

flip = function(p) runif(1) < p
rain.sim = function(i) {
    flip = function(p) runif(1) < p
    Zeus.angry = flip(.3)
    Zeus.crying = flip(.2)
    thunder = (Zeus.angry && flip(.7)) || flip(.3)
    rain = (Zeus.angry && flip(.8)) || (Zeus.crying && flip(.99))
    forgetful.gardener = flip(.4)
    sprinkler = forgetful.gardener && flip(.7)
    wet.grass = (rain && flip(.8)) || (sprinkler && flip(.9)) || flip(.1)
    return(c(Zeus.angry = Zeus.angry, Zeus.crying = Zeus.crying, thunder = thunder, rain=rain,
             forgetful.gardener = forgetful.gardener, sprinkler = sprinkler, wet.grass = wet.grass))
}
sapply(1:10, FUN=rain.sim)

##                     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]
## Zeus.angry         FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Zeus.crying        FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## thunder            FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## rain               FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## forgetful.gardener  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
## sprinkler           TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## wet.grass           TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
##                    [,10]
## Zeus.angry          TRUE
## Zeus.crying        FALSE
## thunder            FALSE
## rain                TRUE
## forgetful.gardener FALSE
## sprinkler          FALSE
## wet.grass           TRUE

Work through the code and draw a DAG that encodes the relationships of immediate dependence and conditional independence between the variables. You can submit this part of the problem in any format you like (using fancy drawing software, pencil-and-paper, or anything in between - just make sure you put your name on the result!).

DAG

What is the relationship between thunder and wet.grass? If you hear thunder while in bed at night, have you learned something about how likely it is that the grass will be wet in the morning? Why or why not?

They are not (unconditionally) independent. Yes. If one hears thunder, he knows that it is more likely that Zeus is angry, which makes it more likely that it is now raining, which in turn means that it is more likely that the grass will be wet in the morning.

What if you also saw rain through the window - should learning about whether there was thunder still influence your best guess about the probability of wet grass?

No. thunder and wet.grass are conditionally independent given rain, since the only path between thunder and wet.grass passes rain, which is the parent of wet.grass. Hence rain blocks the information flow and therefore learning that there was thunder in addition to that it was raining does not influence the probability of wet grass.

Here’s an example of how to estimate probabilities from this DAG using Monte Carlo sampling:

query = function(variable.name, n.sims = 500) {
    sims = sapply(1:n.sims, FUN=rain.sim)
    proportion = length(which(sims[variable.name,] == TRUE))/n.sims
    return(proportion)
}
query('Zeus.angry')

## [1] 0.3

query('thunder')

## [1] 0.44

For each variable, run the query() function to find out its approximate unconditional probability. (Or, more precisely, its approximate probability conditional on this model.) Make sure that you surround the variable name in quotes (single or double), both here and when working with the conditional.query() function below.

query('Zeus.angry')

## [1] 0.328

query('Zeus.crying')

## [1] 0.208

query('thunder')

## [1] 0.44

query('rain')

## [1] 0.36

query('forgetful.gardener')

## [1] 0.418

query('sprinkler')

## [1] 0.234

query('wet.grass')

## [1] 0.542

Most often, what we’re doing with a DAG is observing some ‘downstream’ variables (effects) and trying to infer the values of some ‘upstream’ variables (causes). This means that the unconditional probability is not really what’s of interest, but rather the conditional probability given our observations. Here’s an example of how to write a conditional query function that will estimate conditional probability using rejection sampling - a simple and reliable, but frequently inefficient, method of Monte Carlo approximation.

conditional.query = function(query.variable, conditioners, n.sims=1000) {
    samples = c()
    while (length(samples) < n.sims) {
      Zeus.angry = flip(.3)
      Zeus.crying = flip(.2)
      thunder = (Zeus.angry && flip(.7)) || flip(.3)
      rain = (Zeus.angry && flip(.8)) || (Zeus.crying && flip(.99))
      forgetful.gardener = flip(.4)
      sprinkler = forgetful.gardener && flip(.7)
      wet.grass = (rain && flip(.8)) || (sprinkler && flip(.9)) || flip(.1)
      truth.values = c(Zeus.angry = Zeus.angry, Zeus.crying = Zeus.crying, thunder = thunder, rain=rain,
                       forgetful = forgetful.gardener, sprinkler = sprinkler, wet.grass = wet.grass)
      query.index = which(query.variable == names(truth.values))
      conditioner.indices = which(names(truth.values) %in% conditioners)
      if (all(truth.values[conditioner.indices] == T)) { 
        samples = c(samples, truth.values[query.index])
      }
      # accept sample only if all conditioning statements are true
      # repeat until you have n.sims-many samples
    } 
    return(length(which(samples == T)) / n.sims) # proportion 'true' among accepted samples
}

What is the estimated $P(\mathbf{wet\ grass})$? What about $P(\mathbf{wet\ grass}|\mathbf{thunder})$?, $P(\mathbf{wet\ grass}|\mathbf{thunder},\mathbf{Zeus\ angry})$? $P(\mathbf{wet\ grass}|\mathbf{Zeus\ angry})$? Explain, in terms of the conditional independence relationships encoded in the DAG you drew above, why the approximate probabilities are distributed in this way.

query('wet.grass')

## [1] 0.584

conditional.query('wet.grass', c('thunder'))

## [1] 0.635

conditional.query('wet.grass', c('thunder', 'Zeus.angry'))

## [1] 0.791

conditional.query('wet.grass', c('Zeus.angry'))

## [1] 0.807

We can see that $P(\mathbf{wet\ grass}|\mathbf{thunder})$ is greater than $P(\mathbf{wet\ grass})$, because knowing thunder increases the probability of Zeus.angry, which in turn increases the probability of rain and ultimately wet.grass. For the same reason, we can see $P(\mathbf{wet\ grass}|\mathbf{Zeus\ angry})$ is also greater than $P(\mathbf{wet\ grass})$.

However, Zeus.angry is the parent of thunder and thus blocks the information flow from thunder to wet.grass, i.e., thunder and wet.grass are conditionally independent given Zeus.angry. Therefore we can see that $P(\mathbf{wet\ grass}|\mathbf{thunder},\mathbf{Zeus\ angry})$ is roughly the same as $P(\mathbf{wet\ grass}|\mathbf{Zeus\ angry})$, i.e., knowing thunder in addition to Zeus.angry does not give more information about wet.grass.

How is the performance (runtime) of conditional.query() affected as you increase the number of items in the conditioners vector that you feed to this function? Using a for-loop (or something fancier if you want), test this question systematically for conditioners vectors with lengths from 0 to 7. Use system.time() to time the results.

Ideally, for each variable $X$ and number $i$, we want to calculate the average time it takes to calculate the conditional probability of $X$ given $i$ variables. However, this will take too long (e.g., when $i$=3, there are $7 \choose 3$=35 different sets of 3 variables, and there are 7 possibilities for $X$.) Instead, we will just randomly fix a variable and the order in which each variable is added to the conditioner, and see how the runtime changes. We do this process for 3 times.

rain.vars <- c('Zeus.angry', 'Zeus.crying', 'thunder', 'rain', 'forgetful.gardener', 'sprinkler', 'wet.grass')

track.time <- function(variables){
  variables <- sample(variables, 7)  # permutate the variables
  times <- rep(0, 8)

  # unconditional probability of the last variable
  times[1] <- system.time(conditional.query(variables[7], c()))['elapsed'] 
  # probability of the last variable, condition on the first i variables
  for (i in 1:7){
    times[i+1] <- system.time(conditional.query(variables[7], variables[1:i]))['elapsed']
  }
  names(times) <- paste(0:7, c('unconditional', variables))
  return(times)
}
track.time(rain.vars) # variable 7 conditioned on the first i variables

##      0 unconditional        1 Zeus.crying         2 Zeus.angry 
##                0.108                0.263                0.807 
##               3 rain          4 wet.grass          5 sprinkler 
##                0.780                0.858                2.743 
## 6 forgetful.gardener            7 thunder 
##                2.554                3.353

track.time(rain.vars) # variable 7 conditioned on the first i variables

##      0 unconditional 1 forgetful.gardener         2 Zeus.angry 
##                0.074                0.081                0.167 
##            3 thunder          4 wet.grass        5 Zeus.crying 
##                0.207                0.265                1.017 
##               6 rain          7 sprinkler 
##                1.018                3.313

track.time(rain.vars) # variable 7 conditioned on the first i variables

##      0 unconditional        1 Zeus.crying 2 forgetful.gardener 
##                0.075                0.259                0.236 
##          3 wet.grass               4 rain            5 thunder 
##                0.274                0.265                0.614 
##          6 sprinkler         7 Zeus.angry 
##                1.699                3.404

We can see that as the number of items in the conditioners vector increases, it generally takes longer to do the query.

Explain, in terms of (a) the unconditional probability of the conditioning statements and (b) the implementation of conditional.query() using a while-loop, why the result in (1f) obtains.

As the number of items in the conditioners vector increases, the unconditional probability of the conditioning statements decreases, since more constraints need to be satisfied. This means that a random Sample (by Sample I mean the collection of all the random variables; a sample is the value of the queried variable in a Sample) generated from the graphical model is less likely to satisfy the constraints, which means that on average it will take more Samples before the conditional.query() has collected enough samples. Therefore it takes longer to run conditional.query() when the number of items in the conditioners vector increases.

Multivariate probability practice

[Levy, exercise 3.1: Simpler formula for variance.] Show from the definition of the variance as $\mathbf{Var}(X) ≡ \mathbb{E}[(X − \mathbf{E}[X])^2]$ that it can equivalently be written as $\mathbf{Var}(X) = \mathbb{E}[X^2] − \mathbf{E}[X]^2$, which we stated without proof in Section 2.9.2.

\[ \begin{aligned} \mathbf{Var}(X) & ≡ \mathbb{E}[(X − \mathbf{E}[X]^2)] \\ & = \mathbb{E}[X^2 − 2 * X * \mathbf{E}[X] + \mathbf{E}[X]^2] \\ & = \mathbb{E}[X^2] − \mathbb{E}[2 * X * \mathbf{E}[X]] + \mathbb{E}[\mathbf{E}[X]^2] \quad \text{ (linearity)} \\ & = \mathbb{E}[X^2] − 2\mathbf{E}[X]\mathbb{E}[X] + \mathbf{E}[X]^2 \quad \text{ (linearity, $\mathbf{E}[X]$ is a constant)} \\ & = \mathbb{E}[X^2] − 2\mathbf{E}[X]^2 + \mathbf{E}[X]^2 \\ & = \mathbb{E}[X^2] − \mathbf{E}[X]^2 \end{aligned} \]

I do not know whether the typesetting difference between $\mathbb{E}$ and $\mathbf{E}$ is intentional. Here I treat $\mathbb{E}$ as the process of calculating the expectation and use $\mathbf{E}$ to denote the result of this process, which kind of makes it clearer that $\mathbf{E}[X]$ is just a constant (like $2$).
Of course, this distinction is not really necessary.

[Levy, Exercise 3.2: Covariance of conditionally independent random variables.] Use linearity of the expectation [section 3.3.1] to prove that if two discrete random variables $X$ and $Y$ are conditionally independent given your state of knowledge, then $\mathbf{Cov}(X, Y) = 0$. (Hint: you can rewrite $P(X=x,Y=y) = \sum_{x,y} [x \times p(X = x) \times y \times p(Y = y)]$ as $[\sum_{x} x \times p(X = x)] \times [\sum_y y \times p(Y = y)]$, since $X$ and $p(X = x)$ are constant with respect to $y$, and vice versa.)

First of all, similar to the previous exercise, we show that $\mathbf{Cov}(X, Y)=\mathbb{E}[XY] − \mathbf{E}[X]\mathbf{E}[Y]$:

\[ \begin{aligned} \mathbf{Cov}(X, Y) & ≡ \mathbb{E}[(X − \mathbf{E}[X])(Y - \mathbf{E}[Y])]\\ & = \mathbb{E}[XY − X * \mathbf{E}[Y] − \mathbf{E}[X] * Y + \mathbf{E}[X]\mathbf{E}[Y]] \\ & = \mathbb{E}[X^2] − \mathbf{E}[Y]\mathbb{E}[X] - \mathbf{E}[X]\mathbb{E}[Y] + \mathbb{E}[\mathbf{E}[X]\mathbf{E}[Y]] \\ & = \mathbb{E}[X^2] − \mathbf{E}[Y]\mathbf{E}[X] - \mathbf{E}[X]\mathbf{E}[Y] + \mathbf{E}[X]\mathbf{E}[Y] \\ & = \mathbb{E}[XY] − \mathbf{E}[X]\mathbf{E}[Y] \quad \text{ (!)} \end{aligned} \] Next, we calculate $\mathbb{E}[XY]$ by definition \[ \begin{aligned} \mathbb{E}[XY] & ≡ \sum_{x,y} xyP(X = x, Y = y) \\ & = \sum_{x,y} xyP(X = x)P(Y = y) \quad \text{ (indenpendence of $X$ and $Y$)}\\ & = \sum_x\sum_y xP(X = x) \times yP(Y = y) \\ & = \left(\sum_x x P(X = x)\right) \times \left(\sum_y yP(Y = y)\right) \quad \text{ (constant wrt each other)}\\ & = \mathbf{E}[X]\mathbf{E}[Y] \end{aligned} \]

Now put this back to (!), we have \[ \begin{aligned} \mathbf{Cov}(X, Y) & = \mathbb{E}[XY] − \mathbf{E}[X]\mathbf{E}[Y] \\ & = \mathbf{E}[X]\mathbf{E}[Y] −\mathbf{E}[X]\mathbf{E}[Y] \\ & = 0 \end{aligned} \] which finishes the proof.

LING 275 HW3

Ciyang Qing

01/25/2015

1. Inference in DAGs using rejection sampling

Multivariate probability practice