LING 275 HW2

Exercises

Levy ch.1, pp.34-6, exercise 2.2, modified: “Give an example in words - involving language understanding, and one that was not specifically discussed in class- where two events \(A\) and \(B\) are conditionally independent given some state of knowledge \(C\), but when another piece of knowledge \(D\) is learned, \(A\) and \(B\) lose conditional independence.”

Imagine we hear a Spanish translation of the sentence fragment “The mouse was chased by …” Let \(A\) be the event that the noun of the following NP is masculine and \(B\) be the event that the noun of the following NP is plural. Given the state of knowledge at this point, \(C\), \(A\) and \(B\) are conditionally independent.

However, suppose we further learn \(D\) that the article of this NP starts with ‘l,’ \(A\) and \(B\) lose conditional independence, because while \(p(B | C,D)<1\) (the noun could still be either singular or plural),
\(p(B|A, C, D)=1\) (if the noun is masculine and the definite article starts with ‘l,’ then it has to be the plural article ‘los,’ since the singular ‘el’ does not start with ‘l.’)

Levy ch.1, pp.34-6, exercise 2.3:

“You obtain infinitely many copies of the text Alice in Wonderland and decide to play a word game with it. You cut apart each page of each copy into individual letters, throw all the letters in a bag, shake the bag, and draw three letters at random from the bag. What is the probability that you will be able to spell ‘tea’? What about ‘tee’? [Hint: see Section 2.5.2; perhaps peek at Section A.8 as well.]”

The probability of drawing any particular permutation of ‘tea’ is \(\pi_t\cdot\pi_e\cdot\pi_a\). Since there are \(3!\) different permutations, the probability of being able to spell ‘tea’ is \(3!\cdot\pi_t\cdot\pi_e\cdot\pi_a\).

Similarly, the probability of drawing any particular permutation of ‘tee’ is \(\pi_t\cdot\pi_e\cdot\pi_e\), but there are only 3 different permutations (‘tee’, ‘ete’ and ‘eet’). Hence the probability of being able to spell ‘tee’ is \(3\cdot\pi_t\cdot\pi_e\cdot\pi_e\).

p.e <- 0.126 
p.t <- 0.099
p.a <- 0.082
factorial(3) * p.t * p.e * p.a  # probability of 'tea'

## [1] 0.006137208

3 * p.t * p.e * p.e  # probability of 'tee'

## [1] 0.004715172

“Why did the problem specify that you obtained infinitely many copies of the text? Suppose that you obtained only one copy of the text? Would you have enough information to compute the probability of being able to spell ‘tea’? Why?”

The specification is to make sure that drawing out letters does not affect the probability distribution of the next letter, since there are infinitely many copies. If there is only one copy (or finitely many copies for that matter), then sampling (drawing letters) without replacement means that the probability distribution changes, however slightly, after drawing each letter. In this case, since we do not know the total number of letters in the copy, we will not be able to compute the exact probability. However, since the number of each letter is all very large, sampling a couple of letters will not change the probability distribution much. Hence the exact probability should be very close to what we obtained above.

Re-do both parts of the previous exercise using a simulation in R instead of mathematical reasoning. That is, think carefully about the generative process by which this example proceeds, and write a program which implements a model of this process and uses it to generate many draws of 3 letters. Use the sample() function, and estimate probabilities as the proportion of samples that have the desired property as your best estimate of the probability. Hints: you can learn about sample() by typing ?sample into the console. Make sure that you think carefully about what vector you are sampling from. You may fill in letters for which Levy does not specify frequencies in 2.5.2 with a generic ‘other’ value. Also, make sure that you think carefully about whether to set sample(..., replace=TRUE) or sample(..., replace = FALSE) when answering each sub-question.

Alice in Wonderland has around 120000 letters, but to illustrate the point that sampling with or without replacement does not make much difference in this case, we pretend that it only has 1/10 of the letters, i.e., 12000. This enables us to run more samples to better estimate the probabilities.

We will use 'o' for any letter other than 'e', 't' and 'a'.

N <- 12000
n.e <- N*p.e
n.t <- N*p.t
n.a <- N*p.a
vec.alice <- rep('o', N)
vec.alice[1 : n.e] <- 'e'
vec.alice[(n.e + 1) : (n.e + n.t)] <- 't'
vec.alice[(n.e + n.t + 1) : (n.e + n.t + n.a)] <- 'a'

sample.letters <- function(vec, n, replace=FALSE) {
  # sample n letters from vec, concatenate the result vector to make it a string
  return(paste0(sample(x=vec, size=n, replace), collapse=''))
}

spell.tea <- function(s){
  # whether the letters in the string can spell 'tea'
  return(s=='tea' || s=='tae' || s=='eta' || s=='eat' || s=='ate' || s=='aet')
}
spell.tee <- function(s){
  # whether the letters in the string can spell 'tee'
  return(s=='tee' || s=='ete' || s=='eet')
}

First we consider sampling from infinite copies, which effectively means sampling with replacement.

n.samples <- 50000
samples.rep <- sapply(1:n.samples, 
                      function(i) sample.letters(vec.alice, 3, replace=TRUE)) 
p.tea.rep <- sum(sapply(samples.rep, spell.tea)) / n.samples 
p.tea.rep  # expected 0.0061

## [1] 0.00568

p.tee.rep <- sum(sapply(samples.rep, spell.tee)) / n.samples 
p.tee.rep  # expected 0.0047

## [1] 0.0042

Now we consider sampling from only one copy, i.e., without replacement.

samples.norep <- sapply(1:n.samples, 
                      function(i) sample.letters(vec.alice, 3, replace=FALSE)) 
p.tea.norep <- sum(sapply(samples.norep, spell.tea)) / n.samples 
p.tea.norep  # expected roughly 0.0061

## [1] 0.00596

p.tee.norep <- sum(sapply(samples.norep, spell.tee)) / n.samples 
p.tee.norep  # expected roughly 0.0047

## [1] 0.0049

Levy ch.1, pp.34-6, exercise 2.10: “For adult female native speakers of American English, the distribution of first-formant frequencies for the vowel [E] is reasonably well modeled as a normal distribution with mean 608Hz and standard deviation 77.5Hz. What is the probability that the first-formant frequency of an utterance of [E] for a randomly selected adult female native speaker of American English will be between 555Hz and 697Hz?” Show the R code that you used to calculate the answer. (Hint: look back at Monday’s class notes, specifically the part about using R to find the cumulative probability of continuous distributions.)

pnorm(697, mean=608, sd=77.5) - pnorm(555, mean=608, sd=77.5)

## [1] 0.6275673

Design a simulation implementing Pearl’s rain/sprinkler/wet grass as discussed in class. Make sure that the samples that you generate assign a truth-value to each variable - rain, sprinkler, and wet grass - and that they have the dependency structure assumed: rain and sprinkler are uncaused, occurring with some fixed probability (say, both are flip(.3)); and wet grass occurs if and only if: either it rained or the sprinkler was on. You’ll want to begin your code by defining the ‘coin flip’ function.

flip = function(p) runif(1,0,1) < p

Generate 10000 samples from your model using one the sapply()-based simulation methods discussed in class on Monday

sim = function(i) {
  rain = flip(0.3)
  sprinkler = flip(0.3)
  wet.grass = rain || sprinkler
  return(c(rain, sprinkler, wet.grass)) # what you return should be a non-trivial vector
}
samples = sapply(1:1000, FUN=sim)

Notice that, since your function returns a length-\(n\) vector, the result of using sapply(1:m, ...) is a \(n \times m\) matrix - i.e., one with \(n\) rows and \(m\) columns. What do the columns represent? What does each row represent?

Each column is the result of a simulation and each row records the values of a random variable (row 1 is for rain, row 2 for sprinkler and row 3 for wet.grass).

Use rownames() and colnames() to add informative row and column names to the matrix of samples. [Hint: what does paste('sample', 1:1000) do?]

rownames(samples)=c("rain", "sprinkler", "wet.grass")  
colnames(samples)=paste('sample', 1:1000) # create a vector ("sample 1", "sample 2", ... "sample 1000")

Use which() to define a new matrix with only the samples in which your observation was true: wet.grass == TRUE. What are the dimensions of this matrix? What is the proportion of these samples in which each of rain and sprinkler is true?

samples.wet.grass <- samples[, which(samples["wet.grass", ])]
dim(samples.wet.grass)

## [1]   3 483

mean(samples.wet.grass["rain", ]) # proportion of rain

## [1] 0.5652174

mean(samples.wet.grass["sprinkler", ]) # proportion of sprinkler

## [1] 0.610766

Now you learn that the sprinkler was on - so use which() to select the subset in which rain and wet.grass are BOTH true. What is the proportion of these samples in which sprinkler is true? On an intuitive level, why is this?

samples.wet.grass.rain <- samples[, which(samples["wet.grass", ] & samples["rain", ])]
mean(samples.wet.grass.rain["sprinkler", ]) # proportion of sprinkler

## [1] 0.3113553

We can see that the proportion of these samples in which sprinkler is true drops down to the base level (around 0.3). The reason is that the knowledge of raining fully explains away the observation of wet grass. We have no additional information beyond what we already know a priori about whether the sprinkler was on or not, since either way would be equally consistent with our observation.

LING 275 HW2

Ciyang Qing

01/12/2015

Exercises