Prob & Stats for Linguists: HW 2

Due before class on Wednesday, 1/14/15. Ideally, submit via RPubs as outlined in HW1. If you prefer, you can also use knitr to compile to .pdf and submit as an email attachment. As a last resort, if you can’t get knitr to work, send me your .R or .Rmd source file. Please don’t submit a .pdf with code.

Exercises

If you postponed it last week, do exercises 1-4 from Baayen ch.2.
Levy ch.1, pp.34-6, exercise 2.2, modified: “Give an example in words - involving language understanding, and one that was not specifically discussed in class- where two events \(A\) and \(B\) are conditionally independent given some state of knowledge \(C\), but when another piece of knowledge \(D\) is learned, \(A\) and \(B\) lose conditional independence.”
Levy ch.1, pp.34-6, exercise 2.3:

“You obtain infinitely many copies of the text Alice in Wonderland and decide to play a word game with it. You cut apart each page of each copy into individual letters, throw all the letters in a bag, shake the bag, and draw three letters at random from the bag. What is the probability that you will be able to spell ‘tea’? What about ‘tee’? [Hint: see Section 2.5.2; perhaps peek at Section A.8 as well.]”
“Why did the problem specify that you obtained infinitely many copies of the text? Suppose that you obtained only one copy of the text? Would you have enough information to compute the probability of being able to spell ‘tea’? Why?”

Re-do both parts of the previous exercise using a simulation in R instead of mathematical reasoning. That is, think carefully about the generative process by which this example proceeds, and write a program which implements a model of this process and uses it to generate many draws of 3 letters. Use the sample() function, and estimate probabilities as the proportion of samples that have the desired property as your best estimate of the probability. Hints: you can learn about sample() by typing ?sample into the console. Make sure that you think carefully about what vector you are sampling from. You may fill in letters for which Levy does not specify frequencies in 2.5.2 with a generic ‘other’ value. Also, make sure that you think carefully about whether to set sample(..., replace=TRUE) or sample(..., replace = FALSE) when answering each sub-question.
Levy ch.1, pp.34-6, exercise 2.10: “For adult female native speakers of American English, the distribution of first-formant frequencies for the vowel [E] is reasonably well modeled as a normal distribution with mean 608Hz and standard deviation 77.5Hz. What is the probability that the first-formant frequency of an utterance of [E] for a randomly selected adult female native speaker of American English will be between 555Hz and 697Hz?” Show the R code that you used to calculate the answer. (Hint: look back at Monday’s class notes, specifically the part about using R to find the cumulative probability of continuous distributions.)
Design a simulation implementing Pearl’s rain/sprinkler/wet grass as discussed in class. Make sure that the samples that you generate assign a truth-value to each variable - rain, sprinkler, and wet grass - and that they have the dependency structure assumed: rain and sprinkler are uncaused, occurring with some fixed probability (say, both are flip(.3)); and wet grass occurs if and only if: either it rained or the sprinkler was on. You’ll want to begin your code by defining the ‘coin flip’ function.

flip = function(p) runif(1,0,1) < p

Generate 10000 samples from your model using the sapply()-based simulation method discussed in class on Monday, starting with this schematic code chunk:

sim = function(i) {
  rain = ...
  sprinkler = ...
    ...
  return(...) # what you return should be a non-trivial vector
}
samples = sapply(1:1000, FUN=sim)

Notice that, since your function returns a length-\(n\) vector, the result of using sapply(1:m, ...) is a \(n \times m\) matrix - i.e., one with \(n\) rows and \(m\) columns. What do the columns represent? What does each row represent?
Use rownames() and colnames() to add informative row and column names to the matrix of samples. [Hint: what does paste('sample', 1:1000) do?]
Use which() to define a new matrix with only the samples in which your observation was true: wet.grass == TRUE. What are the dimensions of this matrix? What is the proportion of these samples in which each of rain and sprinkler is true?
Now you learn that the sprinkler was on - so use which() to select the subset in which rain and wet.grass are BOTH true. What is the proportion of these samples in which sprinkler is true? On an intuitive level, why is this?

Prob & Stats for Linguists: HW 2

Dan Lassiter

Tips

Exercises