Context: Imagine I am interpreting the utterance “I’m hot!” from a person named Bill. State A - Bill is feeling overheated. State B - Bill is feeling lucky. Knowledge C - Bill and I are in the U.S. States A and B are conditionally independent given C because knowledge of C does nothing to alter the liklihood of A or B. Knowledge D - Bill and I are in a casino playing poker and Bill has won three consecutive pots. Given knowledge C, A and B are conditionally independent. Given knowedge D, A and B are dependent - the likelihood of Bill feeling lucky (State B) considering knowledge D dramatically increases, consequently decreasing the likelihood of State A.
t = .099
e = .126
a = .082
prob.tea = t * e * a
prob.tea
## [1] 0.001022868
However, given that order is of no consequence, there are six ways we might pick any of these letters: tea tae eat eta aet ate Therefore, the probability that we make three selections that allow us spell “tea” is 6 * prob.
6 * prob.tea #Probability of spelling tea
## [1] 0.006137208
We can follow the same steps to calculate the probability of spelling “tee.” However, in this case there are only 3 different combinations of the letters: tee ete eet
prob.tee = t * e * e
3 * prob.tee #Probability of spelling tee
## [1] 0.004715172
Simulating the above question in R we first populate a vector of letters “chars” as well as a vector of letter probabilities “probs”. I’ve initially given each letter a random relative frequency between .0300 and .0302 to approximate the fraction 1/26. Of course we do have the relative frequencies for “t”, “e”, and “a”. So I populate those manually.
chars = letters
probs = runif(26, .0300, .0302)
probs[1] = .082
probs[5] = .126
probs[20] = .099
Now, I keep a “count”" var to count all the occurance of “t”,“e”,“a” permutations sampling 1,000,000 times and counting everytime we get a match in letters.
count = 0
for (i in 1:1000000) {
selection = paste(sample(chars, size=3, replace=T, prob = probs), collapse="")
if (selection == "tea" | selection == "tae" | selection == "eat" |
selection == "eta" | selection == "aet" | selection == "ate")
count = count + 1
}
Dividing the occurances of letter matches by the sample we get a frequency of:
frequency = count / 1000000
print(frequency)
## [1] 0.006071
We can do the same for “tee” permutations:
count = 0
for (i in 1:1000000) {
selection = paste(sample(chars, size=3, replace=T, prob = probs), collapse="")
if (selection == "tee" | selection == "ete" | selection == "eet")
count = count + 1
}
frequency = count / 1000000
print(frequency)
## [1] 0.004798
First, we introduce a new normal distribution using our mean = 608 and sd = 77.5. I’ve arbitrarily populated the number of observations “n” in rnorm at 1000. Here’s a histogram plot of our normal distribution for fun.
n.distr = rnorm(1000, mean = 608, sd = 77.5)
hist(n.distr, main = "First-formant frequencies for the vowel [E]\nfor female Native American English Speakers", xlab = "first-formant frequencies", ylab = "Occurances")
Now, we can feed this into into a Cummulative Distribution Function (“c.distr”) using ecdf(), again plotting for fun…
c.distr = ecdf(n.distr)
plot(c.distr, main = "Cummulative Probability\nfor the vowels [E]\nfemale Native American English Speakers", xlab = "first-formant frequencies", ylab = "Probability")
Since we have have our Cummulative Probability Distribution we can access the probability of a given frequency using c.distr which will return the probability of a given frequency argument.
c.distr(555)
## [1] 0.275
c.distr(697)
## [1] 0.876
#Which is to say there's a probability of c.distr(555) of randomly selecting a frequency under 555Hz and a probabiliy of c.distr(697) of randomly selecting a frequency under 697Hz
Since we’re interested in the probabily of randomly selecting a frequency between 555Hz and 697Hz, we simply take the difference of the values returned at those frequencies:
prob = c.distr(697) - c.distr(555)
prob
## [1] 0.601
First we define our flip function, which will use for our “rain” and “sprinkler” states and implment our simulation function “sim” using “flip.” I’ve made two vars to hold our rain and sprinkler probabilities we’ll pass to flip. We assign the matr
flip = function(p) runif(1, 0, 1) < p
rain.prob = .24
sprinkler.prob = .31
sim = function(i) {
rain = flip(rain.prob)
sprinkler = flip(sprinkler.prob)
wetgrass = rain || sprinkler
return(c(rain, sprinkler, wetgrass))
}
We generate 10000 samples from our model, storing them in a matrix we’ll call “samples”. Since our “sim” function returns a vector of length 3 (for the truth-functional outcomes of rain, sprinkler and wetgrass) our matrix will be 3 X 100000, with rows 1:3 representing (rain, sprinkler, wetgrass) and each of the columns representing a trial (observation). I name the rows and cols to reflect this. Having the matrix oriented this way doesn’t feel as intuitive as having each row represent an individual trial so I transpose the rows and columns of samples, storing it in a new matrix “samples.transposed”.
samples = sapply(1:10000, FUN=sim)
dim(samples)
## [1] 3 10000
rownames(samples) = c("rain", "sprinkler", "wetgrass")
colnames(samples) = paste('sample', 1:10000)
samples[,1:10]
## sample 1 sample 2 sample 3 sample 4 sample 5 sample 6 sample 7
## rain FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## sprinkler FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## wetgrass FALSE FALSE TRUE FALSE TRUE TRUE FALSE
## sample 8 sample 9 sample 10
## rain FALSE FALSE FALSE
## sprinkler TRUE FALSE FALSE
## wetgrass TRUE FALSE FALSE
samples.transposed = t(samples)
dim(samples.transposed)
## [1] 10000 3
Here’s a look at our first 10 samples now stored in our “samples.transposed” matrix:
samples.transposed[1:10,]
## rain sprinkler wetgrass
## sample 1 FALSE FALSE FALSE
## sample 2 FALSE FALSE FALSE
## sample 3 TRUE FALSE TRUE
## sample 4 FALSE FALSE FALSE
## sample 5 TRUE TRUE TRUE
## sample 6 FALSE TRUE TRUE
## sample 7 FALSE FALSE FALSE
## sample 8 FALSE TRUE TRUE
## sample 9 FALSE FALSE FALSE
## sample 10 FALSE FALSE FALSE
Now I’m going to pull the indices out of samples.tranposed for the samples in which wetgrass == TRUE. I’m breaking this into two steps for clarity.
wetgrass.indices = which(samples.transposed[, "wetgrass"] == T)
As an aside we can see how many TRUE wetgrass states we have taken the length of this vector. Dividing that by the total samples gives us an approximation of how often we observed wetgrass as a relative frequency.
rfrequency.wetgrass = length(wetgrass.indices)
percent.wetgrass = rfrequency.wetgrass / nrow(samples.transposed)
percent.wetgrass
## [1] 0.4693
We can save all samples in which the grass was wet into another matrix we’ll call “samples.wetgrassT”
samples.wetgrassT = samples.transposed[wetgrass.indices,]
dim(samples.wetgrassT)
## [1] 4693 3
From samples.wetgrassT we can pull data to populate two more vector variables, “rainT” and “sprinklerT” storing all the indices in which “rain” is true and “sprinkler” is true. We can find the proportions of these samples
#Indices in which it rained
rainT = which(samples.wetgrassT[,"rain"], T)
#Proportion of instances in which in rained:
length(rainT) / rfrequency.wetgrass
## [1] 0.5024505
#Indices in which it sprinklered
sprinklerT = which(samples.wetgrassT[,"sprinkler"],T)
#Proportion of instances in which in sprinklered:
length(sprinklerT) / rfrequency.wetgrass
## [1] 0.6535265
To find the instances in which it both rained and sprinklered I introduce a new vector “rain.sprinklerT” and use which() again:
rain.sprinklerT = which(samples.wetgrassT[rainT, "sprinkler"], T)
#Number of times it both rained and sprinklered
length(rain.sprinklerT)
## [1] 732
#Relative frequency that it both rained and sprinklered
length(rain.sprinklerT) / rfrequency.wetgrass
## [1] 0.155977
For fun I also did this using nested for-loops and stored the samples in which it BOTH rained and sprinkled in “bothT.”
bothT = numeric()
for (i in 1:length(rainT)) {
for (j in 1:length(sprinklerT)) {
if (rainT[i] == sprinklerT[j]) {
bothT = c(bothT, rainT[i])
}
}
}
#Number of times it both rained and sprinklered
length(bothT)
## [1] 732
Obvioulsy using “which()” is way better…
The last bit of the question is somewhat vague… if the sprinkler was on we’d want to select the subset in which sprinkler==T (wetgrass would of course also be T)… So the proportion of the sample in which sprinkler==T would be 100%… Is it supposed to be the proportion of sprinkler==T sample in which rain is also T? I calculate that here:
length(which(samples.wetgrassT[sprinklerT, "rain"] == T)) / nrow(samples.wetgrassT[sprinklerT,])
## [1] 0.2386697
Notice how close this number is to the original probability that it would rain. This makes sense because rain and sprinklers are independent RV.
rain.prob
## [1] 0.24