When you roll a fair dice 3 times, how many possible outcomes are there?
A single roll has six possible outcomes, so the number of possible combination of outcomes is \(6^3=216\).
What is the probability of getting a sum total of 3 when you roll a die two times?
There are two ways for two consecutive die rolls to sum to 3: the first roll = 1, second roll = 2 and vice versa. There are a total of \(6^2\) possible elementary events, so the probability is \(1/18\).
Assume there are 25 strangers in a room. What is the probability of [at least] two sharing the same birthday? What about the probability with 50 people?
The probability of no one in the room sharing the same birthday is: \[\frac{365!}{340!365^{25}}\approx 0.43\] This indicates that the probability of at least two people sharing a birthday is \(1 - 0.43 = 0.57\).
This generalizes to: \[1 - \frac{365!}{(365-n)!365^n}\]Hence, the probability in the case of n = 50 people is: \[1 - \frac{365!}{(315)!365^{50}}\approx 1 - 0.03 = 0.97\]
Write a program to read an English language document and write out the estimated probabilities for each word of the document. Extend it to calculate the probability of two words (order does not matter).
require(stringr)
calc.word.prob <- function(num.words = 1) {
fname <- 'assign6.sample.txt'
con <- file(fname, open = 'r', encoding = 'UTF-8-BOM')
lines <- readLines(con)
close(con)
# remove blank lines
lines <- lines[lines != '']
# create empty dataframe to store words and counts
words <- data.frame(word = I('a'), count = 0)
words <- words[FALSE,]
# split & clean
for (line in lines) {
wds <- str_split(line, '[[:space:]]')
wds <- str_replace_all(wds[[1]], '[^[:alpha:]]', '')
wds <- wds[wds != ''] # remove empty strings
n <- num.words - 1
for (i in 1:(length(wds) - n)) {
w <- tolower(wds[i:(i + n)])
w <- str_join(w[order(w)], collapse = '.')
# check if in words; increment if true, add otherwise
if (w %in% words$word) {
words[words$word == w, 'count'] <- words[words$word == w, 'count'] + 1
} else {
words[nrow(words) + 1,] <- list(w, 1)
}
}
}
# calculate probabilities
words$prob <- words$count / sum(words$count)
# sort by prob
words <- words[order(words$prob, decreasing = TRUE),]
return(words)
}
The function calc.word.prob takes a single argument representing the length of the n-gram for which the probabilities should be calculated (default is 1) and returns a dataframe with three columns: word, count, and probability.
head(calc.word.prob(1))
## word count prob
## 11 the 76 0.05697151
## 2 a 45 0.03373313
## 21 and 38 0.02848576
## 1 for 31 0.02323838
## 30 to 28 0.02098951
## 38 of 28 0.02098951
head(calc.word.prob(2))
## word count prob
## 84 a.in 6 0.004629630
## 147 at.tutwiler 6 0.004629630
## 44 of.the 5 0.003858025
## 102 in.the 5 0.003858025
## 334 he.said 5 0.003858025
## 531 said.she 5 0.003858025
According to the above calculations, the word “the” appears with frequency between 5-6%. According to the BYU Time Corpus:
require(ggplot2)
decades <- seq(from = 1920, to = 2000, by = 10)
freq.per.mil <- c(64263.20, 55426.82, 57755.03, 58320.10, 58633.99, 59661.06, 60522.72, 55095.48, 50590.06)
pct <- freq.per.mil / 1000000 * 100
time.the <- data.frame(decades = decades, percent = pct)
qplot(decades, percent, data = time.the)
As we can see from the chart, the probability of the word being “the” has fluctuated between 5 and 6.5 percent over the period 1920-2000, showing that our estimate is fairly accurate in this case.