IS605 Assignment 6

PS 1

When you roll a fair dice 3 times, how many possible outcomes are there?
A single roll has six possible outcomes, so the number of possible combination of outcomes is \(6^3=216\).
What is the probability of getting a sum total of 3 when you roll a die two times?
There are two ways for two consecutive die rolls to sum to 3: the first roll = 1, second roll = 2 and vice versa. There are a total of \(6^2\) possible elementary events, so the probability is \(1/18\).
Assume there are 25 strangers in a room. What is the probability of [at least] two sharing the same birthday? What about the probability with 50 people?
The probability of no one in the room sharing the same birthday is: \[\frac{365!}{340!365^{25}}\approx 0.43\] This indicates that the probability of at least two people sharing a birthday is \(1 - 0.43 = 0.57\).

This generalizes to: \[1 - \frac{365!}{(365-n)!365^n}\]Hence, the probability in the case of n = 50 people is: \[1 - \frac{365!}{(315)!365^{50}}\approx 1 - 0.03 = 0.97\]

PS 2

Write a program to read an English language document and write out the estimated probabilities for each word of the document. Extend it to calculate the probability of two words (order does not matter).

require(stringr)

calc.word.prob <- function(num.words = 1) {
  fname <- 'assign6.sample.txt'
  con  <- file(fname, open = 'r', encoding = 'UTF-8-BOM')
  
  
  lines <- readLines(con)
  close(con)
  
  # remove blank lines
  lines <- lines[lines != '']
  
  # create empty dataframe to store words and counts
  words <- data.frame(word = I('a'), count = 0)
  words <- words[FALSE,]
  
  # split & clean
  for (line in lines) {
    
    wds <- str_split(line, '[[:space:]]')
    wds <- str_replace_all(wds[[1]], '[^[:alpha:]]', '')
    wds <- wds[wds != '']  # remove empty strings
    
    n <- num.words - 1
    
    for (i in 1:(length(wds) - n)) {
      
      w <- tolower(wds[i:(i + n)])
      w <- str_join(w[order(w)], collapse = '.')
      
      # check if in words; increment if true, add otherwise
      if (w %in% words$word) {
        words[words$word == w, 'count'] <- words[words$word == w, 'count'] + 1
      } else {
        words[nrow(words) + 1,] <- list(w, 1)
      }
      
    }
  }
  
  # calculate probabilities
  words$prob <- words$count / sum(words$count)
  
  # sort by prob
  words <- words[order(words$prob, decreasing = TRUE),]
  
  return(words)
}

The function calc.word.prob takes a single argument representing the length of the n-gram for which the probabilities should be calculated (default is 1) and returns a dataframe with three columns: word, count, and probability.

head(calc.word.prob(1))

##    word count       prob
## 11  the    76 0.05697151
## 2     a    45 0.03373313
## 21  and    38 0.02848576
## 1   for    31 0.02323838
## 30   to    28 0.02098951
## 38   of    28 0.02098951

head(calc.word.prob(2))

##            word count        prob
## 84         a.in     6 0.004629630
## 147 at.tutwiler     6 0.004629630
## 44       of.the     5 0.003858025
## 102      in.the     5 0.003858025
## 334     he.said     5 0.003858025
## 531    said.she     5 0.003858025

According to the above calculations, the word “the” appears with frequency between 5-6%. According to the BYU Time Corpus:

require(ggplot2)

decades <- seq(from = 1920, to = 2000, by = 10)
freq.per.mil <- c(64263.20,  55426.82,  57755.03,   58320.10,   58633.99,   59661.06,   60522.72,   55095.48,   50590.06)
pct <- freq.per.mil / 1000000 * 100
time.the <- data.frame(decades = decades, percent = pct)
qplot(decades, percent, data = time.the)

As we can see from the chart, the probability of the word being “the” has fluctuated between 5 and 6.5 percent over the period 1920-2000, showing that our estimate is fairly accurate in this case.