Data 605 - Week 6

Problem Set 1

(1) When you roll a fair die 3 times, how many possible outcomes are there?

Given that each trial of rolling the dice can produce 6 possible outcomes (1 to 6), the number of outcomes is equal to \(6^n\), where \(n\) is the number of “rolls”. Given 3 rolls, the number of combinations would be equal to \(6^3 = 216\).

(2) What is the probability of getting a sum total of 3 when you roll a die two times?

To get a sum of 3 when rolling 3 die, all 3 die would need to be 1. Therefore, only 1 out of 216 combinations (all being equally probable during any given 3 roll trial). Thus, the probability would be \(\frac{1}{216} = 0.0046\).

(3) Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1/365 each. What happens to this probability when there are 50 people in the room?

NOTE: I borrowed the methodology from Dr. Anthony at the Math Forum here

There are \(365^n\) combinations (in our first case, \(n\) = 25) of birthday possibilities. There are \(\binom{n}{k}\) ways to choose the two individuals who will share the same birthday (again, in our case, k will equal 2). The probability of a given individual have any birthday is \(= \frac{365}{365}\). The probability of another individual having the same birthday is \(\frac{1}{365}\). All of the other individuals must have different birthdays restricted to the remaining 364 days of the year. Therefore, they have \(\binom{n}{k}\frac{365!}{(365 - n + k - 1)!}\) possible birthdays combinations out of the total \(365^n\). The probability would be:

\[bday(n,k|n >= k) \rightarrow\binom{n}{k}\frac{365!}{(365 - n + k - 1)!}\times\frac{1}{365^n}\]

bday <- function(n, k) {
    return(nchoosek(n,k) * prod(seq(from  = 365 - n + k, to = 365)) * 1/(365^n))
}

p <- seq(2,100)
bday_df <- data.frame(people = p, probability = sapply(p,bday, k = 2))

The probability of only 2 people sharing a birthday in a room of 25 people would be \(\binom{25}{2}\frac{365!}{(365 - 25 + 2 - 1)!}\times\frac{1}{365^{25}} = 0.3794\). For a room of 50 random people the probability would drop to \(0.1148\). This drop is due to the unlikelihood of only 2 people sharing a birthday as the people in the room increases. A plot of probability ranging from 2 people in the room up to 100 people in the room is plotted below. The probability of only 2 people sharing a birthday is highest with 28 people in the room , which is 0.3864.

plot(bday_df$probability ~ bday_df$people)

Problem 2

Part 1 - Find the probability for each word occuring in the document

#***---User Defined Variables---***
txt_url <- "https://raw.githubusercontent.com/Liam-O/Data_605/master/Wk6/assign6.sample.txt"

doc <- Corpus(URISource(txt_url, encoding = "UTF-8"))

w1 <- "she"
w2 <- "said"
#***------***

#Function for for creating document statistics table, namely word probabilities
# and frequencies
get_word_stat <- function(doc) {
    #Clean text
    doc <- tm_map(doc, stripWhitespace)
    doc <- tm_map(doc, content_transformer(tolower))
    doc <- tm_map(doc, content_transformer(gsub),
                  pattern = intToUtf8(8217),
                  replace = "'")
    doc <- tm_map(doc, content_transformer(gsub),
                  pattern = "[^a-z ']",
                  replace = "")
    doc <- tm_map(doc, PlainTextDocument)
    
    dtm <- DocumentTermMatrix(doc)
    tdm <- TermDocumentMatrix(doc)
    
    freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
    
    return(data.frame(freq = freq, prob = round(freq/sum(freq), 4)))
}

w_stat <- get_word_stat(doc)

datatable(w_stat)

Part 2 - Find the joint probability of a word pair occuring in the document

word_prob <- function(w, w_stat) {
    return(ifelse(is.na(w_stat[w,2]),0,w_stat[w,2]))
}

word_freq <- function(w, w_stat) {
    return(ifelse(is.na(w_stat[w,1]),0,w_stat[w,1]))
}

#Function for calculating bigram probabilities
bigram_prob <- function(w1, w2, doc, w_stat = get_word_stat(doc)) {
    raw_txt <- unlist(lapply(doc, as.character), use.names = FALSE)
    raw_txt <- raw_txt[raw_txt != ""]
    raw_txt <- paste0(raw_txt, collapse = " | ")
    
    pair_list <- textcnt(raw_txt, n = 2L, method = "string", recursive = TRUE)
    
    return(ifelse(paste(w1, w2) %in% names(pair_list),
                  pair_list[[paste(w1, w2)]],0)/
               (ifelse(word_freq(w1,w_stat) != 0, word_freq(w1,w_stat), 1)))
}

# Variables defined for output below
w1_prob <- word_prob(w1, w_stat)
w2_prob <- word_prob(w2, w_stat)
w2_given_w1 <- bigram_prob(w1,w2, doc, w_stat)
w1_given_w2 <- bigram_prob(w2,w1, doc, w_stat)

Using the word pair, she and said, the joint probability of the pair occurring is:

\[P\left ( she, said \right ) = P\left ( she \right )P\left ( said|she \right ) = 0.0082\times0.4444 = 0.0036\]

If order doesn’t matter:

\[P\left ( she, said \cup said, she\right ) = P\left ( she \right )P\left ( said|she \right ) + P\left ( said \right )P\left ( she|said \right ) =\\0.0082\times0.4444 + 0.0199\times0.0455 = 0.0036+9\times 10^{-4} = 0.0045\]

The Time Magazine Corpus contains 106,418,475 words. Using the words “she” and “said” and assuming order matters. The respective counts are:

she = 155,151
she said = 3,041

The probability would be: \[P\left ( she, said \right ) = P\left ( she \right )P\left ( said|she \right ) = \frac{155151}{106418475}\times\frac{3041}{155151} = 0.0015\times 0.0196 = 2.9\times 10^{-5}\]

This probability does not match our document because of the limited word count (1104) in our single document corpus and the limited combinations present. If one were to build a Natural Language Processing (NLP) predictor model, a large corpus, like that of Time magazine’s, would need to be used.

Data 605 - Week 6

Liam Byrne

March 7, 2017

Problem Set 1

Problem 2