Problem set 1

  1. When you roll a fair die 3 times, how many possible outcomes are there?
6^3
## [1] 216
  1. What is the probability of getting a sum total of 3 when you roll a die two times?

Only a 2,1 or a 1,2 will do it, so these are two outcomes out of the \(6^2\) possibilies

library(MASS)
fractions(2/36)
## [1] 1/18

Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1/365 each.

This is the same as asking what is the probability that no one in the room has the same birthday. So we subtract one of the possible dates for each individual whose date is not a match to find the probability that two people do not have the same birthday and then subtract one to find the probability of two people having the same birthday. https://en.wikipedia.org/wiki/Birthday_problem

1 - prod(365:341)/(365^25)
## [1] 0.5686997

What happens to this probability when there are 50 people in the room?

1 - prod(365:316)/(365^50)
## [1] 0.9703736

Problem set 2

Sometimes you cannot compute the probability of an outcome by measuring the sample space and examining the symmetries of the underlying physical phenomenon, as you could do when you rolled die or picked a card from a shuffled deck. You have to estimate probabilities by other means. For instance, when you have to compute the probability of various english words, it is not possible to do it by examination of the sample space as it is too large. You have to resort to empirical techniques to get a good enough estimate. One such approach would be to take a large corpus of documents and from those documents, count the number of occurrences of a particular character or word and then base your estimate on that. Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a file containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.

corper <- function(text) {
  #compute probabilities of all words in text
  text <- readLines(text)
  source <- VectorSource(text)
  corpus <- Corpus(source)
  #make all text uniformly lowercase in order to get the proper counts and remove punctuation.
  corpus <- tm_map(corpus, content_transformer(tolower))
  #make the different tenses and pluralities of text equivalent
  corpus <- tm_map(corpus, stemDocument)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  return (corpus)
}

wordprob <- function(corpus) {
  
  #create a document term matrix to get the counts
  dtm <- DocumentTermMatrix(corpus)
  dtm2 <- as.matrix(dtm)
  
  #sort the words by frequency
  frequency <- sort(colSums(dtm2), decreasing=TRUE)
  #count the number of words
  words <- (sum(frequency))
  #calculate the probability of each word and print
  prob.words <- frequency/words
  return (prob.words[1:50])
}

library (tm)
## Warning: package 'tm' was built under R version 3.3.3
## Loading required package: NLP
corp <-  corper("https://www.gutenberg.org/files/2701/2701-0.txt")
wordprob(corp)
##         the         and        that         his         but        with 
## 0.083637947 0.036948974 0.016821552 0.014420928 0.010181164 0.010129599 
##         was         for         all        this       whale         not 
## 0.009373317 0.009304564 0.008496717 0.008044094 0.006732058 0.006560176 
##        from         him         you         one        have         had 
## 0.006319541 0.005809623 0.005288247 0.005116364 0.004784058 0.004446023 
##       there         now        were        they       which       their 
## 0.004274141 0.004234035 0.003896000 0.003775682 0.003706929 0.003563694 
##        some         are        then        when        like        upon 
## 0.003523588 0.003512129 0.003454835 0.003420459 0.003351706 0.003248576 
##        what        into         out        more        them        seem 
## 0.003059506 0.002990753 0.002939188 0.002876165 0.002606882 0.002601153 
##         old        ship       other         man       would        been 
## 0.002532400 0.002526670 0.002498023 0.002469376 0.002446459 0.002377706 
##        ahab         sea        over        will       these        time 
## 0.002343329 0.002320412 0.002303223 0.002291765 0.002286035 0.002257388 
##        such      though 
## 0.002137070 0.002074047

Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the joint probability of both of them occurring together. The order of the two words is not important. Use the accompanying document for your testing purposes. Compare your probabilities of various words with the Time Magazine corpus: http://corpus.byu.edu/time/

https://www.r-bloggers.com/24-days-of-r-day-11/

tau package seemed to be the only thing that was working for me to get the bigram values and it would actually be more efficient than my previous calculations. I will use this function going forward.

library (tau)
## Warning: package 'tau' was built under R version 3.3.3
Prob <- function(corpus, search) {
  
  
  uni <- textcnt(corpus, n = 1, method = "string")
  #split the valuses so that the probability for each word can be calculated
  list <- strsplit(search, " ")
  unlist <- unlist(list)
  one <- unlist[1]
  two <- unlist[2]
  #number of words
  numUni <- length(uni)
  #individual word probabilities
  probUni<- uni/numUni
  
  print (probUni[one])
  print (probUni[two])

  
  bigrams <- textcnt(corpus, n = 2, method = "string")
  #number of bigrams
  numBi <- length(bigrams)
  #probability of each bigram
  probBi<- bigrams/numBi
  
  print (probBi[search])

}
#enter the corpus and the text phrase as variables
Prob(corp, "the whale")
##       the 
## 0.7688684 
##      whale 
## 0.06192442 
##   the whale 
## 0.002971754