CunyData605_Assignment#6

Problem Set # 1

Outcomes of rolling a fair die 3 times?

When rolling a fair dice, there are 3 possible outcode; 1 - 6. Hence when rolling this same fair dice 3 times, for each roll, there are 6 outcomes since rolling for the first time has no bearing on the 2nd roll. Each roll are independent event.

Total number of outcomes = 6X6X6 = \({ 6 }^{ 3 }\) = 216

What is the probability of getting a sum total of 3 when you roll a die two times?

We will assume of that the dice is fair and that the throw are identitical. Hence the probability of getting any number on a given roll is 1/6. Let us denote X as a random variable corresponding to the sum of the faces of a dice rolled twice. We are looking for Pr(X = 3)
Again, since we are assuming that the dice is fair and that the rolls are done in similar manners. The 2 rolls are idenpendent and the outcome of the first roll has no bearing on the outcome of the 2nd. Hence for each roll, the probability = 1/6. The question remain, how many oucomes will give us a sum = 3. Lets us examine the possibilities; There are only 2 outcomes with desire sum = {(1,2), (2,1)}.
Hence,
Pr(X = 3) = \(\frac { 1 }{ 6 } \times \frac { 1 }{ 6 } \quad +\quad \frac { 1 }{ 6 } \times \frac { 1 }{ 6 } \quad =\quad \frac { 2 }{ 36 } \quad =\quad \frac { 1 }{ 18 }\)

What is the probability that for 25 strangers in a room 2 of them have the same birthday?

We are assuming that all birthdays are equally likely and equal to 1/365 each. When considering the probability that 2 people have the same birthday we are actually not restricted to 2. We should therefore understand it as at least 2 people having the same birthday. This means that we would have to consider, 2 people having the same birthday, 3 people having the same birthday, … Let’s us denote the random variable that at least 2 people share their birthday as X. We might want to consider instead, the complimentary event; No 2 people share the same birthday; Y.

Hence, Pr(X) = 1 - Pr(Y)

Let’s us find the probability that no 2 people in the group of 25 share the same birthday. We consider that each person birthday is independent.

Pr(Y) can be writen as a product of probability for each person, not sharing their birthday with previous person(s)

Pr(Y) = Pr(1)xPr(2)xPr(3)x…xPr(25)

For the first person, since we have no previously people to consider, Pr(1) = 1 For the 2nd person, in order to no share a birthday with person 1, we have 364/365 ways to pick birthday For the 3rd person, P(3), there are only 363 days to pick from, hence.. P(3)=363/365 For the kth person, k< 25, P(k) = (365 - (k-1))/365 For the 25th person, p(25) = 341/365

Pr(Y) = 1x364/365x363/365x…(365-(k-1))/365x…x341/365

if we rewrite 1 as 365/365

Pr(Y) = (365x364x363x…..(365-(k-1))x…341)/365^25

Pr(Y) = \(\frac { 365\times 364\times 363\times .....\times 341 }{ { 365 }^{ 25 } } \quad =\quad \frac { 365! }{ 340!\times { 365 }^{ 25 } }\)

m <- 365
n <- 25

#p_y <- factorial(m)/(factorial(m-n)*m^n), get warning: ## Warning in factorial(m): value out of range in 'gammafn'

p_y <- exp((lfactorial(m)-lfactorial(m-n)) - n*log(m))

Pr(Y) = 0.4313

Pr(X) = 0.5687

We would expect this probability to increase when we deal with 50 people. By replacing n by 50 in the above calculation we find the new probability.

m <- 365
n <- 50

#p_y <- factorial(m)/(factorial(m-n)*m^n), get warning: ## Warning in factorial(m): value out of range in 'gammafn'

p_y <- exp((lfactorial(m)-lfactorial(m-n)) - n*log(m))

Pr(Y) = 0.0296

Pr(X) = 0.9704

We can see that when we have 50 people, it is almost guaranty that at least 2 people with have the same birthday.

Problem Set # 2

The following addresses how to find percentage of occurrence of words and combination of 2 words (bigrams) in a given text. We are expecting the text to be in a .txt file and the full path should be pass to the function. The function will return a list of 2 structures containing the percentage of occurrences.

library(tm)

## Loading required package: NLP

library(RWeka)

dtm_generate <- function(path_file){
  
  # load the 
  text_doc <- readLines(path_file, encoding = 'UTF-8')

  # Create a Corpus with Directory as source, we do not load .txt file directly in to Corpus using DirectorySource in order to read with encoding 'UTF-8' 
  docs <- Corpus(VectorSource(text_doc))
  
  # Perform some clean-up on the Corpus
  
  # Create a custom content transformation to input character with space
  toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})

  # replace following characters "-", ':", "'", " -" with space
  docs <- tm_map(docs, toSpace, "-")
  docs <- tm_map(docs, toSpace, ":")
  docs <- tm_map(docs, toSpace, "'")
  docs <- tm_map(docs, toSpace, "'")
  docs <- tm_map(docs, toSpace, " -")

  # Remove Punctuation
  docs <- tm_map(docs, removePunctuation)

  # Remove Numbers
  docs <- tm_map(docs, removeNumbers)

  # Remove Extra Whitespace
  docs <- tm_map(docs, stripWhitespace)

  # Convert to Lowercase
  docs <- tm_map(docs, content_transformer(tolower))

  # Treat Document as plain text
  docs <- tm_map(docs, PlainTextDocument)  
  
  # Create Document Term Matrices for individual words and for bigrams
  dtm <- DocumentTermMatrix(docs)
  
  # Sets the default number of threads to use
  options(mc.cores=1)
  #http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/20251039#20251039
  
  # Create Bigram
  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  
  dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
  
  # Calculate Frequencies
  # Single Term calcuation  
  freq <- colSums(as.matrix(dtm))

  length(freq)

  ord <- order(freq,decreasing=TRUE)

  freq_p = freq[ord]/length(freq)
  
  # Bigrams Calculations
  freq_biagram <- colSums(as.matrix(dtm_bigram))

  ord_biagram <- order(freq_biagram, decreasing = TRUE)

  freq_biagram_p <- freq_biagram[ord_biagram]/length(freq_biagram)
  
  l_result <- list(freq_p, freq_biagram_p)
}

path_file <- "C:/Users/vbrio/Documents/Cuny/DATA_605/assign6/Texts/assign6.sample.txt"

frequency_list <- dtm_generate(path_file)

## Warning in readLines(path_file, encoding = "UTF-8"): incomplete final
## line found on 'C:/Users/vbrio/Documents/Cuny/DATA_605/assign6/Texts/
## assign6.sample.txt'

single_words <- frequency_list[1]
words_pair  <- frequency_list[2]

# print top 10 entries
unlist(single_words)[1:10]

##         the         and         for        said        that    tutwiler 
##  0.13097345  0.06725664  0.05309735  0.03893805  0.03185841  0.02477876 
##      prison corrections         are        been 
##  0.01946903  0.01769912  0.01592920  0.01592920

unlist(words_pair)[1:10]

##        at tutwiler               in a             in the 
##        0.005071851        0.005071851        0.004226543 
##             of the federal government           had been 
##        0.004226543        0.003381234        0.003381234 
##            he said justice department     of corrections 
##        0.003381234        0.003381234        0.003381234 
##           she said 
##        0.003381234