When rolling a fair dice, there are 3 possible outcode; 1 - 6. Hence when rolling this same fair dice 3 times, for each roll, there are 6 outcomes since rolling for the first time has no bearing on the 2nd roll. Each roll are independent event.
Total number of outcomes = 6X6X6 = \({ 6 }^{ 3 }\) = 216
We will assume of that the dice is fair and that the throw are identitical. Hence the probability of getting any number on a given roll is 1/6. Let us denote X as a random variable corresponding to the sum of the faces of a dice rolled twice. We are looking for Pr(X = 3)
Again, since we are assuming that the dice is fair and that the rolls are done in similar manners. The 2 rolls are idenpendent and the outcome of the first roll has no bearing on the outcome of the 2nd. Hence for each roll, the probability = 1/6. The question remain, how many oucomes will give us a sum = 3. Lets us examine the possibilities; There are only 2 outcomes with desire sum = {(1,2), (2,1)}.
Hence,
Pr(X = 3) = \(\frac { 1 }{ 6 } \times \frac { 1 }{ 6 } \quad +\quad \frac { 1 }{ 6 } \times \frac { 1 }{ 6 } \quad =\quad \frac { 2 }{ 36 } \quad =\quad \frac { 1 }{ 18 }\)
We are assuming that all birthdays are equally likely and equal to 1/365 each. When considering the probability that 2 people have the same birthday we are actually not restricted to 2. We should therefore understand it as at least 2 people having the same birthday. This means that we would have to consider, 2 people having the same birthday, 3 people having the same birthday, … Let’s us denote the random variable that at least 2 people share their birthday as X. We might want to consider instead, the complimentary event; No 2 people share the same birthday; Y.
Hence, Pr(X) = 1 - Pr(Y)
Let’s us find the probability that no 2 people in the group of 25 share the same birthday. We consider that each person birthday is independent.
Pr(Y) can be writen as a product of probability for each person, not sharing their birthday with previous person(s)
Pr(Y) = Pr(1)xPr(2)xPr(3)x…xPr(25)
For the first person, since we have no previously people to consider, Pr(1) = 1 For the 2nd person, in order to no share a birthday with person 1, we have 364/365 ways to pick birthday For the 3rd person, P(3), there are only 363 days to pick from, hence.. P(3)=363/365 For the kth person, k< 25, P(k) = (365 - (k-1))/365 For the 25th person, p(25) = 341/365
Pr(Y) = 1x364/365x363/365x…(365-(k-1))/365x…x341/365
if we rewrite 1 as 365/365
Pr(Y) = (365x364x363x…..(365-(k-1))x…341)/365^25
Pr(Y) = \(\frac { 365\times 364\times 363\times .....\times 341 }{ { 365 }^{ 25 } } \quad =\quad \frac { 365! }{ 340!\times { 365 }^{ 25 } }\)
m <- 365
n <- 25
#p_y <- factorial(m)/(factorial(m-n)*m^n), get warning: ## Warning in factorial(m): value out of range in 'gammafn'
p_y <- exp((lfactorial(m)-lfactorial(m-n)) - n*log(m))
Pr(Y) = 0.4313
Pr(X) = 0.5687
We would expect this probability to increase when we deal with 50 people. By replacing n by 50 in the above calculation we find the new probability.
m <- 365
n <- 50
#p_y <- factorial(m)/(factorial(m-n)*m^n), get warning: ## Warning in factorial(m): value out of range in 'gammafn'
p_y <- exp((lfactorial(m)-lfactorial(m-n)) - n*log(m))
Pr(Y) = 0.0296
Pr(X) = 0.9704
We can see that when we have 50 people, it is almost guaranty that at least 2 people with have the same birthday.
The following addresses how to find percentage of occurrence of words and combination of 2 words (bigrams) in a given text. We are expecting the text to be in a .txt file and the full path should be pass to the function. The function will return a list of 2 structures containing the percentage of occurrences.
library(tm)
## Loading required package: NLP
library(RWeka)
dtm_generate <- function(path_file){
# load the
text_doc <- readLines(path_file, encoding = 'UTF-8')
# Create a Corpus with Directory as source, we do not load .txt file directly in to Corpus using DirectorySource in order to read with encoding 'UTF-8'
docs <- Corpus(VectorSource(text_doc))
# Perform some clean-up on the Corpus
# Create a custom content transformation to input character with space
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x))})
# replace following characters "-", ':", "'", " -" with space
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, "'")
docs <- tm_map(docs, toSpace, "'")
docs <- tm_map(docs, toSpace, " -")
# Remove Punctuation
docs <- tm_map(docs, removePunctuation)
# Remove Numbers
docs <- tm_map(docs, removeNumbers)
# Remove Extra Whitespace
docs <- tm_map(docs, stripWhitespace)
# Convert to Lowercase
docs <- tm_map(docs, content_transformer(tolower))
# Treat Document as plain text
docs <- tm_map(docs, PlainTextDocument)
# Create Document Term Matrices for individual words and for bigrams
dtm <- DocumentTermMatrix(docs)
# Sets the default number of threads to use
options(mc.cores=1)
#http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/20251039#20251039
# Create Bigram
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))
# Calculate Frequencies
# Single Term calcuation
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq,decreasing=TRUE)
freq_p = freq[ord]/length(freq)
# Bigrams Calculations
freq_biagram <- colSums(as.matrix(dtm_bigram))
ord_biagram <- order(freq_biagram, decreasing = TRUE)
freq_biagram_p <- freq_biagram[ord_biagram]/length(freq_biagram)
l_result <- list(freq_p, freq_biagram_p)
}
path_file <- "C:/Users/vbrio/Documents/Cuny/DATA_605/assign6/Texts/assign6.sample.txt"
frequency_list <- dtm_generate(path_file)
## Warning in readLines(path_file, encoding = "UTF-8"): incomplete final
## line found on 'C:/Users/vbrio/Documents/Cuny/DATA_605/assign6/Texts/
## assign6.sample.txt'
single_words <- frequency_list[1]
words_pair <- frequency_list[2]
# print top 10 entries
unlist(single_words)[1:10]
## the and for said that tutwiler
## 0.13097345 0.06725664 0.05309735 0.03893805 0.03185841 0.02477876
## prison corrections are been
## 0.01946903 0.01769912 0.01592920 0.01592920
unlist(words_pair)[1:10]
## at tutwiler in a in the
## 0.005071851 0.005071851 0.004226543
## of the federal government had been
## 0.004226543 0.003381234 0.003381234
## he said justice department of corrections
## 0.003381234 0.003381234 0.003381234
## she said
## 0.003381234