(1) When you roll a fair die 3 times, how many possible outcomes are there?
For each roll, there are 6 possible outcomes. So the the total number of possible outcomes for three rolls of a fair die will be:
(6 * 6 * 6) or 6^3 = 216 outcomes
(2) What is the probability of getting a sum total of 3 when you roll a die two times?
Number of ways getting a sum of 3 = 3
number of possible outcomes = 6^2 = 36
Probability of getting a sum of 3 on 2 rolls of a die = 3/6^2 = 1/18
(3) Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1/365 each. What happens to this probability when there are 50 people in the room?
To solve this problem, we first calculate the probability that no people in the group share a birthday. Knowing that the sum of the probability that an event will happen and the probability won’t happen is 1, we can calculate the probability of a shared birthday among two people in the group.
To calcuate the probabaility of no shared birthdays, we use the following approach – shown for a group of n = 5 people:
(365/365) * (364/365) * (363/365) * (362/365) * (361/365)
Alternatively, for n = 5, this can be expressed as :
(1 - 0/365) * (1 - 1/365) * (1 - 2/365) * (1 - 3/365) * (1 - 4/365)
Extending this to R, the probability of no shared birthdays can be calcuated by the following for 5 people:
n = 5
p = 1 - seq(0, n - 1)/365
prod(p) # calcuate the product of the vector elements 1 - 5
## [1] 0.9728644
# or combining the two commands
p <- prod(1 - seq(0, n-1)/365)
Applying the R calculation to 25 people:
n <- 25
p <- prod(1 - seq(0, n-1)/365)
Probability of no shared birthdays = 0.4313003
Probability of two shared birthdays = 0.5686997 (1 - P(no shared birthdays))
For 50 people:
n <- 50
p <- prod(1 - seq(0, n-1)/365)
Probability of no shared birthdays = 0.0296264
Probability of two shared birthdays = 0.9703736 (1 - P(no shared birthdays))
Of note - R has a standard function to calculate the birthday problem.
(pbirthday(25, coincident = 2))
## [1] 0.5686997
(pbirthday(50, coincident = 2))
## [1] 0.9703736
Sometimes you cannot compute the probability of an outcome by measuring the sample space and examining the symmetries of the underlying physical phenomenon, as you could do when you rolled die or picked a card from a shuffled deck. You have to estimate probabilities by other means. For instance, when you have to compute the probability of various english words, it is not possible to do it by examination of the sample space as it is too large. You have to resort to empirical techniques to get a good enough estimate. One such approach would be to take a large corpus of documents and from those documents, count the number of occurrences of a particular character or word and then base your estimate on that.
Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a file containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.
Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the joint probability of both of them occurring together. The order of the two words is not important. Use the accompanying document for your testing purposes. Compare your probabilities of various words with the Time Magazine corpus: http://corpus.byu.edu/time/
This solution will use the quanteda
, tidyr
, and dplyr
packages. Quanteda will be used to create a corpus and clean the text prior to tokenization and creation of a document-feature matrix (dfm).
suppressWarnings(suppressMessages(library(quanteda)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(knitr)))
# set working directory
setwd("~/DataScience/CUNY/DATA605/Week6/")
input.files <- textfile("assign6/assign6.sample.txt", cache = FALSE, encoding = "UTF-8")
Create the corpus using the text file input(s).
docCorpus <- corpus(input.files)
(doc.summary <- summary(docCorpus, removePunct=TRUE, removeNumbers = TRUE))
## Corpus consisting of 1 document.
##
## Text Types Tokens Sentences
## assign6.sample.txt 596 1334 67
##
## Source: C:/Users/keith/Documents/DataScience/CUNY/DATA605/Week6/* on x86-64 by keith
## Created: Sun Oct 02 20:44:26 2016
## Notes:
## Text Types Tokens Sentences
## 1 assign6.sample.txt 596 1334 67
dfm <- dfm(docCorpus, stem = FALSE, removePunct=TRUE, removeNumbers = TRUE, removeHyphens = TRUE)
## Creating a dfm from a corpus ...
##
## ... lowercasing
##
## ... tokenizing
##
## ... indexing documents: 1 document
##
## ... indexing features:
## 567 feature types
##
## ... created a 1 x 568 sparse dfm
## ... complete.
## Elapsed time: 0.1 seconds.
# convert the document-feature matrix to a dataframe
words.df <- data.frame(Content = features(dfm), Frequency = colSums(dfm),
row.names = NULL, stringsAsFactors = FALSE)
Confirm there are no duplicate words
filter(words.df, duplicated(words.df))
## [1] Content Frequency
## <0 rows> (or 0-length row.names)
Capture the total number of tokenized words and calculated the relative probability of a single word occurring in the text
total_words <- doc.summary$Tokens
words.df$Probability <- words.df$Frequency/total_words
Content | Frequency | Probability |
---|---|---|
for | 31 | 0.0232384 |
a | 45 | 0.0337331 |
female | 2 | 0.0014993 |
inmate | 1 | 0.0007496 |
there | 6 | 0.0044978 |
are | 9 | 0.0067466 |
few | 1 | 0.0007496 |
places | 1 | 0.0007496 |
worse | 1 | 0.0007496 |
than | 7 | 0.0052474 |
the | 76 | 0.0569715 |
julia | 2 | 0.0014993 |
tutwiler | 15 | 0.0112444 |
prison | 11 | 0.0082459 |
women | 6 | 0.0044978 |
corrections | 10 | 0.0074963 |
officers | 6 | 0.0044978 |
have | 9 | 0.0067466 |
raped | 2 | 0.0014993 |
beaten | 1 | 0.0007496 |
Create a function to return the probability of the occurrence of a word:
get_word_probability <- function(word) {
Probability <- words.df %>%
filter(Content == word) %>%
select(Probability)
if (length(Probability) == 0) return (0)
return (Probability)
}
get_word_probability <- Vectorize(get_word_probability)
# create a document-feature matrix of bigrams
dfm_bigrams <- dfm(docCorpus, stem = FALSE, removePunct=TRUE,
removeNumbers = TRUE, removeHyphens = TRUE, ngrams = 2)
## Creating a dfm from a corpus ...
##
## ... lowercasing
##
## ... tokenizing
##
## ... indexing documents: 1 document
##
## ... indexing features:
## 1,212 feature types
##
## ... created a 1 x 1213 sparse dfm
## ... complete.
## Elapsed time: 0.06 seconds.
# convert the document-feature matrix of bigrams to a dataframe
bigrams.df <- data.frame(Content = features(dfm_bigrams),
Bigram.Frequency = colSums(dfm_bigrams),
row.names = NULL, stringsAsFactors = FALSE)
Confirm there are no duplicate bigrams
filter(bigrams.df, duplicated(bigrams.df))
## [1] Content Bigram.Frequency
## <0 rows> (or 0-length row.names)
Create a function to return the probability of the occurrence of a specific bigram:
get_bigram_probability <- function(bigram) {
bigram1 <- bigram
# reverse the words in the bigger
# the bigram "for_the" would become "the_for"
reversed <- unlist(strsplit(bigram, "_"))
bigram2 <- paste0(reversed[2], "_", reversed[1])
# search the datafame of bigrams for both versions of the bigram; "for_the" and "the_for"
bigram.prob <- bigrams.df %>%
filter(Content == bigram1 | Content == bigram2) %>%
summarize(Probability = sum(Bigram.Frequency)/total_words)
if (length(bigram.prob) == 0) return (0)
return (bigram.prob)
}
get_bigram_probability <- Vectorize(get_bigram_probability)
Finally, let’s process the document’s bigrams and calculate the following:
process_bigrams <- function(df) {
bgram.df <- data.frame()
# separate the bigrams into word1 and word2
words <- df %>% separate(Content, c("word1", "word2"), sep = "_") %>% select(word1, word2)
bgram.df <- cbind(df, words)
# add the probability of the first word in the bigram
bgram.df$word1_probability <- unlist(get_word_probability(bgram.df$word1))
# add the probability of the second word in the bigram
bgram.df$word2_probability <- unlist(get_word_probability(bgram.df$word2))
# add the probability of the bigram occurring
bgram.df$bigram_probability <- unlist(get_bigram_probability(bgram.df$Content))
return (bgram.df)
}
# Calculate the probabilites of the bigrams
result.df <- process_bigrams(bigrams.df)
Content | Bigram.Frequency | word1 | word2 | word1_probability | word2_probability | bigram_probability |
---|---|---|---|---|---|---|
in_a | 6 | in | a | 0.0209895 | 0.0337331 | 0.0044978 |
at_tutwiler | 6 | at | tutwiler | 0.0089955 | 0.0112444 | 0.0044978 |
of_the | 5 | of | the | 0.0209895 | 0.0569715 | 0.0037481 |
in_the | 5 | in | the | 0.0209895 | 0.0569715 | 0.0037481 |
he_said | 4 | he | said | 0.0067466 | 0.0164918 | 0.0037481 |
said_he | 1 | said | he | 0.0164918 | 0.0067466 | 0.0037481 |
she_said | 4 | she | said | 0.0067466 | 0.0164918 | 0.0037481 |
said_she | 1 | said | she | 0.0164918 | 0.0067466 | 0.0037481 |
justice_department | 4 | justice | department | 0.0044978 | 0.0052474 | 0.0029985 |
the_federal | 4 | the | federal | 0.0569715 | 0.0037481 | 0.0029985 |
federal_government | 4 | federal | government | 0.0037481 | 0.0029985 | 0.0029985 |
for_the | 4 | for | the | 0.0232384 | 0.0569715 | 0.0029985 |
who_is | 4 | who | is | 0.0067466 | 0.0164918 | 0.0029985 |
to_be | 4 | to | be | 0.0209895 | 0.0029985 | 0.0029985 |
of_corrections | 4 | of | corrections | 0.0209895 | 0.0074963 | 0.0029985 |
had_been | 4 | had | been | 0.0044978 | 0.0067466 | 0.0029985 |
corrections_officers | 3 | corrections | officers | 0.0074963 | 0.0044978 | 0.0022489 |
more_than | 3 | more | than | 0.0044978 | 0.0052474 | 0.0022489 |
but_tutwiler | 1 | but | tutwiler | 0.0067466 | 0.0112444 | 0.0022489 |
the_legislature | 3 | the | legislature | 0.0569715 | 0.0022489 | 0.0022489 |