DATA 605: Assignment 6

Problem Set 1

(1) When you roll a fair die 3 times, how many possible outcomes are there?

For each roll, there are 6 possible outcomes. So the the total number of possible outcomes for three rolls of a fair die will be:

(6 * 6 * 6) or 6^3 = 216 outcomes

(2) What is the probability of getting a sum total of 3 when you roll a die two times?

Number of ways getting a sum of 3 = 3
number of possible outcomes = 6^2 = 36

Probability of getting a sum of 3 on 2 rolls of a die = 3/6^2 = 1/18

(3) Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1/365 each. What happens to this probability when there are 50 people in the room?

To solve this problem, we first calculate the probability that no people in the group share a birthday. Knowing that the sum of the probability that an event will happen and the probability won’t happen is 1, we can calculate the probability of a shared birthday among two people in the group.

To calcuate the probabaility of no shared birthdays, we use the following approach – shown for a group of n = 5 people:

(365/365) * (364/365) * (363/365) * (362/365) * (361/365)

Alternatively, for n = 5, this can be expressed as :

(1 - 0/365) * (1 - 1/365) * (1 - 2/365) * (1 - 3/365) * (1 - 4/365)

Extending this to R, the probability of no shared birthdays can be calcuated by the following for 5 people:

  n = 5
  p = 1 - seq(0, n - 1)/365
  prod(p) # calcuate the product of the vector elements 1 - 5

## [1] 0.9728644

  # or combining the two commands
  p <- prod(1 - seq(0, n-1)/365)

Applying the R calculation to 25 people:

  n <- 25
  p <- prod(1 - seq(0, n-1)/365)

Probability of no shared birthdays = 0.4313003
Probability of two shared birthdays = 0.5686997 (1 - P(no shared birthdays))

For 50 people:

  n <- 50
  p <- prod(1 - seq(0, n-1)/365)

Probability of no shared birthdays = 0.0296264
Probability of two shared birthdays = 0.9703736 (1 - P(no shared birthdays))

Of note - R has a standard function to calculate the birthday problem.

Given 25 people, calcuate the probability of 2 people with the same birthday

(pbirthday(25, coincident = 2))

## [1] 0.5686997

Given 50 people, calcuate the probability of 2 people with the same birthday

(pbirthday(50, coincident = 2))

## [1] 0.9703736

Problem Set 2

Sometimes you cannot compute the probability of an outcome by measuring the sample space and examining the symmetries of the underlying physical phenomenon, as you could do when you rolled die or picked a card from a shuffled deck. You have to estimate probabilities by other means. For instance, when you have to compute the probability of various english words, it is not possible to do it by examination of the sample space as it is too large. You have to resort to empirical techniques to get a good enough estimate. One such approach would be to take a large corpus of documents and from those documents, count the number of occurrences of a particular character or word and then base your estimate on that.

Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a file containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.

Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the joint probability of both of them occurring together. The order of the two words is not important. Use the accompanying document for your testing purposes. Compare your probabilities of various words with the Time Magazine corpus: http://corpus.byu.edu/time/

This solution will use the quanteda, tidyr, and dplyr packages. Quanteda will be used to create a corpus and clean the text prior to tokenization and creation of a document-feature matrix (dfm).

suppressWarnings(suppressMessages(library(quanteda)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(knitr)))

# set working directory
setwd("~/DataScience/CUNY/DATA605/Week6/")

input.files <- textfile("assign6/assign6.sample.txt", cache = FALSE, encoding = "UTF-8")

Create the corpus using the text file input(s).

docCorpus <- corpus(input.files)

(doc.summary <- summary(docCorpus, removePunct=TRUE, removeNumbers = TRUE))

## Corpus consisting of 1 document.
## 
##                Text Types Tokens Sentences
##  assign6.sample.txt   596   1334        67
## 
## Source:  C:/Users/keith/Documents/DataScience/CUNY/DATA605/Week6/* on x86-64 by keith
## Created: Sun Oct 02 20:44:26 2016
## Notes:

##                 Text Types Tokens Sentences
## 1 assign6.sample.txt   596   1334        67

Single Words

Create a document-feature matrix of the words in the document(s)
Remove punctuation, numbers, and hyphens
quanteda converts to lowercase by default

dfm <- dfm(docCorpus,  stem = FALSE, removePunct=TRUE, removeNumbers = TRUE,  removeHyphens = TRUE)

## Creating a dfm from a corpus ...

## 
##    ... lowercasing

## 
##    ... tokenizing

## 
##    ... indexing documents: 1 document

## 
##    ... indexing features:

## 567 feature types

##

##    ... created a 1 x 568 sparse dfm
##    ... complete. 
## Elapsed time: 0.1 seconds.

# convert the document-feature matrix to a dataframe

words.df <- data.frame(Content = features(dfm), Frequency = colSums(dfm), 
                       row.names = NULL, stringsAsFactors = FALSE)

Confirm there are no duplicate words

filter(words.df, duplicated(words.df))

## [1] Content   Frequency
## <0 rows> (or 0-length row.names)

Capture the total number of tokenized words and calculated the relative probability of a single word occurring in the text

total_words <- doc.summary$Tokens

words.df$Probability <- words.df$Frequency/total_words

Let’s look at the calculate probabilites of the occurence of words:

Content	Frequency	Probability
for	31	0.0232384
a	45	0.0337331
female	2	0.0014993
inmate	1	0.0007496
there	6	0.0044978
are	9	0.0067466
few	1	0.0007496
places	1	0.0007496
worse	1	0.0007496
than	7	0.0052474
the	76	0.0569715
julia	2	0.0014993
tutwiler	15	0.0112444
prison	11	0.0082459
women	6	0.0044978
corrections	10	0.0074963
officers	6	0.0044978
have	9	0.0067466
raped	2	0.0014993
beaten	1	0.0007496

Create a function to return the probability of the occurrence of a word:

get_word_probability <- function(word) {
    
    Probability <- words.df %>% 
                   filter(Content == word) %>% 
                   select(Probability)
    
    if (length(Probability) == 0) return (0)
    
    return (Probability)
    
}

get_word_probability <- Vectorize(get_word_probability)

Bigrams

Create a document-feature matrix of the bigrams occurring in the provided text
Bigrams are two words that occur consecutively in text
Remove punctuation, numbers, and hyphens
quanteda converts to lowercase by default

# create a document-feature matrix of bigrams
dfm_bigrams <- dfm(docCorpus,  stem = FALSE, removePunct=TRUE, 
                   removeNumbers = TRUE, removeHyphens = TRUE, ngrams = 2)

## Creating a dfm from a corpus ...

## 
##    ... lowercasing

## 
##    ... tokenizing

## 
##    ... indexing documents: 1 document

## 
##    ... indexing features:

## 1,212 feature types

##

##    ... created a 1 x 1213 sparse dfm
##    ... complete. 
## Elapsed time: 0.06 seconds.

# convert the document-feature matrix of bigrams to a dataframe

bigrams.df <- data.frame(Content = features(dfm_bigrams), 
                         Bigram.Frequency = colSums(dfm_bigrams), 
                         row.names = NULL, stringsAsFactors = FALSE)

Confirm there are no duplicate bigrams

filter(bigrams.df, duplicated(bigrams.df))

## [1] Content          Bigram.Frequency
## <0 rows> (or 0-length row.names)

Create a function to return the probability of the occurrence of a specific bigram:

get_bigram_probability <- function(bigram) {
    
    bigram1 <- bigram
    
    # reverse the words in the bigger
    # the bigram "for_the" would become "the_for"
    
    reversed <- unlist(strsplit(bigram, "_")) 
    bigram2  <- paste0(reversed[2], "_", reversed[1])
    
    # search the datafame of bigrams for both versions of the bigram; "for_the" and "the_for"
    bigram.prob <- bigrams.df %>% 
                   filter(Content == bigram1 | Content == bigram2) %>% 
                   summarize(Probability = sum(Bigram.Frequency)/total_words)
    
    if (length(bigram.prob) == 0) return (0)
    
    return (bigram.prob)
    
}

get_bigram_probability <- Vectorize(get_bigram_probability)

Finally, let’s process the document’s bigrams and calculate the following:

Probability of the first word occurring
Probability of the second word occurring
Probability of the first and second word occurring together, regardless of order.

process_bigrams <- function(df) {
    
    bgram.df <- data.frame()
    
    # separate the bigrams into word1 and word2
    words <- df %>% separate(Content, c("word1", "word2"), sep = "_")  %>% select(word1, word2) 
    
    bgram.df <- cbind(df, words)

    # add the probability of the first word in the bigram
    bgram.df$word1_probability <-  unlist(get_word_probability(bgram.df$word1))
    
    # add the probability of the second word in the bigram
    bgram.df$word2_probability <- unlist(get_word_probability(bgram.df$word2))
    

    # add the probability of the bigram occurring
    bgram.df$bigram_probability <- unlist(get_bigram_probability(bgram.df$Content))
    
    return (bgram.df)
    
}

# Calculate the probabilites of the bigrams

result.df <- process_bigrams(bigrams.df)

Let’s look at 20 results:

Content	Bigram.Frequency	word1	word2	word1_probability	word2_probability	bigram_probability
in_a	6	in	a	0.0209895	0.0337331	0.0044978
at_tutwiler	6	at	tutwiler	0.0089955	0.0112444	0.0044978
of_the	5	of	the	0.0209895	0.0569715	0.0037481
in_the	5	in	the	0.0209895	0.0569715	0.0037481
he_said	4	he	said	0.0067466	0.0164918	0.0037481
said_he	1	said	he	0.0164918	0.0067466	0.0037481
she_said	4	she	said	0.0067466	0.0164918	0.0037481
said_she	1	said	she	0.0164918	0.0067466	0.0037481
justice_department	4	justice	department	0.0044978	0.0052474	0.0029985
the_federal	4	the	federal	0.0569715	0.0037481	0.0029985
federal_government	4	federal	government	0.0037481	0.0029985	0.0029985
for_the	4	for	the	0.0232384	0.0569715	0.0029985
who_is	4	who	is	0.0067466	0.0164918	0.0029985
to_be	4	to	be	0.0209895	0.0029985	0.0029985
of_corrections	4	of	corrections	0.0209895	0.0074963	0.0029985
had_been	4	had	been	0.0044978	0.0067466	0.0029985
corrections_officers	3	corrections	officers	0.0074963	0.0044978	0.0022489
more_than	3	more	than	0.0044978	0.0052474	0.0022489
but_tutwiler	1	but	tutwiler	0.0067466	0.0112444	0.0022489
the_legislature	3	the	legislature	0.0569715	0.0022489	0.0022489