DATA 605 Assignment 6

1. Problem set 1
2. Problem set 2

1. Problem set 1

When you roll a fair die 3 times, how many possible outcomes are there?

If \(r\) experiments are performed with \(n_i\) possible outcomes for each experiment where \(i\in \mathbb{Z}\), then there are a total of \(\prod _{ i=1 }^{ r }{ n_i }\) possible outcomes.ⁱ

r = 3
n = 6
n^r

## [1] 216

What is the probability of getting a sum total of 3 when you roll a die two times?

There are \(n^r=6^2\) possible outcomes of which two are equal to \(3\), therefore \(P\left( X_{ 1 }+X_{ 2 }=3 \right) =\frac { 2 }{ 36 } =\frac { 1 }{ 18 }\)

outcomes <- seq(1,n)
sum_table <- matrix(outcomes, nrow=n, ncol=n)
sum_totals <- sum_table + t(sum_table); sum_totals

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    3    4    5    6    7
## [2,]    3    4    5    6    7    8
## [3,]    4    5    6    7    8    9
## [4,]    5    6    7    8    9   10
## [5,]    6    7    8    9   10   11
## [6,]    7    8    9   10   11   12

table(sum_totals)[[3]] / length(sum_totals)

## [1] 0.08333333

Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to \(\frac { 1 }{ 365 }\) each. What happens to this probability when there are 50 people in the room?

The probability that no two people out of a group of \(n\) will have matching birthdays out of \(d\) equally possible birthdays is equal to \(Q(n, d)=\frac { d! }{ \left( d-n \right) !{ d }^{ n } }\). ⁱⁱ Therefore, the probability that two people out of a group of \(n\) will have matching birthdays out of \(d\) equally possible birthdays is equal to \(1 - Q(n, d)\).

d <- 365
n <- 25
1 - prod(seq(d - n + 1, d)) / d^n

## [1] 0.5686997

n <- 50
1 - prod(seq(d - n + 1, d)) / d^n

## [1] 0.9703736

R cannot handle large factorials such as \(365!\) which has a 779 digit answer.ⁱⁱⁱ A “value out of range in gamma” error is produced. There are workarounds online but the simplest thing to do in this case was to simplify the equation algebraically so that it reduces to \(1-\frac { d! }{ \left( d-n \right) !{ d }^{ r } } =1-\left[ \frac { d! }{ \left( d-n \right) ! } \right] \left( \frac { 1 }{ { d }^{ r } } \right) =1-\left[ \frac { (1)\cdots (d-n)(d-n+1)\cdots (d-1)\left( d \right) }{ (1)\cdots (d-n) } \right] \left( \frac { 1 }{ { d }^{ r } } \right) =1-\left[ \frac { (d-n+1)\cdots (d-1)\left( d \right) }{ 1 } \right] \left( \frac { 1 }{ { d }^{ r } } \right) =1-\frac { (d-n+1)\cdots (d-1)\left( d \right) }{ { d }^{ r } }\).

References

ⁱ http://mathworld.wolfram.com/CountingGeneralizedPrinciple.html

ⁱⁱ http://mathworld.wolfram.com/BirthdayProblem.html

ⁱⁱⁱ http://www.calculatorsoup.com/calculators/discretemathematics/factorials.php

2. Problem set 2

Sometimes you cannot compute the probability of an outcome by measuring the sample space and examining the symmetries of the underlying physical phenomenon, as you could do when you rolled die or picked a card from a shuffled deck. You have to estimate probabilities by other means. For instance, when you have to compute the probability of various English words, it is not possible to do it by examination of the sample space as it is too large. You have to resort to empirical techniques to get a good enough estimate. One such approach would be to take a large corpus of documents and from those documents, count the number of occurrences of a particular character or word and then base your estimate on that.

Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a file containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.

Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the joint probability of both of them occurring together. The order of the two words is not important. Use the accompanying document for your testing purposes. Compare your probabilities of various words with the Time Magazine corpus: http://corpus.byu.edu/time/

Take in files containing a large document.

Reuters-21578 is currently the most widely used test collection for text categorization research. The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. ^iv,v

Reuters_21578 <- character()
for (i in 0:21) {
  file <- sprintf("reut2-%03d", i)
  url <- paste0("https://raw.githubusercontent.com/",
                "jzuniga123/SPS/master/DATA%20605/", file, ".sgm")
  Reuters_21578 <- paste(paste(readLines(url), collapse=" "), Reuters_21578)
}

The raw corpus data is 26.4MB.

Extract text from HTML body elements and then remove all remaining HTML tags.

library(stringr)

words <- unlist(str_extract_all(Reuters_21578, "<BODY>.+?</BODY>"))
words <- str_replace_all(words, "<.*?>", " ")

The data first reduces to 15.8MB, then to 15.7MB.

Remove all punctuation (quotes, commas, hyphens, etc).

words <- str_replace_all(words, "[^[:alpha:]]", " ")

The data further reduces to 15.6MB.

Convert all the words to lower case.

words <- str_to_lower(words)

The data finally reduces to 15.5MB.

Count the number of occurrences of a particular word.

First the data is split wherever white space is encountered and then empty strings are removed.

word <- unlist(str_split(words, "[[:blank:]]+"))
word <- word[nchar(word) != 0]
obs <- length(word); obs

## [1] 2499923

We are left with 2,499,923 words.

Print out the estimated probabilities for each of the words that occur in the document.

freq <- as.data.frame(table(word))
prob <- round(freq[ ,2] / obs, 6)
results <- cbind(freq, prob)
head(results[order(-results[ ,2], results[ ,1]), ])

##       word   Freq     prob
## 34878  the 144339 0.057737
## 24096   of  72347 0.028940
## 35224   to  68943 0.027578
## 1471   and  53923 0.021570
## 30282 said  52915 0.021167
## 16837   in  52672 0.021069

Calculate the probability of two words (the and for) occurring adjacent to each other (the joint probability of both of them occurring together).

This was calculated first under the assumption of independence then empirically.

the_freq <- results[results[1] == "the"]
for_freq <- results[results[1] == "for"]
the_prob <- as.numeric(the_freq[3])
for_prob <- as.numeric(for_freq[3])
ind_prob <- the_prob * for_prob; ind_prob

## [1] 0.000601735

for_the <- unlist(str_match_all(words, " for the "))
the_for <- unlist(str_match_all(words, " the for "))
for_the_freq <- length(for_the)
the_for_freq <- length(the_for)
emp_prob <- (for_the_freq + the_for_freq) / obs; emp_prob

## [1] 0.002304471

This data set does not exhibit independence between the words “the” and “for” since \(P\left( A \right) P\left( B \right) = 6.0173501\times 10^{-4} \neq P\left( A\cap B \right) = 0.0023045\). This can be also seen in the disparity between the frequency of the ordered pairs “for the” and “the for” which occur 5760 times and 1 time in the data set, respectively.

Compare probabilities of the word occurences with the Time Magazine 100,000,000 word corpus.

the_TIME <- 6367449 / 100000000
for_TIME <- 950685 / 100000000
emp_TIME <- (186862 + 15) / 100000000

The word “the” occurred 6,367,449 times, yeilding a probability of 0.0636745.
The word “for” occurred 950,685 times, yeilding a probability of 0.0095068.
The ordered word pairs “for the” and “the for” occurred 186,862 times and 15 times respectively, yielding a joint probability of 0.0018688.

compare <-  data.frame(c(the_prob, the_TIME, abs((the_prob -the_TIME) / the_TIME)),
                       c(for_prob, for_TIME, abs((for_prob - for_TIME) / for_TIME)),
                       c(emp_prob, emp_TIME, abs((emp_prob - emp_TIME) / emp_TIME)))
rownames(compare) <- c("Reuters", "TIME", "Error")
colnames(compare) <- c("P(the)", "P(for)", "P(the,for)")
compare

##             P(the)     P(for)  P(the,for)
## Reuters 0.05773700 0.01042200 0.002304471
## TIME    0.06367449 0.00950685 0.001868770
## Error   0.09324755 0.09626217 0.233148530

References

^iv http://www.daviddlewis.com/resources/testcollections/reuters21578/

^v https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

DATA 605 Assignment 6

Jose Zuniga

1. Problem set 1

References

2. Problem set 2

References