Data 605 Week 6 Assignment

Problem Set 1

When you roll a fair die 3 times, how many possible outcomes are there?

(6 * 6 * 6)

## [1] 216

What is the probability of getting a sum total of 3 when you roll a die two times?

(1/6 * 1/6 * 2)

## [1] 0.05555556

Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1=365 each. What happens to this probability when there are 50 people in the room?

# 25 people have the same birthday
(1 - (364/365)^((25 * 24)/2))

## [1] 0.5609078

# 25 people have the same birthday
(1 - (364/365)^((50 * 49)/2))

## [1] 0.9652915

Problem Set 2

Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a file containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.

Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the joint probability of both of them occurring together. The order of the two words is not important.

Use the accompanying document for your testing purposes. Compare your probabilities of various words with the Time Magazine corpus: http://corpus.byu.edu/time/

if (!require('stringr')) install.packages('stringr')

## Loading required package: stringr

if (!require('tm')) install.packages('tm')

## Loading required package: tm

## Loading required package: NLP

if (!require('RTextTools')) install.packages('RTextTools')

## Loading required package: RTextTools

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

sample.text <- 'assign6.sample.txt'
ebook <- readLines(sample.text)

## Warning in readLines(sample.text): incomplete final line found on
## 'assign6.sample.txt'

length(ebook)

## [1] 75

ebook <- str_c(ebook, collapse = " ")

ProbWords <- function (One_or_two_Words) {
  searchWords <- One_or_two_Words
  words.count <- sapply(strsplit(searchWords, " "), length)
  
  # convert documents into a corpus
  doc.vec <- VectorSource(ebook)
  doc.corpus <- Corpus(doc.vec)
  summary(doc.corpus)
  
  # corpus cleaning
  doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
  doc.corpus <- tm_map(doc.corpus, removeNumbers)
  doc.corpus <- tm_map(doc.corpus, removePunctuation)
  doc.corpus <- tm_map(doc.corpus, stripWhitespace)
  
  # tokenize, term document matrix
  xgramTokenizer <- function(x) unlist(lapply(ngrams(words(x), words.count), paste, collapse = " "), use.names = FALSE)
  tdm <- TermDocumentMatrix(doc.corpus, control = list(tokenize = xgramTokenizer))
  
  # create frequency table for ngrams
  dcount <- sort(slam::row_sums(tdm), decreasing = TRUE)
  dfreq <- data.frame(word = names(dcount), freq = dcount)
  
  # counts words base for probability calculation
  wcount <- TermDocumentMatrix(doc.corpus)
  wordfreq <- slam::row_sums(wcount)
  wordfreq <- data.frame(word = names(wordfreq), freq = wordfreq)
  
  # calculate probability
  dfreq$prob <- round(dfreq$freq / sum(wordfreq$freq), 5)
  rownames(dfreq) <- NULL
  
  rev_searchWords <- paste0(strsplit(searchWords, ' ')[[1]][2], ' ', strsplit(searchWords, ' ')[[1]][1])
  dfreq %>% 
    dplyr::filter(word == searchWords| word == rev_searchWords) %>%
      dplyr::summarize(Freq = sum(freq), Prob = sum(prob))

}

# Enter one word
ProbWords('the')

##   Freq   Prob
## 1   76 0.0689

# Enter 2 words, the order doesn't matter, the proabilities of the 2 words in different orders were combined.
ProbWords('he said')

##   Freq Prob
## 1    0    0

reference:

https://github.com/wwells/CUNY_DATA_605/blob/master/Week06/WWells_Assign6.Rmd http://www.exegetic.biz/blog/2013/09/text-mining-the-complete-works-of-william-shakespeare/ http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

Data 605 Week 6 Assignment

Ann Liu-Ferrara

March 8, 2017

Problem Set 1

Problem Set 2