DATA 605 - Assignment 6

PROBLEM SET 1

When you roll a fair die 3 times, how many possible outcomes are there?

#the sign of the beast
6*6*6

## [1] 216

What is the probability of getting a sum total of 3 when you roll a die two times?

library(dice)

#The following uses the R DICE function / install.packages("dice")
#the sum of getting a 1 and a 2
getEventProb(nrolls = 2,
  ndicePerRoll = 1,
  nsidesPerDie = 6,
  eventList = list(1,2),
  orderMatters = FALSE)

## [1] 0.05555556

Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1/365 each. What happens to this probability when there are 50 people in the room?

birthFunction<-function(n){
    return(1 - prod((365:(365 - n + 1))/rep(365, n)))
}
#in a room with 25 people
birthFunction(25)

## [1] 0.5686997

#in a room with 50 people
birthFunction(50)

## [1] 0.9703736

PROBLEM SET 2

Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a file containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.

Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the join probability of both of them occurring together. The order of the two words is not important.

library(quanteda)

words <- readLines('https://raw.githubusercontent.com/RobertSellers/605_MATH/master/assign6/assign6.sample.txt',encoding="UTF-8")

#The following uses the R quanteda function / install.packages("quanteda")

gramText <- function (characterData){
  #skip blank rows
  words <- characterData[seq(1,length(words),2)]
  #paste everything into a single string
  wordsInLine = paste(words, collapse = " ")
  #quanteda corpus conversions
  #unigrams
  unigramCorpus<-dfm(wordsInLine, 
                  toLower = TRUE,
                  removePunct = TRUE, 
                  removeNumbers = TRUE)
  #bigrams
  bigramCorpus<-dfm(wordsInLine, ngrams = 2,
                  toLower = TRUE,
                  removePunct = TRUE, 
                  removeNumbers = TRUE)
  #calculate column sums
  unigramFreq <- colSums(unigramCorpus)
  bigramFreq <- colSums(bigramCorpus)
    
  unigramFreq <- sort(unigramFreq,decreasing=TRUE)
  bigramFreq <- sort(bigramFreq,decreasing=TRUE)

  return(list(unigramFreq,bigramFreq))
}

#running the program
grams<-gramText(words)

#probabilities unigram
head(grams[[1]]/sum(grams[[1]]))

##        the          a        and        for         to         of 
## 0.05697151 0.03373313 0.02848576 0.02323838 0.02098951 0.02098951

#probabilities bigram
head(grams[[2]]/sum(grams[[2]]))

##               in_a        at_tutwiler             of_the 
##        0.004501125        0.004501125        0.003750938 
##             in_the justice_department        the_federal 
##        0.003750938        0.003000750        0.003000750

#plotting the results
unigrams<-unlist(grams[1])
bigrams<-unlist(grams[2])
uniTop <- as.numeric()
biTop <- as.numeric()

for (i in 1:length(unigrams)) { 
    if (unigrams[i] > 10) {
        uniTop <- c(uniTop, unigrams[i]) 
        }
}

for (i in 1:length(bigrams)) { 
    if (bigrams[i] > 3) {
        biTop <- c(biTop, bigrams[i]) 
        }
}

barplot(uniTop,main="Most Frequent Unigrams",ylab="Frequency",xlab="",las=2)

barplot(biTop,main="Most Frequent Bigrams",ylab="Frequency",xlab="",las=2)

DATA 605 - Assignment 6

Robert Sellers

October 2, 2016

PROBLEM SET 1

PROBLEM SET 2