When you roll a fair die 3 times, how many possible outcomes are there?
#the sign of the beast
6*6*6 ## [1] 216
What is the probability of getting a sum total of 3 when you roll a die two times?
library(dice)#The following uses the R DICE function / install.packages("dice")
#the sum of getting a 1 and a 2
getEventProb(nrolls = 2,
ndicePerRoll = 1,
nsidesPerDie = 6,
eventList = list(1,2),
orderMatters = FALSE)## [1] 0.05555556
Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1/365 each. What happens to this probability when there are 50 people in the room?
birthFunction<-function(n){
return(1 - prod((365:(365 - n + 1))/rep(365, n)))
}
#in a room with 25 people
birthFunction(25)## [1] 0.5686997
#in a room with 50 people
birthFunction(50)## [1] 0.9703736
Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a file containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.
Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the join probability of both of them occurring together. The order of the two words is not important.
library(quanteda)words <- readLines('https://raw.githubusercontent.com/RobertSellers/605_MATH/master/assign6/assign6.sample.txt',encoding="UTF-8")
#The following uses the R quanteda function / install.packages("quanteda")
gramText <- function (characterData){
#skip blank rows
words <- characterData[seq(1,length(words),2)]
#paste everything into a single string
wordsInLine = paste(words, collapse = " ")
#quanteda corpus conversions
#unigrams
unigramCorpus<-dfm(wordsInLine,
toLower = TRUE,
removePunct = TRUE,
removeNumbers = TRUE)
#bigrams
bigramCorpus<-dfm(wordsInLine, ngrams = 2,
toLower = TRUE,
removePunct = TRUE,
removeNumbers = TRUE)
#calculate column sums
unigramFreq <- colSums(unigramCorpus)
bigramFreq <- colSums(bigramCorpus)
unigramFreq <- sort(unigramFreq,decreasing=TRUE)
bigramFreq <- sort(bigramFreq,decreasing=TRUE)
return(list(unigramFreq,bigramFreq))
}
#running the program
grams<-gramText(words)#probabilities unigram
head(grams[[1]]/sum(grams[[1]]))## the a and for to of
## 0.05697151 0.03373313 0.02848576 0.02323838 0.02098951 0.02098951
#probabilities bigram
head(grams[[2]]/sum(grams[[2]]))## in_a at_tutwiler of_the
## 0.004501125 0.004501125 0.003750938
## in_the justice_department the_federal
## 0.003750938 0.003000750 0.003000750
#plotting the results
unigrams<-unlist(grams[1])
bigrams<-unlist(grams[2])
uniTop <- as.numeric()
biTop <- as.numeric()
for (i in 1:length(unigrams)) {
if (unigrams[i] > 10) {
uniTop <- c(uniTop, unigrams[i])
}
}
for (i in 1:length(bigrams)) {
if (bigrams[i] > 3) {
biTop <- c(biTop, bigrams[i])
}
}
barplot(uniTop,main="Most Frequent Unigrams",ylab="Frequency",xlab="",las=2)barplot(biTop,main="Most Frequent Bigrams",ylab="Frequency",xlab="",las=2)