Assignment6-605

Sometimes you cannot compute the probability of an outcome by measuring the sample space and examining the symmetries of the underlying physical phenomenon, as you could do when you rolled die or picked a card from a shued deck. You have to estimate probabilities by other means. For instance, when you have to compute the probability of various english words, it is not possible to do it by examination of the sample space as it is too large. You have to resort to empirical techniques to get a good enough estimate. One such approach would be to take a large corpus of documents and from those documents, count the number of occurrences of a particular character or word and then base your estimate on that. Write a program to take a document in English and print out the estimated probabilities for each of the words that occur in that document. Your program should take in a le containing a large document and write out the probabilities of each of the words that appear in that document. Please remove all punctuation (quotes, commas, hyphens etc) and convert the words to lower case before you perform your calculations.

library(tm)

## Loading required package: NLP

library(stringr)
library(plyr)

setwd("C:/Pnl/assign6")

docTxt <- readLines("assign6.sample.txt", encoding="UTF-8") # UTF-8

## Warning in readLines("assign6.sample.txt", encoding = "UTF-8"): incomplete
## final line found on 'assign6.sample.txt'

summary(docTxt)

##    Length     Class      Mode 
##        75 character character

#remove all punctuation (quotes, commas, hyphens etc)
#and convert the words to lower case before you perform your calculations
clean_corpus <- function(corpus) {
  output <- tolower(corpus)
  output <- str_replace_all(output, pattern = "'", replacement = " ")
  output <- str_replace_all(output, pattern = "[^[:print:]]", replacement = " ")
  output <- str_replace_all(output, pattern = "[^[:alnum:] | [:space:]]", replacement = " ")
  output <- str_trim(output, side = 'both')
  #output <- str_replace_all(output, pattern = " {2,}", replacement = " ")
  output <- str_split(output, pattern = "[[:space:]]") 
  
  output
}
news_clean <- clean_corpus(docTxt)

#Total number of words in the document
word.count <- c(length(unlist(news_clean)))
n <-length(news_clean)

MyTable <- table(unlist(news_clean))

#Convert List to Dataframe
news_clean.df <- as.data.frame(MyTable)

#Removing "" from dataframe
news_clean.df <-news_clean.df[news_clean.df$Var1!="",]

#estimated probabilities
#for each of the words that occur in that document.
data <- data.frame(Word=news_clean.df$Var1,Count=news_clean.df$Freq, Prob=(news_clean.df$Freq)/word.count)

Extend your program to calculate the probability of two words occurring adjacent to each other. It should take in a document, and two words (say the and for) and compute the probability of each of the words occurring in the document and the joint probability of both of them occurring together. The order of the two words is not important.

df <- data.frame(matrix(unlist(news_clean),  byrow=T),stringsAsFactors=FALSE)

df <-df[df!=" ",]
NROW(df)

## [1] 1573

word.pair.mat <- matrix(nrow=NROW(df), ncol=3) 

#load all words into a matrix col 3 as  the join of first 2 words
for (i in 1:NROW(df))
{
  word.pair.mat[i,] <- c(df[i], df[i+1], str_join(df[i], df[i+1]))
   
}

word.pair.df <-data.frame(first_word=word.pair.mat[,1],second_word=word.pair.mat[,2],word_pair=word.pair.mat[,3])

#find the frequency of adjacent words
mytable <- table(word.pair.mat[,3])
word.pair.dffreq <- as.data.frame(mytable)

word.pair.dfProb <- data.frame(Word=word.pair.dffreq$Var1,Count=word.pair.dffreq$Freq,Prob=(word.pair.dffreq$Freq)/NROW(df))

Assignment6-605

SVasudevannair

Monday, October 05, 2015