The goal of this project is to build simple model for the relationship between words.
We will explore simple models and discover more complicated modeling techniques eventually.
To begin this project, we were given corpora in four differnet languages (DE,US,FI,RU). Each set of corpora contained samples for blog, news and tweets. We will focus on the English language corpora.
Tasks to accomplish
Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.
#load the libraries
library(tm);
## Loading required package: NLP
library(stringi);
## Warning: package 'stringi' was built under R version 3.3.3
library(RWeka);
## Warning: package 'RWeka' was built under R version 3.3.3
library(ggplot2);
## Warning: package 'ggplot2' was built under R version 3.3.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
con = file("en_US.blogs.txt", open = "r")
BlogLines = readLines(con)
BlogLength = length((BlogLines))
close(con)
BlogLength
## [1] 899288
con = file("en_US.news.txt", open = "r")
newsLines = readLines(con)
## Warning in readLines(con): incomplete final line found on 'en_US.news.txt'
newsLength = length((newsLines))
close(con)
newsLength
## [1] 77259
con = file("en_US.twitter.txt", open = "r")
twitterLines = readLines(con)
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
tweetsLength = length((twitterLines))
close(con)
tweetsLength
## [1] 2360148
#Remove all weird characters
cleanedTwitter<- iconv(twitterLines, 'UTF-8', 'ASCII', "byte")
#Sample 10000
twitterSample<-sample(cleanedTwitter, 1000)
doc.vec <- VectorSource(twitterSample)
doc.corpus <- Corpus(doc.vec)
#Convert to lower case
doc.corpus<- tm_map(doc.corpus, tolower)
#Remove all punctuatins
doc.corpus<- tm_map(doc.corpus, removePunctuation)
#Remove all numbers
doc.corpus<- tm_map(doc.corpus, removeNumbers)
##Remove whitespace
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
##Stop words
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
##Force everything back to plaintext document
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
# n-gram Modeling
# bigram
bigram <- function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
# trigram
trigram <- function(x)
NGramTokenizer(x, Weka_control(min = 3, max = 3))
get_word_Freq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
# n-grams analysis of sample data
corpus_unigram <- get_word_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus), 0.9999))
corpus_bigram <- get_word_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = bigram)), 0.9999))
corpus_trigram <- get_word_Freq(removeSparseTerms(TermDocumentMatrix(doc.corpus, control = list(tokenize = trigram)), 0.9999))
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.