The goal of the Data Science project is to build a predictive model for natural language processing(NLM). We are provided a set of three documents - twitter feed, US news feed and a blog data - that will form the basis of corpus that will be used to predict next word given a sentence.
The corpus dataset that will be used for this project were downloaded from Capstone Dataset. The files in the data set are en_US_twitter.txt,en_US_news.txt and en_US_blogs.txt
library(tm)
library(SnowballC)
library(dplyr)
library(RColorBrewer)
library(ggplot2)
Sys.setenv(JAVA_HOME="")
options( java.parameters = "-Xmx4g" )
library(rJava)
library(RWeka)
require(wordcloud)
As the dataset is huge, we will be using connections and readLines functions to parse data in chunks. The following is the code snippet to read twitter data and some basic analysis that was done to answer Quiz 1 of this course.
conTwitter <- file("en_US.twitter.txt", "r")
system('wc -l en_US.twitter.txt')
maxLength<- 0
maxLine <- ''
maxLinesPerIterator = 10000
countLove <-0
countHate <-0
while (length(dataTwitter <- readLines(conTwitter, n = maxLinesPerIterator, warn = FALSE)) > 0) {
maxLocation <- which.max(nchar(dataTwitter))
maxLengthStep = nchar(dataTwitter[maxLocation])
maxLineStep <- dataTwitter[maxLocation]
countLove <- countLove + length(grep('love[ .?]',dataTwitter))
countHate <- countHate + length(grep('hate[ .?]',dataTwitter))
biostatsLocation <- grep('biostats', dataTwitter)
if(length(biostatsLocation) >0 ) {
print(dataTwitter[biostatsLocation])
}
sentenceLocation <- grep('A computer once beat me at chess, but it was no match for me at kickboxing', dataTwitter)
if(length(sentenceLocation) >0 ) {
print(dataTwitter[sentenceLocation])
}
if(maxLengthStep > maxLength) {
maxLength <- maxLengthStep
maxLine <- maxLineStep
}
}
print(maxLine)
print(maxLength)
print(countLove/countHate)
close(conTwitter)
Here is the basic summary of the data set
system('wc corpus_complete/en_US.twitter.txt')
system('wc corpus_complete/en_US.news.txt')
system('wc corpus_complete/en_US.blogs.txt')
system('du -h corpus_complete/en_US.twitter.txt')
system('du -h corpus_complete/en_US.news.txt')
system('du -h corpus_complete/en_US.blogs.txt')
For our prediction model, we will build multiple samples of data for training purposes. The following code generates sample data using 30% of the original dataset using sample function. The following is the code snippet that is used to read blogs data and create new sample data files.
set.seed(1000)
maxLinesPerIterator=10000
sampleRatio=.30
conUSBlogs <- file("corpus_complete\\en_US.blogs.txt", "r")
length <- length(dataUSBlogs <- readLines(conUSBlogs, n = maxLinesPerIterator, warn = FALSE))
while (length> 0) {
sampleData <- sample(dataUSBlogs,length*sampleRatio)
write(sampleData,"corpus_sample4/en_US.blogs_sample4.txt",append=TRUE)
length <- length(dataUSBlogs <- readLines(conUSBlogs, n = maxLinesPerIterator, warn = FALSE))
}
close(conUSBlogs)
The following code builds corpus of the sample data and cleans it by using tm_map function by stripping white spaces, converts to lower case, removes punctuations and numbers. At this time, we will not remove stop words as we will doing parts of speech analysis later in the course.
cname <-file.path(".","corpus_sample4")
corpusData <- VCorpus(DirSource(cname),readerControl=list(reader=readPlain,language="english",encoding='ANSI'))
corpusData <- tm_map(corpusData, stripWhitespace)
corpusData <- tm_map(corpusData, content_transformer(tolower))
corpusData <- tm_map(corpusData, removeNumbers)
corpusData <- tm_map(corpusData, removePunctuation, preserve_intra_word_dashes = TRUE)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
corpusTDMUnigrams <- TermDocumentMatrix(corpusData, control = list(wordLengths=c(2,15),tokenize = UnigramTokenizer))
removeSparseTerms(corpusTDMUnigrams,.2)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
corpusTDMBigrams <- TermDocumentMatrix(corpusData, control = list(wordLengths=c(2,45),tokenize = BigramTokenizer))
removeSparseTerms(corpusTDMBigrams,.2)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
corpusDTMTrigrams <- TermDocumentMatrix(corpusData, control = list(wordLengths=c(2,45), tokenize = TrigramTokenizer))
removeSparseTerms(corpusTDMBigrams,.2)
head(findFreqTerms(corpusTDMUnigrams,lowfreq=10000),10)
[1] “â???”" “about” “after” “again” “all” “also” “always” “am” “an” “and”
head(findFreqTerms(corpusTDMBigrams,lowfreq=1000),10)
[1] “a bad” “a beautiful” “a better” “a big” “a bit” “a book” “a bunch” “a chance” “a couple”
[10] “a day”
head(findFreqTerms(corpusDTMTrigrams,lowfreq=1000),10)
[1] “a bit of” “a couple of” “a great day” “a little bit” “a long time” “a lot of” “all of the” “all the time” “and i am”
[10] “and i have”
findAssocs(corpusTDMUnigrams, "love", corlimit=0.99)
[subset of resultset] * videos 1.00 * videotapes 1.00 * vilify 1.00 * vilma 1.00 * vin 1.00 * vince 1.00 * vinces 1.00
plot(corpusTDMUnigrams, terms=findFreqTerms(corpusTDMUnigrams, lowfreq=1000)[10:20], corThreshold=0.8)
freqBigram <- rowSums(as.matrix(corpusDTMBigrams))
wordFrameBigram <- data.frame(word=names(freqBigram),count=freqBigram,stringsAsFactors=FALSE)
bigramPlot <- ggplot(subset(wordFrameBigram, count > 15000), aes(word,count))
bigramPlot <- bigramPlot + geom_bar(stat="identity")
bigramPlot <- bigramPlot + theme(axis.text.x=element_text(angle=45, hjust=1))
bigramPlot
freqTrigram <- rowSums(as.matrix(corpusDTMTrigrams))
wordFrameTrigram <- data.frame(word=names(freqTrigram),count=freqTrigram,stringsAsFactors=FALSE)
wordcloud(wordFrameTrigram$word, wordFrameTrigram$count, min.freq=1000, colors=brewer.pal(8,"Dark2"))
After building the term document matrix, we will apply smoothing techniques to discount the probabilities of existing terms to account for the words that currrently don’t exist in the corpus. Different smoothing techniques such as Maximum Likelihood Estimate(MLE), Laplace Smoothing and Simple Good Turing techniques will be evaluated taking into consideration the size of the corpus and the available hardware to process the data. After smoothing of data, we will use only the last few words in the sentence to predict the next word based on Markov’s concept and using smoothed probabilities.
Once data modeling is complete and sentence context is taken into consideration, the next step would be build an application using Shiny that will mimic SwitfKey app to predict next word.