Executive summary:-

This report provides a data analysis and evaluation of the data collected to predict next word using natural language processing in a string entered by the user. The method of analysis includes NGram prediction model and exploratory data analyses.The data was extracted from files named LOCALE.blogs.txt where LOCALE is extracted from en_US data file. The data is from a corpus called HC Corpora https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Data

The data we will be using from the Corpora are the en_US data file containg 3 files en_US.blogs.txt,en_US.twitter.txt and en_US.news.txt. First we will check the Word counts and line counts for each file.The files are already loaded to the system.We will use Unix command wc -l and wc -w to get the line counts and word count from each file.

LINE Counts:-

en_US.blogs.tx=899288 rows

en_US.twitter.txt=2360148 rows

en_US.news.txt=1010242 rows

WORD Counts:-

en_US.blogs.tx=37334690 words

en_US.twitter.txt=3037420 words

en_US.news.txt=34372720 words

Based on the above summary we know that the data files are huge so we will first reduce the size of data files to accomodate our corpus and to do so we will cut down the file by taking 50k rows only and create a sample corpus called dataUS.

library(tm)
library(SnowballC)
library(caret)
library(kernlab)
library(UsingR)
library(data.table)
library(dplyr)
write(scan(file="/Users/mayankpundir/mayankpundir83/Capstone_project/en_US/en_US.blogs.txt",what="list",blank.lines.skip=TRUE,n=50000,sep = "\n",skipNul=TRUE),file = "/Users/mayankpundir/mayankpundir83/Capstone_project/sample/sampleblogs.txt")
write(scan(file="/Users/mayankpundir/mayankpundir83/Capstone_project/en_US/en_US.twitter.txt",what="list",blank.lines.skip=TRUE,n=50000,sep = "\n",skipNul=TRUE),file = "/Users/mayankpundir/mayankpundir83/Capstone_project/sample/sampletwitter.txt")
write(scan(file="/Users/mayankpundir/mayankpundir83/Capstone_project/en_US/en_US.news.txt",what="list",blank.lines.skip=TRUE,n=50000,sep = "\n",skipNul=TRUE),file = "/Users/mayankpundir/mayankpundir83/Capstone_project/sample/samplenew.txt")
dataUS <- Corpus(DirSource("/Users/mayankpundir/mayankpundir83/Capstone_project/sample/"))
class(dataUS)
## [1] "VCorpus" "Corpus"
summary(dataUS)
##                   Length Class             Mode
## sampleblogs.txt   2      PlainTextDocument list
## samplenew.txt     2      PlainTextDocument list
## sampletwitter.txt 2      PlainTextDocument list

As you can see from the summary of the dataUS that we have build a Corpus having 3 PlainTextDocument files.

We will now performe data cleaning on the corpus to remove numbers/Punctuation/Whitespace/unicode and Change Upper case to lower case words.

dataUS <- tm_map(dataUS,removeNumbers)
dataUS <- tm_map(dataUS,removePunctuation)
dataUS <- tm_map(dataUS,content_transformer(tolower))
dataUS <- tm_map(dataUS,stripWhitespace)
rmUnico <- content_transformer(function(row) iconv(row, "latin1", "ASCII", sub="")) # remove unicode records
dataUS <- tm_map(dataUS,rmUnico)

Since the data is clean now and can be used for creating our predictive model we will now use library(RWeka) to create our Ngram Tokenizer for 1gram,2gram,3gram and 4gram and there coresponding TermDocumentMatrix.We will also remove the sparse Terms from the TermDocumentMatrix to only pick words occuring in all 3 documents.These TermDocumentMatrix will help us in creating a predictive model to predict the next word in a string.

library(RWeka)
options(mc.cores=1)
onegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
fourgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
tdm1 <- TermDocumentMatrix(dataUS, control = list(tokenize = onegramTokenizer))
tdm2 <- TermDocumentMatrix(dataUS, control = list(tokenize = bigramTokenizer))
tdm3 <- TermDocumentMatrix(dataUS, control = list(tokenize = trigramTokenizer))
tdm4 <- TermDocumentMatrix(dataUS, control = list(tokenize = fourgramTokenizer))
#removing Sparse Terms from the TermDocumentMatrix
tdm1 <- removeSparseTerms(tdm1, 0.1)
tdm2 <- removeSparseTerms(tdm2, 0.1)
tdm3 <- removeSparseTerms(tdm3, 0.1)
tdm4 <- removeSparseTerms(tdm4, 0.1)

Now since we have the TermDocumentMatrix created for 1,2,3 and 4 grams we will see which words are more frequently used by creating the Bar graph and word cloud for each ngram.

#Getting the frequency of words for each TermDocumentMatrix
freq1 <- sort(rowSums(as.matrix(tdm1)), decreasing=TRUE)
freq2 <- sort(rowSums(as.matrix(tdm2)), decreasing=TRUE)
freq3 <- sort(rowSums(as.matrix(tdm3)), decreasing=TRUE)
freq4 <- sort(rowSums(as.matrix(tdm4)), decreasing=TRUE)

#creating data.tables of words and there frequency for each ngram
wf1 <- data.table(word=names(freq1), freq=freq1)
wf2 <- data.table(word=names(freq2), freq=freq2)
wf3 <- data.table(word=names(freq3), freq=freq3)
wf4 <- data.table(word=names(freq4), freq=freq4)

As per below table the top 6 words in 1Gram are

Top 6 words in One Gram

head(wf1)
##    word   freq
## 1:  the 440308
## 2:  and 226249
## 3: that  94742
## 4:  for  91313
## 5:  you  65387
## 6: with  64267

Ploting Bar-graph for One Gram.

#create Bar-Graph for 1gram 
library(ggplot2)
subset(wf1, freq > 30000) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))

Word cloud for One gram.

#create word clound for 1gram 
library(wordcloud)
set.seed(123)
wordcloud(names(freq1), freq1, min.freq=1000,colors=brewer.pal(5, "Dark2"))

As per below table the top 6 words in 2Gram are

Top 6 words in Two Gram

head(wf2)
##       word  freq
## 1:  of the 41272
## 2:  in the 38133
## 3:  to the 19854
## 4:  on the 17529
## 5: for the 16327
## 6:   to be 14038

Bar-graph for Two gram.

#create Bar-Graph for 1gram 
subset(wf2, freq > 10000) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))

As per below table the top 6 words in 3Gram are

Top 6 words in Three Gram

head(wf3)
##           word freq
## 1:  one of the 3218
## 2:    a lot of 2764
## 3:     to be a 1487
## 4:  the end of 1434
## 5: going to be 1390
## 6:  as well as 1362

Bar-graph for Three gram.

#create Bar-Graph for 1gram 
subset(wf3, freq > 1200) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))

As per below table the top 6 words in 4Gram are

Top 6 words in Four Gram

head(wf4)
##                  word freq
## 1:     the end of the  729
## 2:      at the end of  661
## 3:    the rest of the  596
## 4: for the first time  548
## 5:   at the same time  438
## 6:    one of the most  427

Bar-graph for Four gram.

#create Bar-Graph for 1gram 
subset(wf4, freq > 400) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))

Model prediction algorithm Plan:-

With these Ngrams we can predict the next best fitting matching word.We will be using loop to first check if the user had entered a string of length 3 and above and if yes then we will extract the last 3 words from the string and will use our 4 gram to get the next word.If we do not find a match with first 3 words we will use last 2 words to prectict next word from 3 gram else we will use last word to predict from 2 gram.