This report provides a data analysis and evaluation of the data collected to predict next word using natural language processing in a string entered by the user. The method of analysis includes NGram prediction model and exploratory data analyses.The data was extracted from files named LOCALE.blogs.txt where LOCALE is extracted from en_US data file. The data is from a corpus called HC Corpora https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
The data we will be using from the Corpora are the en_US data file containg 3 files en_US.blogs.txt,en_US.twitter.txt and en_US.news.txt. First we will check the Word counts and line counts for each file.The files are already loaded to the system.We will use Unix command wc -l and wc -w to get the line counts and word count from each file.
en_US.blogs.tx=899288 rows
en_US.twitter.txt=2360148 rows
en_US.news.txt=1010242 rows
en_US.blogs.tx=37334690 words
en_US.twitter.txt=3037420 words
en_US.news.txt=34372720 words
Based on the above summary we know that the data files are huge so we will first reduce the size of data files to accomodate our corpus and to do so we will cut down the file by taking 50k rows only and create a sample corpus called dataUS.
library(tm)
library(SnowballC)
library(caret)
library(kernlab)
library(UsingR)
library(data.table)
library(dplyr)
write(scan(file="/Users/mayankpundir/mayankpundir83/Capstone_project/en_US/en_US.blogs.txt",what="list",blank.lines.skip=TRUE,n=50000,sep = "\n",skipNul=TRUE),file = "/Users/mayankpundir/mayankpundir83/Capstone_project/sample/sampleblogs.txt")
write(scan(file="/Users/mayankpundir/mayankpundir83/Capstone_project/en_US/en_US.twitter.txt",what="list",blank.lines.skip=TRUE,n=50000,sep = "\n",skipNul=TRUE),file = "/Users/mayankpundir/mayankpundir83/Capstone_project/sample/sampletwitter.txt")
write(scan(file="/Users/mayankpundir/mayankpundir83/Capstone_project/en_US/en_US.news.txt",what="list",blank.lines.skip=TRUE,n=50000,sep = "\n",skipNul=TRUE),file = "/Users/mayankpundir/mayankpundir83/Capstone_project/sample/samplenew.txt")
dataUS <- Corpus(DirSource("/Users/mayankpundir/mayankpundir83/Capstone_project/sample/"))
class(dataUS)
## [1] "VCorpus" "Corpus"
summary(dataUS)
## Length Class Mode
## sampleblogs.txt 2 PlainTextDocument list
## samplenew.txt 2 PlainTextDocument list
## sampletwitter.txt 2 PlainTextDocument list
As you can see from the summary of the dataUS that we have build a Corpus having 3 PlainTextDocument files.
We will now performe data cleaning on the corpus to remove numbers/Punctuation/Whitespace/unicode and Change Upper case to lower case words.
dataUS <- tm_map(dataUS,removeNumbers)
dataUS <- tm_map(dataUS,removePunctuation)
dataUS <- tm_map(dataUS,content_transformer(tolower))
dataUS <- tm_map(dataUS,stripWhitespace)
rmUnico <- content_transformer(function(row) iconv(row, "latin1", "ASCII", sub="")) # remove unicode records
dataUS <- tm_map(dataUS,rmUnico)
Since the data is clean now and can be used for creating our predictive model we will now use library(RWeka) to create our Ngram Tokenizer for 1gram,2gram,3gram and 4gram and there coresponding TermDocumentMatrix.We will also remove the sparse Terms from the TermDocumentMatrix to only pick words occuring in all 3 documents.These TermDocumentMatrix will help us in creating a predictive model to predict the next word in a string.
library(RWeka)
options(mc.cores=1)
onegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
fourgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
tdm1 <- TermDocumentMatrix(dataUS, control = list(tokenize = onegramTokenizer))
tdm2 <- TermDocumentMatrix(dataUS, control = list(tokenize = bigramTokenizer))
tdm3 <- TermDocumentMatrix(dataUS, control = list(tokenize = trigramTokenizer))
tdm4 <- TermDocumentMatrix(dataUS, control = list(tokenize = fourgramTokenizer))
#removing Sparse Terms from the TermDocumentMatrix
tdm1 <- removeSparseTerms(tdm1, 0.1)
tdm2 <- removeSparseTerms(tdm2, 0.1)
tdm3 <- removeSparseTerms(tdm3, 0.1)
tdm4 <- removeSparseTerms(tdm4, 0.1)
Now since we have the TermDocumentMatrix created for 1,2,3 and 4 grams we will see which words are more frequently used by creating the Bar graph and word cloud for each ngram.
#Getting the frequency of words for each TermDocumentMatrix
freq1 <- sort(rowSums(as.matrix(tdm1)), decreasing=TRUE)
freq2 <- sort(rowSums(as.matrix(tdm2)), decreasing=TRUE)
freq3 <- sort(rowSums(as.matrix(tdm3)), decreasing=TRUE)
freq4 <- sort(rowSums(as.matrix(tdm4)), decreasing=TRUE)
#creating data.tables of words and there frequency for each ngram
wf1 <- data.table(word=names(freq1), freq=freq1)
wf2 <- data.table(word=names(freq2), freq=freq2)
wf3 <- data.table(word=names(freq3), freq=freq3)
wf4 <- data.table(word=names(freq4), freq=freq4)
As per below table the top 6 words in 1Gram are
Top 6 words in One Gram
head(wf1)
## word freq
## 1: the 440308
## 2: and 226249
## 3: that 94742
## 4: for 91313
## 5: you 65387
## 6: with 64267
#create Bar-Graph for 1gram
library(ggplot2)
subset(wf1, freq > 30000) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))
#create word clound for 1gram
library(wordcloud)
set.seed(123)
wordcloud(names(freq1), freq1, min.freq=1000,colors=brewer.pal(5, "Dark2"))
As per below table the top 6 words in 2Gram are
Top 6 words in Two Gram
head(wf2)
## word freq
## 1: of the 41272
## 2: in the 38133
## 3: to the 19854
## 4: on the 17529
## 5: for the 16327
## 6: to be 14038
#create Bar-Graph for 1gram
subset(wf2, freq > 10000) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))
As per below table the top 6 words in 3Gram are
Top 6 words in Three Gram
head(wf3)
## word freq
## 1: one of the 3218
## 2: a lot of 2764
## 3: to be a 1487
## 4: the end of 1434
## 5: going to be 1390
## 6: as well as 1362
#create Bar-Graph for 1gram
subset(wf3, freq > 1200) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))
As per below table the top 6 words in 4Gram are
Top 6 words in Four Gram
head(wf4)
## word freq
## 1: the end of the 729
## 2: at the end of 661
## 3: the rest of the 596
## 4: for the first time 548
## 5: at the same time 438
## 6: one of the most 427
#create Bar-Graph for 1gram
subset(wf4, freq > 400) %>%
ggplot(aes(word, freq)) + geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=45,hjust=1))
With these Ngrams we can predict the next best fitting matching word.We will be using loop to first check if the user had entered a string of length 3 and above and if yes then we will extract the last 3 words from the string and will use our 4 gram to get the next word.If we do not find a match with first 3 words we will use last 2 words to prectict next word from 3 gram else we will use last word to predict from 2 gram.