Executive summary:

The goal of this report is to build a basic n-gram model for predicting next word based on the previous words.

Overview:

In milestone report for Data Sciece Capstone project, we will build prediction algorithm and perform exploratory data analysis to check the relationship between words, tokens, and phrases in the text.

Required libraries

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
library(stringi) # for character string processing facilities
library(NLP)
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm) # for text mining
## Warning: package 'tm' was built under R version 4.0.2
library(SnowballC) # for text stemming
library(RColorBrewer) #for color palettes
library(RWeka) # for data mining tasks
## Warning: package 'RWeka' was built under R version 4.0.2
## java.home option:
## JAVA_HOME environment variable: C:\Program Files\Java\jre7
## Warning in fun(libname, pkgname): Java home setting is INVALID, it will be ignored.
## Please do NOT set it unless you want to override system settings.
library(RWekajars)
## Warning: package 'RWekajars' was built under R version 4.0.2
library(knitr)
library(rJava)
## Warning: package 'rJava' was built under R version 4.0.2

Task 0 : Obtaining the data

1. Data acquisition

2. Corpus Dimensions

In this module we will check the size, length, maximum characters in a line, and number of words of the text file.

> Size:

blogsSize<-file.info("C:/Users/laksk/Downloads/Coursera-SwiftKey (1)/final/en_US/en_US.blogs.txt")$size/1024^2

blogsSize
## [1] 200.4242
newsSize<-file.info("C:/Users/laksk/Downloads/Coursera-SwiftKey (1)/final/en_US/en_US.news.txt")$size/1024^2

newsSize
## [1] 196.2775
twitterSize<-file.info("C:/Users/laksk/Downloads/Coursera-SwiftKey (1)/final/en_US/en_US.twitter.txt")$size/1024^2

twitterSize
## [1] 159.3641

> Length:

blogsLength<-length(blogs)

newsLength<-length(news)

twitterLength<-length(twitter)

> Word Count:

blogsWords<-sum(stri_count_words(blogs))

newsWords<-sum(stri_count_words(news))

twitterWords<-sum(stri_count_words(twitter))

> Length of the longest line:

blogsMax<-max(nchar(blogs))

newsMax<-max(nchar(news))

twitterMax<-max(nchar(twitter))

3. Summary statistics about the data

Summary<-data.frame(FileName=c("blogs", "news", "twitter"), 
                   FileSize=c(blogsSize,newsSize,twitterSize),                             FileLength=c(blogsLength,newsLength,twitterLength),                     WordCount=c(blogsWords, newsWords, twitterWords),
                   MaxCharacters=c(blogsMax, newsMax, twitterMax))

kable(Summary)
FileName FileSize FileLength WordCount MaxCharacters
blogs 200.4242 899288 37546239 40833
news 196.2775 1010242 34762395 11384
twitter 159.3641 2360148 30093413 140

4. Sampling the data

Because of the huge size of the corpus, we will analyze a sample randomly chosen and checking one percent of the total data.

Sample<-c(sample(blogs, length(blogs)*0.01),
            sample(news, length(news)*0.01),
               sample(twitter, length(twitter)*0.01))

5. Cleaning the data

Minimal cleaning process is done here inorder to avoid punctuation, numbers, white space etc,.

corpus<-VCorpus(VectorSource(Sample))

corpus<-tm_map(corpus, tolower)
corpus<-tm_map(corpus, removeWords, stopwords("en"))
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, PlainTextDocument)

Task 1: Tokenization and n-gram model

Tokenization is the task of chopping the sentences into pairs of words or n-grams.

An N-gram modelis a sequence of N tokens in Natural Language Processing. It predicts the next word based on the previous 1,2, or 3 words

A 1-gram or unigram is a one word sequence, a 2-gram or bigram is a two-word sequence and a 3-gram or trigram is a three-word sequence.

Unigram:

uniToken<-function(x) NGramTokenizer(x, Weka_control(min=1, max=1))

uniMatrix<-TermDocumentMatrix(corpus, control=list(tokenize=uniToken))

uniCorpus<-findFreqTerms(uniMatrix, lowfreq=100)

uniWords<-rowSums(as.matrix(uniMatrix[uniCorpus, ]))

unigram<-data.frame(Word=names(uniWords), Frequency=uniWords)

kable(head(unigram, n=10))
Word Frequency
’ll ’ll 168
’re ’re 199
’ve ’ve 264
able able 340
according according 217
account account 105
across across 196
act act 163
action action 119
actually actually 330

Bigram:

biToken<-function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

biMatrix<-TermDocumentMatrix(corpus, control=list(tokenize=biToken))

biCorpus<-findFreqTerms(biMatrix, lowfreq=10)

biWords<-rowSums(as.matrix(biMatrix[biCorpus, ]))

bigram<-data.frame(Word=names(biWords), Frequency=biWords)

kable(head(bigram, n=10))
Word Frequency
– ’s – ’s 11
’d like ’d like 16
’m going ’m going 17
’m sure ’m sure 23
’re going ’re going 16
’s going ’s going 21
’s hard ’s hard 14
’s just ’s just 25
’s like ’s like 13
’s much ’s much 10

Trigram:

triToken<-function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

triMatrix<-TermDocumentMatrix(corpus, control=list(tokenize=triToken))

triCorpus<-findFreqTerms(triMatrix, lowfreq=10)

triWords<-rowSums(as.matrix(triMatrix[triCorpus, ]))

trigram<-data.frame(Word=names(triWords), Frequency=triWords)

kable(head(trigram, n=10))
Word Frequency
amazon services llc amazon services llc 10
cake cake cake cake cake cake 20
cinco de mayo cinco de mayo 12
happy mothers day happy mothers day 29
happy new year happy new year 13
let us know let us know 25
llc amazon eu llc amazon eu 10
looking forward seeing looking forward seeing 13
new york city new york city 18
new york times new york times 16

Task 2: Exploratory Data Analysis

Here we performed exploratory data analysis to understand the

> basic relationship between the words,

> distribution of the words,

> the variation in the frequencies of words/word pairs in the data,

> frequencies of unique, 2-gram and 3-gram words in the dataset.

Plot for Unigram.

g1<-ggplot(data=unigram[1:10,], aes(x=Word, y=Frequency))

g1<-g1+geom_bar(stat="identity", color="green", fill="blue", width=0.5)

g1<-g1+ggtitle("Unigram plot")+labs(x="One-word")

g1<-g1+coord_flip()

g1

Plot for Bigram

g2<-ggplot(data=bigram[1:10, ], aes(x=Word, y=Frequency))

g2<-g2+geom_bar(stat="identity", color="blue", fill="green", width=0.5)

g2<-g2+ggtitle("Bigram plot")+labs(x="Two-words")

g2<-g2+coord_flip()

g2

Plot for Trigram

g3<-ggplot(data=trigram[1:10, ], aes(x=Word, y=Frequency))

g3<-g3+geom_bar(stat="identity", color="yellow", fill="deeppink",                                                                 width=0.5)

g3<-g3+ggtitle("Trigram plot")+labs(x="Three-words")

g3<-g3+coord_flip()

g3

As a next step, will also build a Shiny app as a user-interface to interact with our predictive models to predict the next word.