N-grams word prediction model

Exploratory Analysis


Introduction


This is a part of Data Science - Capstone project.

The goal here is to build a simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

  1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.

  2. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Data cleaning and filtering

We will use the data HC Corpora Data Set provided by SwiftKey.

First we will read the data set and create basic summary like number of words and lines. Then we will clean the data by removing the whitespaces, punctuations, numbers etc. Then we will establish unigram, bigram and trigram models.

  • Description of data

After unzipping the file we found that there folder “final” which has four sub-folders de_DE, en_US, fi_FI, ru_RU. Each of these sub-folder has thress files related to blogs, twitter and news. We will work on data in English as I don’t understand other languages :)

library(stringi)
library(ggplot2)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
library(tau)
library(Matrix)
library(data.table)
library(parallel)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
setwd("~/Documents/My/coursera_capstone")

# importing the blogs and twitter datasets in text mode
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 167155 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 268547 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1274086 appears to contain an embedded nul
## Warning in readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8"):
## line 1759032 appears to contain an embedded nul
# importing the news in binary mode 
tmpV <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(tmpV, encoding="UTF-8")
close(tmpV)
rm(tmpV)

# drop non UTF-8 characters
twitter <- iconv(twitter, from = "latin1", to = "UTF-8", sub="")
twitter <- stri_replace_all_regex(twitter, "\u2019|`","'")
twitter <- stri_replace_all_regex(twitter, "\u201c|\u201d|u201f|``",'"')

length(blogs)
## [1] 899288
length(news)
## [1] 1010242
length(twitter)
## [1] 2360148

So these data sets in total contains around .9B, 1B, and 2.4B lines of text for blogs, news, and Twitter respectively.

Here is the summary of the text as well as character words analysis

# Character analysis

blogsStats <- stri_stats_general(blogs)
newsStats <- stri_stats_general(news)
twitterStats <- stri_stats_general(twitter)

# more textual analysis

blogsWords <- stri_count_words(blogs)
newsWords <- stri_count_words(news)
twitterWords <- stri_count_words(twitter)

Now lets summarize the dataset and number of characters per line per dataset

summary( blogsWords )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
summary( newsWords )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00
summary( twitterWords )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.79   18.00   61.00
# number of characters per line

summary( nchar(blogs) )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830
summary( nchar(news) )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11380.0
summary( nchar(twitter) )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

Now lets plot the summaries

qplot(blogsWords)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(newsWords)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(twitterWords)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Lets create a sample of 5000 rows from each dataset which will be used to provide more accurate predictions.

blogsSample <- sample(blogs, 5000)
newsSample <- sample(news, 5000)
twitterSample <- sample(twitter, 5000)

# save samples

sample <-c(blogsSample, newsSample, twitterSample)

Lets Clean our data from profanities

Remmove any profanities

# Profanity filtering

f<-file("./profanity.txt")
profanities<-readLines(f,encoding="UTF-8")
close(f)
profanitiesDatafile<-Corpus(VectorSource(profanities))

corpusSample <- tm_map(sample, removeWords, profanitiesDatafile)

N-gram modeling

We will create unigram (1-gram), bigram (2-gram), trigram (3-gram) models for the data set UniGram Plot top 20 Unigrams with the highest frequency

options(mc.cores = 1)
OnegramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))

corpusSample.pruned <- tm_map(sample, removeWords, stopwords("english"))
oneGramMatrix<-as.matrix(TermDocumentMatrix(corpusSample.pruned, #control=list(tokenize=OnegramTokenizer)))
onegramRowSum<-rowSums(onegramMatrix)
onegram<-data.frame(onegram=names(onegramRowSum),freq=onegramRowSum)
onegramSorted<-onegram[order(-onegram$freq),]
par(mar = c(5, 5, 2, 2) + 0.2)
barplot(onegramSorted[1:20,]$freq/1000, horiz=F, cex.names=0.8, xlab="Unigrams",
    ylab="Frequency (thousand)",las=2,names.arg=onegramSorted[1:20,]$onegram, 
    main="Top 20 Unigrams with the highest frequency")

plot of chunk unnamed-chunk-8

Word cloud for Unigrams

freq.onegram <- sort(rowSums(as.matrix(onegramMatrix)), decreasing = FALSE)
wordcloud(names(freq.onegram), freq.onegram, max.words = 100, colors = brewer.pal(6, "Dark2"))

plot of chunk unnamed-chunk-9

Top 20 Bigrams with the highest frequency

TwogramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
                                
twogramMatrix<-TermDocumentMatrix(corpusSample, control=list(tokenize=TwogramTokenizer))

twoFreq<-findFreqTerms(twogramMatrix,lowfreq=10)
twogramRowSum<-rowSums(as.matrix(twogramMatrix[twoFreq,]))

barplot(twogramRowSum[1:20], horiz=F, cex.names=0.8, xlab="twograms",
         ylab="Frequency",las=2,names.arg=names(twogramRowSum[1:20]), 
         main="Top 20 twogram with the highest frequency")

plot of chunk unnamed-chunk-10

Word cloud for Bigrams:

wordcloud(names(twogramRowSum), twogramRowSum, max.words = 100, colors = brewer.pal(6, "Dark2"))

plot of chunk unnamed-chunk-11

Top 20 Trigrams with the highest frequency

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
                                
trigramMatrix<-TermDocumentMatrix(corpusSample, control=list(tokenize=TrigramTokenizer))

triFreq<-findFreqTerms(trigramMatrix,lowfreq=10)
trigramRowSum<-rowSums(as.matrix(trigramMatrix[trifreq,]))
 
barplot(trigramRowSum[1:20], horiz=F, cex.names=0.8, 
         ylab="Frequency",las=2,names.arg=names(trigramRowSum[1:20]), 
         main="Top 20 trigram with the highest frequency")

plot of chunk unnamed-chunk-12

Word Cloud for Trigrams

wordcloud(names(trigramRowSum), trigramRowSum, max.words = 100, colors = brewer.pal(6, "Dark2"), scale=c(2,0.2))

plot of chunk unnamed-chunk-13

Now that we have cleaned up our data and done an Exploratory analysis let’s focus on Prediction Algorithm and Shiny app. Let’s do the following:

  1. Further clean up the data by removing non-descript words like “the”, “is”, “that,” etc.

  2. Partition the data sets into training, testing and validation data sets.We will then analyze the training set to determine the features and corresponding matrices that the algorithm can discover relevancy.

  3. Using the training data sets, Model Development will determine the best fit that will predict the words based on the previous 3 or 4 words.

  4. Model accuracy and error analysis will be investigated to validate the success of the model selection. The model will be tested on the partitioned test data and validated on the final partitioned data set.

  5. Build a shiny app that will allow a user to begin typing words and display a list of predicted words based on the selected prediction model.