N-grams word prediction model

Exploratory Analysis

The goal here is to build first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis we performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 wrds. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

Cleaning and filtering data

The HC Corpora Data is downloaded; When unzipped, it created four folders with 3 txt files in each of those folders. Will use only the data from en_us folder

Description of data

Reading the en_us data; then showing the summary

library(stringi)
library(ggplot2)
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 3.6.2
library(tau)
library(Matrix)
library(data.table)
library(parallel)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
library(stringr)

Setting default directory

setwd("C:/DS/F7/final")

Showing summary info

#Blog file info

FileInfo <- file.info("en_US.blogs.txt")
sizeB <- FileInfo$size
sizeKB <- sizeB/1024
sizeMB <-  sizeKB/1024
sizeMB
## [1] 200.4242
FileBlog <- file("en_US.blogs.txt")
finalDataBlog <- readLines(FileBlog)
close(FileBlog)
summary(nchar(finalDataBlog))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40835.0
#Head(finalDataBlog)

# News file info
FileInfo <- file.info("en_US.news.txt")
sizeB <- FileInfo$size
sizeKB <- sizeB/1024
sizeMB <-  sizeKB/1024
sizeMB
## [1] 196.2775
FileNews <- file("en_US.news.txt")
finalDataNews <- readLines(FileNews)
## Warning in readLines(FileNews): incomplete final line found on
## 'en_US.news.txt'
close(FileNews)
summary(nchar(finalDataNews))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760
#head(finalDataNews)

# Twitter file info
FileInfo <- file.info("en_US.Twitter.txt")
sizeB <- FileInfo$size
sizeKB <- sizeB/1024
sizeMB <-  sizeKB/1024
sizeMB
## [1] 159.3641
FileTwitter <- file("en_US.Twitter.txt")
finalDataTwitter <- readLines(FileTwitter)
## Warning in readLines(FileTwitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 1759032 appears to contain an
## embedded nul
close(FileTwitter)
summary(nchar(finalDataTwitter))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0
#head(finalDataTwitter)

Cleaning the data by removing the whitespaces, punctuations and numbers. Then we will create unigram, bigram and trigram models.

blogs <- iconv(finalDataBlog, from = "latin1", to = "UTF-8", sub="")
blogs <- stri_replace_all_regex(finalDataBlog, "\u2019|`","'")
blogs <- stri_replace_all_regex(finalDataBlog, "\u201c|\u201d|u201f|``",'"')

#Blog file size
length(blogs)
## [1] 899288
nws <- iconv(finalDataNews, from = "latin1", to = "UTF-8", sub="")
nws <- stri_replace_all_regex(finalDataNews, "\u2019|`","'")
nws <- stri_replace_all_regex(finalDataNews, "\u201c|\u201d|u201f|``",'"')
#News file size
length(nws)
## [1] 77259
Twitter <- iconv(finalDataTwitter, from = "latin1", to = "UTF-8", sub="")
Twitter <- stri_replace_all_regex(finalDataTwitter, "\u2019|`","'")
Twitter <- stri_replace_all_regex(finalDataTwitter, "\u201c|\u201d|u201f|``",'"')
#Twitter file size
length(Twitter)
## [1] 2360148

Plots to show summary info

blogswrdsNews <- stri_count_words(finalDataNews)
qplot(blogswrdsNews)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

blogswrdsTwitter <- stri_count_words(finalDataTwitter)
qplot(blogswrdsTwitter)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

blogswrdsBlo <- stri_count_words(finalDataBlog)
qplot(blogswrdsBlo)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Since the dataset is quite large, taking just a small sample

    sample_size = 500
    set.seed(3456)
    sample_news = sample(finalDataNews, sample_size, replace = FALSE)
    sample_blog = sample(finalDataBlog, sample_size, replace = FALSE)
    sample_twitter = sample(finalDataTwitter, sample_size, replace = FALSE)
    
    # merging data
    sample_text = c(sample_news, sample_blog, sample_twitter)

For the purpose of prediction, applying 1-gram, 2-gram and 3-gram sequences.

For the purpose of prediction, applying 1-gram, 2-gram and 3-gram sequences. 1-gram

                        freq1 <- sort(rowSums(as.matrix(TermDocumentMatrix(Crps, control = list(tokenize = t1)))), decreasing=TRUE)
                        barplot(freq1[1:10], main = "1-gram with mMost Frequency", xlab="Words Sequence", ylab = "Total", las=2)

                        df1 <- data.frame(word=names(freq1), freq=freq1)

The words with most frequency displayed in bold 2-gram

3-gram

## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## as well as could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## one of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## it would be could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## the importance of could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## of the book could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## some of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## in regards to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## the new york could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## and it is could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## is not a could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## for the first could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## out of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## this is what could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## over and over could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## the first time could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## not going to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## is a great could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## are going to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## i have to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## things to do could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## don't want to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## side of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## what you are could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## thanks for the could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## a couple of could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## they did not could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## it comes to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## there is a could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## to make a could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## to the state could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## were going to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## is going to could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## a woman that could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## there was a could not be fit on page. It will not be plotted.
## Warning in wordcloud(df3$word, df3$freq, min.freq = 2500, max.words = 100):
## he could be could not be fit on page. It will not be plotted.

Unsurprisingly, we can see the words “the” listed frequently in 1-gram sequence. For 2-gram sequence “in the” and “of the” words are more frequent. Finally on 3-gram sequence, “a lot of” appears most. Detalied analysis is needed specifically on 3-gram sequennce

Summary

Above listed are the steps for a quick exploratory level assessment of how frequently certain words appear. As mentioned earlier, will perform a detailed analysis in the forthcoming sessions.