N-grams word prediction model

Exploratory Analysis

The goal here is to build first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 wrds. Build a model to handle unseen n-grams - in some cases people will want to type a combination of wrds that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

cleaning and filtering data

The HC Corpora Data is downloaded; When unzipped, it created four folders with 3 txt files in each of those folders. Will use only the data from en_us folder

Description of data

Reading the en_us data; then showing the summary of the data.

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: RColorBrewer
## 
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## [1] 200.4242
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40835.0
## Warning in readLines(FileNews): incomplete final line found on
## 'en_US.news.txt'
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760
## Warning in readLines(FileTwitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 1759032 appears to contain an
## embedded nul
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0

Cleaning the data by removing the whitespaces, punctuations and numbers. Then we will create unigram, bigram and trigram models.

## [1] 899288
## [1] 77259
## [1] 2360148

Including Plots

You can also embed plots, for example:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

n-gram This part is to build the n-gram models for the whole corpus.

Let’s take care of data first.

## Package version: 1.5.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
## 
##     View
## Warning: Argument removeTwitter not used.
## Warning in plot.window(...): "max.words" is not a graphical parameter
## Warning in plot.window(...): "colors" is not a graphical parameter
## Warning in plot.window(...): "scale" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "max.words" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "colors" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "scale" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "max.words" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "colors" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "scale" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "max.words" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "colors" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "scale" is not
## a graphical parameter
## Warning in box(...): "max.words" is not a graphical parameter
## Warning in box(...): "colors" is not a graphical parameter
## Warning in box(...): "scale" is not a graphical parameter
## Warning in title(...): "max.words" is not a graphical parameter
## Warning in title(...): "colors" is not a graphical parameter
## Warning in title(...): "scale" is not a graphical parameter

## Warning in tm_map.SimpleCorpus(modi, stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(moddata, tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(moddata, removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(moddata, removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(moddata, removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(moddata, removeWords, c("and", "the",
## "our", : transformation drops documents
## Warning in tm_map.SimpleCorpus(wrds, stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(wrds, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(wrds, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(wrds, removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(wrds, removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(wrds, stemDocument): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(wrds, stemDocument): transformation drops
## documents

The workdcould function gives the following error message: Therfore skipping execution of the line:

“no non-missing arguments to max; returning -InfError in strwidth(words[i], cex = size[i], …) : invalid ‘cex’ value”