N-grams word prediction model
Exploratory Analysis
The goal here is to build first simple model for the relationship between words. This is the first step in building a predictive text mining application. You will explore simple models and discover more complicated modeling techniques.
Tasks to accomplish
Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 wrds. Build a model to handle unseen n-grams - in some cases people will want to type a combination of wrds that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.
cleaning and filtering data
The HC Corpora Data is downloaded; When unzipped, it created four folders with 3 txt files in each of those folders. Will use only the data from en_us folder
Description of data
Reading the en_us data; then showing the summary of the data.
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
## Loading required package: RColorBrewer
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## [1] 200.4242
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 47.0 157.0 231.7 331.0 40835.0
## Warning in readLines(FileNews): incomplete final line found on
## 'en_US.news.txt'
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 111 186 203 270 5760
## Warning in readLines(FileTwitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(FileTwitter): line 1759032 appears to contain an
## embedded nul
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.8 100.0 213.0
Cleaning the data by removing the whitespaces, punctuations and numbers. Then we will create unigram, bigram and trigram models.
## [1] 899288
## [1] 77259
## [1] 2360148
You can also embed plots, for example:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
n-gram This part is to build the n-gram models for the whole corpus.
Let’s take care of data first.
## Package version: 1.5.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
##
## View
## Warning: Argument removeTwitter not used.
## Warning in plot.window(...): "max.words" is not a graphical parameter
## Warning in plot.window(...): "colors" is not a graphical parameter
## Warning in plot.window(...): "scale" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "max.words" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "colors" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "scale" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "max.words" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "colors" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "scale" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "max.words" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "colors" is
## not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "scale" is not
## a graphical parameter
## Warning in box(...): "max.words" is not a graphical parameter
## Warning in box(...): "colors" is not a graphical parameter
## Warning in box(...): "scale" is not a graphical parameter
## Warning in title(...): "max.words" is not a graphical parameter
## Warning in title(...): "colors" is not a graphical parameter
## Warning in title(...): "scale" is not a graphical parameter
## Warning in tm_map.SimpleCorpus(modi, stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(moddata, tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(moddata, removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(moddata, removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(moddata, removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(moddata, removeWords, c("and", "the",
## "our", : transformation drops documents
## Warning in tm_map.SimpleCorpus(wrds, stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(wrds, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(wrds, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(wrds, removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(wrds, removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(wrds, stemDocument): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(wrds, stemDocument): transformation drops
## documents
The workdcould function gives the following error message: Therfore skipping execution of the line:
“no non-missing arguments to max; returning -InfError in strwidth(words[i], cex = size[i], …) : invalid ‘cex’ value”