Textminig: An N-Gram model to predict the next word

Introduction

Basic goal for the project is to build a predictive model of English text. So if somebody types 2 words a model predicts the next word.

The prediction is based on a text dataset consisting of textfiles for blog postings, news and twitter messages. The data is originally from a corpus called HC Corpora.

Usual steps for natural language processing

  1. import texts into R
  2. combine and structure the texts to be able to access them in a uniform manner (creating a corpus)
  3. tidy up the text, including preprocessing to obtain a convenient representation for later analysis. This step might involve text reformatting (conversion to lower case, whitespace removal), normalization (stopword removal, stemming) and tokenziation (splitting lines of text into tokens
  4. transform the preprocessed text into a structured format. This usually implies the creation of a so-called term-document matrix
  5. do whatever you have to do (sentiment, classification, prediction, etc)

Considerations

  1. What is the most appropriate text base for the prediction model? And how do I decide that? Is it wise to build one prediction model for a combined text body or is it better to split the prediction according to the type of text?

    Literature suggests that it enhances the prediction quality if different types of texts are mixed.

  2. Do I need the whole dataset or is a subset sufficient for a good enough model? Memory limitatios, runtime performance, accuracy of prediction and restricted development capacity should be considered.

    Literature suggests that the bigger the dataset, the better for the prediction model.

  3. How (much) do I tidy the text? Do I use stemming or whould this jeopardize the prediction? Same for stopword removal? On the other hand: the more clean the textbody, the more chances of an acceptable runtime? Remove numbers or convert them into text or leave them be?

    Used in the first trial:

    • tolower
    • remove numbers
    • remove profanity words
    • remove punctuation
    • remove white space

    Under consideration for the final trial:

    • remove special characters
    • remove punctuation except single quote and dash
    • convert nubmers to text or leave them as numbers 3-year-old will be a -year-old otherwise
    • remove stop words
    • stemming
    • remove sparse terms from DTM
  4. Do I remove sparse terms from the Document Term Matrix or do I treasure them for the most accurate prediction? Is a prediction on sparse terms worth something? The average lenth of an English word is 4.79 letters per word, and 80% are between 2 and 7 letters long (see reference). This gives me some confidence to restrict max word length to 20 when constructing a DTM. For the first trial I do not remove sparse terms.

  5. What are available packages which might be relevant for this analysis?

    Used in the first trial:

    • tm: to create and clean up a text corpus
    • ngram: to create ngrams
    • data.table: to handle a lot of data efficiently
    • gplot2: draw plots

    Under consideration for the final trial:

    • corpus: to create and clean up a text corpus
    • filehash: might solve memory/runtime issues
    • quanteda: textmining
    • snowballC: textminig
    • tidytext: textmining
    • stringi: easy text summary
    • RWeka: to create ngrams
  6. What are the modelling approaches for the prediction the next words? All literature points in the direction of N gram models, see wikipedia for more information. The basic goal is to build an n-gram model, which will predict a word given the previous one, two, or maybe three words. This will be based on combinations of words that are observed in the data set.

Additionally the model needs to cover inputs that are not observed in the data set. One way to handle these cases seems to be called smoothing methods, as they smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams. There appears to be a simple Laplace smoothing (assign a count of 1 to unseen n-grams), see here for more information. Katz’s back-off model and Good-Turing frequency estimation seem more sophisticated, applicable models.

  1. How do I measure the quality of the prediction? Is the model good enough? After I figured out how to predict anything, I will measure the quality of prediction by splitting into training, test and evaluation set I migth get an idea of the accuracy and out of sample error of predicting words.

Load packages

library(tm)
library(ngram)
library(ggplot2)
library(data.table)

Summary of all data

In total the three source files hold about 4 millions of rows with about 100 millions of words.

Type Filename Number of rows Number of words Avg number of words per row
blogs en_US.blogs.txt 899.288 37,334.131 41.5
news en_US.news.txt 1,010.242 34,372.530 34.0
twitter en_US.twitter.txt 2,360.148 30,373.543 12.9

Summary of sample data

Load first 10.000 records of all three sets and profanity words to be excluded:

path <- "C:/Harriet/MOOC/DataScience Johns Hopkins/wd/wd10/en_US/"
file1 <- "en_US.blogs.txt"
file2 <- "en_US.news.txt"
file3 <- "en_US.twitter.txt"
file4 <- "full-list-of-bad-words_text-file_2018_07_30.txt"

con <- file(paste0(path,file1), open ="rt")                     ## open connection
blogs <- readLines(con, 10000, encoding = "UTF-8")              ## read the first 10.000 lines of text 
close(con)                                                      ## close the connection

con <- file(paste0(path,file2), open ="rt")   
news <- readLines(con, 10000, encoding = "UTF-8")                    
close(con)                                   

con <- file(paste0(path,file3), open ="rt")                     
twitter <- readLines(con, 10000, encoding = "UTF-8")                    
close(con)

profanitywords <- read.delim(paste0(path,file4),header=FALSE,skip=13,encoding = "UTF-8" )
blogsnrrows <- length(blogs)
newsnrrows <- length(news)
twitternrrows <- length(twitter)

blogsnrwords <- wordcount(blogs)
newsnrwords <- wordcount(news)
twitternrwords <- wordcount(twitter)

blogsnrwordsavg <- wordcount(blogs,count.function=mean)
newsnrwordsavg <- wordcount(news,count.function=mean)
twitternrwordsavg <- wordcount(twitter,count.function=mean)

In total the three sample files hold 30.000 rows (lines of text) with about 900.000 of words, not even 1% of the source.

Type Filename Number of rows Number of words Avg number of words per row
blogs en_US.blogs.txt 10.000 410.620 41.0
news en_US.news.txt 10.000 343.929 34.3
twitter en_US.twitter.txt 10.000 127.674 12.8

Create and clean the text corpus

#corpus <- Corpus(VectorSource(dt$tmp)) - dt obsolet, input wird direkt in corpus verwurstelt
#corpus <- Corpus(VectorSource(rbind(blogs, news, twitter))) - simple corpus does not accept bigram nor trigram tokenizer
corpus <- VCorpus(VectorSource(rbind(blogs, news, twitter))) # VCorpus does accept bigram and trigram tokenizer, but is very slow
#corpus <- PCorpus(VectorSource(rbind(blogs, news, twitter))) # PCorpus might be the final solution for the complete date

#library(filehash) # package used to update/maintain permanent corpus database
#corpus <- PCorpus(DirSource("training", encoding="UTF-8",mode="text"), dbControl=list(dbName="corpus.db", dbType="DB1"))

#cleaning start
#inspect(corpus[1:5]) # initial state Corpus
inspect(corpus[[5]]) # initial state of fifth entry VCorpus
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 153
## 
## The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
corpusClean = tm_map(corpus, content_transformer(tolower))
rm(corpus)
corpusClean = tm_map(corpusClean, removeWords, profanitywords$V1)
corpusClean = tm_map(corpusClean, removeNumbers) ## maybe translate numbers to words instead?
corpusClean = tm_map(corpusClean, removePunctuation)  
corpusClean = tm_map(corpusClean, stripWhitespace)

## remove special signs, https etc ?
##corpusClean = tm_map(corpusClean, removeWords, stopwords("english"))
##corpusClean <- tm_map(corpusClean, stemDocument, "english")
##corpusClean = tm_map(corpusClean, removePunctuation, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE) ## will also remove hashtags
##corpusClean <- tm_map(corpusClean, PlainTextDocument) # needs to be converted if VCorpus to be able to do DTM or actualle in case tolower is not sourrounded by content_transformer, because this destroys the corpus somehow

## NOTE TO MYSELF: All this needs to be done also for input in shiny as well!

inspect(corpusClean[[5]]) # final state of fifth entry VCorpus after cleaning
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 145
## 
## the st louis plant had to close it would die of old age workers had been making cars there since the onset of mass automotive production in the s
rm(blogs)
rm(news)
rm(twitter)
rm(profanitywords)

Tokenization

A token is a meaningful unit of text to be used for further analysis. Tokenization is the process of splitting text into tokens. In this case we split the textbody into unigrams (one word), bigrams (two words) and trigrams (three words).

#Unigrams (no extra tokenizer needed as default for tm package)
unigramdtm <- TermDocumentMatrix(corpusClean, control = list(wordLengths = c(3, 20)))
#unigramdtm
#unigramdtm <- removeSparseTerms(unigramdtm, 0.99)
#unigramdtm   # see effects of removeSparseTerms

#Bigrams
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
bigramdtm <- TermDocumentMatrix(corpusClean, control = list(wordLengths = c(3, 40),tokenize = bigramTokenizer))
#bigramdtm <- removeSparseTerms(bigramdtm, 0.99)
#bigramdtm

#Trigrams
trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE) }
trigramdtm <- TermDocumentMatrix(corpusClean, control = list(wordLengths = c(3, 60),tokenize = trigramTokenizer))
#trigramdtm
#inspect(trigramdtm[1000:1005,100:105])
#trigramdtm <- removeSparseTerms(trigramdtm, 0.99)
#trigramdtm

rm(corpusClean)

As result of the tokenziation of the 30.000 sample documents (text lines) there are

  • 54.217 unigrams,
  • 399.296 bigrams and
  • 685.765 trigrams

Explorative Analysis

How frequently do unigrams, bigrams and trigrams appear in the data set?

#Unigrams
unigramdtm <- sort(slam::row_sums(unigramdtm), decreasing=T)
unigramfreq <- data.table(tok = names(unigramdtm), freq = unigramdtm)

ggplot(unigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "steelblue") + theme_bw() +
     ggtitle("Frequency of unigrams: Top 25") +labs(x = "", y = "")

#Bigrams
bigramdtm <- sort(slam::row_sums(bigramdtm), decreasing=T)
bigramfreq <- data.table(tok = names(bigramdtm), freq = bigramdtm)

ggplot(bigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "darkgreen") + theme_bw() +
     ggtitle("Frequency of bigrams: Top 25") +labs(x = "", y = "")

#Trigrams
trigramRowSums <- sort(slam::row_sums(trigramdtm), decreasing=T)
trigramfreq <- data.table(tok = names(trigramRowSums), freq = trigramRowSums)

ggplot(trigramfreq[1:25,], aes(x = reorder(tok,freq), y = freq)) + coord_flip() +
     geom_bar(stat = "identity", fill = "coral") + theme_bw() +
     ggtitle("Frequency of trigrams: Top 25") +labs(x = "", y = "")

rm(unigramdtm)
rm(bigramdtm)
rm(trigramdtm)

Model

Prepare first prediction model based on frequency table of trigrams.

Split token in prefix and rest

splitTokenTrigram = function(x){

    prefix = character(nrow(x))
    rest = character(nrow(x))
    tmp <- strsplit(x$tok, " ", fixed=TRUE)

    for(i in 1:nrow(x)){
        prefix[i] = paste(tmp[[i]][1],tmp[[i]][2])
        rest[i] = tmp[[i]][3]
    }
    x$prefix <- prefix
    x$rest <- rest
    #list(prefix=prefix, rest=rest)
    return(x)
}

trigramfreq <- splitTokenTrigram(trigramfreq)

Input Output

2 words as input, options for 3rd word as output sorted descending by frequency

outputTrigram = function(x){
  subset(trigramfreq, prefix == x)
}

For example:

outputTrigram('case of')
##                       tok freq  prefix          rest
##  1:           case of the    2 case of           the
##  2:             case of a    1 case of             a
##  3:            case of an    1 case of            an
##  4:           case of any    1 case of           any
##  5:        case of anyone    1 case of        anyone
##  6:      case of argument    1 case of      argument
##  7:     case of baltimore    1 case of     baltimore
##  8: case of communicating    1 case of communicating
##  9:      case of computer    1 case of      computer
## 10:       case of culture    1 case of       culture
## 11: case of cyberprotests    1 case of cyberprotests
## 12: case of doubledipping    1 case of doubledipping
## 13:     case of instalove    1 case of     instalove
## 14:           case of jan    1 case of           jan
## 15:        case of london    1 case of        london
## 16:      case of military    1 case of      military
## 17:     case of mochizuki    1 case of     mochizuki
## 18:          case of more    1 case of          more
## 19:        case of oxford    1 case of        oxford
## 20:      case of pakistan    1 case of      pakistan
## 21:           case of red    1 case of           red
## 22:        case of romans    1 case of        romans
## 23:         case of russo    1 case of         russo
## 24:      case of swelling    1 case of      swelling
## 25:          case of this    1 case of          this
## 26:        case of trying    1 case of        trying
## 27:       case of vermeer    1 case of       vermeer
##                       tok freq  prefix          rest

Next steps

  • create a corpus of 4 million documents
  • create advanced prediction model with smoothing method
  • split training and test set and assess accuracy

References

Appendix

Code

Load all data

#con <- file(paste0(path,file1), open ="rb")                     ## open connection
con <- file(paste0(path,file1), open ="rt")                     ## open connection
blogs <- readLines(con,  encoding = "UTF-8")               
close(con)                                                      ## close the connection
#blogs <- iconv(blogs, from = "latin1", to = "UTF-8", sub="")    ## delete not UTF-8 characters

#con <- file(paste0(path,file2), open ="rb")
con <- file(paste0(path,file2), open ="rt")   
news <- readLines(con,  encoding = "UTF-8")                    
close(con)                                   
#news <- iconv(news, from = "latin1", to = "UTF-8", sub="")

#con <- file(paste0(path,file3), open ="rb")                    
con <- file(paste0(path,file3), open ="rt")   
twitter <- readLines(con,  encoding = "UTF-8")                    
close(con)
#twitter <- iconv(twitter, from = "latin1", to = "UTF-8", sub="") 

Summary of source data

blogsnrrows <- length(blogs)
newsnrrows <- length(news)
twitternrrows <- length(twitter)

blogsnrwords <- wordcount(blogs)
newsnrwords <- wordcount(news)
twitternrwords <- wordcount(twitter)

blogsnrwordsavg <- wordcount(blogs,count.function=mean)
newsnrwordsavg <- wordcount(news,count.function=mean)
twitternrwordsavg <- wordcount(twitter,count.function=mean)

Experiments with Blog Sample Corpus

Load first 100.000 records of all blogs:

path <- "C:/Harriet/MOOC/DataScience Johns Hopkins/wd/wd10/en_US/"
file1 <- "en_US.blogs.txt"

con <- file(paste0(path,file1), open ="rb")                     ## open connection
blogs <- readLines(con, 100000, encoding = "UTF-8")              ## read the first 100.000 lines of text 
close(con)                                                      ## close the connection
blogs <- iconv(blogs, from = "latin1", to = "UTF-8", sub="")    ## delete not UTF-8 characters
corpus <- VCorpus(VectorSource(blogs))

Mind: The following exploration is for 11 % 11? of the blog data.

From the blogs sample file (28.5 MB) a corpus is created (422.2 MB) and cleaned (421.5 MB).

Based on the clean corpus a DTM is constructed for unigrams (54.7 MB) where sparse terms are removed (0.999 sparsity: 37 MB, 0.95 sparsity: 16 MB, 0.9 sparsity: 12 MB).

<<DocumentTermMatrix (documents: 100000, terms: 128858)>> Non-/sparse entries: 2621485/12883178515 Sparsity : 100% Maximal term length: 25 Weighting : term frequency (tf)

The DTM output shows that the 100.000 lines hold 128.858 terms (unigrams) which appear at least once.

If the sparsity limit is set to 0.999 this will remove any term that doesn’t appear in at least 0.1% of the lines. Only those terms remain which occur in more than 100 100 of the 100.000 lines. Those are 3.126 terms.

<<DocumentTermMatrix (documents: 100000, terms: 3126)>> Non-/sparse entries: 2008942/310591058 Sparsity : 99% Maximal term length: 14 Weighting : term frequency (tf)

If the sparsity limit is set to 0.95 only 55 terms remain which occur in more than 5000 500 of the 100.000 lines.

<<DocumentTermMatrix (documents: 100000, terms: 55)>> Non-/sparse entries: 648179/4851821 Sparsity : 88% Maximal term length: 7 Weighting : term frequency (tf)

If the sparsity limit is set to 0.9 only 18 (!) terms remain which occur in more than 10^{4} 1000 of the 100.000 lines.

<<DocumentTermMatrix (documents: 100000, terms: 18)>> Non-/sparse entries: 385346/1414654 Sparsity : 79% Maximal term length: 5 Weighting : term frequency (tf)

Two methods are compared for bigrams: ngram vs Rweka

Version NGRAM bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " “), use.names = FALSE) } trigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 3), paste, collapse =” "), use.names = FALSE) }

Version RWEKA bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

  • NGRAM Tokenziation
bigramTokenizer <- function(x) { unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) }
system.time(bigramdtm <- TermDocumentMatrix(corpusClean, control = list(wordLengths = c(3, 25),tokenize = bigramTokenizer)))
##       User      System verstrichen 
##     129.36        0.71      135.61 

Based on the clean corpus a DTM is constructed for bigrams with ngram (2.2601667 minutes, 156 MB, 1.2 mio terms) where sparse terms are removed (0.99 sparsity: 11.4 MB, 156 terms).

<<TermDocumentMatrix (terms: 1274683, documents: 100000)>> Non-/sparse entries: 3865569/127464434431 Sparsity : 100% Maximal term length: 25 Weighting : term frequency (tf)

<<TermDocumentMatrix (terms: 156, documents: 100000)>> Non-/sparse entries: 347168/15252832 Sparsity : 98% Maximal term length: 11 Weighting : term frequency (tf)

Do i need to double the max length because 2 words are combined? > system.time(bigramdtm <- TermDocumentMatrix(corpusClean, control = list(wordLengths = c(3, 50),tokenize = bigramTokenizer))) User System verstrichen 133.14 1.55 139.32

<<TermDocumentMatrix (terms: 1278944, documents: 100000)>> Non-/sparse entries: 3869867/127890530133 Sparsity : 100% Maximal term length: 50 Weighting : term frequency (tf)

  • RWEKA Tokenziation
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
system.time(bigramdtm <- TermDocumentMatrix(corpusClean, control = list(wordLengths = c(3, 25),tokenize = bigramTokenizer)))
##       User      System verstrichen 
##     195.54        1.86      197.19 

Based on the clean corpus a DTM is constructed for bigrams with RWeka (3.2865 minutes, 156 MB, 1.2 mio terms) where sparse terms are removed (0.99 sparsity: 11.7 MB, 164 terms).

<<TermDocumentMatrix (terms: 1263453, documents: 100000)>> Non-/sparse entries: 3898423/126341401577 Sparsity : 100% Maximal term length: 25 Weighting : term frequency (tf)

<<TermDocumentMatrix (terms: 164, documents: 100000)>> Non-/sparse entries: 366515/16033485 Sparsity : 98% Maximal term length: 11 Weighting : term frequency (tf)

Software Environment

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
## [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Austria.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.12.2 ggplot2_3.2.0     ngram_3.0.4       tm_0.7-6         
## [5] NLP_0.2-0        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       knitr_1.23       xml2_1.2.0       magrittr_1.5    
##  [5] tidyselect_0.2.5 munsell_0.5.0    colorspace_1.4-1 R6_2.4.0        
##  [9] rlang_0.4.0      dplyr_0.8.3      stringr_1.4.0    tools_3.6.1     
## [13] parallel_3.6.1   grid_3.6.1       gtable_0.3.0     xfun_0.8        
## [17] withr_2.1.2      htmltools_0.3.6  assertthat_0.2.1 yaml_2.2.0      
## [21] lazyeval_0.2.2   digest_0.6.20    tibble_2.1.3     crayon_1.3.4    
## [25] purrr_0.3.2      glue_1.3.1       evaluate_0.14    slam_0.1-45     
## [29] rmarkdown_1.14   labeling_0.3     stringi_1.4.3    pillar_1.4.2    
## [33] compiler_3.6.1   scales_1.0.0     pkgconfig_2.0.2

This analysis is part of the “Capstone project” for the course Data Science from Johns Hopkins on Coursera.