Synopsis

The goal is to try to show how to attack the problem of being able to predict the next word using a text file as training set. Sometimes we will need to predict a word that has not appeared in the training set. Files based on sample Twitter feeds, news, and blogs are loaded and analyzed using R’s TM package. The final objective is to generate n-gram probability models for the purpose of predicting the next word to make it easier to type in one movement.

Data Processing

We tried the brute force counting of lines which is not a great idea and takes too long:

options(warn=-1)
x=0
con <- file("en_US.blogs.txt", "r")
 a=readLines(con, 1)
while (length(a)==1){
         a=readLines(con, 1)
         x=x+1      }
 close(con)
x
## [1] 899288

but an easier way is to using the window dos cmd prompt:

cat en_US.twitter.txt | wc to find out the number of lines and words in the data sets.

US twitter file is 2360148 lines and 30341028 words.

US news file is 1010242 lines and 34309642 words.

US blogs file 899288 is lines 37272578 words.

Now we are going to read the file, create a corpus, convert everything to lower cap, remove all punctuation, remove numbers, remove stop words, and do a histogram of what we found. Loading the whole twitter file took about 30 minutes so we’ll just run this with the first 1% of the data.

aFile = readLines("en_US.twitter.txt",n=2360)
 
library(tm)
## Loading required package: NLP
library(RWeka)
library(wordcloud)
## Loading required package: RColorBrewer
myCorpus = Corpus(VectorSource(aFile))
 
myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))
myCorpus = tm_map(myCorpus, PlainTextDocument)

BigramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) 

myDTM = TermDocumentMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
 
findFreqTerms(myDTM, lowfreq = 50)
## character(0)
findFreqTerms(myDTM, lowfreq = 100)
## character(0)
#We can also identify high frequency terms.

dtm.matrix = as.matrix(myDTM)
words = rowSums(dtm.matrix)
hist(words)

words = sort(words, decreasing = TRUE)
wordcloud(names(words), words, min.freq = 75)

head(sort(words, decreasing = TRUE))
## last night  cant wait  right now  dont know   just got    lets go 
##         15         14         14          9          9          9
##      in the    for the     of the      to be     on the thanks for 
##       481        442        356        303        275        268 
tail(sort(words, decreasing = TRUE))
##  zombie magazine       zone block        zone game       zoom meant 
##                1                1                1                1 
## zooming overused       zutara ect 
##                1                1
options(warn=0)
##     zooming is     zooming the      zurawik at      zutara and zygodactyl bird zygodactylous a 
##              1               1               1               1               1               1 

 

#another way to do it is with the tau library. We will probably continue with the tm library.
#library(tau)
 
#bigrams = textcnt(aFile, n = 2, method = "string")
#bigrams = bigrams[order(bigrams, decreasing = TRUE)]
#trigrams = textcnt(aFile, n = 3, method = "string")
#trigrams = trigrams[order(trigrams, decreasing = TRUE)]

#TrigramTokenizer <- function(x) NGramTokenizer(x, 
#                                Weka_control(min = 3, max = 3))

#tdm2 <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))

#inspect(tdm2)

As you can see from the histogram there are only relatively few high frequency words.

You can see what they are in the word cloud too.

Results

It is straightforward to create bigrams and trigrams in R. The plan will be to generate 1-gram, bigram, and trigram matrices. By summing frequency counts, we will generate a 2-column table of unique ngrams by frequencies. Then we can match a two-word string with its corresponding entry in a tri-gram table. If there is a match, propose high frequency words to the user. Continuing with higher n, a match should be the last word of the n-gram. The difficulty will be in storing them wisely for retrieval and in deciding what to do with combinations that are not in the training set.