Introduction

This is the first milestone report of the Capstone of the Data Specialization course from the Johns Hopkins University. It consists of a natural language processing activity aimed at predicting the next word in a sentence. This report answers some questions about the characteristics of some input texts, the frequency of tokens, 2-grams and 3-grams and the definition of a n-gram model.

Getting and cleaning the data

The data are downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

The following chunk of code reads the data and extracts some basic statistics of the files for the English case.

library(stringi)

fileblogs<-"./Data/final/en_US/en_US.blogs.txt"
con <- file(fileblogs, "r")
blogs<-readLines(con,encoding="UTF-8",skipNul=TRUE)
close(con)

filenews<-"./Data/final/en_US/en_US.news.txt"
con <- file(filenews, "r")
news<-readLines(con,encoding="UTF-8",skipNul=TRUE)
close(con)

filetwitter<-"./Data/final/en_US/en_US.twitter.txt"
con <- file(filetwitter, "r")
twitter<-readLines(con,encoding="UTF-8",skipNul=TRUE) 
close(con)

data.frame(filename=c(fileblogs,filenews,filetwitter),
           sizeMb=c(file.info(fileblogs)$size/1024/1024,
                    file.info(filenews)$size/1024/1024,
                    file.info(filetwitter)$size/1024/1024),
           numlines=c(length(blogs),length(news),length(twitter)),
           numwords=c(sum(stri_count_words(blogs)),
                      sum(stri_count_words(news)),
                      sum(stri_count_words(twitter))))
##                               filename   sizeMb numlines numwords
## 1   ./Data/final/en_US/en_US.blogs.txt 200.4242   899288 37546239
## 2    ./Data/final/en_US/en_US.news.txt 196.2775    77259  2674536
## 3 ./Data/final/en_US/en_US.twitter.txt 159.3641  2360148 30093413
mylines<-c(blogs,news,twitter)

The data will be sampled at 0.1% because the files are very large. It has been checked that higher samples do not significantly affect the resulting dictionary. Then, they will be loaded using library tm and then they have to be cleaned from non-printable characters, punctuation and multiple spaces.

library(tm)
set.seed(100)

mylines <- sample(mylines, 0.001*length(mylines))

corpus <- SimpleCorpus(VectorSource(mylines),
                       control = list(language = "en_US"))
remover <- content_transformer(function(x, greppattern) gsub(greppattern, "", x))
corpus <- tm_map(corpus,remover,"[^[:print:]]") # Remove nonprintable
corpus <- tm_map(corpus, tolower) # Lower case
corpus <- tm_map(corpus, removePunctuation) # Remove punctuation
corpus <- tm_map(corpus, stripWhitespace) # Remove multiple spaces

Exploratory analysis

We have created a function that extracts tokens from a sample of the lines defined from its desired length.

tokenize <- function(lines,length){
    L=length(lines)
    consider=rbinom(L,1,length/L)
    rawwords=NULL
    for (i in 1:L){
        if(consider[i]){
            lines[i]<-paste(lines[i]," ")
            expr<-gregexpr("[^ ]+? +",lines[i]) # get "words" (may contain numbers) 
            for (j in 1:length(expr[[1]])){
                start=expr[[1]][j]
                stop=start+attr(expr[[1]],"match.length")[j]-2
                rawwords<-c(rawwords,substr(lines[i],start,stop))
            }
            # Revert some usual mispelled words
            rawwords[rawwords=="doesnt"]<-"doesn't"
            rawwords[rawwords=="dont"]<-"don't"
            rawwords[rawwords=="arent"]<-"aren't"
            rawwords[rawwords=="isnt"]<-"isn't"
            rawwords[rawwords=="aint"]<-"ain't"
            rawwords[rawwords=="im"]<-"I'm"
            rawwords[rawwords=="youre"]<-"you're"
            rawwords[rawwords=="hes"]<-"he's"
            rawwords[rawwords=="shes"]<-"she's"
            rawwords[rawwords=="theyre"]<-"they're"
            rawwords[rawwords=="ive"]<-"I've"
            rawwords[rawwords=="youve"]<-"you've"
            rawwords[rawwords=="weve"]<-"we've"
            rawwords[rawwords=="youll"]<-"you'll"
            rawwords[rawwords=="wont"]<-"won't"
            rawwords[rawwords=="couldnt"]<-"couldn't"
            rawwords[rawwords=="u"]<-"you"
            rawwords[rawwords=="thats"]<-"that's"
            rawwords[rawwords=="whats"]<-"what's"
            rawwords[rawwords=="cant"]<-"can't"
            rawwords[rawwords=="lets"]<-"let's"
            rawwords[rawwords=="theres"]<-"there's"
            rawwords[rawwords=="havent"]<-"haven't"
        }
    }
    rawwords
}

We will also remove unwanted words (profanity check).

for (token in swearwords) {
  corpus <- tm_map(corpus,remover,token) 
}

tokens<-tokenize(as.list(corpus),length(corpus))

Now we will get the histogram of 1-grams. It results that less than ten words make up more than half of the dictionary. There are plenty of words that appear only once (sparse words). The histogram below is showing only the 40 most frequent words.

termDocMat<-TermDocumentMatrix(corpus)
num<-sort(rowSums(as.matrix(TermDocumentMatrix(corpus))), decreasing = TRUE)
histtokens<-data.frame(tokens=names(num),num=num)
#plot histogram
library(ggplot2)
ggplot(histtokens[1:40,], aes(reorder(tokens, -num), num)) + geom_col() + theme_bw() +
         xlab("tokens") + ylab("Freq") + ggtitle("Histogram of 1-grams") +
         theme(axis.text.x = element_text(angle = 45, size = 9, hjust = 1,vjust=1))

The 2-gram and 3-gram histograms are computed by looking up in the tokenized corpus (character vector tokens).

twograms<-data.frame(twograms="",word1=tokens,word2="")
threegrams<-data.frame(threegrams="",word1=tokens,word2="",word3="")
for (i in 1:(length(tokens)-2)){
  twograms$word2[i]=tokens[i+1]
  twograms$twograms[i]<-paste(tokens[i],tokens[i+1])
  threegrams$word2[i]=tokens[i+1]
  threegrams$word3[i]=tokens[i+2]
  threegrams$threegrams[i]<-paste(twograms$twograms[i],tokens[i+2])
}
twograms$twograms[length(tokens)-1]<-paste(tokens[length(tokens)-1],tokens[length(tokens)])
twograms$word2[length(tokens)-1]<-tokens[length(tokens)]
histtokens2<-twograms %>% group_by(twograms,word1,word2) %>% summarise(num2=n()) %>% arrange(desc(num2))
histtokens3<-threegrams %>% group_by(threegrams,word1,word2,word3) %>% summarise(num3=n()) %>% arrange(desc(num3))

#plot histograms
ggplot(histtokens2[1:40,], aes(reorder(twograms, -num2), num2)) + geom_col() + theme_bw() +
         xlab("two-grams") + ylab("Freq") + ggtitle("Histogram of 2-grams") +
         theme(axis.text.x = element_text(angle = 45, size = 9, hjust = 1,vjust=1))

ggplot(histtokens3[1:40,], aes(reorder(threegrams, -num3), num3)) + geom_col() + theme_bw() +
         xlab("three-grams") + ylab("Freq") + ggtitle("Histogram of 3-grams") +
         theme(axis.text.x = element_text(angle = 45, size = 9, hjust = 1,vjust=1))

Now the structures histtokens, histokens2 and histtokens3 have got the ordered dictionary of n-grams of order 1-3, with the words separated for easy access by the n-gram model.

To finish with the exploratory analysis, the answers to some provided questions will be addressed:

# Number of tokens covering 50% of all instances in the sample
fifty<-1
while(apply(as.matrix(histtokens$num),2,sum2 <-function(x,n) sum(x[1:n])<length(x)/2,fifty)) fifty<-fifty+1
# Number of tokens covering 90% of all instances in the sample
ninety<-1
while(apply(as.matrix(histtokens$num),2,sum2 <-function(x,n) sum(x[1:n])<length(x)*0.9,ninety)) ninety<-ninety+1

The number of tokens covering 50% and 90% of tokens in the sample are 4 and 15.

To detect possible foreign language words, we could have an a-priori English dictionary and check those scanned words not present in it. We could also load the German and French corpora provided for the exercise and check those in the English corpus the tokens which are in any of them. A simple way (provided next) to detect just some of the words (those with accents) is to look for characters in the extended ASCII code:

# Trying to detect some foreign languages
print(tokens[grep("[áéíóúâêîôûàèìòùüäçëïöÿåæßªºãðõµñ]",tokens)])
## [1] "cliché"   "hermé"    "prés"     "cliché"   "cliché"   "thoméses" "tioté"

n-gram model

The n-gram model which would be practical for the text prediction needs to take into account that the probability of the next word is defined by the state of the lower-level n-grams. This could be modeled as a Markov chain, where the transition to the next state (past words + new word) has a pre-defined probability. This probability can be computed by means of evaluating the frequency of the higher-level n-grams with respect to the sum of all frequencies of the higher-level n-grams.

In other words, to predict the next word from a stream, we should take into account the appearance of the last ones in the dictionary of n-grams of the next-higher level, and fall back to the lower-level dictionary if that n-gram is not found. For instance, if we have one word A, to predict the second one, we should look at the most probable 2-gram among those starting with A. The probability is decreasing in the histogram, thus the first appearance will be the most likely.

It may happen that A does not exist in the 1-gram dictionary. To handle that, we will do two things:

library(stringdist)

correct_word<-function(word){
  distmtx <- sapply(histtokens$tokens,function(x)
    stringdist(word,x,method="lcs"))
if(min(distmtx)<=nchar(word)/3){
    pred_word<-histtokens$tokens[which.min(distmtx)]
}else{
    pred_word<-"<unk>"
}
  return(pred_word)
}

word<-"mistaqe"
print(paste("Instead of",word,"you must have meant",correct_word(word)))
## [1] "Instead of mistaqe you must have meant mistake"

Back to the prediction problem, if A happens to be , the most likely word to follow will be the, and so on. If not, we will choose the first 2-gram in the histogram starting with A (say, from the ordered list A b1, A b2 etc., hence the predicted word to follow would be b1).

For 2-grams (e.g. A B), the same logic will be applied to predict the third word, with the following hierarchy:

  1. First, if A is <unk>, repeat the 1-gram problem starting with B.
  2. Otherwise, search in the 3-gram dictionary the likeliest elements starting by A B (unless B is <unk>, in whose case C will be the).
  3. If not found, C will be obtained from the 1-gram problem starting with B.

The algorithm looks like recursive for higher order n-grams. #, although the trade-off between accuracy and loss of performance makes it advisable to stay with 3-grams dictionary maximum.

Next steps

The next steps in the project will be to implement the prediction algorithm, to try to optimise the model and to produce a presentation and Shiny app both accurate and performing enough to be attractive to be played with.