This is the first milestone report of the Capstone of the Data Specialization course from the Johns Hopkins University. It consists of a natural language processing activity aimed at predicting the next word in a sentence. This report answers some questions about the characteristics of some input texts, the frequency of tokens, 2-grams and 3-grams and the definition of a n-gram model.
The data are downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
The following chunk of code reads the data and extracts some basic statistics of the files for the English case.
library(stringi)
fileblogs<-"./Data/final/en_US/en_US.blogs.txt"
con <- file(fileblogs, "r")
blogs<-readLines(con,encoding="UTF-8",skipNul=TRUE)
close(con)
filenews<-"./Data/final/en_US/en_US.news.txt"
con <- file(filenews, "r")
news<-readLines(con,encoding="UTF-8",skipNul=TRUE)
close(con)
filetwitter<-"./Data/final/en_US/en_US.twitter.txt"
con <- file(filetwitter, "r")
twitter<-readLines(con,encoding="UTF-8",skipNul=TRUE)
close(con)
data.frame(filename=c(fileblogs,filenews,filetwitter),
sizeMb=c(file.info(fileblogs)$size/1024/1024,
file.info(filenews)$size/1024/1024,
file.info(filetwitter)$size/1024/1024),
numlines=c(length(blogs),length(news),length(twitter)),
numwords=c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))))
## filename sizeMb numlines numwords
## 1 ./Data/final/en_US/en_US.blogs.txt 200.4242 899288 37546239
## 2 ./Data/final/en_US/en_US.news.txt 196.2775 77259 2674536
## 3 ./Data/final/en_US/en_US.twitter.txt 159.3641 2360148 30093413
mylines<-c(blogs,news,twitter)
The data will be sampled at 0.1% because the files are very large. It has been checked that higher samples do not significantly affect the resulting dictionary. Then, they will be loaded using library tm
and then they have to be cleaned from non-printable characters, punctuation and multiple spaces.
library(tm)
set.seed(100)
mylines <- sample(mylines, 0.001*length(mylines))
corpus <- SimpleCorpus(VectorSource(mylines),
control = list(language = "en_US"))
remover <- content_transformer(function(x, greppattern) gsub(greppattern, "", x))
corpus <- tm_map(corpus,remover,"[^[:print:]]") # Remove nonprintable
corpus <- tm_map(corpus, tolower) # Lower case
corpus <- tm_map(corpus, removePunctuation) # Remove punctuation
corpus <- tm_map(corpus, stripWhitespace) # Remove multiple spaces
We have created a function that extracts tokens from a sample of the lines defined from its desired length.
tokenize <- function(lines,length){
L=length(lines)
consider=rbinom(L,1,length/L)
rawwords=NULL
for (i in 1:L){
if(consider[i]){
lines[i]<-paste(lines[i]," ")
expr<-gregexpr("[^ ]+? +",lines[i]) # get "words" (may contain numbers)
for (j in 1:length(expr[[1]])){
start=expr[[1]][j]
stop=start+attr(expr[[1]],"match.length")[j]-2
rawwords<-c(rawwords,substr(lines[i],start,stop))
}
# Revert some usual mispelled words
rawwords[rawwords=="doesnt"]<-"doesn't"
rawwords[rawwords=="dont"]<-"don't"
rawwords[rawwords=="arent"]<-"aren't"
rawwords[rawwords=="isnt"]<-"isn't"
rawwords[rawwords=="aint"]<-"ain't"
rawwords[rawwords=="im"]<-"I'm"
rawwords[rawwords=="youre"]<-"you're"
rawwords[rawwords=="hes"]<-"he's"
rawwords[rawwords=="shes"]<-"she's"
rawwords[rawwords=="theyre"]<-"they're"
rawwords[rawwords=="ive"]<-"I've"
rawwords[rawwords=="youve"]<-"you've"
rawwords[rawwords=="weve"]<-"we've"
rawwords[rawwords=="youll"]<-"you'll"
rawwords[rawwords=="wont"]<-"won't"
rawwords[rawwords=="couldnt"]<-"couldn't"
rawwords[rawwords=="u"]<-"you"
rawwords[rawwords=="thats"]<-"that's"
rawwords[rawwords=="whats"]<-"what's"
rawwords[rawwords=="cant"]<-"can't"
rawwords[rawwords=="lets"]<-"let's"
rawwords[rawwords=="theres"]<-"there's"
rawwords[rawwords=="havent"]<-"haven't"
}
}
rawwords
}
We will also remove unwanted words (profanity check).
for (token in swearwords) {
corpus <- tm_map(corpus,remover,token)
}
tokens<-tokenize(as.list(corpus),length(corpus))
Now we will get the histogram of 1-grams. It results that less than ten words make up more than half of the dictionary. There are plenty of words that appear only once (sparse words). The histogram below is showing only the 40 most frequent words.
termDocMat<-TermDocumentMatrix(corpus)
num<-sort(rowSums(as.matrix(TermDocumentMatrix(corpus))), decreasing = TRUE)
histtokens<-data.frame(tokens=names(num),num=num)
#plot histogram
library(ggplot2)
ggplot(histtokens[1:40,], aes(reorder(tokens, -num), num)) + geom_col() + theme_bw() +
xlab("tokens") + ylab("Freq") + ggtitle("Histogram of 1-grams") +
theme(axis.text.x = element_text(angle = 45, size = 9, hjust = 1,vjust=1))
The 2-gram and 3-gram histograms are computed by looking up in the tokenized corpus (character vector tokens
).
twograms<-data.frame(twograms="",word1=tokens,word2="")
threegrams<-data.frame(threegrams="",word1=tokens,word2="",word3="")
for (i in 1:(length(tokens)-2)){
twograms$word2[i]=tokens[i+1]
twograms$twograms[i]<-paste(tokens[i],tokens[i+1])
threegrams$word2[i]=tokens[i+1]
threegrams$word3[i]=tokens[i+2]
threegrams$threegrams[i]<-paste(twograms$twograms[i],tokens[i+2])
}
twograms$twograms[length(tokens)-1]<-paste(tokens[length(tokens)-1],tokens[length(tokens)])
twograms$word2[length(tokens)-1]<-tokens[length(tokens)]
histtokens2<-twograms %>% group_by(twograms,word1,word2) %>% summarise(num2=n()) %>% arrange(desc(num2))
histtokens3<-threegrams %>% group_by(threegrams,word1,word2,word3) %>% summarise(num3=n()) %>% arrange(desc(num3))
#plot histograms
ggplot(histtokens2[1:40,], aes(reorder(twograms, -num2), num2)) + geom_col() + theme_bw() +
xlab("two-grams") + ylab("Freq") + ggtitle("Histogram of 2-grams") +
theme(axis.text.x = element_text(angle = 45, size = 9, hjust = 1,vjust=1))
ggplot(histtokens3[1:40,], aes(reorder(threegrams, -num3), num3)) + geom_col() + theme_bw() +
xlab("three-grams") + ylab("Freq") + ggtitle("Histogram of 3-grams") +
theme(axis.text.x = element_text(angle = 45, size = 9, hjust = 1,vjust=1))
Now the structures histtokens
, histokens2
and histtokens3
have got the ordered dictionary of n-grams of order 1-3, with the words separated for easy access by the n-gram model.
To finish with the exploratory analysis, the answers to some provided questions will be addressed:
# Number of tokens covering 50% of all instances in the sample
fifty<-1
while(apply(as.matrix(histtokens$num),2,sum2 <-function(x,n) sum(x[1:n])<length(x)/2,fifty)) fifty<-fifty+1
# Number of tokens covering 90% of all instances in the sample
ninety<-1
while(apply(as.matrix(histtokens$num),2,sum2 <-function(x,n) sum(x[1:n])<length(x)*0.9,ninety)) ninety<-ninety+1
The number of tokens covering 50% and 90% of tokens in the sample are 4 and 15.
To detect possible foreign language words, we could have an a-priori English dictionary and check those scanned words not present in it. We could also load the German and French corpora provided for the exercise and check those in the English corpus the tokens which are in any of them. A simple way (provided next) to detect just some of the words (those with accents) is to look for characters in the extended ASCII code:
# Trying to detect some foreign languages
print(tokens[grep("[áéíóúâêîôûàèìòùüäçëïöÿåæßªºãðõµñ]",tokens)])
## [1] "cliché" "hermé" "prés" "cliché" "cliché" "thoméses" "tioté"
The n-gram model which would be practical for the text prediction needs to take into account that the probability of the next word is defined by the state of the lower-level n-grams. This could be modeled as a Markov chain, where the transition to the next state (past words + new word) has a pre-defined probability. This probability can be computed by means of evaluating the frequency of the higher-level n-grams with respect to the sum of all frequencies of the higher-level n-grams.
In other words, to predict the next word from a stream, we should take into account the appearance of the last ones in the dictionary of n-grams of the next-higher level, and fall back to the lower-level dictionary if that n-gram is not found. For instance, if we have one word A, to predict the second one, we should look at the most probable 2-gram among those starting with A. The probability is decreasing in the histogram, thus the first appearance will be the most likely.
It may happen that A does not exist in the 1-gram dictionary. To handle that, we will do two things:
<unk>
.library(stringdist)
correct_word<-function(word){
distmtx <- sapply(histtokens$tokens,function(x)
stringdist(word,x,method="lcs"))
if(min(distmtx)<=nchar(word)/3){
pred_word<-histtokens$tokens[which.min(distmtx)]
}else{
pred_word<-"<unk>"
}
return(pred_word)
}
word<-"mistaqe"
print(paste("Instead of",word,"you must have meant",correct_word(word)))
## [1] "Instead of mistaqe you must have meant mistake"
Back to the prediction problem, if A happens to be the
, and so on. If not, we will choose the first 2-gram in the histogram starting with A (say, from the ordered list A b1, A b2 etc., hence the predicted word to follow would be b1).
For 2-grams (e.g. A B), the same logic will be applied to predict the third word, with the following hierarchy:
<unk>
, repeat the 1-gram problem starting with B.<unk>
, in whose case C will be the
).The algorithm looks like recursive for higher order n-grams. #, although the trade-off between accuracy and loss of performance makes it advisable to stay with 3-grams dictionary maximum.
The next steps in the project will be to implement the prediction algorithm, to try to optimise the model and to produce a presentation and Shiny app both accurate and performing enough to be attractive to be played with.