Aixa Rodriguez Salan
14/November/2017
Swiftkey Next Word Prediction
The goal of this project it's to create an application that can predict the most probable next word to be typed using Natural Language Processign model based on a large corpus of raw data sourced by blogs, news and twitter. Basically the project was set in 4 stages:
The corpus for the NLP model consist in 3 text files collected from publicly available sources by a web crawler from blog, news and twitter with aproximatly 250MB each one and 102.4 millions of words. Given the amount of data, process it requires an amount of resources way too large to accomplish by my computer and for been readed in the shinnyiop server, a sample set of 1% was cleaned using “gsub” function and used as input on the next stage of the project to construct the NGram's models.
con <- file("final/en_US/en_US.blogs.txt", open="rb")
blog <- readLines(con, encoding="UTF-8")
close(con)
rm(con)
c(sample(blog,length(blog)*0.01),sample(twitter,length(twitter)*0.01),sample(news,length(news)*0.01))
The quanteda" library was used to tokenize the dataset and then “sapply” to calculate the frequency of a phrase and with this obtain each ngram data frame.
ngram_set<-function(dt_set,ng=1){
df_ngram<-tokenize(dt_set,ngrams=ng)
df_ngram <- unlist(df_ngram)
df_ngram <- table(df_ngram)
df_ngram <- as.data.frame(df_ngram)
colnames(df_ngram)<-c("term", "freq")
df_ngram <- arrange(df_ngram, -freq)
if (ng>1) {
df_ngram$term<-gsub("_", " ",df_ngram$term)
df_ngram<-cbind(df_ngram,data.frame(t(sapply(df_ngram$term, function(x) strsplit(x, " ")[[1]]))))
rownames(df_ngram)<-NULL
}else
df_ngram$X1<-df_ngram$term
return(df_ngram)
}
1, 3 and 5 Grams Models
The prediction algorithm gets from each of the ngrams (1 to 5) the exact words in the same order leaving the last X term out of the filtering and with the obtained dataset calculates the percentage that the frequency represents.
st_word<-filter(df_1gram,grepl(txt[1],X1) & txt[1]!=df_1gram$X1 & txt[1]==substr(df_1gram$X1,1,stri_length(txt[1])) & !(df_1gram$X1 %in% stopwords("en")))
if(nrow(st_word)!=0){
sm_freq<-sum(st_word$freq)
st_word$prctn<-round(st_word$freq*100/sm_freq)
if(nrow(st_word)>sz)
st_word<-st_word[order(-st_word$freq),c("X1","prctn")][1:sz,]
else
st_word<-st_word[order(-st_word$freq),c("X1","prctn")][1:nrow(st_word),]
names(st_word)<-c("term","prctn")
return(st_word)
Then merging the 5 results calculates ponderated frequency based on the number of the terms found and subset the top 3 predictions.
n_merge$scr<-n_merge$N5*0.4^0+n_merge$N4*0.4^1+n_merge$N3*0.4^2+n_merge$N2*0.4^3+n_merge$N1*0.4^4
The app it's hosted in shinyapps.io in https://aixarodriguez.shinyapps.io/shynapp/.
A minimalistic frontend with only a textbox, you just need to write in it and the app will make the rest. The first time it's loaded takes 30 seconds approximately setting in the grams and the prediction functions.