Vidur Nayyar
March 26, 2017
This application will take user input and predict the next word or words in the sentence. The utility of this app goes towards mobile text keyboards that greatly benefit the user experience by helping complete sentences and words when typed on a touchscreen keyboard.
In Predicting next words, the frequency/probability of words is key. N-gram models take textual data and produce 1 word, 2 word up to n word term-frequency data structures that inherently give probability of the words occurrence.
A robust model takes powerful computing to compile in a reasonable amount of time: hence the Stupid-Backoff modelis used to trim down processing and compiling times to work in mobile and web-apps. The Stupid Backoff Model works by getting the most frequent trigram that contains the input, and scales back to using bigrams and unigrams when not found. This works surprisingly well in quick applications of sentence prediction in a reasonable amount of compute-time.
Data was cleaned and processed into a Text Corpus. (truncated for simplicity):
library(tm);library(stringi)
t<-readLines('/twitter.txt',skipNul=T)
set.seed(7);index<-sample(1:length(t),n);
ttr<-t[index]
ttr<-stri_replace_all_regex(ttr,"[^\x20-\x7E]","")
train<-VectorSource(ttr);corpusTrain<-Corpus(train)
This text corpus was used to produce Term Document Matrices of unigrams,bigrams and trigrams. The Model cleans user-input and searches ngrams for input, and output is frequency-rated.
corpusTrain<-tm_map(corpusTrain,stripWhitespace);
triTokenizer<-function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
trigrams<-TermDocumentMatrix(corpusTrain,control=list(tokenize=triTokenizer,bounds=list(global=c(3,Inf))))
predictSB<-function(input,profanity,uniDF,biDF,triDF,maxResults){ input<-removePunctuation(input);input<-setdiff(input,stopwords);
input <- input[grepl('[[:alpha:]]',input)] }
The user Input is then searched through the ngrams. The data frames are weighted and highest rank output is given.
greptri<-grepl(paste0("^",input,"$"),triDF$bigram); indexedtri<-triDF[greptri,]
grepbi<-grepl(paste0("^",input2,"$"),biDF$unigram); indexedbi<-biDF[grepbi,]
grepuni<-grepl(paste0("^",input3,"$"),uniDF$unigram); indexeduni<-uniDF[grepuni,];
indexedbi$s<-0.7*indexedbi$freq/sum(grepbi);indexedtri$s<-indexedtri$freq/sum(indexedtri$freq);
names<-c(indexedtri$name,indexedbi$name,ordereduni$unigram);
score<-c(indexedtri$s,indexedbi$s,ordereduni$s)
predictWord<-data.frame(next_word=names,score=score,stringsAsFactors=F)
predictWord<-predictWord[order(predictWord$score,decreasing =T),]
final<-unique(predictWord$next_word)
final<-final[1:maxResults]