Word Predict

Vidur Nayyar
March 26, 2017

This application will take user input and predict the next word or words in the sentence. The utility of this app goes towards mobile text keyboards that greatly benefit the user experience by helping complete sentences and words when typed on a touchscreen keyboard.

Introduction:

In Predicting next words, the frequency/probability of words is key. N-gram models take textual data and produce 1 word, 2 word up to n word term-frequency data structures that inherently give probability of the words occurrence.

A robust model takes powerful computing to compile in a reasonable amount of time: hence the Stupid-Backoff modelis used to trim down processing and compiling times to work in mobile and web-apps. The Stupid Backoff Model works by getting the most frequent trigram that contains the input, and scales back to using bigrams and unigrams when not found. This works surprisingly well in quick applications of sentence prediction in a reasonable amount of compute-time.

Key Points:

  • Data from twitter/news/blogs formed a Text Corpus (10% Sample).
  • Uses Text-Miner, NLP, RWeka, Stringr/Stringi packages.
  • Cleaned Text Corpus were constructed into unigrams,bigrams,and trigrams.
  • N-grams used in prediction model to output next-word.

Data was cleaned and processed into a Text Corpus. (truncated for simplicity):

library(tm);library(stringi)
t<-readLines('/twitter.txt',skipNul=T)
set.seed(7);index<-sample(1:length(t),n);
ttr<-t[index]
ttr<-stri_replace_all_regex(ttr,"[^\x20-\x7E]","")
train<-VectorSource(ttr);corpusTrain<-Corpus(train)

How The Model Works:

This text corpus was used to produce Term Document Matrices of unigrams,bigrams and trigrams. The Model cleans user-input and searches ngrams for input, and output is frequency-rated.

corpusTrain<-tm_map(corpusTrain,stripWhitespace);
triTokenizer<-function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
trigrams<-TermDocumentMatrix(corpusTrain,control=list(tokenize=triTokenizer,bounds=list(global=c(3,Inf))))
predictSB<-function(input,profanity,uniDF,biDF,triDF,maxResults){ input<-removePunctuation(input);input<-setdiff(input,stopwords);
input <- input[grepl('[[:alpha:]]',input)] }

Final Touches

The user Input is then searched through the ngrams. The data frames are weighted and highest rank output is given.

greptri<-grepl(paste0("^",input,"$"),triDF$bigram); indexedtri<-triDF[greptri,]
grepbi<-grepl(paste0("^",input2,"$"),biDF$unigram); indexedbi<-biDF[grepbi,]
grepuni<-grepl(paste0("^",input3,"$"),uniDF$unigram); indexeduni<-uniDF[grepuni,];

indexedbi$s<-0.7*indexedbi$freq/sum(grepbi);indexedtri$s<-indexedtri$freq/sum(indexedtri$freq);
names<-c(indexedtri$name,indexedbi$name,ordereduni$unigram);
score<-c(indexedtri$s,indexedbi$s,ordereduni$s)
predictWord<-data.frame(next_word=names,score=score,stringsAsFactors=F)
predictWord<-predictWord[order(predictWord$score,decreasing =T),]
final<-unique(predictWord$next_word)
final<-final[1:maxResults]