Wordpredictor

Lars Bungum
March, 24th, 2017

Word predictor

Simple, easly-to-understand interface
Accepts a sentece of any length and predicts the next word
Gives a quick response, ensuring usability
First, build the model
Next, write a prediction algorithm that looks up the model
Finally, present results

Next word predictor snapshot

Prediction model

exclude_last_word<-function(ngram){
  matches <- gregexpr("_", ngram)
  matchlen <- length(matches[[1]])
  substring(ngram, 1,matches[[1]][matchlen]-1)
}

get_bigram_prob<-function(bigram){
  ngram_minus_1 <- exclude_last_word(bigram)
  bigram_prob <- bi_freq[bigram]/freq[ngram_minus_1]
  unname(bigram_prob)
}
get_trigram_prob<-function(trigram){
  ngram_minus_1 <- exclude_last_word(trigram)
  trigram_prob <- tri_freq[trigram]/bi_freq[ngram_minus_1]
  unname(trigram_prob)
}

corpusObject<-corpus(readtext(corpuspath))
....
bi_dictdfm <- dfm(corpusObject, ngrams=2, removePunct=TRUE, removeNumbers=T)
bi_dictdfm <- dfm_trim(bi_dictdfm, min_count = mincount) 
bi_freq <- colSums(bi_dictdfm)
bigram_probs<-log(sapply(names(bi_freq), get_ngram_prob))
....

Prediction Algorithm

predict_ngram <- function(ngram, problist) {
  query<-paste("^",ngram,"_",sep='')
  matches<-grep(query, names(problist))
  if (any(matches)){
    which.max(problist[matches])
  }
  else
    FALSE
}
predict <- function(ngram) {
  ngramlength<-nchar(ngram)
  if (ngramlength == 0){
    return(names(which.max(unigram_probs)))
  }
  if (substring(ngram, ngramlength) == "_") {
    ngram<-substr(ngram, 1,ngramlength-1)
  }
  tripred<-predict_ngram(ngram,trigram_probs)
  if (tripred)
  {
    pred<-names(tripred)
    secmatch <- gregexpr("_", pred)[[1]][2]
    pred <- substring(pred, secmatch+1)
  }
....

Conclusions

A simple model was chosen to ensure responsiveness
Three tables are created, with log-probabilities of ngrams up to 3
The model will look for longer n-grams and back-off successively to lower order
No smoothing is implemented
In future work, a trie datastructure will be used to save space, and the model will be smoothed in order to take unknown words better into account
Taking longer distances into account is necessary to make accurate predictions. Hence, models like ANN's (RNN and LTSM models) or Conditional-Random Fields will be considered

Prediction is split in two: a) a main routine which separates the lookups after length and b) a routine that looks up parts in either trigram, bigram or unigram models
If trigram is input, the trigram model is queried for tokens starting with the two first words. The entry with the highest probability is returned, and the last word in this trigram returned as the prediction
If no trigram is found starting with the two first tokens, the algorithm backs off to the bigram model, looking for bigrams starting with the last word entered by the user
If no bigram is used, the most likely unigram is returned – the