Wordpredictor

Lars Bungum
March, 24th, 2017

Word predictor

  • Simple, easly-to-understand interface
  • Accepts a sentece of any length and predicts the next word
  • Gives a quick response, ensuring usability
  • First, build the model
  • Next, write a prediction algorithm that looks up the model
  • Finally, present results

Next word predictor snapshot

Prediction model

exclude_last_word<-function(ngram){
  matches <- gregexpr("_", ngram)
  matchlen <- length(matches[[1]])
  substring(ngram, 1,matches[[1]][matchlen]-1)
}

get_bigram_prob<-function(bigram){
  ngram_minus_1 <- exclude_last_word(bigram)
  bigram_prob <- bi_freq[bigram]/freq[ngram_minus_1]
  unname(bigram_prob)
}
get_trigram_prob<-function(trigram){
  ngram_minus_1 <- exclude_last_word(trigram)
  trigram_prob <- tri_freq[trigram]/bi_freq[ngram_minus_1]
  unname(trigram_prob)
}

corpusObject<-corpus(readtext(corpuspath))
....
bi_dictdfm <- dfm(corpusObject, ngrams=2, removePunct=TRUE, removeNumbers=T)
bi_dictdfm <- dfm_trim(bi_dictdfm, min_count = mincount) 
bi_freq <- colSums(bi_dictdfm)
bigram_probs<-log(sapply(names(bi_freq), get_ngram_prob))
....

Prediction Algorithm

predict_ngram <- function(ngram, problist) {
  query<-paste("^",ngram,"_",sep='')
  matches<-grep(query, names(problist))
  if (any(matches)){
    which.max(problist[matches])
  }
  else
    FALSE
}
predict <- function(ngram) {
  ngramlength<-nchar(ngram)
  if (ngramlength == 0){
    return(names(which.max(unigram_probs)))
  }
  if (substring(ngram, ngramlength) == "_") {
    ngram<-substr(ngram, 1,ngramlength-1)
  }
  tripred<-predict_ngram(ngram,trigram_probs)
  if (tripred)
  {
    pred<-names(tripred)
    secmatch <- gregexpr("_", pred)[[1]][2]
    pred <- substring(pred, secmatch+1)
  }
....

Conclusions

  • A simple model was chosen to ensure responsiveness
  • Three tables are created, with log-probabilities of ngrams up to 3
  • The model will look for longer n-grams and back-off successively to lower order
  • No smoothing is implemented
  • In future work, a trie datastructure will be used to save space, and the model will be smoothed in order to take unknown words better into account
  • Taking longer distances into account is necessary to make accurate predictions. Hence, models like ANN's (RNN and LTSM models) or Conditional-Random Fields will be considered
  • Prediction is split in two: a) a main routine which separates the lookups after length and b) a routine that looks up parts in either trigram, bigram or unigram models
  • If trigram is input, the trigram model is queried for tokens starting with the two first words. The entry with the highest probability is returned, and the last word in this trigram returned as the prediction
  • If no trigram is found starting with the two first tokens, the algorithm backs off to the bigram model, looking for bigrams starting with the last word entered by the user
  • If no bigram is used, the most likely unigram is returned – the