predictiveText

Predictive Text

Goal: Create an algorithm to predict the next word in a series.

Constraints: Must run within a reasonable amount of time

Implementation - Creating N-grams

read each line of the various data sources (limited by amount of memory and time)

remove all punctuation and stop words process a combined corpus from all sources

process in batches to avoid freezing

use Kneser-Ney smoothing components

use tm create_ngram_model function to create ngrams of 1 through 5

Implementation - cleaning input

User input in the shiny app needs to be sanitized in a similar way to the original training text make everything lower case remove punctuation

library(tm)
enhanced_preprocess <- function(text) {
  text <- tolower(text)
  text <- gsub("[^[:alnum:][:space:]']", "", text) # Keep apostrophes
  text <- removeNumbers(text)
  text <- removePunctuation(text, preserve_intra_word_contractions = TRUE)
  text <- removePunctuation(text, preserve_intra_word_dashes = TRUE)
  text <- removeWords(text, stopwords("english"))
  text <- stripWhitespace(text)
  text <- str_trim(text)
  return(text)
}

Implementation - predicting text

Depending on the length of the user input, check it against appropriate N-gram. If possible check against the most compatible 5-gram. Backoff to lower n-grams.

Collect all possible phrases from the ngram matching, weighting them to give longer phrases a better score.

Take the top 5 scoring words and return them to the user.