Predicting N-Grams from the SwiftKey Dataset

owilks
11/3/2019

Description

This algorithm pulls data from a corpus (a collection of written texts) provided by a team at TouchType/SwiftKey, a subsidiary of the Microsoft.

Stripping the corpus into its raw components, breaking those into documents, and then creating our predictive algorithm using the breakdown of n-grams, this lets us predict the “next word” in a sentence using a subset of the phrase provided.

If no such phrase can be found, it incorporates data from the user and updates the relevant probability distribution.

Tweaking the Underlying Data

A few key choices were made while trimming the corpus to maximize the accuracy of the underlying dataset:

Instead of removing them, converting expletives to a common keyword to minimize data loss
Similarly, instead of deleting all numbers, converting to 1-9 range to their alphabetic string equivalent, and replacing numbers after a certain amount with the keyword “numrep” for number-representative
Instead of removing them, replacing some standard punctuation (dashes, commas, etc) with an underscore character “_” to minize false-equivalence scenarios: e.g. “well” vs “we'll”

fullprocess <- function(text){
  sam<-sample(texts(text), size=.01*length(text),replace=FALSE)%>%corpus()
  texts(sam)<-gsub("\\.{2}",".",texts(sam))
  sam<-corpus_segment(sam, pattern = ".", valuetype = "fixed",
                      pattern_position = "after", extract_pattern = FALSE) %>%
    corpus_segment(sam, pattern = "?", valuetype = "fixed",
                   pattern_position = "after", extract_pattern = FALSE) %>%
    corpus_segment(sam, pattern = "!", valuetype = "fixed",
                   pattern_position = "after", extract_pattern = FALSE)

  texts(sam)<-gsub("[[:punct:]]","_",texts(sam))
  texts(sam)<-tolower((texts(sam)))
  texts(sam)<-gsub(badp,"cnsrd",texts(sam))
  texts(sam)<-gsub("\\b1st\\b","first",texts(sam))
  texts(sam)<-gsub("\\b2nd\\b","second",texts(sam))
  texts(sam)<-gsub("\\b3rd\\b","third",texts(sam))
  texts(sam)<-gsub("\\b4th\\b","fourth",texts(sam))
  texts(sam)<-gsub("\\b5th\\b","fifth",texts(sam))
  texts(sam)<-gsub("\\b6th\\b","sixth",texts(sam))
  texts(sam)<-gsub("\\b7th\\b","seventh",texts(sam))
  texts(sam)<-gsub("\\b8th\\b","eigth",texts(sam))
  texts(sam)<-gsub("\\b9th\\b","ninth",texts(sam))
  texts(sam)<-gsub("\\b(\\d)+th\\b","nth",texts(sam))
  texts(sam)<-gsub("\\b1\\b","one",texts(sam))
  texts(sam)<-gsub("\\b2\\b","two",texts(sam))
  texts(sam)<-gsub("\\b3\\b","three",texts(sam))
  texts(sam)<-gsub("\\b4\\b","four",texts(sam))
  texts(sam)<-gsub("\\b5\\b","five",texts(sam))
  texts(sam)<-gsub("\\b6\\b","six",texts(sam))
  texts(sam)<-gsub("\\b7\\b","seven",texts(sam))
  texts(sam)<-gsub("\\b8\\b","eight",texts(sam))
  texts(sam)<-gsub("\\b9\\b","nine",texts(sam))
  texts(sam)<-gsub("\\b0\\b","zero",texts(sam))
  texts(sam)<-gsub("\\b(\\d)+\\b","numrep",texts(sam))

  return(sam)
}

Subsetting Strategy & Accuracy

Ingesting all of the data would be too memory expensive, so we chose to use a subset of the top 1% of common n-grams (bi,tri,quad-grams) across the corpus for our predictive model.
Each of those sets of n-grams, and their frequency count, define a probability distribution used to “guess” the following word at random (this ensures more variability in the guessed results).

                 feature frequency            open    last
1 thanks for the follow_        51 thanks for the  follow_
2        the rest of the        50    the rest of      the
3         the end of the        49     the end of      the
4       when it comes to        42  when it comes       to
5          at the end of        35     at the end       of
6        one of the most        34     one of the     most

Application Dynamics

Once the user enters input into the box, the app trims the phrase down to the final 3 words and uses them to predict the next word.

If the tri-gram cannot be found to create a prediction, then the algorithm searches for a prediction with the final 2 words (bi-gram) and so on so forth.

Afterwards the system is updated to include the “correct word” in the new bi-gram/tri-gram search function, weighted specifically to favor user input over the existing components of the probability distribution.

predcomplex <- function(text,table){
  corr<-"No"
  count=0
    while(corr=="No" & count<3){

      nextword <- sample(
        size=1, 
        x=table[table[,3]==text,4],
        prob=table[table[,3]==text,2]
      )
      corr<-readline(paste("Is your next word",nextword,"? Yes/No? "))
      count=count+1
      }

  if(corr=="Yes"){
    nextword<-nextword
    table[table[,3]==text,2]<-table[table[,3]==text,2]+25
    }
  else{
  newword <-readline(paste("Sorry! Please tell us your next word! "))
  nextword <-newword
  }

 return(nextword)
}

Next Steps

Implementing longer n-grams and having a more comprehensive search model could be a straightforward next step that didn't include too much additional memory.
Additionally, there are more intensive probability distribution mechanics, e.g. the chinese restaurant process, that could be implemented in such a way that would increase our accuracy without necessarily increasing the amount of memory we need to use in a severely costly way.
That, and a more robust user-interface that gave users the option of selecting one-of-three potential predictions instead of one, and had a more robust user-feedback response tree would better flesh out the call-and-response mechanics of the app itself, better increasing the intuitiveness of the app.

Thank you for taking the time to read this! Guess well!