Predicting N-Grams from the SwiftKey Dataset

owilks
11/3/2019

Description

This algorithm pulls data from a corpus (a collection of written texts) provided by a team at TouchType/SwiftKey, a subsidiary of the Microsoft.

Stripping the corpus into its raw components, breaking those into documents, and then creating our predictive algorithm using the breakdown of n-grams, this lets us predict the “next word” in a sentence using a subset of the phrase provided.

If no such phrase can be found, it incorporates data from the user and updates the relevant probability distribution.

Tweaking the Underlying Data

A few key choices were made while trimming the corpus to maximize the accuracy of the underlying dataset:

Instead of removing them, converting expletives to a common keyword to minimize data loss
Similarly, instead of deleting all numbers, converting to 1-9 range to their alphabetic string equivalent, and replacing numbers after a certain amount with the keyword “numrep” for number-representative
Instead of removing them, replacing some standard punctuation (dashes, commas, etc) with an underscore character “_” to minize false-equivalence scenarios: e.g. “well” vs “we'll”

#Samples of Punctuation adjustment strategy
        texts(sam)<-gsub("\\bisn't\\b","is not",texts(sam))

        texts(sam)<-gsub("\\b1st\\b","first",texts(sam))

        texts(sam)<-gsub("\\b4\\b","four",texts(sam))

Subsetting Strategy & Accuracy

Ingesting all of the data would be too memory expensive, so we chose to use a subset of the top 1% of common n-grams (bi,tri,quad-grams) across the corpus for our predictive model.
Each of those sets of n-grams, and their frequency count, define a probability distribution used to “guess” the following word at random (this ensures more variability in the guessed results).

    rawB<-readLines(pathBlogs, encoding = "UTF-8", skipNul = TRUE, n=9000)
    rawT<-readLines(pathTwit, encoding = "UTF-8", skipNul = TRUE, n=7000)
    rawN<-readLines(pathNews, encoding = "UTF-8", skipNul = TRUE, n=3000)

Application Dynamics

Once the user enters input into the box, the app trims the phrase down to the final 3 words and uses them to predict the next word.

If the tri-gram cannot be found to create a prediction, then the algorithm searches for a prediction with the final 2 words (bi-gram) and so on so forth.

Afterwards the system is updated to include the “correct word” in the new bi-gram/tri-gram search function, weighted specifically to favor user input over the existing components of the probability distribution.

    if(count>2&sum(stat4$open==phrase4)>0){nextword<-prednext(phrase4,stat4)}
        else{if(count>1&sum(stat3$open==phrase3)>0){nextword<-prednext(phrase3,stat3)}
            else{if(count>0&sum(stat2$open==phrase2)>0){nextword<-prednext(phrase2,stat2)}
                else{nextword<-"We don't have an accurate prediction for this phrase! Try another!"}
                }

Next Steps

Implementing longer n-grams and having a more comprehensive search model could be a straightforward next step that didn't include too much additional memory.
Additionally, there are more intensive probability distribution mechanics, e.g. the chinese restaurant process, that could be implemented in such a way that would increase our accuracy without necessarily increasing the amount of memory we need to use in a severely costly way.
That, and a more robust user-interface that gave users the option of selecting one-of-three potential predictions instead of one, and had a more robust user-feedback response tree would better flesh out the call-and-response mechanics of the app itself, better increasing the intuitiveness of the app.

Thank you for taking the time to read this! Guess well!