Word Salad: Word Prediction Algorithm Teaching Tool

Jesse Sharp
April 17, 2016

Coursera John Hopkins Data Science Capstone Project with Swiftkey

alt text

Concept

Word Salad is intended to be a starting platform for introducing students to natural language processing through word prediction. It is designed to be used on a device with enough computing power to run R and a Shiny app.

Goals:

Increase student engagment by giving them a tangible starting point.
Quit reinventing the wheel, giving creative learners more of a chance to go farther.
Accessible to new students, a way to engage verbal learners with numbers.
Can be used to teach many topics depending on course emphasis.

The app is simple and takes a phrase as input then returns a possible next word. Guesses are generated by three different methods using a database of N-grams.

Prediction Approach - Back-off and Good-Turing Smoothing

First, Stupid Backoff is used to generate a list of possible words. Then a final prediction is made using:

Maximum Likelihood Estimate
Longest Word
Crude Good-Turing Smoothing

For example, just check each word file for an “exact” match depending on length of query and adjust the query as needed as shown below. Then select the most frequently occuring or “most probable” and we have a backoff algorithm with a maximum likelihood estimate.

## Quad-Grams
    if (length(x) == 3) {idx <- which(substr(quadGrams[,gram], 0, nchar(query)) == query)
    if (length(idx) > 0) {res <- head(quadGrams[idx,], nwords)
        return(res)}
## Tri-Grams
    else query <- paste(x[2],x[3])
        idx <- which(substr(triGrams[,gram], 0, nchar(query)) == query)
        if (length(idx) > 0) {res <- head(triGrams[idx,], nwords)
            return(res)}
## Bi-Grams
    else query <- paste(x[3])
        idx <- which(substr(biGrams[,gram], 0, nchar(query)) == query)
        if (length(idx) > 0) {res <- head(biGrams[idx,], nwords)
            return(res)}

Application UI and Predictions

Predictions

The image shows the main page for the app. A default phrase is loaded and dynamically returns the prediction when the app is first run. The side panel gives information about each of the prediction methods.

In addition, the student or user can use the Recipe tab to see the entire list (up to 10) of possible next words that were generated. Finally, the Learn More tab supplies links to relevant references and information.

The concept calls for a companion “cookbook” of lessons in text data preparation, text prediction and complete R code used in creating the tool.

alt text

Conclusion and Credits

Simple text prediction problems are a great launching place for student's to explore many aspects of text mining, algorithms, statistics and data science. This app provides a framework for building a dynamic and accessible teaching tool.

Launch Word Salad

Thanks

Coursera, John Hopkins University and Swiftkey
R Project, R Studio and R community
Fellow Data Science Certificate students, especially the Capstone Class!