Jesse Sharp
April 17, 2016
Word Salad is intended to be a starting platform for introducing students to natural language processing through word prediction. It is designed to be used on a device with enough computing power to run R and a Shiny app.
Goals:
The app is simple and takes a phrase as input then returns a possible next word. Guesses are generated by three different methods using a database of N-grams.
First, Stupid Backoff is used to generate a list of possible words. Then a final prediction is made using:
For example, just check each word file for an “exact” match depending on length of query and adjust the query as needed as shown below. Then select the most frequently occuring or “most probable” and we have a backoff algorithm with a maximum likelihood estimate.
## Quad-Grams
if (length(x) == 3) {idx <- which(substr(quadGrams[,gram], 0, nchar(query)) == query)
if (length(idx) > 0) {res <- head(quadGrams[idx,], nwords)
return(res)}
## Tri-Grams
else query <- paste(x[2],x[3])
idx <- which(substr(triGrams[,gram], 0, nchar(query)) == query)
if (length(idx) > 0) {res <- head(triGrams[idx,], nwords)
return(res)}
## Bi-Grams
else query <- paste(x[3])
idx <- which(substr(biGrams[,gram], 0, nchar(query)) == query)
if (length(idx) > 0) {res <- head(biGrams[idx,], nwords)
return(res)}
The image shows the main page for the app. A default phrase is loaded and dynamically returns the prediction when the app is first run. The side panel gives information about each of the prediction methods.
In addition, the student or user can use the Recipe tab to see the entire list (up to 10) of possible next words that were generated. Finally, the Learn More tab supplies links to relevant references and information.
The concept calls for a companion “cookbook” of lessons in text data preparation, text prediction and complete R code used in creating the tool.
Simple text prediction problems are a great launching place for student's to explore many aspects of text mining, algorithms, statistics and data science. This app provides a framework for building a dynamic and accessible teaching tool.
Launch Word Salad
Thanks