Next Word Predictor

Leigh Goodenough

Next Word Predictor in R

As the capstone project of the Johns Hopkins Data Science Specialisation, I have developed a language model that predicts the next word when a string of words is provided as input.

This model has been deployed in Shiny and can be accessed via: https://leighgoodenough.shinyapps.io/wordpredict/

The app works simply by adding text to the input box and waiting for predicted next words to be generated. It is possible to select between 1-10 predictions.

Initial versions of the language model used the twitter, news and blog data provided as part of the Capstone course content.

To increase the prevalence of well-formed English sources in the corpus, some classic novels were sourced from Project Gutenberg and added to the corpus.

Final corpus composition was:

  • 5,000 randomly selected lines of twitter content.

  • 10,000 randomly selected lines of news content.

  • 20,000 randomly selected lines of blog content.

  • Full texts of The Picture of Dorian Gray, Heart of Darkness and Jane Eyre, with prefaces included but with Project Gutenberg metadata deleted.

Language Model Development

The tm library was used to transform corpora, with help from the textreg library to switch corpora between VCorpus and dataframe formats as required.

The RWeka library was used for tokenisation and to create ngrams of the corpus. This was particularly important in developing a list of words to be used as responses in the prediction model.

The kgrams library was used to develop the language model. This was a succinct method. Once the corpus had been finalised, the language model could be developed with two lines of code:

freqs_corpus <- kgram_freqs(corpus, N = 5, verbose = FALSE)
kn_corpus <- language_model(freqs_corpus, "kn", D = 0.9)

The meaning of the elements of this code will be explained in the next slide.

Language Model Refinement & Deployment

  • The Kneser-Ney method (“kn”) was used to smooth the language model. Based on research I undertook, Kneser-Ney is often considered as an optimal model for accuracy and speed. Bayesian approaches can be more accurate but at the expense of speed. Other relatively fast methods do not have the same performance as Kneser-Ney in tests by various researchers.

It was found through development testing that:

  • Working with up to 5n-grams was appropriate (characterised by ‘N = 5’ in the code).

  • A discounting rate of 0.9 resulted in a better prediction of next words than other discounting rates (characterised by ‘D = .9’ in the code).

The corpus and predicted-word list were saved as RDS files and loaded into the Shiny app at start-up. The corpus was subsetted down to 30,000 lines for use in the Shiny app - this meant that it didn’t take too long to prepare the language model and respond to the first input by the user upon start-up.

Performance Testing

The language model was tested by measuring perplexity of the model. The kgrams documentation includes code for testing the perplexity of a model, which was used to test various models during the development process:

D_grid <- seq(from = 0.05, to = 0.95, by = 0.1)
FUN_c1 <- function(D, N) {param(kn_test1, "N") <- N; param(kn_test1, "D") <- D; perplexity(corpus_test1, model = kn_test1)}
P_grid_c1 <- lapply(2:5, function(N) sapply(D_grid, FUN_c1, N = N))
plot(D_grid, P_grid_c1[[1]], type = "n", xlab = "D", ylab = "Perplexity", ylim = c(0, 350))
lines(D_grid, P_grid_c1[[1]], col = "red"); lines(D_grid, P_grid_c1[[2]], col = "chartreuse")
lines(D_grid, P_grid_c1[[3]], col = "blue"); lines(D_grid, P_grid_c1[[4]], col = "black")

sapply(c("2-gram" = 1, "3-gram" = 2, "4-gram" = 3, "5-gram" = 4), function(N) min(P_grid_c1[[N]]))

The first code block produces a graph showing the perplexity of the model for 2-5gram inputs at various discounting rates. The second code-block shows the minimum perplexity for 2-5gram inputs. This diagnostic data was helpful in selecting the final model for deployment.