Next word prediction app

Remko Logemann

Capstone project

Objective: development of a text prediction model in Shiny app. The app predicts the next word in a sentence based on previous words provided by user.

Presentation highlights:

  • Data product
  • Method, algorithm and data

Data product

The app can be found here


  • User hits predict button after providing part of a sentence
  • Used words are shown in red, next predicted word is shown in blue

Method, algorithm and data

  • The Capstone dataset is used to develop the app
  • In total 10% of the data was randomly sampled and split into:
    • Trainingset (80%)
    • Testset (20%)
  • The data was cleaned removing extra whitespaces, punctuation, symbols, twitter handlers, lower-cased etc…
  • Next n-grams (2,3,4) were created, 3-gram example (a simple sample sentence here):
    • a simple sample
    • simple sample sentence
    • sample sentence here

Method, algorithm and data (2)

  • A stupid back-off method was combined with the n-gram model for the prediction model:

    • For each n-gram a frequency table is created.
    • Search 4-gram dataset for a match on the first 3 words. Of all matches select most frequent 4-gram.
    • If no match was found, search 3-gram dataset,
    • If still no match, search 2-gram dataset
    • Ultimately without a match the top 1-gram word is predicted.
  • Various optimizations such as cut-offs in frequencies are performed to balance dataset sizes, speed and performance.