What's Next?

Kuba
December 27, 2020

Next word guesser App

Subject

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain

The aim of the project is to develop application guessing next word for phrase user entered

At a glance

  • the model computes probability of the most likely next word given few first letters/ words of text
  • train data set (all links here open in new tab) containing text examples is provided by SwiftKey company
  • text mining and Natural Language Processing is done with well-known R-packages
  • minimizing both the size and runtime: to provide a reasonable experience to user

Intermediate exploratory analysis includes:

  • data cleaning and tokenization (separate data into smaller units like words or phrases)
  • visualization of words/ n-grams frequency
    • n-gram is a contiguous sequence of n words

Data Summary

Original data is obtained from three sources: blogs/ news/ twitter. Within each file (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt), every line is an extract from a single post/ article/ tweet

  • In total
    • 4 269 678 lines
    • 102 080 204 words
  • Longest line
    • overall: the one of en_US.blogs.txt file, contains 40 833 characters
    • en_US.twitter.txt: contains expected 140 characters
  • Average length of lines
    • 68.68 characters in the case of en_US.twitter.txt
    • around 200 characters for the other two sources
  • Sample for building the model
    • 15% of lines from each source (blogs/ news/ twitter)
    • 640 451 lines; 15 294 010 words; ~1.5M tokenized sentences

How it Works

Model

  • data sample is cleaned by conversion to lowercase, removing punctuation, links, twitter hashtags, white spaces, numbers, special characters etc.
    • profanities/ stop-words are filtered
  • 1-grams to 5-grams are generated
  • frequency tables for each unique gram are constructed
    • only kept tokens with frequency > 1 (for speed reasons)

Algorithm

  • query input is pre-processed and tokenized resulting in a number of grams
  • starting from the last 5 (or less) grams of the query the Stupid Backoff algorithm (looking for a required word in a last (n-1)-gram) is recursively applied
    • inexpensive to calculate, while accuracy approaches to more complicated models when very large text sources are used
  • all matching grams are aggregated and sorted by their score descending
  • top-10 (or less) options are suggested

"What's Next?" App

Shiny Application What's Next? (open in new tab) is deployed on Rstudio's shiny server

Using the app is intuitive and easy:

  • enter letter(s)/ word(s)/ phrase in the input field
    • only the English language is supported so far
    • the more words are entered, the more accurate the prediction is
      • prediction uses up to 5 words
  • predicted next letters/ words are delivered (up to 10 options) dynamically below
  • select a suggestion by up/ down arrow keys & enter,
    • OR trigger autocomplete with right arrow key