What's Next?

Kuba
December 27, 2020

Next word guesser App

Subject

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain

The aim of the project is to develop application guessing next word for phrase user entered

At a glance

the model computes probability of the most likely next word given few first letters/ words of text
train data set (all links here open in new tab) containing text examples is provided by SwiftKey company
text mining and Natural Language Processing is done with well-known R-packages
minimizing both the size and runtime: to provide a reasonable experience to user

Intermediate exploratory analysis includes:

data cleaning and tokenization (separate data into smaller units like words or phrases)
visualization of words/ n-grams frequency
- n-gram is a contiguous sequence of n words

Data Summary

Original data is obtained from three sources: blogs/ news/ twitter. Within each file (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt), every line is an extract from a single post/ article/ tweet

In total
- 4 269 678 lines
- 102 080 204 words
Longest line
- overall: the one of en_US.blogs.txt file, contains 40 833 characters
- en_US.twitter.txt: contains expected 140 characters
Average length of lines
- 68.68 characters in the case of en_US.twitter.txt
- around 200 characters for the other two sources
Sample for building the model
- 15% of lines from each source (blogs/ news/ twitter)
- 640 451 lines; 15 294 010 words; ~1.5M tokenized sentences

How it Works

Model

data sample is cleaned by conversion to lowercase, removing punctuation, links, twitter hashtags, white spaces, numbers, special characters etc.
- profanities/ stop-words are filtered
1-grams to 5-grams are generated
frequency tables for each unique gram are constructed
- only kept tokens with frequency > 1 (for speed reasons)

Algorithm

query input is pre-processed and tokenized resulting in a number of grams
starting from the last 5 (or less) grams of the query the Stupid Backoff algorithm (looking for a required word in a last (n-1)-gram) is recursively applied
- inexpensive to calculate, while accuracy approaches to more complicated models when very large text sources are used
all matching grams are aggregated and sorted by their score descending
top-10 (or less) options are suggested

"What's Next?" App

Shiny Application What's Next? (open in new tab) is deployed on Rstudio's shiny server

Using the app is intuitive and easy:

enter letter(s)/ word(s)/ phrase in the input field
- only the English language is supported so far
- the more words are entered, the more accurate the prediction is
  - prediction uses up to 5 words
predicted next letters/ words are delivered (up to 10 options) dynamically below
select a suggestion by up/ down arrow keys & enter,
- OR trigger autocomplete with right arrow key