Next Word Predictor

Sentil Pillai
April 2015

Data Science Specialization SwiftKey Capstone

Natural Language Processing (NLP) Final Project

Introduction and Objectives

The Swiftkey keyboard has been installed in over 250 million handheld (Android, IOS) devices. It provides the user-base a text input experience they love and also saves them time.

The objective of this project is to emulate the Swiftkey's NLP, text mining and text predictive models with your own.

The tasks for the project are

  • Acquire data, clean and produce N Gram data table.
  • Explore, analyze and create a predictive algorithm.
  • Build an shiny application to predict the next single word from a given text phrase.

Tidying data

Three ( blogs, tweets and blogs) corpus of data for this project were provided by Swiftkey, in four (German, Finish, Russian, US English) locals.

  • US English files were chosen as the training datasets.
  • The three corpus's combined was 583 Mb in size, contained 4.2 million lines of text and 102 million words.
  • Cleaned the data of binary (non-printable) character, emoticons, foreign words, numbers, white-space and most punctuation's.
  • All words were converted to lowercase and tokenized.
  • Combined adjacent words and created 2, 3, 4 and 5 word N Gram data tables.
  • The last word in combination was labeled as predict and remaining words were combined and labled as phrase in the data table.
  • Frequency of the phrase per predict value were calculated and stored as frequency using Stylo package.
  • Profane words found in the predict value of the tables were masked as '#%!@?'

Algorithm and Prediction model

The approach was to create a predictive model small enough ( less than 100 Mb) to load onto the Shiny server, get a quick response (less than 100 msec) and a good prediction of the next word.

  • Low frequency phrases were dropped from the N Gram data tables.
  • The frequency values were converted to integer values to reduce memory.
  • Only the top 3 frequency predictions of each phrase was kept.
  • Back-off strategy was implemented using a scoring system and applied to the N Gram data tables.
    • A lambda weight was added to the value of frequency and stored as score.
    • The higher scores would be of high order N Grams with high frequency.
  • The entered text phrase would be searched in all the N Gram data tables.
    • All resulting rows combined and ordered descendingly by the score value.
    • The highest score's predict value would be the single word prediction.

Application illustration

The https://sentilpillai.shinyapps.io/next/ Shiny App has a simple interface; Enter a text phrase in the text box and click on 'Predict next word' button. The single word prediction result is displayed in red next to it. App Image Explore the other Tabs on the right panel, it displays additional prediction model information.