Next Word Prediction

Ralston Fonseca
05 November 2018

Overview

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Typing on mobile devices can be a serious pain. SwiftKey, who is the corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. In this capstone project we will be applying data science in the area of natural language processing.

The Goal: The goal of the project is to create a shiny web application with predictive model. Based on the limited resources available on mobile devices we chose an algorithm based on accuracy and speed.

Data

The Corpus was picked from the site provided in the instructions. This data was not sufficient so additional corpus was used to tokenise and create the n-grams. The data was cleaned before building the n-grams.

  • Cases: Converted all words to lower case.
  • Punctuation: Were stripped off.
  • Stopwords, URL, Emojis, Numbers: Were removed.

NOTE: All attempts were made to remove profanity. Due to the volume of data if some are still existing then please excuse. There is no intention to abuse any person or community.

The Algorithm

The high level steps are as follows:

  • The Corpus was used to tokenise and create n-grams used to search the next word. An external corpus was also used. Thus the word frequency may vary and so will the results.
  • The Latent Dirichlet allocation (LDA) method, a particularly popular method, was used to model the words as per topics.
  • Based on the topic grouping and the n-gram frequency the predicted words are listed in order.
  • In case a word is not found in trigram then a bigram was searched and so on.
  • In cases where word was not found in bigram then the previous words in the input string were used. This may not give the best results but it was to overcome the limitations.

A look at the web application.

It shows the next word based on the entered text (on the left side bar) and a List of top 6 words (if they exist). It is simple and user-friendly.

plot of chunk unnamed-chunk-1

How to access this application?

The application is hosted on Shinyapps.io . Use the following link to access the application: https://demo-shiny-apps.shinyapps.io/capstone_shinyapp/