Word Prediction App

Data Science Specialization

d_elia

Introduction

This presentation illustrates the main features of a word prediction app developed for the Capstone project of the Data Science Specialization course offered by Johns Hopkins University (JHU) and Coursera in partnership with SwiftKey.

The objective of the Capstone project is to build a word prediction app and demonstrate how data science can be applied in the area of natural language processing.

The Data

The data used to develop this application comes from a corpus called HC Corpora. More details on the corpora can be found HERE.

A small sample of English text from blogs, twitter, and news articles published on the web has been used to develop this application. The sample text has been converted into lower case before being processed. Numbers, punctuation, white spaces, non-ASCII characters and profanity words have also been removed.

The Algorithm

The App uses tables of frequencies of uni-grams (single words), bi-grams (two-word sequences), tri-grams (three-word sequences) and quadri-grams (four-word sentences) to predict the most likely next word.
When the user types or pastes the text into the application, the last 3, 2 or 1 words are searched within the quadri-grams, tri-grams, bi-grams and uni-grams.

If a match for the last 3 words of the text is found, the App displays the 4th word of the most frequent quadri-gram.
If a match for the last 2 words of the text is found, the App displays the 3th word of the most frequent tri-gram.
If a match for the last word of the text is found, the App displays the 3th word of the most frequent bi-gram.
If no match is found, the App displays the most frequent uni-gram.

Word Prediction - The Application

The App has been built using the Shinypackage and hosted on shiniapps.io. The App can be accessed at the following link.
This Shiny App accepts an input text, a word or a sentence, and returns the most likely next word.
The left side bar panel consists of a text input widget where the user can type or paste a text.
Below the typed text, a submit button widget is available to send the sentence to the server for processing.
The main panel of the application consists of two widgets used to display the prediction results:

1.The first widget repeats the sentence inputted by the user.

2.The second widget displays the predicted next word.

Performance Notes

The App takes around 10 seconds of loading time.

The predicted next word is displayed on-the-fly with almost no waiting time after clicking the submit button.

In order to ensure the responsiveness of the App a tradeoff between accuracy and efficiency has been met. The App has a 20% accuracy and computational time of around 3 secs.