Auto Word Predictor

Sanyam Jain
20th April, 2020

Capstone Project
Data Science Specialisation
John Hopkins University

Introduction

The aim of this project to present an shiny web app that can help a user in predicting the next word based on user input.
User can predict upto three most likely words using a drop down.
The application uses an n-gram model to predict the next word.
Click here to access the app.

Methodology

Data Ingestion:

The textual data used to train the predictive model is extracted from a sample of 800,000 lines extracted from the large corpus of blogs, news and twitter data provided by Swiftkey.

Data Cleaning and Preparation:

Removing web and email addresses
Removing hashtags and twitter handle names from the tweets
Removing ordinal numbers.
Removing profane words.
Removing punctuations and extra whitespaces.
The cleaned text corpus is then converted to N-grams upto frequency 4.

Model / Algorithm

When user enters text, the algorithm iterates from longest n-gram (4-gram) to shortest (2-gram) to detect a match.
Unigrams are left to trade-off with performance.
The predicted next word is considered using the longest, most frequent matching n-gram.
The algorithm makes use of a simple back-off strategy.

Using the Application

User can start typing in the text box given, the text box with heading 'Input' will show the text you are entering.
Select the number of most likely words to predict using the drop down (default value is 1).
The text box/boxes with heading 'Prediction will show the most likely next word predictions using the algorithm.