Auto Word Predictor

Sanyam Jain
20th April, 2020

Capstone Project
Data Science Specialisation
John Hopkins University

Introduction

  • The aim of this project to present an shiny web app that can help a user in predicting the next word based on user input.

  • User can predict upto three most likely words using a drop down.

  • The application uses an n-gram model to predict the next word.

  • Click here to access the app.

Methodology

Data Ingestion:

  • The textual data used to train the predictive model is extracted from a sample of 800,000 lines extracted from the large corpus of blogs, news and twitter data provided by Swiftkey.

Data Cleaning and Preparation:

  • Removing web and email addresses
  • Removing hashtags and twitter handle names from the tweets
  • Removing ordinal numbers.
  • Removing profane words.
  • Removing punctuations and extra whitespaces.
  • The cleaned text corpus is then converted to N-grams upto frequency 4.

Model / Algorithm

  • When user enters text, the algorithm iterates from longest n-gram (4-gram) to shortest (2-gram) to detect a match.
  • Unigrams are left to trade-off with performance.
  • The predicted next word is considered using the longest, most frequent matching n-gram.
  • The algorithm makes use of a simple back-off strategy.

Using the Application

  • User can start typing in the text box given, the text box with heading 'Input' will show the text you are entering.
  • Select the number of most likely words to predict using the drop down (default value is 1).
  • The text box/boxes with heading 'Prediction will show the most likely next word predictions using the algorithm.