The Word Predictor

Somnath Maji
February 24, 2018

Part of Capstone Project of Data Science Specialization

Introduction - The Word Predictor

This project attempts to predict the next word(s). Users are welcome to visit the project website. As users type on the text box, the next predicted word(s) will be displayed in the right.

Capstone project of Data Science Specialization Stream in collaboration with Swiftkey
Purpose of this project is to build a data product based on the input text(s) from user.
Project uses data from HC Corpora

Data Processing & Analysis :

Three data sources created for Blogs, News and Twitter
The data cleaned (by removing all weird characters, profanity, URL, punctuations, numbers, whitespace, converting to lower case etc.)
Some exploratory analysis done for analysis
Created n-gram files (biGram, triGram and quadGram) by tokenizing the cleaned sample corpus file
We have taken small sample size to get acceptable performance.

Prediction Algorithm

Katz's Back-off Algorithm has been used to predict next word using a training dataset.

From an overall perspective the calculations involved in this algorithm are inexpensive, but quite accurate if training set is large
Essentially, if the n-gram has been seen more than k times in training, the conditional probability of a word given its history is proportional to the maximum likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the back-off conditional probability of the (n - 1)-gram.

In order to explain the model, wiki content can be checked: https://en.wikipedia.org/wiki/Katz%27s_back-off_model

Shiny Application

This Word Predictor Shiny application : https://myshiny-project.shinyapps.io/WordPredictor/

Application accepts following user inputs :
- Phrase or text from users
- Number of words to be predicted
- A check box to switching off displaying of entered text
Once a phrase or text is entered, predicted word is shown at the right-hand side
The 'About' tab contains user manual of the application

Conclusion & Facts

Challenges Faced during preparation of this data product.

The RAM of the laptop was unable to handle the large size of the data.
A small sampling of it was taken, which was a 0.5% representation of the actual dataset.
Thus it may not reflect the true potential of the predictive mechanism built within this application.
During deployment of this app in Shiny (which was later resolved by looking at error messages at Shiny Log )
After deployment issues (as the app screen was getting disabled within a few seconds)

Thank you for trying this data product.