Next Word Prediction

The presentation was created as the final step in the Capstone project for the Data Scientist specialization offered through Coursera / Johns Hopkins.

Author: Mohamed Rizwan

Date: 24/01/2020

The goal of the project

The goal is to build a predictive model of English text.
The skills needed to complete this task include natural language processing and text mining.
Shiny App can be found at: https://rizwandatascientist.shinyapps.io/NextWordPrediction/
Source code: https://github.com/rizwandel/Coursera-Johns-Hopkins-University-Data-Science-Specialisation-Next-Word-Prediction
The accuracy of model is about 9% only. However, it's good learning exercise
Deep learning algorithms will produce more powerful language models in NLP and speech recognition up to the accuracy of 99%. Hence, I strongly recommend to use the deep learning techniques

Processing of the data

In order to build a prediction algorithm, data cleaning is performed on the sample of data drawn from raw data
Alternative data set: “bad-words.csv” is taken from www.kaggle.com to remove profanity from the data
unigrams, bigrams, trigrams are created with ngram package and adjusted counts & probabilities calculated from smoothed Ngrams
Good Turing algorithm used to create smoothed Ngrams with smoothed counts and probabilities along with probabilities of unseen ngrams
The processed data saved as .rds and .r files for the shiny application

The 'Prediction Model' algorithm

n-gram model for predicting the next word based on the previous 1, 2, or 3 words and to handle unseen n-grams
The prediction model is based on the Katz Back-off algorithm with Good Turing smoothing
Trigrams is the first N-gram to be used. This takes into account the first two words that user has provided
If no match is found, the bigrams is used. This takes the last one word of the user input into account
If there is still no match found, unigram is used next
When no match is found, the application will return a comment that no match is found

The Shiny Application

The app is titled “Johns Hopkins University Data Science Capstone Project 2020”
Navigation bar and Sidebar are present under the title
Navigation bar shows “User Interface” & “About the application” sections
User Interface section consists of sidebar with textbox to input text
Main panel shows “entered words” and “sequence of predicted words”
User has to type a single word or text sentence(s) in the “box” provided
Abbrevations, numbers, symbols and punctuations are removed by the model to predict the next word
When no match is found, the application returns “UNK” which means unknown word