DATA SCIENCE CAPSTONE PROJECT:Word Predictor

Nithin
21 May,2021

INTRODUCTION

The main purpose of this application is to predict the next word after the user has entered an input.
This model was created for the final Data Science Capstone project by John Hopkins University.The SwiftKey corpus data used for building this model was downloaded from- https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Key features of this application include:-

USER INTERFACE

In the application,there is white box titled 'Please Enter Input' where users can enter their phrases.
After placing input,please press the SUBMIT button.Now please wait for a few seconds for the application to predict the next word.

A bar chart is also plotted which shows most likely words to follow user input along with the predicted word(shown in green).

interface

Points to note:-

-Please bear in mind that some errors in prediction are likely as the model was built on a small sample of the Swiftkey Corpus.
-Please do not enter any numbers(eg 5,125 etc) as the last set of words in input phrase.

CREATING THE MODEL

Creating the dataset used for the model:-

1)The entire Swiftkey Corpus(containing twitter,news and blogs files) was download from this site-Website
2)The data was then sampled into a smaller set(8% of total).
3)The sampled data was then cleaned by-
-Removing ASCII characters,punctuations,numbers,brackets,additional white spaces,other special characters.
-Profanity filtering by removing bad words and other degrading words.
4)Next step was tokenizing the data into different ngrams(ie unigrams,bigrams,trigrams and quadgrams) and then storing them as different datasets.

WORKING OF MODEL

STEPS OF PREDICTION ARE-
-The user input is first converted to lower case and the punctuation,numbers,white spaces are removed from the input.
-The model predicts the next word based on length of the input phrase.
-For example,if the user input has more than 3 words,the model first looks at the last 3 words and searches for a suitable next word in the quadgrams dataset.If no match found, then the last two words of user input taken and compared against trigrams dataset.If still a match is not found,the last word of input is taken and compared against the bigrams dataset.Failing to match yet will return most common words from the unigram dataset.
-The model sorts the most probable words in a descending order.Then it returns the most probable word as predicted word.A bar chart showing other likely words are also returned.