The accuracy of model is about 9% only. However, it's good learning exercise
Deep learning algorithms will produce more powerful language models in NLP and speech recognition up to the accuracy of 99%. Hence, I strongly recommend to use the deep learning techniques
Processing of the data
In order to build a prediction algorithm, data cleaning is performed on the sample of data drawn from raw data
Alternative data set: “bad-words.csv” is taken from www.kaggle.com to remove profanity from the data
unigrams, bigrams, trigrams are created with ngram package and adjusted counts & probabilities calculated from smoothed Ngrams
Good Turing algorithm used to create smoothed Ngrams with smoothed counts and probabilities along with probabilities of unseen ngrams
The processed data saved as .rds and .r files for the shiny application
The 'Prediction Model' algorithm
n-gram model for predicting the next word based on the previous 1, 2, or 3 words and to handle unseen n-grams
The prediction model is based on the Katz Back-off algorithm with Good Turing smoothing
Trigrams is the first N-gram to be used. This takes into account the first two words that user has provided
If no match is found, the bigrams is used. This takes the last one word of the user input into account
If there is still no match found, unigram is used next
When no match is found, the application will return a comment that no match is found
The Shiny Application
The app is titled “Johns Hopkins University Data Science Capstone Project 2020”
Navigation bar and Sidebar are present under the title
Navigation bar shows “User Interface” & “About the application” sections
User Interface section consists of sidebar with textbox to input text
Main panel shows “entered words” and “sequence of predicted words”
User has to type a single word or text sentence(s) in the “box” provided
Abbrevations, numbers, symbols and punctuations are removed by the model to predict the next word
When no match is found, the application returns “UNK” which means unknown word