Mahendra Kumar lal
20-May-2019
Coursera Data Science Specialization Capstone Project
The data came from HC Corpora with three files (Blogs, News and Twitter).
After loading the data, a sample was created, cleaned and prepared to be used as a corpus of text. It was converted to lower case, removed the punctuation, links, whitespace, numbers and profanity words.
The sample text was “tokenized” into n-grams to construct the predictive models (Tokenization is the process of breaking a stream of text up into words, phrases. N-gram is a contiguous sequence of n items from a given sequence of text).
The n-grams files or data.frames (unigram, bigram, trigram and quadgram) are matrices with frequencies of words, used into the algorithm to predict the next word based on the text entered by the user.
Capture input text, including all preceding words in the phrase
Iteratively traverse n-grams (longest to shortest) for matches
On match(es), use the longest, most common, n-gram
Last word in the matching n-gram is the predicted next word
If no match in {5, 4, 3, 2}-grams, resort to randomly selecting a most frequently occurring 1-gram (e.g. common word)
The Shiny application allow the prediction of the next possible word in a sentence.
The user entered the text in an input box, and in the other one, the application returns the most probability word to be used.
The predicted word is obtained from the n-grams matrices, comparing it with tokenized frequency of 2, 3 and 4 grams sequences.
While entering the text, the field with the predicted next word refreshes instantaneously, and then the predicted word is then provided for the user's choice.
text-predictor interactively performs word/phrase completion!
Application Link : (https://mahenlal.shinyapps.io/text-predictor/)