Gangathren Pillay
20 April 2016
Data Scientist Specialization Capstone Project
The Next Word Prediction Shiny App predicts text using concepts of Natural Language Processing, and works in similar fashion as the ones found in smartphone keyboards. The reference data used for training the prediction model comprises of text from blogs, tweets and news.
The key aspect of the algorithm is usage of n-grams. N-gram is a contiguous sequence of n items from a given sequence of text. Before creating n-grams(uni, bi, tri for the study) the data was cleaned by removing numbers, punctuations, transforming to lowercase and stripping whitespaces from the sample text corpus followed by tokenization.
The algorithm matches words in combinations of one through three words with the n-gram database and gives predictions for next possible words.