Priyadharshini D
9 October, 2021
Coursera Data Science Specialization
Capstone Project
Johns Hopkins University
People may now stay connected on the go thanks to recent advancements in mobile technology. However, without any language prediction, communicating by text quickly and accurately using a smartphone keypad is quite difficult.
Against this backdrop, the purpose of the Data Science Capstone project was to develop a data product that uses natural language processing to predict the next word a user may want to type. A shiny application serves as the final product for this project.
The Next Word Predict app is located at:
The source code files can be found on GitHub:
An N-Gram Language Model was used to create the prediction algorithm. This approach is based on the Markov Assumption, which states that each word in a string of text is only dependent on the previous N words. A set of four N-gram models has been built for this project:
Maximum-Likelihood Estimation along with Good Turing Smoothing is used to calculate the likelihood of probable future words given this dictionary of N-grams. The prediction algorithm chooses candidates from correctly ranked N-grams based on the length of the text input. This mechanism is known as the “The Stupid Backoff algorithm”, a highly efficient and inexpensive algorithm proposed by Thorsten Brants et al (2007).
Next Word Predict is a Shiny app that uses a text prediction algorithm to predict the next word(s) based on text entered by a user.
The application will suggest the next word in a sentence using an n-gram algorithm. An n-gram is a contiguous sequence of n words from a given sequence of text.
The text used to build the predictive text model came from a large corpus of blogs, news and twitter data. N-grams were extracted from the corpus and then used to build the predictive text model.
Various methods were explored to improve speed and accuracy using natural language processing and text mining techniques.
The predictive text model was built from a sample of 800,000 lines extracted from the large corpus of blogs, news and twitter data.
The sample data was then tokenized and cleaned using the tm package and a number of regular expressions using the gsub function. As part of the cleaning process the data was converted to lowercase, removed all non-ascii characters, URLs, email addresses, Twitter handles, hash tags, ordinal numbers, profane words, punctuation and whitespace. The data was then split into tokens (n-grams).
The algorithm detects a match by iterating from the longest n-gram (4 gramme) to the shortest (2 gramme) as text is submitted by the user. The longest, most often matching n-gram is used to anticipate the next word. A basic back-off method is used in the algorithm.