Nilosree Sengupta
12th October,2020
Coursera Data Science Specialization
Capstone Project
Johns Hopkins University
This project involves Natural Language Processing. The critical task is to take a user's input phrase (group of words) and to output a predicted next word.
Project deliverables:
The next word prediction model uses the principles of “tidy data” applied to text mining in R. Key model steps:
Benefits: easy to read code; uses “pipes”; fast processing of training data; able to sample up to 25% of original corpus; relatively small output repos
The predictive text model was built from a sample of 800,000 lines extracted from the large corpus of blogs, news and twitter data.
The sample data was then tokenized and cleaned using the tm package and a number of regular expressions using the gsub function. As part of the cleaning process the data was converted to lowercase, removed all non-ascii characters, URLs, email addresses, Twitter handles, hash tags, ordinal numbers, profane words, punctuation and whitespace. The data was then split into tokens (n-grams).
As text is entered by the user, the algorithm iterates from longest n-gram (4-gram) to shortest (2-gram) to detect a match. The predicted next word is considered using the longest, most frequent matching n-gram. The algorithm makes use of a simple back-off strategy.
The next word prediction app provides a simple user interface to the next word prediction model.
Key Features:
Key Benefits:
TheShiny Prediction app is located at:
The source code files can be found on GitHub: