Vimal Simha
12-Aug-2020
Data Science Capstone Project offerred by Johns Hopkins University through Coursera
Aim is to build a model to predict the next word in a sentence using Natural Language Processing (NLP) techniques
Develop a Shiny App as a user interface for the model
Training data are collected from publicly available sources - blogs, news articles and twitter feeds using a web crawler and can be downloaded here.
Data are cleaned to remove punctuation, extra whitespace, numbers, profanities and tokenised into words.
Continuous sequences of words (n-grams) are extracted, their frequences are calculated and frequently occuring n-grams are indexed and saved.
The next word prediction is based on Katz back-off model.
The last three words are used to predict the next word.
If there is no match above a likelihood threshold, the number of words considered is progressively shortened.
If no match is found, the algorithm returns the most commonly used single word.
Can be extended to include more words, correlations and sentiment analysis, but at the cost of speed and computational expense.
Interactive Shiny App can be accessed here https://vimalsimha.shinyapps.io/wordpredictor/
Code For App provided here https://github.com/vimalsimha/nlpwordpredictor