Coursera JHU Data Science Specialization Capstone Project

Rajaram R
2021-02-07

This Project uses Natural Language Processing aka NLP to predict the next word when user inputs words or set of words.

The data for this project is provided by Swiftkey. Data comes in 4 different languages and there are 3 sources of data (blogs, news and twitter)

Deliverables:

The next word prediction model applies principles of text mining infrastructure in R . Prediction Model steps are as follows :

Input : Data from HC Corpora Corpus
20% sampling is used due to huge file size and limited processing power
Clean Training Data : Apply basic cleaning techniques(remove URLs, special characters, email address, profanity words, ordinals, white spaces) and create the data to 2 word, 3 word and 4 word data in data table
Sort n grams data based on the frequency
N-grams functions uses Katz's Ngram back off model to predict the next word
Katz's backoff is a generative n-gram language model that estimates the conditional probability of a word given its history in n-gram. It accomplishes this estimation by backing off through progressively shorter history models under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results

The next word prediction App provides a simple user interface where user can input the words and see the next predicted word.

Features :

Data Table

Text Mining in R

Katz's back-off model

Shiny App

Github source code repository