Nikhil Prakash
May-19
The goal of this project is to create an application that predicts the next word in a phrase/sentence. Here we demonstrate the ability to process and analyze large volumes of unstructured text.Use text mining technique of cleaning, sampling, tokenization. And, As a final deliverable, we develop an algorithm that predicts the next word in a provided text, similar to the predictive text functions found on today's modern smart phones.
Below are the list of topic we will be discussing on the following slide:
The data came from HC Corpora with three files (Blogs, News and Twitter). It was provided by the Swiftkey.
Major task involve in this project were:
– Obtain the data, Understands the problem and then clean the data accordingly.
– Perform Exploratory analysis.
– Tokenization of words and apply predictive algorithm.
– Create a interactive application using shiny.
NLP (N-Gram dictionary)
– For initial exploration, data analyst need to construct a dictionary of unigram, bigrams, trigrams, and four-grams, collectively called n-grams.
– Unigram are one word phrases, Bigrams are two word phrases, trigrams are three word phrases, and four-grams are four word phrases.
The application uses text documents collected from blogs, news articles, and twitter to statistically model language patterns. N-Grams were used to predict the next word.
The 'PredictNextWord' Shiny app is a basic application to present the working of prediction model. It works only for English language.
Areas of improvement:
– UI design of the app.
– Input data validation.
– Increase sample size for more relevant predictions.
– Feedback loop to model to learn from the earlier prediction.
Conclusion:
– This project involve lot of research in data pre-processing, text modeling, NLP.
– All the skills gain throughout entire lifecycle of this specialization were used in this project.
– Entire specialization was very fun to learn and required ton of research which definitely increase my level of knowledge.