G Srinithin
7-23-2020
This is the capstone of the data science specialization
provided by coursera given by JHU.
The main Aim of the capstone is bulit a model which
predicts the next word of the given word or phrase.
The Dataset consist of three text files which are
named as twitter,blogs and news.
These are in four different languages.
The entire model is bulit on the english dataset.
and a shiny app also is developed to display the
working of model.
I removed the punctuations and profanity words.
Later on removed numbers also and then performed some
exploratory analysis on them and found the sentiment of
lines to know the words.
later I used cleaned dataset to bulit the n-grams,
from 2 to 5 n-grams which are separated as last word in
separate
column named as “predict”
Then done some exploratory analysis found the most repeating n-grams,
I also made the word cloud of it.
I merged every n-grams of all three datasets into one csv file.
I used the concept of “katz's back-off model” to predict the
next word.
I used the “Markov Chain transition matrix” to improve
the accuracy of the model.
Later on I started buliding my data products.
The Shiny app will need the query(word or pharse) as
input where you enter in text field and you need to select how many words do you want to
predict on scale of 1-5,
then it gives the prediction of words in text field
The link to working model is “shiny app”
The concepts are all learnt form the links provided and by the internet.