PRESENTATION ON CAPSTONE

G Srinithin
7-23-2020

INTRODUCTION

  • This is the capstone of the data science specialization
    provided by coursera given by JHU.

  • The main Aim of the capstone is bulit a model which
    predicts the next word of the given word or phrase.

  • The Dataset consist of three text files which are
    named as twitter,blogs and news.
    These are in four different languages.

  • The entire model is bulit on the english dataset.
    and a shiny app also is developed to display the
    working of model.

DATA PREPROCESSING

  • I removed the punctuations and profanity words.

  • Later on removed numbers also and then performed some
    exploratory analysis on them and found the sentiment of
    lines to know the words.

  • later I used cleaned dataset to bulit the n-grams,
    from 2 to 5 n-grams which are separated as last word in
    separate column named as “predict”

  • Then done some exploratory analysis found the most repeating n-grams,
    I also made the word cloud of it.

DATA MODELING

  • I merged every n-grams of all three datasets into one csv file.

  • I used the concept of “katz's back-off model” to predict the
    next word.

  • I used the “Markov Chain transition matrix” to improve
    the accuracy of the model.

  • Later on I started buliding my data products.

    App description
  • The Shiny app will need the query(word or pharse) as
    input where you enter in text field and you need to select how many words do you want to predict on scale of 1-5,
    then it gives the prediction of words in text field

The link to working model is “shiny app”

The concepts are all learnt form the links provided and by the internet.

THANK YOU