Next Word Prediction

The presentation was created as the final step in the Capstone project for the Data Scientist specialization offered through Coursera / Johns Hopkins.

Author: Mohamed Rizwan

Date: 24/01/2020

The goal of the project

Processing of the data

  • In order to build a prediction algorithm, data cleaning is performed on the sample of data drawn from raw data
  • Alternative data set: “bad-words.csv” is taken from www.kaggle.com to remove profanity from the data
  • unigrams, bigrams, trigrams are created with ngram package and adjusted counts & probabilities calculated from smoothed Ngrams
  • Good Turing algorithm used to create smoothed Ngrams with smoothed counts and probabilities along with probabilities of unseen ngrams
  • The processed data saved as .rds and .r files for the shiny application

The 'Prediction Model' algorithm

  • n-gram model for predicting the next word based on the previous 1, 2, or 3 words and to handle unseen n-grams
  • The prediction model is based on the Katz Back-off algorithm with Good Turing smoothing
  • Trigrams is the first N-gram to be used. This takes into account the first two words that user has provided
  • If no match is found, the bigrams is used. This takes the last one word of the user input into account
  • If there is still no match found, unigram is used next
  • When no match is found, the application will return a comment that no match is found

The Shiny Application

  • The app is titled “Johns Hopkins University Data Science Capstone Project 2020”
  • Navigation bar and Sidebar are present under the title
  • Navigation bar shows “User Interface” & “About the application” sections
  • User Interface section consists of sidebar with textbox to input text
  • Main panel shows “entered words” and “sequence of predicted words”
  • User has to type a single word or text sentence(s) in the “box” provided
  • Abbrevations, numbers, symbols and punctuations are removed by the model to predict the next word
  • When no match is found, the application returns “UNK” which means unknown word