Aug 6, 2021

About the project

This is a pitch deck for the JHU Coursera Capstone project where the students are asked to develop an app for predicting the next word based on the user input. E.g., typing suggestions on mobile devices.

  • The dataset for building the prediction model was given within the instruction.
  • The prediction is based on ngram and backoff models.
  • The app needs to be hosted on Shiny.io. It needs to accept a “user input” and output/display the “prediction”.

Preliminary EDA

  • Skewness is minimized by combining data from blog, news and Twitter sources.
  • Optimal size picked for accuracy and speed.
    • Sampled 10% data from each source
  • Profanity and stop words are removed.
  • Data is NOT stemmed.

For further details: Milestone Report

Prediction model & the APP

  1. User enters word(s)
  2. Word(s) referenced against the tokenized ngram tables for the match. The process is iterated from higher (quad at max) to lower (bi at minimum) ngrams.
  3. Based on the higher probability of the match, the prediction is made.
  4. The prediction is displayed on the user interface.

“Next Word Prediction App” on Shiny.io

Next steps

  • Improve the prediction model to predict more than 1 word.
  • Further optimize the data size and the cleaning process for speed and accuracy.
  • Explore other techniques than the ngram and backoff models.