Predict Next Word - Capstone project

Vishakha Mujoo
25-April-2015

Introduction

The purpose of this project was to build a app to predict next word

  • Three types of data were provided - Twitte, News, Blogs
  • Appropriate data cleaning and subsetting techniques were applied to finalize training data
  • Katz-Back off predictive algorithim was used
  • It can used for mobile platform, web applications and ends up saving user time and results in delighted user Experience

LINK : https://mujoo.shinyapps.io/nextwordAPP/

Data Handling and Cleaning

  • Subset of original data was randomly selected from all three sources and merged into one
  • data cleaning involved converting to lower case, removing punctuations, numbers, non printable characters
  • Four sets of word combinations N-grams, with 4- words, 3- words, 2-words, and 1- word were then created
  • After calculating cummulative frequencies these four N-grams were sorted and saved
  • Low frequency n-grams were further filtered to reduce their size for better performance
  • The four n-gram objects were saved as .RData files - that is reason for fast performance and shorter wait time

Word Prediction Algorithim

Steps

  • Load four compressed .RData files which contain sorted N-grams with cumulative frequencies
  • Filter user specified sequence of words using same technique as was used to clean training data
  • Based on number of words specified by user ,extract last 3 or 2 or 1 word
  • First use 4-gram and then backoff to 3-gram and so on
  • if no match is found use the most frequent word from 1-gram as next word

Shiny Application

  • User enters a sequence of words in the text box and then clicks “ next word” button
  • The predicted next word appears along with the notes about the specific n-gram used for next word prediction
  • User entered sentence is also displayed in Shiny GUI

    alt text