Next Word Prediction

Johann Posch
December 2014

Coursera Data Science Capstone Project

  • A Web Application to assist a user typing text by predicting the next work for a partial sentence.

Text Cleaning and Preparation

  • a text collection (news, blogs, twitter) is used to train a model
  • raw text is cleaned, prepared and partitioned for iterative or parallel processing
    • load and merge the original text (Blog, News, Twitter)
    • parse each line and find dates, $ values, numbers and ','
    • remove all non-printable characters, lower case, trim line
    • split line into sentences
    • save pre-processed text
    • split pre-processed text into training and testing set
    • split training set into N partitions

Building a Predictive Model

  • a model is constructed to predict the next word for a phrase
  • a phrase is composed of prefix followed by word
  • formally, the model will predict word Y given phrase X where
    • Y .. is the next word (e.g 'time')
    • X .. is zero or more words (e.g 'at the first')
  • the probabilities of Y given X is calculated by:
    • prob of X = count of Y / count of X
  • N-grams of lenght 1..4 are used build Markov chain
  • R data tables are used for fast in-memory lookup

Training and Testing the Model

  • Training

    • all sentences of a training set partition are used in a step
    • model are trained over training set partitions until stop criteria is reached
    • phrases with low probability are trimmed to keep size reasonable
  • Prediction

    • for a given prefix (X),
      • highest n-grams are examined first
      • the top N predictions (e.g three are used)

Web Application / Future Work

  • the web application:
    • has a text box to enter phrase (partial sentence)
    • shows the top N predicted words for the entered phrase
    • shows a plot with top N predicted words
    • shows help text to guid user
  • future work:
    • code was architecurted with parallel and distibuted processing in mind (e.g on Spark cluster with SparkR)
    • modify code to run on Spark (SparkR)
    • explore customer specific model (e.g train with works of Shakespeare)
    • explore higher N-grams and minimal semantic analysis
  • Application
  • Presentation
  • Acknowledgement

    I sincerly thank the Data Science team at John Hopkins, especially Jeff Leek, Roger Peng and Brian Caffo as well as Coursera team, for this excellent specialization course series. For me, it has made a new career opportunity possible!