Word Prediction

kojava@yahoo.com
December 2014

Data Science Specialization Capstone Project

Introduction

  • The goal is to develop a natural language processing engine based on a predictive model using English language text fragments and words.
  • As a user enters one or more words, the predictive model should be able to predict the next word that the user is going to enter.
  • Data set used in the predictive model is from SwiftKey and consists of unstructured large text databases from blogs, news and twitter in English language.

Data Analysis and Preprocessing

  • Due to memory constraints approximately, 5% of data is sampled and tokenized to construct a text corpus that is used in the N-gram (sequence of N words) model.
  • Transformations on the text corpus include removing numbers, punctuation, profanities, changing to lowercase and eliminate words with frequency count of less than 3.
  • Corpus consists of 4-grams, tri-grams and bi-grams where Nth word is the response variable and N-1 words are the predictors.
  • Model is preprocessed according to the N-gram model.

Prediction Model

  • N-gram model is used to estimate the Nth word occurrence using occurrences of last N-1 words from the input text. \[ P(N^{th} word | N-1 words) = \frac{C(N^{th} word, N-1 words)}{C(N-1 words)} \]
  • The algorithm is depicted in the figure below. alt text

Shiny Application

alt text

  • Please wait a few seconds for the app to load. The word prediction will be shown in blue text. Three other top predictions are shown below if they are available.
  • Simple N-gram model has poor prediction rate as it does not take into contextual elements of the sentences. Further improvement would be to include contextual analysis into the model.