Word Predictor

Gregorio Ambrosio Cestero
July 2016

A text prediction app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

Word Predictor, yet another Data Science Capstone Project by Gregorio Ambrosio

Description

Word Predictor

Word Predictor is a shiny application that runs on shinyapps.io and takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. It predicts the most likely next word, based on frequently occurring phrases (n-grams).

How works

  • The user write a phrase with an undetermined number of words
  • The application extract the last two words
  • The application runs a prediction model taking both words as input
  • Finally the application shows the predicted word

For prediction model a model based on Naive Bayes classifier was built.

Under the hood

The apps is based in Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes probability theorem, are known for creating simple yet well performing models, especially in the fields of document classification and disease prediction. The Naive Bayes model involves a simplifying conditional independence assumption. That is given a class (positive or negative), the words are conditionally independent of each other. This assumption does not affect the accuracy in text classification by much but makes really fast classification algorithms applicable for the problem.

The algorithm

  • Coursera Swiftkey Dataset is downloaded and then some statistics are collected.
  • Since the size of the data set is very big it is randomly sampled to a reduced data set to address the analysis.
  • This sample data set is preprocessed and tokenized for extracting contiguous sequence of items. Concretely unigrams, bigrams and trigrams are extracted.
  • n-grams are used to probabilities calculation.
  • a n-gram probabilistic based model is built to next word prediction based on the previous 1, 2 or 3 words.
  • Finally the predcition model is used trough the shiny app as the goal of the project is defined.

Model building procedure for pattern classification