Coursera Data Science Specialization: Capstone Project

C. Euler
2017-03-19

Executive Summary

This presentation outlines the methodology behind the submitted next word prediction model.

  • A Shiny UI is used to enter a search string.
  • A Bayesian algorithm predicts the most likely next word based on a large database of text.
  • The found word is printed on screen.

Corpus

The corpus used for modeling is based on 5% of the Twitter, news and blog corpora available from the Coursera assignment page. The following steps were carried out to prepare:

  • Remove punctuation and numbers
  • Convert all letters to lower case
  • Remove stop words (e.g., “a”, “and”, “the”, …)
  • Remove extra white spaces.

Model

The model is based on Bayes' theorem that connects previous knowledge of different aspects of the problem to obtain a solution. Specifically, it determines the probability of an event B provided that A happen (\( P(B|A) \)) based on the reverse, \( P(A|B) \) and the separate probabilities \( P(A) \) and \( P(B) \) to be

\( P(B|A) = \frac{P(A|B)\cdot P(B)}{P(A)} \).

In this context, \( A \) is the occurrence of a specific n-gram, \( B \) is that of a specific word and, thus, \( B|A \) the occurrence of a specific word in a specific n-gram.

Shiny App

The app is usable by simply typing in a word or phrase. The model result is printed in blue.

The shiny app is available at shinyapps.io.

The underlying code is available at github.