Coursera Data Science Specialization: Capstone Project

C. Euler
2017-03-19

Executive Summary

This presentation outlines the methodology behind the submitted next word prediction model.

A Shiny UI is used to enter a search string.
A Bayesian algorithm predicts the most likely next word based on a large database of text.
The found word is printed on screen.

Corpus

The corpus used for modeling is based on 5% of the Twitter, news and blog corpora available from the Coursera assignment page. The following steps were carried out to prepare:

Remove punctuation and numbers
Convert all letters to lower case
Remove stop words (e.g., “a”, “and”, “the”, …)
Remove extra white spaces.

Model

The model is based on Bayes' theorem that connects previous knowledge of different aspects of the problem to obtain a solution. Specifically, it determines the probability of an event B provided that A happen (\( P(B|A) \)) based on the reverse, \( P(A|B) \) and the separate probabilities \( P(A) \) and \( P(B) \) to be

\( P(B|A) = \frac{P(A|B)\cdot P(B)}{P(A)} \).

In this context, \( A \) is the occurrence of a specific n-gram, \( B \) is that of a specific word and, thus, \( B|A \) the occurrence of a specific word in a specific n-gram.

Shiny App

The app is usable by simply typing in a word or phrase. The model result is printed in blue.

The shiny app is available at shinyapps.io.

The underlying code is available at github.