Data Science Capstone Project

Dinara Mukhtarova

App description

alt text

A Shiny app that takes as input a phrase (multiple words), one clicks submit, and it predicts the next word. I used English texts from blogs, news and Twitter to compute n-gram frequencies. Based on those frequencies I find the most likely end of a phrase.

How it works

  1. User inputs a phrase and presses “Submit”.
  2. App splits the phrase into words and looks at the last 1-3 ones.
  3. App looks in the database for phrases that match.
  4. App finds all possible predicted words and prints the most likely one.

My prediction algorithm

The algorithm I'm using is called Stupid Backoff. This algorithm assigns a score to every candidate word as follows:

\[ S(w_i|w^{i-1}_{i-k+1}) = \begin{cases} \frac{freq(w^i_{i-k+1})}{freq(w^{i-1}_{i-k+1})} & \quad \text{if } freq(w^i_{i-k+1}) > 0\\ \alpha S(w_i|w^{i-1}_{i-k+2}) & \quad \text{otherwise}\\ \end{cases} \]

Here, we're using \( \alpha = 0.4 \).

Stupid Backoff is inexpensive to calculate in a distributed environment while having a high quality for large amounts of data.

Handling unobserved n-grams

If a user inputs a phrase that has no matches in the database then the app returns word “and” (since that phrase is possibly a name of something). If he/she enters an empty string (or a string that has no words in it) than the app returns word “the” (since it is the most popular word in English and fits good for a beginning of a phrase).

alt text