Word Prediction Application

MOHAMMAD SHADAN
23-DEC-2016

Coursera - Data Science Capstone Project
(using Stupid Backoff)

Steps Involved to create the Application

Creating n-grams (n =1, 2, 3, 4 and 5) from random one percent sample of three english text files (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) from Coursera dataset
Creating functions to calculate Relative Frequncies (Maximum Liklihood Estimate), Stupid Backoff Score and predict the next word based on maximum score
Creating shiny app (ui.R and server.R) and implementing Stupid Backoff Algorithm using above functions

Description of the Stupid Backoff Algorithm

To find the score of a word that should appear after a sentence it will first look for context for the word at the n-gram level and if there is no n-gram of that size it will recurse to the (n-1)-gram and multiply its score with 0.4 (alpha).

Mathematically, \( Score = \begin{cases} \frac {freq(w_i)_{n=k+1}} {freq(w_{i-k}^{i-1})_{n=k+1}} & \text{if } freq(w_{i-k}^i)_{n=k+1} > 0 \\ 0.4 \frac {freq(w_i)_{n=k}} {freq(w_{i-(k-1)}^{i-1})_{n=k}} & \text{otherwise} \end{cases} \)

Stupid Backoff is comparatively computationally inexpensive and accuracy is good
Stupid Backoff uses relative frequencies (score)

Shiny Application - Word Predictor

Word Predictor

About the Application

https://mohammadshadan.shinyapps.io/wordpredictor/
User enters the text in the Box and application displays the most probable word below “Predicted Word Is …” e.g. entering “Cup of” would predict “coffee”
If nothing is entered in the text box, the application displays the word “the”, as it is the most frequent uni-gram

Logic behind Prediction

Users inputs a set of words and application fetches the last 4 to 1 words and checks the ngrams for a match
All probable words are gathered and the word with maximum score is displayed