MOHAMMAD SHADAN
23-DEC-2016
Coursera - Data Science Capstone Project
(using Stupid Backoff)
Creating n-grams (n =1, 2, 3, 4 and 5) from random one percent sample of three english text files (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) from Coursera dataset
Creating functions to calculate Relative Frequncies (Maximum Liklihood Estimate), Stupid Backoff Score and predict the next word based on maximum score
Creating shiny app (ui.R and server.R) and implementing Stupid Backoff Algorithm using above functions
To find the score of a word that should appear after a sentence it will first look for context for the word at the n-gram level and if there is no n-gram of that size it will recurse to the (n-1)-gram and multiply its score with 0.4 (alpha).
Mathematically, \( Score = \begin{cases} \frac {freq(w_i)_{n=k+1}} {freq(w_{i-k}^{i-1})_{n=k+1}} & \text{if } freq(w_{i-k}^i)_{n=k+1} > 0 \\ 0.4 \frac {freq(w_i)_{n=k}} {freq(w_{i-(k-1)}^{i-1})_{n=k}} & \text{otherwise} \end{cases} \)
Logic behind Prediction