This presentation is the summary of the Coursera Data Science Specialization Capstone Project. The purpose of the project is to build a shiny app that predicts the next word in a sentence.
1. Predictive model structure
The predictive model consists of five data structures. Four of the data structures are hashtables built from n-grams and the remaining structure is a word-frequency table.
N-gram structures:
The keys are (n-1)-grams (last word of each n-gram removed). Ex: key hash(“going_to_the”) for 4-gram:“going_to_the_cinema”
The value for each key is a list of probabilities with the words that could finish the n-gram (words removed in the key). Ex: Key=“going_to_the”. Value=[<“cinema”,0.3>, <“market”,0.25>,<“bed”,0.25>,<“room”,0.2>].
Word-frequency table: Ordered table with the frequency and the probability of appearence for the most used words.
2. Predictive model methodology
Based on a backoff schema.
First, we look for the (n-1)-gram in the 5-gram structure. If the key exists, we return the most likely word.
If the key does not exist, we perform the same query in the 4-gram structure.
The last two steps are repeated until quering the 2-gram structure.
If no key was found in any n-gram, we return the most frequent word found in the word-frequency table.
3. Predictive model performance
For each time that the model has to query one of the n-gram structures, we need to compute:
O(1) to find a value in a hashtable.
O(1) to find the most probable word in the list (Sorted list, so the first element is returned).
In the worst case, this needs to be done for each n-gram structure: 4 x O(1)=O(1).
In the worst case, we need to look for the most frequent word in the word-frequency table: O(1).
Hence, the cost of the predictive algorithm is O(1).