Shiqi Yang
1/21/2020
This is a presentation for the Coursera Data Science Capstone. The objective of this capstone is to build a smart typing application that can help typing more easier by predicting the next word based on current words like those used by SwiftKey.
In order to build the next word predictive model, three data sets that include twitter, news and blogs data sets have been used to train the model. various data cleaning and sampling processes are applied to finalize the training data set. Using natural language processing approach, various word combinations commonly known as N-Grams are then created using training data set and the predictive algorithm is applied to predict next word. Finally, shiny application has been developed incorporating this predictive model to predict the next word.
We used twitter, news and blogs datasets to train language model.We took sample from three datasets and combined them to create one single dataset.
We removed numbers,Punctuation,Symbols and non printable characters on the combined data.
After cleaning we created five sets of word combination with their respective frequencies- penta-grams (5 words phrases) tetra-gram(4 words phrases), tri-gram(3 words phrases), bi-gram(two words phrases) and uni-gram(1 word) respectively.
In stupid backoff model the backoff factor Alpha is heuristically set to a fixed value 0.4 to reduce complexity.Each time we back off we multiply by the factor .4
The algorithm matches the last 4 words typed in with 5gram model which complete those 4 words and calculates their scores.
If no match found or it returns less than 4 records the app backs off and it matches last 3 words typed in and searches 4grams that completes those 3 words and calculate the score.
If no match found or total less than 3 records are found it backs of to bigrams and at last backs off to unigrams.
After all the calculations the top ten words that achieve the highest scores are returned.
Top 10 predictions would show up as the user types without additional steps required from the user!