Shiqi Yang
1/21/2020
This is a presentation for the Coursera Data Science Capstone. The objective of this capstone is to build a smart typing application that can help typing more easier by predicting the next word based on current words like those used by SwiftKey.
In order to build the next word predictive model, three data sets that include twitter, news and blogs data sets have been used to train the model. various data cleaning and sampling processes are applied to finalize the training data set. Using natural language processing approach, various word combinations commonly known as N-Grams are then created using training data set and the predictive algorithm is applied to predict next word. Finally, shiny application has been developed incorporating this predictive model to predict the next word.
We used twitter, news and blogs datasets to train language model.We took sample from three datasets and combined them to create one single dataset.
We removed numbers,Punctuation,Symbols and non printable characters on the combined data.
After cleaning we created five sets of word combination with their respective frequencies- penta-grams (5 words phrases) tetra-gram(4 words phrases), tri-gram(3 words phrases), bi-gram(two words phrases) and uni-gram(1 word) respectively.
Top 10 predictions would show up as the user types without additional steps required from the user!