Arouna Mopa
April 16, 2016
Data Science Capstone Project - Coursera/JHU
This project used the N-GRAMs model to build an algorithm to suggest the next word, given a text enter by the user as input. The data source used in this project contains three types of data including twitter, news and blogs. After cleaning and sub-setting data to build the training model, an N-gram model was created and a predictive algorithm (Katz Back-off) was applied to predict next word.
The final data product model was published as a Shiny application. This application can be found online using the links bellow:
In an N-gram model, the number of word is N. The first N-1 words form the history of the last word. For example In a 2-gram model, the number of word in the history is 1. In a 3-gram model,this number is 2.
The prediction of the next word is based on the computation of the probability of a phrase which is approximated by the ratio of the number of times the last N words occurred divided by the number of times the history occurred:
\( P(Good morning everybody) = #(Good morning everybody)/#(Good morning) \)
The left panel on the home page has an input box where the user can enter a text.
As the content of the imput box changes, the suggested next words are displayed in the right panel.
This application can suggest up to five next words ranged in decreasing order of the probality of occurrence.