Ghazal Pasha
7/20/2018
The goal of this project is to build a model that predicts the next word.
In this project corpora collected from twitter, blogs, and news are used to make a predictive text model.
A Shiny application is built to demonstrate this prediction model.
You can find the application here:
To build this model, natural language processing methods are used. These are the steps toward building this model:
First a sample of the dataset is taken. The data got cleaned and tokenized and N-grams are built based on frequency of being used.
Next the input phrase is read and tokenized. Input is compared with the N-grams based on the length of the phrase. For example a two word phrase got compared with trigrams in the first levels and with bigrams in the next level.
The most frequent next words are added to the list of the suggestions. If encountered a new phrase. Most common words are suggested,
The suggestion list will be sorted based on probability and the top three suggestions for the next word are printed to the output.
To use the app start typing in English and you will see the top three suggestions.
You can find the application here: