This presentation describe the main consideration during the process to create the system allow predict the next word from a previous sentence. The prediction is made under the context of some texts used to train the model used for that goal.
This system contain two parts :
Build model
Prediction word
The reference to build this app is: https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/3
On this part the 1% of twitter.txt file is loaded to train the models. This because shinyapp.io server has limitations related to the memory.
The next step is identifying profanity words and add them to stop_word list.
The third step is getting all token (by word) from the training data. In this step a ‘closed vocabulary’ is created.
The last step is building bi-gram and tri-gram
To predict a word is used the conditional probability. This indicate that
P(word | previous_sentence) = count(previous_sentence,word) / count(previous_sentence)
To calculate, only one time, the count(previous_sentence) the n-gram model (bi-gram) is used. In n-gram is possible filter by ‘ngram’ and search the number of ‘previous_sentence’ happen.
To calculate the count(previous_sentence,word) the n-gram-plus-one (tri-gram) is used. In this case is necessary search the ngram (tri-gram) that star with the previous sentence and then calculate the conditional probability for each of them
Once all the probabilities are done, the four bigger are selected and then the suggested words that correspond to those probabilities are listed in the UI
This system actually has a performance issue when is building the models and it use the 70% of data of each file. Build the model with a lot of data might take 10 minutes.
When enter to the page, please be patient because the system star to build the models. It is possible check the progress of the building in the right down corner. Once the models are done a message will be showed and the application is ready for suggest a word.
The interface is very simple, because only exist a text field where the user can enter the ‘previous sentence’ and the system will do the rest. The system will show the four more probably words above the text field and on the right it will show the table with the base information about probabilities.