Natural Language Processing (NLP)

Farid Tayari

5/14/2020

NLP: Predicting the Next Word

This application can suggest the next word based on six predictive models that user can choose from. These models are based on two methods of Katz Back-off and Linear Interpolation.

Train set

Train set used in this application is a small (can be changed by the user 0.2% to 20%) subset of three data sets:

These text file can be downloaded here

Predictive models

After n-grams are built, we can see many word combinations with zero frequency in the train set. Consequently, these word combinations will receive zero probability in the model. Two of methods used to fix this problem are:

Katz Back-off method estimates the probabilities of n-gram based on the lower order n-gram.
Linear Interpolation method estimates the probabilities by considering the weighted average of all previous order n-grams. For example, Trigram Linear Interpolation uses weighted average of Trigram, Bigram, and Unigram probabilities.

More detailed information about these methods can be found in:
Speech and Language Processing by Daniel Jurafsky and James H. Martin

Steps to run the model

The text has to be at least two words for Bigram, three words for Trigram, and four words for 4-gram models.

Steps to run the model

The model will display a plot that includes the predicted words with the probabilities.

User can choose the weights and discount factors (gamma) in the models. However, because the train set is fairly small, changing the weights will not show significant changes in the predictions.

User can also determine the ratio of data set that will be dedicated to the train set. However, the application is running on a public server, which doesn’t support computational power required for higher train set proportions. Consequently, it has to be kept low (around 5% for a 60 MB text file corpus). Therefore, the accuracy of the predictions might be low.

User can also upload a new corpus. However, because of the processing and memory limitations, the maximum size of the uploaded file is set to 50 MB.

References

The application is accessible here and on Github