This project lies within the Natural Language Processing (NLP) field, a subfield of artifical intelligence concerned with the understading of language by machines.
The goal of this project is develope a NLP tool to predict the next word of a phrase given some context.
The training data provided consists of a compilation of text from 3 sources: twitter, blogs and news. The main features of the data are the following:
Word.counts
Line.counts
File.size.Mb.
Twitter
30373543
2360148
334.48
Blog
37334131
899288
267.76
News
2643969
77259
20.73
N-grams language models and Stupid Backoff
Models that assign probabilities to sequences of n-words (n-grams) are called N-gram models.
The goal is to compute the probability of a new word given some history. E.g.: \( \small P(word|history) = P(exam|he~studied~and~passed~the) \)
To simplify, we use the N-gram assumption, by which the probability of the next word only depends on the last n words (n-grams). E.g.:
\( \small n = 2 \rightarrow P(exam|he~studied~and~passed~the) \approx P(exam|the) \)
We used Stupid Backoff to calculate the probabilities using up to 3-grams, if available. The formula is the following (for more information, see Brants et al., 2007):
The Shiny application
We developed the following Shiny application with the algorithm:
Then, just type some text in the input bar and click on the predict button. The 3 most likely words and the word with the highest probability will be displayed below.
Thorsten Brants, et al. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858-867, June.
Accessed from: https://www.aclweb.org/anthology/D07-1090.pdf