This Project is the capstone project for the Data Science Specialization from John Hopkins University. In this project, I demonstrate how to build a simple model that predict the next words according to the previous one or few words.
You can access the app on: https://siyangni.shinyapps.io/ngram_prediction/
If you are interested in how I preprocess the data using the NGram algorithm, here is a short tutorial from me: RPubs - Milestone Report
N-gram models are a type of simple language model that assign probabilities to sequences of words, and are a common approach to language modeling. The letter “n” in n-gram stands for number, and can be replaced with any value. For example, a 1-gram is a single word, a 2-gram is a two-word sequence, and a 3-gram is a three-word sequence.
2-gram: “This is” or “is a”
3-gram: “Is a great”
4-gram: “She stood up slowly”
The app runs well, and is super simple to navigate. You can toggle the number of contextual words and enter your own sentences/words on the left-hand side, and the right-hand side will give you the three possible predicted sentences, and also tells you how many contextual words are actually used for the prediction.
It will tells you there is no match found when you enter words/phrases that are not in the training data, as it is shown below. Given the limitation of the platform (Shiny free tier only accepts extremely small apps), we don’t have many training words. However, as long as the words/sentences you enter has been trained on, the app does what it is designed to do.
The App UI
I hope this course and my little demonstration takes you into the world of Natural Language Processing (NLP). NLP itself deserves a two-semester course, but this is a good point of departure.
NGram is smart and extremely computationally cheap. It is one of the traditional language modelling algorithm that emphasizes analytic visibility (meaning we can mathematically derive how it works) and computational efficiency. However, even if we dramatically increases the training set size and the number of grams, the nGram model’s performance caps quickly.
Nowadays, predictive language models rely on deep learning models, some that I recommend you to learn are:
Recurrent Neural Network (especially LSTM)
Autoencoders
Transformers
Lastly, my personal advice is to use Python to do anything NLP, because you’ll very likely need to use Tensorflow or Pytorch as you go on in your learning.