The goal of this project is to build an N-gram model to predict the next word in an incomplete sentence based on 3 text files that contained collections of:
- Blogs
- News Articles
- Tweets
Click here to download the 3 datasets above.
19/08/2020
The goal of this project is to build an N-gram model to predict the next word in an incomplete sentence based on 3 text files that contained collections of:
Click here to download the 3 datasets above.
Before constructing the model, the text corpus needed to be cleaned. Punctuation, stop words, and unnecessary whitespace were removed. Then everything was converted to lowercase. We can now analyze what single words appear most frequently in our text corpus before diving into N-grams. The word cloud below illustrates this.
This detailed article outlines the process of exploring the data after cleaning it. The focus was on analyzing the most frequently occurring unigrams, bigrams, and trigrams from the datasets. Various bar charts were constructed for better visualization of this process.
The web application built in Shiny presents a predictive model that makes the best guess of what the next word of an incomplete sentence will be. Users can also toggle between bar charts of the most frequently occurring N-grams. We currently support unigrams, bigrams, and trigrams.