Data Science Capstone: Text Prediction

19/08/2020

Background

The goal of this project is to build an N-gram model to predict the next word in an incomplete sentence based on 3 text files that contained collections of:

Blogs
News Articles
Tweets

Click here to download the 3 datasets above.

Cleaning the Data

Before constructing the model, the text corpus needed to be cleaned. Punctuation, stop words, and unnecessary whitespace were removed. Then everything was converted to lowercase. We can now analyze what single words appear most frequently in our text corpus before diving into N-grams. The word cloud below illustrates this.

Exploratory Analysis Using N-grams

This detailed article outlines the process of exploring the data after cleaning it. The focus was on analyzing the most frequently occurring unigrams, bigrams, and trigrams from the datasets. Various bar charts were constructed for better visualization of this process.

Web Application

The web application built in Shiny presents a predictive model that makes the best guess of what the next word of an incomplete sentence will be. Users can also toggle between bar charts of the most frequently occurring N-grams. We currently support unigrams, bigrams, and trigrams.