Text Prediction Final Project

2022-12-30

Overview

This application was developed as part of the Data Science specialization. Steps required in this project included: - getting and cleaning data - exploratory data analysis - data modeling - algorithm implementation

Shiny Application

Using the back-off method, this application will suggest the next word in a sentence using an n-gram algorithm. An n-gram is a contiguous sequence of n words from a given sequence of text.
The text used to build the predictive text model came from a large corpus of blogs, news and twitter data. N-grams were extracted from the corpus and then used to build the predictive text model.

The Shiny Application

The predictive text model was built from a sample of 800,000 lines extracted from the large corpus of blogs, news and twitter data.

The sample data was then tokenized and cleaned using the tm package and a number of regular expressions using the gsub function. As part of the cleaning process the data was converted to lowercase, removed all non-ascii characters, URLs, email addresses, Twitter handles, hash tags, ordinal numbers, profane words, punctuation and whitespace. The data was then split into tokens (n-grams).

User Interface

The predicted next word will be shown when the app detects that you have finished typing one or more words. The top prediction will be shown. There are instructions on the left hand side and an “About” page for the app on the right hand side.