Data Science Capstone Project - Text Prediction

Wei Hao Khoong
17/6/2018

This presentation will provide a brief walkthrough of the application for predicting the next word, a capstone project for the Coursera Data Science specialization created by Johns Hopkins University (JHU), in cooperation with SwiftKey. SwiftKey, Bloomberg & Coursera Logo

Objective

The main goal of this capstone project is to build a shiny application that is able to predict the next word. This exercise was divided into tasks such as data cleansing, exploratory analysis, the creation of a predictive model and more. All text data used to create a dictionary that contains the frequency of each word to predict the next words comes from the corpus - HC Corpora.

Methods & Models

After importing the required data and creating a data sample from the HC Corpora data, the sample was cleaned by converting to lowercase, removing all punctuation, links, white space, numbers and all sorts of special characters.

This data sample was then tokenized (which is the process of demarcating and possibly classifying sections of a string of input characters) into n-grams (which is a contiguous sequence of n items from a given sequence of text or speech).

In particular, aggregated bi-,tri- and quadgram term frequency matrices are imported into dictionaries which store the frequencies. Following which, the resulting data frames are used to predict the next word in relation to the text input by a user of the described application, and the frequencies of the n-grams table.

Usage Of The Application

Firstly, users can enter the text in the “Enter your text here:” text box, and the next word in sequence is predicted and displayed in the section “The predicted next word:”. The app also displays what you have entered initially below the predicted word.

Interface Screenshot{width=50%}

References & Useful Links

[1] n-gram. Wikipedia. (https://en.wikipedia.org/wiki/N-gram)

[2] Package 'shiny'. Version 1.1.0. Cran, 17 May 2018.

To learn more about the Coursera Data Science Specialization by JHU: https://www.coursera.org/specializations/jhu-data-science