Coursera Data Science Capstone Project

Areddy Anuradha Chowdary

This presentation will briefly but comprehensively pitch an application for predicting the next word.

The application is the capstone project for the Coursera Data Science specialization held by professors of the Johns Hopkins University and in cooperation with SwiftKey. SwiftKey, Bloomberg & Coursera Logo

The Objective

The main goal of this capstone project is to build a shiny application that is able to predict the next word.

This exercise was divided into seven sub tasks like data cleansing, exploratory analysis, the creation of a predictive model and more.

All text data that is used to create a frequency dictionary and thus to predict the next words comes from a corpus called HC Corpora.

All text mining and natural language processing was done with the usage of a variety of well-known R packages.

The Applied Methods & Models

After creating a data sample from the HC Corpora data, this sample was cleaned by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters. This data sample was then tokenized into so-called n-grams.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. (Source)

Those aggregated bi-,tri- and quadgram term frequency matrices have been transferred into frequency dictionaries.

The resulting data.frames are used to predict the next word in connection with the text input by a user of the described application and the frequencies of the underlying n-grams table.

The Usage Of The Application

The user interface of this application was designed with Mobile First in mind. While entering the text (1), the field with the predicted next word (2) refreshes instantaneously and also the whole text input (3) gets displayed.

Application Screenshot

Additional Information