Coursera Data Science Capstone Project

Sandra Ezidiegwu

This presentation describes the functionality and usefulness of the application built for next word predictions.

The application was built as a part of the capstone project for the Coursera Data Science specialization course held by professors of John Hopkins University in cooperation with Swiftkey.

Objective

Build an algorithm that can predict the next word of a phrase or word input by the application user. In building the prediction algorithm, the following steps were applied:
- Exploratory Analysis
- Data Cleaning
- Creation of Frequency Dictionaries, e.g. uni/bi/tri/quad-grams
- Create Algorithm
Create Shiny App to display functionality of prediction algorithm using R Studio

Process Description

Data Sampling and Cleaning: Data was randomly sampled and cleaned by conversion to lowercase and applying regex functions to remove punctuations, special characters etc
Corpus and N-Grams: This data sample was converted to a vector corpus and was then tokenized using the tau package in R to uni-, bi-, tri-, and quad- grams.
Frequency Dictionary and Prediction: A frequency matrix was created for each n-gram and transferred into frequency dictionaries. The resulting data frames were used to predict the next word.

Application How-To

To use the application, you simply enter a word or phrase in the text box and the application will then try to predict the next word. This result will be shown in blue.

Additional Details

The next word prediction application can be accessed here @ Shiny App
Application codes can be found @ Github
To learn more about the Coursera Data Science Specialization course, visit Coursera