Coursera Data Science Capstone Project

Sandra Ezidiegwu

This presentation describes the functionality and usefulness of the application built for next word predictions.

The application was built as a part of the capstone project for the Coursera Data Science specialization course held by professors of John Hopkins University in cooperation with Swiftkey.

Objective

  1. Build an algorithm that can predict the next word of a phrase or word input by the application user. In building the prediction algorithm, the following steps were applied:

    • Exploratory Analysis
    • Data Cleaning
    • Creation of Frequency Dictionaries, e.g. uni/bi/tri/quad-grams
    • Create Algorithm
  2. Create Shiny App to display functionality of prediction algorithm using R Studio

Process Description

  1. Data Sampling and Cleaning: Data was randomly sampled and cleaned by conversion to lowercase and applying regex functions to remove punctuations, special characters etc

  2. Corpus and N-Grams: This data sample was converted to a vector corpus and was then tokenized using the tau package in R to uni-, bi-, tri-, and quad- grams.

  3. Frequency Dictionary and Prediction: A frequency matrix was created for each n-gram and transferred into frequency dictionaries. The resulting data frames were used to predict the next word.

Application How-To

To use the application, you simply enter a word or phrase in the text box and the application will then try to predict the next word. This result will be shown in blue.

Additional Details

  • The next word prediction application can be accessed here @ Shiny App

  • Application codes can be found @ Github

  • To learn more about the Coursera Data Science Specialization course, visit Coursera