Coursera Data Science Capstone Project

Ghida Ibrahim

This is the presentation of the data science capstone project done as part of the Coursera data science specialization in partnership with Johns Hopkins and Swiftkey

Objective & Steps

The goal of this capstone project is to develop a text prediction app that predicts the next word based on previously written words. Involved steps include:

  • Sampling a large dataset of English words including English news, tweets and blogs
  • Loading and cleaning sampled data
  • Tokenizing sampled data into unigrams, bigrams, trigrams and quadgrams using n-grams and converting these tokens and associated frequencies into dataframes
  • Building a prediction model using so-formed dataframes
  • Building a shiny app that uses the prediction model for predicting next word

The Prediction Model

  • After creating a data sample from the HC Corpora data, this sample was cleaned by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters. This data sample was then tokenized into so-called n-grams.

  • Those aggregated bi-,tri- and quadgram term frequency matrices have been transferred into frequency dictionaries.

  • The resulting dataframes are used to predict the next word in connection with the text input by a user of the described application and the frequencies of the underlying n-grams table.

The Shiny App

The app can be found here: https://ghida.shinyapps.io/Shiny_App_Capstone_Project/

  • The UI.R file contains a ShinyUI function that allows visualizing a text input box where input words are written and a text output box where the predicted next word is displayed
  • The Server.R file imports the transformed bigrams and trigrams data frames and sources the prediction model. It includes a ShinyServer function which renders user input and computes the predicted next word by applying the sourced prediction model

Additional Information