Coursera Data Science Capstone Project

Ghida Ibrahim

This is the presentation of the data science capstone project done as part of the Coursera data science specialization in partnership with Johns Hopkins and Swiftkey

Objective & Steps

The goal of this capstone project is to develop a text prediction app that predicts the next word based on previously written words. Involved steps include:

Sampling a large dataset of English words including English news, tweets and blogs
Loading and cleaning sampled data
Tokenizing sampled data into unigrams, bigrams, trigrams and quadgrams using n-grams and converting these tokens and associated frequencies into dataframes
Building a prediction model using so-formed dataframes
Building a shiny app that uses the prediction model for predicting next word

The Prediction Model

After creating a data sample from the HC Corpora data, this sample was cleaned by conversion to lowercase, removing punctuation, links, white space, numbers and all kinds of special characters. This data sample was then tokenized into so-called n-grams.
Those aggregated bi-,tri- and quadgram term frequency matrices have been transferred into frequency dictionaries.
The resulting dataframes are used to predict the next word in connection with the text input by a user of the described application and the frequencies of the underlying n-grams table.

The Shiny App

The app can be found here: https://ghida.shinyapps.io/Shiny_App_Capstone_Project/

The UI.R file contains a ShinyUI function that allows visualizing a text input box where input words are written and a text output box where the predicted next word is displayed
The Server.R file imports the transformed bigrams and trigrams data frames and sources the prediction model. It includes a ShinyServer function which renders user input and computes the predicted next word by applying the sourced prediction model

Additional Information

The first milestone report including an exploratory analysis of the english datasets can be found here: http://www.rpubs.com/gibrahim/217948
Shiny app is here: https://ghida.shinyapps.io/Shiny_App_Capstone_Project/
Associated code including the prediction model, the UI.R and server.R files can be found here: https://github.com/ghida87/Capstone