Capstone Data Science Project

Bowen Liu
July 3th, 2016

Objectives

  • This presentation is for the Capstone Project under Coursera Data Science specialization.

  • The project builds a shiny application to predict the next word given input text.

  • The following slides will briefly walk you through the text prediction pipeline.

Approaches & Modeling

  • Pipeline: data cleansing => exploratory analysis => prediction modeling => data product
  • Data source: twitter, blogs, and news texts provided by SwiftKey.
  • Data cleansing: after sampling(50%) texts, the corpus were created and transformed in R tm package to lowercase, stem, and remove punctuation, white space, numbers, and profanity words. Then the cleansed corpus were tokenized into unigram, bigram, trigram, and quargram.
  • Explorartory analysis: milestone link
  • Prediction modeling: the ngrams dataframes were built along with laplace smoothing probabilities. The prediction is to query the dataframe to find the words with highest probabilites. Besides, stupid backoff is used in case of unknown words.
  • Data product: the application is hosted in Shiny.io.

Shiny Application

The application is shown as follows. The text input will be extracted to get the priors based on the ngram modes. The default is bigram so the last word from the input will be extracted. Then the modeling will output the top three prediction words with their probabilities. You can switch ngram modes among bigram, trigram, quargram, and stupid backoff.

Project Links

note: please open the link in new tab or window