15/12/2019

About the Capstone Project

  • This project is part of the tenth and final course of the Coursera Data Science Specialization. The project focuses on the analysis of several huge files with text, to analyze their structure and on this analysis create a model to predict the next word written by a user.
  • Contents
    • Text data analysis: analysis of the corpus to understand the relationship of words and word pairs
    • Predictive modeling: build basice n-gram models and develop algorithms to facilitate text prediction
    • Shiny app development: produce a web-based Shiny app to predict next words

Modeling

  1. Getting and cleaning the data: profanity was first removed and words tokenized
  2. Exploratory data analysis: the frequencies of words and word paris were calculated
  3. Modeling: 2-7 gram models were built to facilitate word prediction
  4. Prediciton model: - Katz’s back-off model was used to predict the next word - The model iterates from 7-gram to 2-gram to find matches in the last n-1 words - In the case of unseen n-gram, the most frequent word, ‘the’, is returned - To improve efficiency, word pairs that appear less than 5 times in the corpus were removed

Results

  • The data analysis and model building writeups was delivered to the coursera platform
  • The Shiny app for prediction can be found in: https://rafmesal.shinyapps.io/predictorApp/
  • The app takes in the following inputs:
    1. a word or phrase that the user inputs
    2. “# words to predict”, the user select the number of words to predict
  • The predicted next word will show up in the order of most frequently used

Working App

Type some word or phrase and show results in seconds!