Coursera Data Science Capstone: Course Project

Rich Huebner
September 17, 2018

Overview

To use the app, go here to try it!

  • Predicts next word as the user types a sentence
  • Similar to the way most smart phone keyboards are implemented today using the technology of Swiftkey

How To Use the App

  • First, you will be asked to enter the first few words of a sentence.
  • As you type, the next predicted word will be displayed for you.
  • Also, the method of prediction is also displayed.

Getting & Cleaning the Data

  • A sample of the data sets were imported into R from three sources (blogs,twitter and news) which is then merged into one.
  • Next, I cleansed the data to lowercase, removed all white space, and removed punctuation and numbers.
  • The anagrams are then created (Quadgram,Trigram and Bigram). Anagrams are frequently appearing word combinations (i.e., “the way”, “new york”, “i like to”, etc.)
  • Next, the frequency tables are extracted from the anagrams and sorted in descending order.
  • Lastly, the anagram objects are saved as R compressed files (.RData files).

Underlying Algorithm

  • Anagram (N-Gram) model with Backoff
  • The algorithn checks if the highest-order (n = 4) N-Gram has been seen. If not, it backs down to a lower-order model (n = 3, or n = 2).

Further Exploration

  • Further work
    1. Explore different algorithms like Naive Bayes.
    2. Find ways of making the preprocessing faster – using parallelization
    3. Explore text mining with social-emotional learning data
    4. Explore text mining with student discussion board data