Coursera Data Science Capstone: Course Project

Rich Huebner
September 17, 2018

To use the app, go here to try it!

Predicts next word as the user types a sentence
Similar to the way most smart phone keyboards are implemented today using the technology of Swiftkey

A sample of the data sets were imported into R from three sources (blogs,twitter and news) which is then merged into one.
Next, I cleansed the data to lowercase, removed all white space, and removed punctuation and numbers.
The anagrams are then created (Quadgram,Trigram and Bigram). Anagrams are frequently appearing word combinations (i.e., “the way”, “new york”, “i like to”, etc.)
Next, the frequency tables are extracted from the anagrams and sorted in descending order.
Lastly, the anagram objects are saved as R compressed files (.RData files).

Anagram (N-Gram) model with Backoff
The algorithn checks if the highest-order (n = 4) N-Gram has been seen. If not, it backs down to a lower-order model (n = 3, or n = 2).

Further work
1. Explore different algorithms like Naive Bayes.
2. Find ways of making the preprocessing faster – using parallelization
3. Explore text mining with social-emotional learning data
4. Explore text mining with student discussion board data