Similar to the way most smart phone keyboards are implemented today using the technology of Swiftkey
How To Use the App
First, you will be asked to enter the first few words of a sentence.
As you type, the next predicted word will be displayed for you.
Also, the method of prediction is also displayed.
Getting & Cleaning the Data
A sample of the data sets were imported into R from three sources (blogs,twitter and news) which is then merged into one.
Next, I cleansed the data to lowercase, removed all white space, and removed punctuation and numbers.
The anagrams are then created (Quadgram,Trigram and Bigram). Anagrams are frequently appearing word combinations (i.e., “the way”, “new york”, “i like to”, etc.)
Next, the frequency tables are extracted from the anagrams and sorted in descending order.
Lastly, the anagram objects are saved as R compressed files (.RData files).
Underlying Algorithm
Anagram (N-Gram) model with Backoff
The algorithn checks if the highest-order (n = 4) N-Gram has been seen. If not, it backs down to a lower-order model (n = 3, or n = 2).
Further Exploration
Further work
Explore different algorithms like Naive Bayes.
Find ways of making the preprocessing faster – using parallelization
Explore text mining with social-emotional learning data
Explore text mining with student discussion board data