Devi
Aug 23 2015
Coursera Data Science Specialization : Capstone Swiftkey Project
The goal of this project is to “Create a Shiny app that accepts as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.”. The data is taken from a Corpus called HC Corpora.The training dataset is downloaded from Capstone Dataset.
This model was developed using 500K randomly sampled lines from Blogs postings,News articles,and Twitter feeds. A modified Katz Back-Off model was developed using n-word sequences (n-grams) ranging from 2 to 6 words. Frequent n-grams were identified and used to calculate probabilities. Numbers, punctuation, capitalization, and profanity words were removed. In addition to the next word, this application displays a prediction data table and wordcloud.
(1.) Process text input from user (separate/tokenize into n words)
(2.) Search (n+1)-gram frequency table for matches
(3.) Calculate probabilities of each match (frequency/total)
(4.) If no matches, search the next lower-order n-gram table
(5.) If no match in 2-gram table, use most frequent 1-grams
(6.) Return word with the highest probability score (0-1, 1=best)
Enter a word or phrase in English in the text box.(or) You can also select any of the phrases from the Quizzes. Then Click the “Predict” button.The best next single word prediction will be displayed on the “Prediction Results” tab or the “Quiz Prediction Results” tab depending on the users selection from the side bar panel of the app respectively. Note: It may take a few seconds to load the app initially !
Special thanks to Jeff Leek, PhD, Roger D. Peng, PhD, Brian Caffo, PhD. Coursera Data Science Faculty.
ShinyApp Server: https://devi.shinyapps.io/NWPShinyApp
Slide Deck: http://rpubs.com/Devi/