The objective of the App is implementing a Predictive Model that offers hint to one or more words coherent to the sentence that’s been input by user.The Capstone data set includes Blogs,News and Twitter From HC Corpora.After performing data cleansing sampling and subsetting we gather all data in Data Frame.
Applying some NLP and Text Mining techniques is created some sets of word combinations N-Grams.These are the main support to Katz Back Off algorithm predicts the next word.Some adaptions and heuristics were specially developed to enhance this Shiny Application.
Downloaded data from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip .
As the original dataset is very large so subset of the original data was sampled from 3 sources blogs,news and twitter which is then merged into one using tm package function VCorpus.
Then data cleansing is done by converting all words from upper case to lowercase,removing all urls,punctuations,numbers,colons,quotes,non ascii,repeated words,strip whitespaces and stopwords.The corresponding N-Grams are then created bigram and trigram using R package RWeka function NGramTokenizer().
Next the term-count tables are extracted from the N-Grams and sorted according to the frequency in descending order. Lastly the n-grams objects are saved as R-Compressed Files.
Compressed datasets containing descending frequency sorted N-Grams are first loaded.Then user
will choose from options that how many words he wants to predict either next one word or next two words Then user input single word in textbox.Input words are cleaned in the similar ways as before prior to
prediction of the next word. For prediction of the next word if no bigram found you can search those word in trigram.If no bigram and trigram found back off to the most common word with highest frequency shown in Frequent Words tab.
Name Of App : Predict Your Next Word
Shiny App Url : https://unsashinyapps.shinyapps.io/predictnextword/