April 24, 2016
Sentence Predictor
- Text completion is becoming a commonplace feature among many web enbabled applications. They make text entry that was once complicated more convenient by predicting what the next word will be given preceding words. By examining bodies of text, probability alogrithms can help make predictions of next-words given the previous words in a phrase or sentence.
- Business value in it is not only in word prediction and customer convenience, but also in predicting words / products given what someone has clicked on already.
How It Does It
- Collect a corpus of text from sources around the web. + Twitter + Blogs + News Articles (AP)
- Clean the text to remove numbers, punctuation, and *!%@$ (profanity).
- Use R's tm text mining package to break the text into n-grams
- N-grams are groupings of words + unigram = a single word + bigram = a word pair + trigram = three word combination
Algorithm
- Backoff Model
- Taking the last two words of the phrase or sentence and find all trigrams that begin with that bigram
- For each trigram, find the frequency of the ending bigram
- for each bigram returned in the previous step, find the frequency
- get these counts (probilities), and then add the probabilities from each step together
- weights are applied to improve accuracy for trigrams, bigrams and unigrams
The Application
- There are other smoothing algorithms which can improve performance (Good Turing, Kneser-Ney)
- Other techniques can improve processing speed
- Note at this time: shinyapps.io was experiencing difficulties. Please use this alternate URL if the above URL does not work.