WordPredict.Rmd

Tinniam V Ganesh
27 Jul 2015

Ingest the data

This presentation highlights the steps in creating a Word Predict Shiny App

The steps taken were

  • Ingest the from the Tweets, Blogs and News
  • Sample 10% of the and split it into training and test set
  • Store as separate files

Create and clean the Corpus

  • Create a Corpus from the tweets, blogs and news items
  • Clean the Corpus to remove punctuation, special characters, stopwords etc
  • Remove profanity from the training and test set

Create N-grams

  1. Use the package RWeka to create n-grams
  2. Remove sparse terms
  3. Convert to a data frame and compute frequency of n-gram
  4. Use Markov chains to calculate conditional probability P(C|AB) = Count(ABC)/Count(AB)
  5. Use the smoothing algorithm where the Count of the n-1 gram is 0
  6. Arrange the counts in descending order of conditional probability
  7. Write this to the term, next word and the conditional probability to a CSV file

Katz backoff algorithm

The backoff algorithm given a phrase “This is so” is as follows

  1. Start with the quadgram for the given phrase e.g.“This is so” If there are 10 next words stop.
  2. Otherwise sum the probabilities of the phrase in found in quadgram e.g Pq
  3. Compute alpha = 1 - Pi
  4. Search for next 2 words “is so"in the trigram table.
  5. Multiply the probabilities of trigram Pt with alpha Pt' = alpha * Pt
  6. If the total of trigram next words and the previous word = 10 then stop.
  7. Compute new alpha = 1 - Pt'
  8. Continue like this with the bigram and unigram
  9. Store only the n-1 gram, next word and conditional probability as CSV files

The Next Word Shiny app

  1. Read all the CSV files. The CSV files contain n-1 gram, next word and Probability
  2. Read the last 3 words in typed phrase.
  3. Search in the n -gram and back of n-1 gram for e,g. search in quadgram backoff to trigram etc
  4. Display the top 10 words in a table when the user presses submit button