Tinniam V Ganesh
6 Aug 2015

Create and clean the Corpus

This presentation highlights the steps in creating a Word Predict Shiny App

The steps taken were

  • Ingest the from the Tweets, Blogs and News
  • Sample 15% of the and split it into training and test set
  • Store as separate files
  • Create a Corpus from the tweets, blogs and news items
  • Clean the Corpus to remove punctuation, special characters, stopwords etc
  • Remove profanity from the training and test set

Create N-grams

  1. Use the package RWeka to create Quadgrams,Trigrams,
  2. Remove sparse terms
  3. Convert to a data frame and compute frequency of n-gram
  4. Use Markov chains to calculate the Maximum Likelihood estimate P(C|AB) = count(ABC)/count(AB)
  5. Use the smoothing algorithm where the Count of the n-1 gram is 0
  6. Arrange the counts in descending order of conditional probability
  7. Write this to the term, next word and the conditional probability to a CSV file

Use Laplace Add-1 smoothing

  1. For previous terms whose count is 0, perform Laplace Add - 1 smoothing

Padd-1(C|AB) = (count(C|AB) + 1)/(count(AB) + V)

This method steals probability mass from existing terms and provides it to terms whose count is 0

Katz backoff algorithm

The backoff algorithm given a phrase “This is so” is as follows for 10 next words

  1. Sum the probabilities(Pi) for “This is so” in quadgram e.g Pq = sum(Pi)
  2. Compute alpha = 1 - Pq
  3. Search trigram table (Pj) for “This is” and compute Pt = sum(Pj)
  4. Multiply with alpha Pt' = alpha * Pt
  5. If number of words < 10 continue like this with the bigram and unigram
  6. Store only the n-1 gram, next word and conditional probability as CSV files.

The Next Word Shiny app

  1. Read all the CSV files. These CSV files contain n-1 gram, next word and Probability
  2. Read the word(s) input. If more than 3 words input read the last 3 words.
  3. Search in the n gram and back off to n-1 gram for e,g. search in quadgram backoff to trigram etc
  4. Display the top 10 words in a table when the user presses submit button or hits enter along with the conditional probability
               Thank You!