# NextWordPredict

Tinniam V Ganesh
6 Aug 2015

### Create and clean the Corpus

This presentation highlights the steps in creating a Word Predict Shiny App

The steps taken were

• Ingest the from the Tweets, Blogs and News
• Sample 15% of the and split it into training and test set
• Store as separate files
• Create a Corpus from the tweets, blogs and news items
• Clean the Corpus to remove punctuation, special characters, stopwords etc
• Remove profanity from the training and test set

### Create N-grams

1. Use the package RWeka to create Quadgrams,Trigrams,
2. Remove sparse terms
3. Convert to a data frame and compute frequency of n-gram
4. Use Markov chains to calculate the Maximum Likelihood estimate P(C|AB) = count(ABC)/count(AB)
5. Use the smoothing algorithm where the Count of the n-1 gram is 0
6. Arrange the counts in descending order of conditional probability
7. Write this to the term, next word and the conditional probability to a CSV file

1. For previous terms whose count is 0, perform Laplace Add - 1 smoothing

Padd-1(C|AB) = (count(C|AB) + 1)/(count(AB) + V)

This method steals probability mass from existing terms and provides it to terms whose count is 0

### Katz backoff algorithm

The backoff algorithm given a phrase “This is so” is as follows for 10 next words

1. Sum the probabilities(Pi) for “This is so” in quadgram e.g Pq = sum(Pi)
2. Compute alpha = 1 - Pq
3. Search trigram table (Pj) for “This is” and compute Pt = sum(Pj)
4. Multiply with alpha Pt' = alpha * Pt
5. If number of words < 10 continue like this with the bigram and unigram
6. Store only the n-1 gram, next word and conditional probability as CSV files.

### The Next Word Shiny app

1. Read all the CSV files. These CSV files contain n-1 gram, next word and Probability
2. Read the word(s) input. If more than 3 words input read the last 3 words.
3. Search in the n gram and back off to n-1 gram for e,g. search in quadgram backoff to trigram etc
4. Display the top 10 words in a table when the user presses submit button or hits enter along with the conditional probability
               Thank You!