Tinniam V Ganesh

6 Aug 2015

This presentation highlights the steps in creating a Word Predict Shiny App

The steps taken were

- Ingest the from the Tweets, Blogs and News
- Sample 15% of the and split it into training and test set
- Store as separate files
- Create a Corpus from the tweets, blogs and news items
- Clean the Corpus to remove punctuation, special characters, stopwords etc
- Remove profanity from the training and test set

- Use the package RWeka to create Quadgrams,Trigrams,
- Remove sparse terms
- Convert to a data frame and compute frequency of n-gram
- Use Markov chains to calculate the Maximum Likelihood estimate P(C|AB) = count(ABC)/count(AB)
- Use the smoothing algorithm where the Count of the n-1 gram is 0
- Arrange the counts in descending order of conditional probability
- Write this to the term, next word and the conditional probability to a CSV file

- For previous terms whose count is 0, perform Laplace Add - 1 smoothing

Padd-1(C|AB) = (count(C|AB) + 1)/(count(AB) + V)

This method steals probability mass from existing terms and provides it to terms whose count is 0

The backoff algorithm given a phrase “This is so” is as follows for 10 next words

- Sum the probabilities(Pi) for “This is so” in quadgram e.g Pq = sum(Pi)
- Compute alpha = 1 - Pq
- Search trigram table (Pj) for “This is” and compute Pt = sum(Pj)
- Multiply with alpha Pt' = alpha * Pt
- If number of words < 10 continue like this with the bigram and unigram
- Store only the n-1 gram, next word and conditional probability as CSV files.

- Read all the CSV files. These CSV files contain n-1 gram, next word and Probability
- Read the word(s) input. If more than 3 words input read the last 3 words.
- Search in the n gram and back off to n-1 gram for e,g. search in quadgram backoff to trigram etc
- Display the top 10 words in a table when the user presses submit button or hits enter along with the conditional probability

```
Thank You!
```