Text Predictor

Ricardo Rios
April 15th, 2016

Data Science Capstone

Johns Hopkins University | Coursera

Introduction

Text predictor is a shiny application that uses Markov chains to predict the next word given a sequence of words according to the information provided in the corpus called HC Corpora.
If the sequence of words provided to text predictor is unknown, text predictor will use stupid back off model with the following squeme:

\[ S\left(w_{i}|w_{i-n+1}^{i-1}\right)=\begin{cases} {\frac{f\left(w_{i-n+1}^{i}\right)}{f\left(w_{i-n+1}^{i-1}\right)}} & \textrm{if }f\left(w_{i-n+1}^{i}\right)>0\\ {\alpha}S{\left(w_{i}|w_{i-n+2}^{i-1}\right)} & \textrm{otherwise} \end{cases} \]

Introduction

\[ w_{i-n+1}^{i-1}=w_{i-1}w_{i-2}\ldots w_{i-n+1} \]

\[ S(w_{i})=\frac{f\left(w_{i}\right)}{N} \]

The value of \( \alpha \) was set to 0.4. According to [1] stupid back off works well with very large language model.
To build the language model, we take samples of 1%, 1%, and 0.5% of the blogs, news and twitter datasets, respectively (train set).

Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. (2007). Large language models in machine translation. In EMNLP/CoNLL 2007.

Performance of the natural language model

To measure the performance of the model estimated, we choose 200 randomly sequences of 1-grams, 2-grams, and 3-grams from the data that was not used in the process to build the language model.
Then, we applied the model estimated to predict the next word for each word in the test set.
The percentage of corrected predicted words are shown as follows:

Unigram	Bigram	Trigram
0.036	0.054	0.035

How to use text predictor

Type in your web browser https://ricardoues.shinyapps.io/text_predictor.
Type something for example “We need” and leave a space in the text input.
Below it will appear the predicted text, then press the submit button to add the word.
Due to limitations of RAM memory in shinyapps.io, the information of the N-grams is not complete and therefore perhaps, there will not be predictions for certain combination of words.