TEXT PREDICTOR APPLICATION: A FIRST APPROACH

Martin Pons
12/13/2014

INTRODUCTION

Text prediction has useful applications
Speling correction, speech regcognition…
We present here an application that takes an imput phrase and predicts the next word

THE SOFTWARE

The application has been developed using the software R
Free and flexible statistical software with many libraries capable of handle text data
Specific libraries suited for text mining tasks data have been used (tm, RWeka)
The app has been developed making use of the shiny apps library

THE ALGORITHM

A variant of n-grams algorithm has been developed.
How does it work?
- The algoritm predicts the next word conditional to the joint probability of previous n words.
- Naive approach: Assumes that the word predicted is independent from the rest of the words in the text (except the previous n grams) -> simple algorithm and relatively low computational cost
- Joint and conditional frequencies are taken as stimators of these probabilites
- Variation Back-off model: as a default, the next word is predicted taking the previous 3 grams. If a frequency threshold is not met, a 2 gram model is used and so on.

THE DATA

The algorithm was trained using three different corpus from three different sources: blogs, twitter and news
The data was train in tokenized versions of these corpus. Joint frequency tables were obtain.

  word1 word2  word3 word4 freq       rel
1   the   end     of   the 3388 0.0008199
2   the  rest     of   the 2992 0.0007240
3    at   the    end    of 2486 0.0006016
4    is going     to    be 2367 0.0005728
5    is   one     of   the 1886 0.0004564
6    in   the middle    of 1873 0.0004532

PERFORMANCE

Frequency tables were reestructured as trees (list of lists in R), thanks to this the computational cost (in terms of user waiting time) is minimal.
Prediction:
- The data was evaluated in separated test corpuses from the same sources. The algorithm predicts the next word to be one of the three with highest frequencies approximately 26% of the time.

THE APPLICATION

An application witha simple user interface has been developed. This is how it works

1- The user types a phrase

2- The user clics the “Predict” button

3- The application returns the most likely word predicted by the algoritm

alt text