TEXT PREDICTOR APPLICATION: A FIRST APPROACH

Martin Pons
12/13/2014

INTRODUCTION

  • Text prediction has useful applications
  • Speling correction, speech regcognition…
  • We present here an application that takes an imput phrase and predicts the next word
THE SOFTWARE
  • The application has been developed using the software R
  • Free and flexible statistical software with many libraries capable of handle text data
  • Specific libraries suited for text mining tasks data have been used (tm, RWeka)
  • The app has been developed making use of the shiny apps library

THE ALGORITHM

  • A variant of n-grams algorithm has been developed.

  • How does it work?

    • The algoritm predicts the next word conditional to the joint probability of previous n words.
    • Naive approach: Assumes that the word predicted is independent from the rest of the words in the text (except the previous n grams) -> simple algorithm and relatively low computational cost
    • Joint and conditional frequencies are taken as stimators of these probabilites
    • Variation Back-off model: as a default, the next word is predicted taking the previous 3 grams. If a frequency threshold is not met, a 2 gram model is used and so on.

THE DATA

  • The algorithm was trained using three different corpus from three different sources: blogs, twitter and news

  • The data was train in tokenized versions of these corpus. Joint frequency tables were obtain.

  word1 word2  word3 word4 freq       rel
1   the   end     of   the 3388 0.0008199
2   the  rest     of   the 2992 0.0007240
3    at   the    end    of 2486 0.0006016
4    is going     to    be 2367 0.0005728
5    is   one     of   the 1886 0.0004564
6    in   the middle    of 1873 0.0004532

PERFORMANCE

  • Frequency tables were reestructured as trees (list of lists in R), thanks to this the computational cost (in terms of user waiting time) is minimal.

  • Prediction:

    • The data was evaluated in separated test corpuses from the same sources. The algorithm predicts the next word to be one of the three with highest frequencies approximately 26% of the time.

THE APPLICATION

An application witha simple user interface has been developed. This is how it works

1- The user types a phrase

2- The user clics the “Predict” button

3- The application returns the most likely word predicted by the algoritm

alt text