Language Prediction

wandervogel

Introduction into the Language Prediction Task

Nowadays everyone who types mails or messages with his or her smartphone/tablet cames in contact with language prediction. Since the predicted word is sometimes useful, the prediction is correct, but often not the one, which one has in mind, the prediction models have spaces for improvement.

The goal of this project is to implement an app which predicts the next word in an English sentence based on a statistical model for Natural Language Processing.

The given training data consists of texts from news, blogs and twitter tweets: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The size of the training data is as follows:

  sources number_lines number_words
1    news        77259      2643969
2   blogs       899288     37334131
3 twitter      2360148     30373543

The Steps towards the Prediction Model

To create a statistical model, which predicts the next word given a single, two or three words of a sentence, by using the given training set, the following steps are taken:

  • Cleaning the data
  • Delete lines with profane words
  • Creating dictionaries with the most frequent used words as well as two or three often successive used combination of words (bi-grams and tri-grams)
  • Calculating the probability of the N-grams, taken the Markov assumption into account. (It means, that the prediction of the next word can be done by using only a short history of preceding words.)
  • Implementing models, which predicts the following word, given one or two words
  • Implementing a model with probablies using the Katz's back off model
  • Design a method, which converts an input of some words into a form which can be used for the model
  • Design and Implement a Shiny app for demonstrating the abilities of the prediction model

Coverage of the whole text

One idea to increase the speed for prediction, is to restrict the number of words or N-grams in the dictionary. The following plots show, that a greater part of the whole text can be covered by a smaller amount of mono-grams, bi-grams or tri-grams.

plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

The Prediction App

To increase the prediction speed and since otherwise the shiny app runs out of memory, the dictionary, which is based for the created model, is restricted to N-grams, which frequency is greater than four. That means, that all N-grams, which can be found only once or twice in the whole text, are removed from the dictionary.

The number of used n-grams are as follows.

  Type of N-Grams Number of different N-Grams
1      mono-grams                      202646
2        bi-grams                     1186405
3       tri-grams                     1384621

The result can be tested in the app here: https://sora725-2.shinyapps.io/language_prediction_app/