Predictive keyboard. Data Science Capstone.

José Antonio González Prieto
10 Dec. 2014

The project

The final objective of the project is the develop of a predictive keyboard that helps the user to writte english texts by trying to “anticipate the next word”.

Summary of data sources to develop the project:

  • Blogs : 260.564.320 Mb, 899.288 lines and 38.222.304 words.
  • News : 261.759.048 Mb, 1.010.242 lines and 35.710.849 words.
  • Twitter : 316.037.344 Mb, 2.360.148 lines and 30.433.509 words.

The Natural Language Processing and Analysis

  • N.L.P. processing

    • Remove punctuation and numbers.
    • Strip white spaces and transform to lower case.
  • N.L.P. analysis

    • Number of words, chars and sentences by text and source (blogs, news and twitter).
    • Number of words and chars at sentences by source.
    • Number of verbs, adjectives, adverbs and nouns by sentence and source.
    • Most frequent words, 2grams, 3grams and 4grams.

The predictive model

Combination of short and long distance models:

  • Ngram model (Markov)
    • Short distance model.
    • Based on the scaled probability of 4,3,2 grams at the sentence.
    • Prob = w4 * Prob4gram + w3 * Prob3gram + w2 * Prob2gram
  • Grammar model
    • Long distance model.
    • Based on the number of times that the most common verbs, adverbs, adjectives and nouns appear at the same sentences.

The application

  • Selection of number of words to predict .
  • Model balance selector that selects the weight between the ngram and the grammar models.
  • Selection of the weight value that the user can select to scale each ngram model.
  • The user can select the number of words in the sentence that will be used to search for the long distance model.
  • Table and plot view of the word candidates.
  • See : https://jagprieto.shinyapps.io/predictive_keyboard/