Predictive keyboard. Data Science Capstone.

José Antonio González Prieto
10 Dec. 2014

The project

The final objective of the project is the develop of a predictive keyboard that helps the user to writte english texts by trying to “anticipate the next word”.

Summary of data sources to develop the project:

Blogs : 260.564.320 Mb, 899.288 lines and 38.222.304 words.
News : 261.759.048 Mb, 1.010.242 lines and 35.710.849 words.
Twitter : 316.037.344 Mb, 2.360.148 lines and 30.433.509 words.

The Natural Language Processing and Analysis

N.L.P. processing
- Remove punctuation and numbers.
- Strip white spaces and transform to lower case.
N.L.P. analysis
- Number of words, chars and sentences by text and source (blogs, news and twitter).
- Number of words and chars at sentences by source.
- Number of verbs, adjectives, adverbs and nouns by sentence and source.
- Most frequent words, 2grams, 3grams and 4grams.

The predictive model

Combination of short and long distance models:

Ngram model (Markov)
- Short distance model.
- Based on the scaled probability of 4,3,2 grams at the sentence.
- Prob = w4 * Prob4gram + w3 * Prob3gram + w2 * Prob2gram
Grammar model
- Long distance model.
- Based on the number of times that the most common verbs, adverbs, adjectives and nouns appear at the same sentences.

The application

Selection of number of words to predict .
Model balance selector that selects the weight between the ngram and the grammar models.
Selection of the weight value that the user can select to scale each ngram model.
The user can select the number of words in the sentence that will be used to search for the long distance model.
Table and plot view of the word candidates.
See : https://jagprieto.shinyapps.io/predictive_keyboard/