Data Science Capstone

The goal of this project was to create a Shiny application that will predict the following word of (incomplete) phrase the user typed in. The result can be found here:
https://archnae.shinyapps.io/DataScienceCapstone/

The implementation consists of three steps:

  • buiding the language model
  • training this language model
  • running the trained model

The language model

This project used the 4-gram language model with Kneser-Net smoothing.

4-gram word cloud

The next word is predicted by the most probable 4-gram for which first 3 words are known (the last 3 words of the already typed phrase).

The training data

The N-gram probabilities were calculated using 200,000 blog records provided by SwiftKey. The data were cleaned before use:

  • all non-alphanumeric characters were removed.
  • all contractions (e.g. “I've”) are replaced with full forms (“I have”).
  • all numbers are replaced with a single special token.
  • all obscene words are replaced with a single special token.

Calculated probabilities are saved in R data files to be used for prediction.

The running application

uses the probability data saved at the previous step to predict the next word of a partially typed prase.

looks like this

https://archnae.shinyapps.io/DataScienceCapstone/

Possible improvements

Using this app has shown that 4-gram model often degrades into tri- or even bi-gram model because of the omnipresent articles (“a” and “the”) and other too-common “noise” words. while keeping within N-gram language model, the algorithm can hopefully be improved by:

  • combining articles with following nouns into a single token (“a-lot”);
  • combining phrasal verbs with following prepositions (“get-to”);
  • combining pronouns with following modal verbs (“I-have”).