wandervogel
Nowadays everyone who types mails or messages with his or her smartphone/tablet cames in contact with language prediction. Since the predicted word is sometimes useful, the prediction is correct, but often not the one, which one has in mind, the prediction models have spaces for improvement.
The goal of this project is to implement an app which predicts the next word in an English sentence based on a statistical model for Natural Language Processing.
The given training data consists of texts from news, blogs and twitter tweets: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The size of the training data is as follows:
sources number_lines number_words
1 news 77259 2643969
2 blogs 899288 37334131
3 twitter 2360148 30373543
To create a statistical model, which predicts the next word given a single, two or three words of a sentence, by using the given training set, the following steps are taken:
One idea to increase the speed for prediction, is to restrict the number of words or N-grams in the dictionary. The following plots show, that a greater part of the whole text can be covered by a smaller amount of mono-grams, bi-grams or tri-grams.
To increase the prediction speed and since otherwise the shiny app runs out of memory, the dictionary, which is based for the created model, is restricted to N-grams, which frequency is greater than four. That means, that all N-grams, which can be found only once or twice in the whole text, are removed from the dictionary.
The number of used n-grams are as follows.
Type of N-Grams Number of different N-Grams
1 mono-grams 202646
2 bi-grams 1186405
3 tri-grams 1384621
The result can be tested in the app here: https://sora725-2.shinyapps.io/language_prediction_app/