Created by Dimitrios Apostolopoulos May 20th, 2015
This presentation is a brief description of a Shiny application that predicts the next word to the user's input. It was created as the final part of the Data Sciense Specialization provided by JHU and Coursera.
About the text corpus
The text corpus used for the creation of the algorithm is provided by SwiftKey and can be downloaded from this link. It consists of three different text files containing data from blogs, twitter and news feeds, it is multilingual but for the purposes of our project only the english files were used.
In order to achieve better performance for the algorithm, we didn't use the whole text corpus, we used random samples of 100000 rows from each text file.
The Algorithm
Corpus creation
Unify the three different text files in one.
Cleanse the data (remove non ASCII characters, transform all letters to lowercase, remove urls, remove punctuation marks, strip whitespace, remove profanity words).
Tokenize the corpus with tokens of length of four words, three words and two words.
Create the frequency tables and the ngrams.
The algorithm makes predictions according to the user's input. At first it tries to match a prediction in the 4-gram, it continues searching in the 3-gram and at the end in the 2-gram. If no prediction is produced it returns a proper message.
The Application
The user types in the input box, then he hits the submit button.
On the output panel are printed the top three predictions and a bar chart with the frequensies of every prediction.