Julia Breitenbruch
12th July, 2016
The main objective of this project is to build a Shiny App that gives a suggestion for the next word to type (after having entered a phrase)- based on probabilities.
The underlying dataset used to create the prediction algorithm for this purpose, comes from a corpus called HC Corpora (www.corpora.heliohost.org).
It contains data from three text files: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt.
After creating a sample from the HC Corpora data, I cleaned it by converting words to lower case and removing punctuation, numbers and special characters.
The next step was to tokenize the text corpus into so-called n-grams, namely bigrams, trigrams and quadgrams, using the RWeka library.
Definition (Wikipedia): In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.
After having generated frequency tables of the respective bigrams, trigrams and quadgrams (arranged in descending order), I created a single data set containing those.I separated the respective n-grams into single words (w1, w2, w3, w4). (In case of bigrams and trigrams, “leading” words were set to “NA”) .
How the algorithm starts: First of all, if the user's input phrase exceeds three words, only the last three words are taken into account. In case, the input contains only of one or two words, the input is filled up with two resp. one “leading” “NA”.
What the algorithm does: it searches for matches. If the input string consists of three or more words, it starts with quadgrams , checking if the first three words are equal and then returning the fourth word (depending on frequency). If no match, back-off to trigrams (i.e. check, if w2 and w3 match), if still no match, back-off to bigrams, if still no match, return the most frequent unigrams as next word prediction. The equivalent goes for strings with less than three words.
The app is hosted on https://jbreiten73.shinyapps.io/capstone
Its use is easy and self-explanatory: Nevertheless, it is important to know that the prediction is not finished until the phrase you have entered appears completely in the “What you have entered” field!
The app gives up to four suggestions for the next word.
The code for the application can be found in my GitHub repository https://github.com/jbreiten73/DataScienceCapstone