Coursera Data Science Capstone Project

Julia Breitenbruch
12th July, 2016

Synopsis

The main objective of this project is to build a Shiny App that gives a suggestion for the next word to type (after having entered a phrase)- based on probabilities.

The underlying dataset used to create the prediction algorithm for this purpose, comes from a corpus called HC Corpora (www.corpora.heliohost.org).

It contains data from three text files: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt.

Prediction Algorithm

  • After creating a sample from the HC Corpora data, I cleaned it by converting words to lower case and removing punctuation, numbers and special characters.

  • The next step was to tokenize the text corpus into so-called n-grams, namely bigrams, trigrams and quadgrams, using the RWeka library.

  • Definition (Wikipedia): In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech.

Prediction Algorithm (continued)

  • After having generated frequency tables of the respective bigrams, trigrams and quadgrams (arranged in descending order), I created a single data set containing those.I separated the respective n-grams into single words (w1, w2, w3, w4). (In case of bigrams and trigrams, “leading” words were set to “NA”) .

  • How the algorithm starts: First of all, if the user's input phrase exceeds three words, only the last three words are taken into account. In case, the input contains only of one or two words, the input is filled up with two resp. one “leading” “NA”.

  • What the algorithm does: it searches for matches. If the input string consists of three or more words, it starts with quadgrams , checking if the first three words are equal and then returning the fourth word (depending on frequency). If no match, back-off to trigrams (i.e. check, if w2 and w3 match), if still no match, back-off to bigrams, if still no match, return the most frequent unigrams as next word prediction. The equivalent goes for strings with less than three words.

About the Application