Milestone Report

In this report I will lay out the first steps taken towards creating a predcitive algorithm that allows to determine the next word following a suite of given words. In the first steps we will be exploring the data sets provided by the Coursera Team and which will be the basis input for the algorithm. For this exploratory analysis I will make use of the Quanteda Package in R that allows the handling of large texts efficiently.

Exploratory analysis

The data can be found here and it includes datasets in 3 languages French, German and English. However we will address only the English language data. In the English language data directory there are 3 text files made up from sentences taken from blogs, news and twitter. The details of the files, including length in terms of sentences and token count can be seen in the table below.

We can also have a look at the similarity between the 3 text file documents. Generally they are quite similar, however while news and blogs are very alike with about 96% similarity, news and twitter are less so with only around 84% similar while twitter and blogs are inbetween at about 93%. This is displayed in the dotchart below.

We can also inspect the most frequent words from the data sets as displayed in the table below and using a word cloud diagramm. As expected it is mostly stopwords such as “the” and “of” that are most frequent accross all datasets.

##     the      to     and       a      of       i      in     for      is 
## 4765870 2754522 2414928 2382313 2005499 1653831 1646113 1099443 1075200 
##    that 
## 1041886

Sampling and Cleaning the data

For any further analysis we will take a sample of around 5% of each data set and clean it. The steps taken are:

transforming all words to lowercase
removing punctutation
removing special characters such as #, /,-,@
removing numbers

In this process the data is also being tokenized and combined to one dataset. This will build the basis for constructing the n-grams. I will built bigrams, trigrms and quadgrams and below we can observe a chart of the most frequent ones.

The trigrams and bigrams have each got distinct leaders namely “one of the”, “a lot of” and “thanks for the” and “of the” and “in the” in bigrams. They are substatntially more frequent than the next following tri- and bigrams.

There are no clear leaders in quadgrams like their are with trigram and bigrams. The most frequent are “the end of the”, “thanks for the follow” (which probably comes from the twitter data set) and “for the first time” however they are equally frequent as “at the end of” and “the rest of the”.

Observations

In the sample analysed above it is clear that stopwords are obviously the most common and necessary for this predcition task. It also seems to be heavily influenced by the twitter data set as “thanks for the” and “thanks for the follow” which probably stem from the same sentences occur as frequent trigrams and quadgrams. It therefore seems that blog would probably be a better choice as a basis for the predicition. News might be affected by a lot of singular names but could aslo be a good basis.

Plan

The next steps are:

use the ngrams to evaluate probabilites and invesitgate backoff models for unseen ngram
use the ngrams and probabilites as a basis for a prediction function
write a predciton function that is stored efficiently to minimize calculation time
analyze the accuracy of the prediciton function