Executive Summary

This report has two main aims:

  1. To demonstrate some key features of the data and the approach taken to data cleansing.
  2. To outline the approach that will be taken to building the prediction algorithm

Data Exploration

The data set comprised of three files:

The table below summarises each data source in its raw form.

Ā  newsTable twitterTable blogsTable
size 205.8 167.1 210.2
#lines 1010242 2360148 899288
#words 34726303 30093369 37546249
#unique words 333177 486658 396194

Since the files are large using the entire data set would make analysis unweildy. Therefore a sample (using 10% of each dataset) was constructed.

The three data sources were then combined and converted to a corpus to allow text manipulation. In order to standardise the text a number of operations were performed:

The resulting corpus had a total of 44081 words & 24801 unique words

Tokenisation

The corpus was then tokenised using the Rweka package. The top 10 N grams (n=1:4) were calculated and can be seen graphed below.

Next Steps

Before moving on to the prediction algorithm I would like to perform more explorations into whether removing stopwords is the approach I want to take. I would also like to investigate how using stemming (reducing words to the root) can be used to help optimise the corpus for prediction.

For the prediction algorithm I intend to create a library of N-grams from the data. This library will be used to find the three most likely candidates for the next word. To do this I will use up to quadragrams - the higher N picked the greater the predictive power possible from the algorithm but both the size and speed of the are adversely affected.

Once the algorithm is operating I will build a shiny application for it to run in & optimise for performance in the app.