Milestone Report - Data Exploration

Executive Summary

This report has two main aims:

To demonstrate some key features of the data and the approach taken to data cleansing.
To outline the approach that will be taken to building the prediction algorithm

Data Exploration

The data set comprised of three files:

News articles
Blog posts
tweets

The table below summarises each data source in its raw form.

	newsTable	twitterTable	blogsTable
size	205.8	167.1	210.2
#lines	1010242	2360148	899288
#words	34726303	30093369	37546249
#unique words	333177	486658	396194

Since the files are large using the entire data set would make analysis unweildy. Therefore a sample (using 10% of each dataset) was constructed.

The three data sources were then combined and converted to a corpus to allow text manipulation. In order to standardise the text a number of operations were performed:

All text was converted to lower case
All punctuation was removed
Additional whitespace was removed
Numbers were removed
words deemed to be profane/offensive were removed (using the list maintained here)
Broken plurals were fixed (many occurances of " s " existed within the data, these were replaced with “s”)
Stopwords were removed

The resulting corpus had a total of 44081 words & 24801 unique words

Tokenisation

The corpus was then tokenised using the Rweka package. The top 10 N grams (n=1:4) were calculated and can be seen graphed below.

Next Steps

Before moving on to the prediction algorithm I would like to perform more explorations into whether removing stopwords is the approach I want to take. I would also like to investigate how using stemming (reducing words to the root) can be used to help optimise the corpus for prediction.

For the prediction algorithm I intend to create a library of N-grams from the data. This library will be used to find the three most likely candidates for the next word. To do this I will use up to quadragrams - the higher N picked the greater the predictive power possible from the algorithm but both the size and speed of the are adversely affected.

Once the algorithm is operating I will build a shiny application for it to run in & optimise for performance in the app.