This report has two main aims:
The data set comprised of three files:
The table below summarises each data source in its raw form.
| Ā | newsTable | twitterTable | blogsTable |
|---|---|---|---|
| size | 205.8 | 167.1 | 210.2 |
| #lines | 1010242 | 2360148 | 899288 |
| #words | 34726303 | 30093369 | 37546249 |
| #unique words | 333177 | 486658 | 396194 |
Since the files are large using the entire data set would make analysis unweildy. Therefore a sample (using 10% of each dataset) was constructed.
The three data sources were then combined and converted to a corpus to allow text manipulation. In order to standardise the text a number of operations were performed:
The resulting corpus had a total of 44081 words & 24801 unique words
The corpus was then tokenised using the Rweka package. The top 10 N grams (n=1:4) were calculated and can be seen graphed below.
Before moving on to the prediction algorithm I would like to perform more explorations into whether removing stopwords is the approach I want to take. I would also like to investigate how using stemming (reducing words to the root) can be used to help optimise the corpus for prediction.
For the prediction algorithm I intend to create a library of N-grams from the data. This library will be used to find the three most likely candidates for the next word. To do this I will use up to quadragrams - the higher N picked the greater the predictive power possible from the algorithm but both the size and speed of the are adversely affected.
Once the algorithm is operating I will build a shiny application for it to run in & optimise for performance in the app.