Capstone Week 2

Aim & Goals:

The final aim / goal is to build a predictive text algorithm which sits behind a user interface (shinny app) and takes user input which then attempts to predict the next text to be typed by the user.

This milestone report will present preliminary analysis findings and first summaries / interactions with the data sets provided, which will later be used to train / shape the predictive algorithm.

Loading Data:

As a starting point, we have been supplied with 3 text files, respective type of text files provided listed below with links to where you can download them.

Data Set Available here

Twitter
News
Blogs

Findings and Next Steps.

The rest of this document details some of the exploratory analysis we have undertaken and the results produced thus far, we present the findings in both a tabular and graphical format to aid initial expolratory understanding of the data sets supplied.

Below we present a table detailing the total number of lines, words and characters found in each dataset, simple means for words and characters per line and also characters per word are also presented.

Datasource	Number of Words	Number of Lines	Number of Characters	Mean Words per Line	Mean Characters per Line	Mean Characters per Word
Twitter	30373543	2360148	162384825	12.86934	68.80281	5.346259
News	2643969	77259	15683768	34.22215	203.00247	5.931903
Blogs	37334131	899288	208361438	41.51521	231.69601	5.580991

Below we give visual plots of the tabular data presented above.

Below plots the Mean Characters per word for each of the Data sets.

Below we present summary statistics for each of the Data sets that have been provided.

This gives us a good feel for how distrubuted the values for both words and characters in each of the datasets.

Blog Data set - Summary Statistics

	Summary - Number of Words	Summary - Number of Characters
	Min. : 1.00	Min. : 1.0
	1st Qu.: 9.00	1st Qu.: 47.0
	Median : 28.00	Median : 157.0
	Mean : 41.52	Mean : 231.7
	3rd Qu.: 59.00	3rd Qu.: 331.0
	Max. :6630.00	Max. :40835.0

News Data set - Summary Statistics

	Summary - Number of Words	Summary - Number of Characters
	Min. : 1.00	Min. : 2
	1st Qu.: 19.00	1st Qu.: 111
	Median : 31.00	Median : 186
	Mean : 34.22	Mean : 203
	3rd Qu.: 45.00	3rd Qu.: 270
	Max. :1031.00	Max. :5760

Twitter Data set - Summary Statistics

	Summary - Number of Words	Summary - Number of Characters
	Min. : 1.00	Min. : 2.0
	1st Qu.: 7.00	1st Qu.: 37.0
	Median :12.00	Median : 64.0
	Mean :12.87	Mean : 68.8
	3rd Qu.:18.00	3rd Qu.:100.0
	Max. :47.00	Max. :213.0

It is a little hard to make comparisions across the three data sets regards their dispersion, so below we display a boxplots for the following (good visual for comparing the table data above across the data sets).

Number of Words per Line (for each of the three datasources)
Number of Characters per Line (for each of the three datasources)

Findings and Commentary

Aggregate / Totals

From a totals perspective, the blog data set has the most number of lines of data present (2.36 million) with the news data set having significantly fewer (77.2 thousand) while the twitter data set sat in the middle with roughly 900 thousand.

Blogs also had the most words per line (record) in the data set, followed by the news and then twitter datasets.

Means of Word and Character counts.

As expected the Twitter data set had the shortest wordcounts and characters per line (there is a restriction in place, was 140 and there was an experiment with 280 in some places apparently?)

The blog and news datasets had some records which were very large (blog data set had a max wordcount of 6630), some of these outliers should be inspected further to see what they are?

All three data sets seemed to have similar mean characters per word (5.33 for Twitter through to 5.93 for the news).