The final aim / goal is to build a predictive text algorithm which sits behind a user interface (shinny app) and takes user input which then attempts to predict the next text to be typed by the user.
This milestone report will present preliminary analysis findings and first summaries / interactions with the data sets provided, which will later be used to train / shape the predictive algorithm.
As a starting point, we have been supplied with 3 text files, respective type of text files provided listed below with links to where you can download them.
The rest of this document details some of the exploratory analysis we have undertaken and the results produced thus far, we present the findings in both a tabular and graphical format to aid initial expolratory understanding of the data sets supplied.
Below we present a table detailing the total number of lines, words and characters found in each dataset, simple means for words and characters per line and also characters per word are also presented.
.
| Datasource | Number of Words | Number of Lines | Number of Characters | Mean Words per Line | Mean Characters per Line | Mean Characters per Word |
|---|---|---|---|---|---|---|
| 30373543 | 2360148 | 162384825 | 12.86934 | 68.80281 | 5.346259 | |
| News | 2643969 | 77259 | 15683768 | 34.22215 | 203.00247 | 5.931903 |
| Blogs | 37334131 | 899288 | 208361438 | 41.51521 | 231.69601 | 5.580991 |
Below we give visual plots of the tabular data presented above.
Below plots the Mean Characters per word for each of the Data sets.
Below we present summary statistics for each of the Data sets that have been provided.
This gives us a good feel for how distrubuted the values for both words and characters in each of the datasets.
| Summary - Number of Words | Summary - Number of Characters | |
|---|---|---|
| Min. : 1.00 | Min. : 1.0 | |
| 1st Qu.: 9.00 | 1st Qu.: 47.0 | |
| Median : 28.00 | Median : 157.0 | |
| Mean : 41.52 | Mean : 231.7 | |
| 3rd Qu.: 59.00 | 3rd Qu.: 331.0 | |
| Max. :6630.00 | Max. :40835.0 |
| Summary - Number of Words | Summary - Number of Characters | |
|---|---|---|
| Min. : 1.00 | Min. : 2 | |
| 1st Qu.: 19.00 | 1st Qu.: 111 | |
| Median : 31.00 | Median : 186 | |
| Mean : 34.22 | Mean : 203 | |
| 3rd Qu.: 45.00 | 3rd Qu.: 270 | |
| Max. :1031.00 | Max. :5760 |
| Summary - Number of Words | Summary - Number of Characters | |
|---|---|---|
| Min. : 1.00 | Min. : 2.0 | |
| 1st Qu.: 7.00 | 1st Qu.: 37.0 | |
| Median :12.00 | Median : 64.0 | |
| Mean :12.87 | Mean : 68.8 | |
| 3rd Qu.:18.00 | 3rd Qu.:100.0 | |
| Max. :47.00 | Max. :213.0 |
It is a little hard to make comparisions across the three data sets regards their dispersion, so below we display a boxplots for the following (good visual for comparing the table data above across the data sets).
Aggregate / Totals
From a totals perspective, the blog data set has the most number of lines of data present (2.36 million) with the news data set having significantly fewer (77.2 thousand) while the twitter data set sat in the middle with roughly 900 thousand.
Blogs also had the most words per line (record) in the data set, followed by the news and then twitter datasets.
Means of Word and Character counts.
As expected the Twitter data set had the shortest wordcounts and characters per line (there is a restriction in place, was 140 and there was an experiment with 280 in some places apparently?)
The blog and news datasets had some records which were very large (blog data set had a max wordcount of 6630), some of these outliers should be inspected further to see what they are?
All three data sets seemed to have similar mean characters per word (5.33 for Twitter through to 5.93 for the news).