The objective of this report is to perform exploratory analysis on the Project Dataset.
The project dataset consists of 3 files (blogs, twitter and news) which include texts obtained from each of the sources in 4 languages (English, German, Russian and Finnish). The words/phrases will be used to design and train a predictive texting model in the given language. For the purpose of this report and course project, English files will be used.
There are 3 txt files (blogs, twitter and news) of 120, 120 and 136 bytes respectively. Wordcount of each file is 44m, 3m and 37m totalling 84m. Line-count for each file is 899k, 77k and 2.3m. It is interesting that news appear to have much longer lines than the other two sources.
After removing non-ASCII characters, 83m characters remain from the 3 files out of which 11.5m are white-spaces. This leaves 71m of words, numeric and alphanumeric combinations.
Out of the 71m words/combinations, the top 11 most frequently used words make up just ynder 20% of usage. Top 150 words covers 50% whereas top 100000 90%.
We will now use this data-set to build a model for predictive texting. The method will be based in analysing word combinations to identify the most commonly used combinations. The main focus will be a 2-gram model but the model will also accommodate 3-word phrases.
The first step is to list the files and understand their respective sizes and length etc.
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## file_names unlist.size.
## 1 blogs 120
## 2 news 120
## 3 twitter 136
## file_names wordcount linecount
## 1 blogs 43937582 899288
## 2 news 3137816 77259
## 3 twitter 36947512 2360148
## 4 Total 84022910 3336695
The next step is to look a bit deeper into the each of the files about frequency of (English) words. All 3 files are also combined into one corpus for ease of analysis. Because of the nature of texting/blogging, words containing numbers will be taken into account in case it is a popular shorthand.
[1] Remaining NO. of words [2] words removed
## [1] 83081749
## [1] 941161
## Loading required package: NLP
## US_EN.ASCII Freq
## 1 11560554
## 2 the 2644358
## 3 to 1895039
## 4 I 1651448
## 5 a 1507983
## 6 and 1507581
## 7 of 1279227
## 8 in 964643
## 9 you 802892
## 10 is 786907
## US_EN.ASCII Freq percentage cum_percentage
## 2 the 2644358 0.03697307 0.03697307
## 3 to 1895039 0.02649619 0.06346926
## 4 I 1651448 0.02309033 0.08655959
## 5 a 1507983 0.02108442 0.10764401
## 6 and 1507581 0.02107880 0.12872281
## 7 of 1279227 0.01788599 0.14660880
## 8 in 964643 0.01348751 0.16009631
## 9 you 802892 0.01122593 0.17132224
## 10 is 786907 0.01100243 0.18232467
## 11 for 742367 0.01037968 0.19270435