Loading the data

Lets load the data and see how many some main metadata for each file. This metadata is mainly how many sentences each dataset has and how much memory the underlying R object is using

## Warning in readLines(filePathTwitter, encoding = "utf-8"): line 167155 appears
## to contain an embedded nul
## Warning in readLines(filePathTwitter, encoding = "utf-8"): line 268547 appears
## to contain an embedded nul
## Warning in readLines(filePathTwitter, encoding = "utf-8"): line 1274086 appears
## to contain an embedded nul
## Warning in readLines(filePathTwitter, encoding = "utf-8"): line 1759032 appears
## to contain an embedded nul
## # A tibble: 3 × 3
##   fileName      numLines    Mb
##   <chr>            <int> <dbl>
## 1 en_US.blogs     899288  268.
## 2 en_US.news     1010242  270.
## 3 en_US.twitter  2360148  334.

Word distribution

From the above we clearly see that the amount of data, namely sentences, that we have is in the order of hundred of thousands. Furthermore we can see that the biggest data set at hour disposal is the twitter one, with over 2 million sentences, more that twice the number of sentences of any of the other two datasets, Yet we see that the amount of memory it uses is clearly less than the double. Why could that be? It can be explained by the nature of the sentences. Content posted on twitter is usually more brief and concise that the lengthy and fancy lines written for news or a blog.

Let’s do some more exploration and see how the distributions of words are. Lets start easy with one simple wordcloud for each dataset, and then a more general one with the concatenation of all. With this graphic we can see how the words are distributed for each data set and if there is some discrepancy between collections.

####Blogs

####News

####Twitter #### All combined

Interesting, we saw that the same words tend to appear for all datasets and even the combination of them. These words are the, and of, a, to , in ….. So as a first guess we would expect to see a lot of ngrams containing these words

Event that the wordcloud is useful, let’s do some standard barblot with the top 25 most used words, as well as the number of time they appear in the combined dataset.

This is pretty useful, but finally, let’s see how many words do we really need. That is, let’s see how much words we need to use to cover 50%, 80%, 90%, 95% & 99% of all the used words

## # A tibble: 6 × 3
##   CoveragePercentage numWords wordPercentage
##                <dbl>    <int>          <dbl>
## 1                 50      155       0.000219
## 2                 80     2543       0.00360 
## 3                 90     7924       0.0112  
## 4                 95    19120       0.0271  
## 5                 99   119457       0.169   
## 6                100   706664       1

Wow, that is pretty amazing, with only ~17% of the words we can have coverage of around, 99% of the words used in the text. Or we can even drop that number to ~3% if we only want to have coverage for 95% of the words