The wordcount was computed for each line and summary statistics and density plots were made in order to get an idea of distribution of wordcounts:
Summary statistics of Wordcounts:Blogs:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 28.00 41.38 59.00 6629.00
News:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 19.0 31.0 34.1 45.0 1030.0
Twitter:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.61 18.00 47.00
The exploratory analysis shows that there’s an extreme high-end to the wordcount range for blog and news sources, but that the vast majority of lines from blog and news have wordcounts that are near the median.
It’s also interesting that the Twitter wordcount density shows a sawtooth pattern. Twitter has by far the most number of lines, but the smallest range of wordcount values. I’ll investigate why this pattern appeared if time permits, otherwise I’m not too concerned given that the overall shape of the density curve is what I expected to see.