Basic summaries of the news file

Fist, we have the number of lines, words and characters in the news file and an example.

I will explore this file to get the word frequency distribution and get the most commons, to do that I will take the 10% of the text to do an initial exploratory analysis. Keep in mind this file is taken of news.

## [1] "Lines: 1010242"
## [1] "Words: 34372598"
## [1] "Characters: 203791405"
## [1] "Example: Of course, Paul was 20 as a ro ..."

Basic charts of the news file

First, I took only those words that appear more than 5 times, the first chart shows us the histogram of frequency of the 95% of the words with lower occurrence. The second one is a word cloud of the top 1%, here we can see the most common word ‘said’ because this file is about news it’s not surprising.

## [1] "Most common word: said"

Basic summaries of the twitter file

Repeating the same process applied to the news file, we get:

## [1] "Lines: 2360148"
## [1] "Words: 30373832"
## [1] "Characters: 162385042"
## [1] "Example: just wanted to thank you & ask ..."

Basic charts of the twitter file

We can see the most common words are: just, like, get, love and good. Also, in the word cloud we can see expressions widely used in social networks like haha and lol.

term term_count
get 11070
like 12195
just 14883

Basic summaries of the blogs file

Repeating the same process applied to the news file, we get:

## [1] "Lines: 899288"
## [1] "Words: 37334441"
## [1] "Characters: 208361438"
## [1] "Example: The bruschetta however, missed ..."

Basic charts of the blogs file

In the blogs file, the most frequent words are: one, will, just, like and can, words very used to write about experiences and stories.

term term_count
just 9936
will 11337
one 12398