The following report shows the amazing world of Data Science in action. In particular, it demostrates some of the basic functions and features developers and users can benefit from when they need to analyze the nature and content of text documents. In this analsys three text files will be reviewed (Us.blogs.txt, US.news.txt and Us_twitter.txt). They were downloaded from the Coursera website. Since the files are huge, only a sample of 10,000 randomly selected records per file were used for this report. We hope you enjoy it!
Total lines found in file en_US.news.txt:
## [1] "1,010,242"
Total lines found in file: en_US.twitter.txt:
## [1] "2,360,148"
Total lines found in file: en_US.blogs.txt:
## [1] "899,288"
A sample file was created for each text file with 10,000 records randomly selected. Then a wordcloud for a each of the files (see below)
Interpreting usBlogs word cloud: As you can see the most common topics discussed in blogs are those with the bigger sizes. Also notice people’s interest in life, work, home, book…
Interpreting usNews word cloud: As you can see the most recurring topics in the news are those with the bigger sizes. Also notice that the word said is the most used term in the news, which makes sense since they are normally reporting about others actions.
Interpreting usTwitter word cloud: As you can see the most recurring topics in Twitter are those with the bigger sizes. An important thing to notice in this wordcloud is how often people refer to timing, such as: time, weekend, tomorrow, tonight… It is interesting to note the expressions of people’s sentiment in this chart: love, thanks, happy, hope, feel…
## like time get new now see know make day good
## 622 556 462 370 369 356 355 348 324 324
## said new year last like people years time state get
## 2016 588 513 434 422 392 351 345 340 328
## like get good love day great now know thanks new
## 484 472 433 392 359 348 344 327 316 313
Notice that the term like appears in the three tables as one of the most used words, from this we conclude that it can be easily determined how people feel about different topics by analysing the content of sites like: News, Blogs and Twitter.