Executive Summary

The following report shows the amazing world of Data Science in action. In particular, it demostrates some of the basic functions and features developers and users can benefit from when they need to analyze the nature and content of text documents. In this analsys three text files will be reviewed (Us.blogs.txt, US.news.txt and Us_twitter.txt). They were downloaded from the Coursera website. Since the files are huge, only a sample of 10,000 randomly selected records per file were used for this report. We hope you enjoy it!


Data Processing.

The following actions were taken before generating the results:

  1. Download the zip file from Coursera
  2. Read the 3 files from the local drive.
  3. Remove:
    1. Common used words which do not add value to the analysis (i.e: the, of, also…)
    2. Non-ASCII characters
    3. Non-Printable characters
    4. Punctuations

Determine how many lines are in each text file.

Total lines found in file en_US.news.txt:

## [1] "1,010,242"

Total lines found in file: en_US.twitter.txt:

## [1] "2,360,148"

Total lines found in file: en_US.blogs.txt:

## [1] "899,288"

Reports.

Showing basic features of the files.

Word Clouds for the three files.


A sample file was created for each text file with 10,000 records randomly selected. Then a wordcloud for a each of the files (see below)


US Blogs - word cloud.

Interpreting usBlogs word cloud: As you can see the most common topics discussed in blogs are those with the bigger sizes. Also notice people’s interest in life, work, home, book…


US News - word cloud.

Interpreting usNews word cloud: As you can see the most recurring topics in the news are those with the bigger sizes. Also notice that the word said is the most used term in the news, which makes sense since they are normally reporting about others actions.


US Twitter - word cloud.

Interpreting usTwitter word cloud: As you can see the most recurring topics in Twitter are those with the bigger sizes. An important thing to notice in this wordcloud is how often people refer to timing, such as: time, weekend, tomorrow, tonight… It is interesting to note the expressions of people’s sentiment in this chart: love, thanks, happy, hope, feel…


Summary: most used terms.


Top 10 terms (words) used in US Blogs.
## like time  get  new  now  see know make  day good 
##  622  556  462  370  369  356  355  348  324  324
Top 10 terms (words) used in US News.
##   said    new   year   last   like people  years   time  state    get 
##   2016    588    513    434    422    392    351    345    340    328
Top 10 terms (words) used in US Twitter.
##   like    get   good   love    day  great    now   know thanks    new 
##    484    472    433    392    359    348    344    327    316    313

Observation about the summary above:

Notice that the term like appears in the three tables as one of the most used words, from this we conclude that it can be easily determined how people feel about different topics by analysing the content of sites like: News, Blogs and Twitter.


Bar chart to show the 10 most used terms in the three files.