Executive Summary

This is part one of the final project. I will explore the three US text sets which includes news, blogs, and tweets.

Exploration goals

  1. Do basic summaries of the three files? Word counts, line counts and basic data tables?
  2. Make basic plots, such as histograms to illustrate features of the data?
  3. Write report in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Load Libraries

library(tidyverse)
library(tidytext)
library(tidyr)
library(tm)
library(stringr)
library(ngram)
library(scales)

Load Data

These files are extremely large so I will read in only the first 10,000 lines of each file for exploration.
After reading in each dataset I put them into dataframes that have a column for their source and then row bind them together in a single dataframe.
The text column was created as a factor which is not useful for such a large set. I convert it to character class instead.

# Read in sample of full files
sample_size = 10000
path <- "~/Analytics Course/09_Capstone_project/en_US/"
us_blogs <- read_lines(paste0(path,"en_US.blogs.txt"), n_max = sample_size)
us_news <- read_lines(paste0(path,"en_US.news.txt"), n_max = sample_size)
us_twitter <- read_lines(paste0(path,"en_US.twitter.txt"), n_max = sample_size)

blogs_df <- data.frame(source = "us_blogs", line_number = 1:sample_size, 
                       text = us_blogs)
news_df <- data.frame(source = "us_news", line_number = 1:sample_size, 
                                  text = us_news)
twitter_df <- data.frame(source = "us_twitter", line_number = 1:sample_size, 
                         text = us_twitter)

Bind it all together

all_us_df <- rbind(blogs_df, news_df, twitter_df)
all_us_df$text <- as.character(all_us_df$text)  

Remove text, such as punctuation that doesn’t contribute to understanding

# This section removes some
us_text <- all_us_df$text
us_text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", us_text) # remove retweet entities
us_text <- gsub("@\\w+", "", us_text) # remove at people
us_text <- gsub("[ \t]{2,}", "", us_text) # remove unnecessary tabs
us_text <- gsub("^\\s+|\\s+$", "", us_text) # remove unnecessary spaces
us_text <- gsub('https://','',us_text) # removes https://
us_text <- gsub('http://','',us_text) # removes http://
us_text <- gsub('[^[:graph:]]', ' ',us_text) ## removes graphic characters 
us_text <- gsub('[[:punct:]]', '', us_text) # removes punctuation 
us_text <- gsub('[[:cntrl:]]', '', us_text) # removes control characters
us_text <- gsub('\\d+', '', us_text) # removes numbers
us_text <- tolower(us_text) # makes all letters lowercase
us_text <- us_text[!is.na(us_text)] # remove NAs

us_w_clean_text <- cbind(all_us_df, clean_text = us_text)
us_df <- us_w_clean_text %>%
    select(source, line_number, clean_text) %>%
    rename(text = clean_text)
us_df$text <- as.character(us_df$text)

Break dataset into individual words

This process uses the the tidytext unnest_tokens() function. *NOTE: Since I’m going to introduce randomiztion I’ll set the random seed for reproducibility.

us_unnested <- us_df %>% 
    unnest_tokens(word, text)

Count of unique words from each source

us_unnested %>%
    group_by(source) %>%
    summarise(unique_words = n_distinct(word))
## # A tibble: 3 x 2
##       source unique_words
##       <fctr>        <int>
## 1   us_blogs        30872
## 2    us_news        30625
## 3 us_twitter        15452

Interesting that us_blogs and us_news have almost the same number of unique words.

Even from this random sample you can see that there are a number of common words, called stopwords, that need to be excluded since they don’t provide any context.
Examples of stopwords are ‘be’, ‘or’, and ‘of’.

The 30,000 rows of text turn into ~868k individual words.

nrow(us_unnested)
## [1] 868300

Remove stopwords

The tidytext package comes with a dataset of English stopwords These words provide little context and can be removed so that only the important words are retained. Removal is done with an anti-join. I added some stopwords to the main list to eliminate rt, and the numbers from 1:10.

my_stopwords <- data_frame( word = c("rt", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), lexicon = "MINE")

stop_words <- rbind(stop_words, my_stopwords)

us_no_stopwords <- us_unnested %>%
  anti_join(stop_words)
## Joining, by = "word"

Count of unique words from each source

us_no_stopwords %>%
    group_by(source) %>%
    summarise(unique_words = n_distinct(word))
## # A tibble: 3 x 2
##       source unique_words
##       <fctr>        <int>
## 1   us_blogs        30224
## 2    us_news        29996
## 3 us_twitter        14871

Not a big change because stopwords are common in usage but are a very small number of words.

What proportion of words did the stopwords occupy?

1 - (nrow(us_no_stopwords)/nrow(us_unnested))
## [1] 0.5845526

This indicates that only about 41% of words in these works is important for analysis.

Most often used words in whole dataset

us_no_stopwords %>%
  count(word, sort = TRUE) 
## # A tibble: 50,966 x 2
##      word     n
##     <chr> <int>
##  1   time  1853
##  2 people  1322
##  3     im  1319
##  4    day  1197
##  5   love  1055
##  6   dont   981
##  7   home   743
##  8   life   671
##  9 school   613
## 10   week   607
## # ... with 50,956 more rows

Chart the top overall usage

I select only words that were used more than 500 times.

us_no_stopwords %>%
  count(word, sort = TRUE) %>%
  filter(n > 500) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Top ten words per source

us_ranked_top_10 <- us_no_stopwords %>%
    group_by(source) %>%
    count(word, sort=TRUE) %>%
    arrange(source, -n) %>%
    top_n(10)
## Selecting by n
us_ranked_top_10[1:10, ]
## # A tibble: 10 x 3
## # Groups:   source [1]
##      source   word     n
##      <fctr>  <chr> <int>
##  1 us_blogs   time  1018
##  2 us_blogs people   639
##  3 us_blogs    day   539
##  4 us_blogs   love   516
##  5 us_blogs     im   478
##  6 us_blogs   life   403
##  7 us_blogs  world   325
##  8 us_blogs   dont   317
##  9 us_blogs    god   308
## 10 us_blogs    don   305
us_ranked_top_10[11:20, ]
## # A tibble: 10 x 3
## # Groups:   source [1]
##     source    word     n
##     <fctr>   <chr> <int>
##  1 us_news    time   514
##  2 us_news  people   485
##  3 us_news percent   370
##  4 us_news  school   351
##  5 us_news    city   346
##  6 us_news    home   332
##  7 us_news million   325
##  8 us_news    game   298
##  9 us_news  county   293
## 10 us_news     day   284
us_ranked_top_10[21:30, ]
## # A tibble: 10 x 3
## # Groups:   source [1]
##        source    word     n
##        <fctr>   <chr> <int>
##  1 us_twitter      im   644
##  2 us_twitter    love   435
##  3 us_twitter    dont   386
##  4 us_twitter     day   374
##  5 us_twitter    time   321
##  6 us_twitter     lol   260
##  7 us_twitter  follow   225
##  8 us_twitter   happy   207
##  9 us_twitter  people   198
## 10 us_twitter tonight   194

Note that the words ‘time’, ‘day’, and ‘people’ are in the top ten of each dataset.

Reshape data to get proportion or words per source

We can use spread() and gather() from tidyr package to reshape our dataframe so that it is just what we need for plotting and comparing the three sources of data.

Get the count of each unique word per source

us_ranked_complete <- us_no_stopwords %>%
    group_by(source) %>%
    count(word, sort=TRUE)
head(us_ranked_complete)
## # A tibble: 6 x 3
## # Groups:   source [3]
##       source   word     n
##       <fctr>  <chr> <int>
## 1   us_blogs   time  1018
## 2 us_twitter     im   644
## 3   us_blogs people   639
## 4   us_blogs    day   539
## 5   us_blogs   love   516
## 6    us_news   time   514

Final Re-shaping

I am going to consider the us_news source as having the most formal use of English. In this next part we’ll see how close blogs and tweets are to formal English. Note how I leave the us_news proportion column while gathering blogs and tweets into a single column.

us_gathered <- us_ranked_complete %>%
  mutate(proportion = n / sum(n)) %>% # Calculate the proportion
  select(-n) %>% # remove the count
  spread(source, proportion) %>% # Spread out so I can select the base work (us_news)
  select(word, us_news, everything()) %>% # Re-order to put us_news in 2nd column
  gather(source, proportion, us_blogs:us_twitter) # Gather so that the proportion of Bronte and Wells can be compared to us_news.
head(us_gathered)
## # A tibble: 6 x 4
##       word      us_news   source   proportion
##      <chr>        <dbl>    <chr>        <dbl>
## 1        à           NA us_blogs 2.580212e-05
## 2        â 6.491567e-06 us_blogs           NA
## 3       aa 1.947470e-05 us_blogs 4.515372e-05
## 4      aaa 3.894940e-05 us_blogs           NA
## 5     aaaa 6.491567e-06 us_blogs           NA
## 6 aaaaaand           NA us_blogs 6.450531e-06

There’s a lot going on here but this is the format you want for the next chart.

Plot correlation of words in blogs and tweets, compared to news

This plot compares words used in us_news with words in us_blogs & us_twitter. The words that are further away from the correlation line are more unique to the particular datasource than us_news. For instance, the word ‘blog’ occurs more frequently in us_blogs than in us_news. The word ‘twitter’ appears more frequently in the us_twitter data than in us_news.
I like this plot because you can compare uniqueness in one datasource against others.

ggplot(us_gathered, aes(x = proportion, y = us_news, color = abs(us_news - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~source, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "us_news", x = NULL)
## Warning: Removed 79272 rows containing missing values (geom_point).
## Warning: Removed 79272 rows containing missing values (geom_text).

You can see that blogs is denser along the correlation line than tweets which indicates that tweets are a less formal communication than blogs.

What are the actual correlations?

US News to US Blogs

cor.test(data = us_gathered[us_gathered$source == "us_blogs",],
         ~ proportion + us_news)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and us_news
## t = 120.59, df = 14397, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7006603 0.7169143
## sample estimates:
##       cor 
## 0.7088814

US News to US Tweets

cor.test(data = us_gathered[us_gathered$source == "us_twitter",],
         ~ proportion + us_news)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and us_news
## t = 54.163, df = 8259, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4958722 0.5277011
## sample estimates:
##       cor 
## 0.5119624

From this we see that blogs uses 71% of the same words as news, and tweets use 51% of the same words.

Final comments

I hope you enjoyed this exporatory analysis of three types of written communications originating in the US.

In this project I have:
1. Done basic summaries of the three files? Word counts, line counts and basic data tables.
2. Made basic plots, such as histograms to illustrate features of the data.
3. and wrote in a brief, concise style, in a way that a non-data scientist manager could appreciate.

End