This is part one of the final project. I will explore the three US text sets which includes news, blogs, and tweets.
library(tidyverse)
library(tidytext)
library(tidyr)
library(tm)
library(stringr)
library(ngram)
library(scales)
These files are extremely large so I will read in only the first 10,000 lines of each file for exploration.
After reading in each dataset I put them into dataframes that have a column for their source and then row bind them together in a single dataframe.
The text column was created as a factor which is not useful for such a large set. I convert it to character class instead.
# Read in sample of full files
sample_size = 10000
path <- "~/Analytics Course/09_Capstone_project/en_US/"
us_blogs <- read_lines(paste0(path,"en_US.blogs.txt"), n_max = sample_size)
us_news <- read_lines(paste0(path,"en_US.news.txt"), n_max = sample_size)
us_twitter <- read_lines(paste0(path,"en_US.twitter.txt"), n_max = sample_size)
blogs_df <- data.frame(source = "us_blogs", line_number = 1:sample_size,
text = us_blogs)
news_df <- data.frame(source = "us_news", line_number = 1:sample_size,
text = us_news)
twitter_df <- data.frame(source = "us_twitter", line_number = 1:sample_size,
text = us_twitter)
all_us_df <- rbind(blogs_df, news_df, twitter_df)
all_us_df$text <- as.character(all_us_df$text)
# This section removes some
us_text <- all_us_df$text
us_text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", us_text) # remove retweet entities
us_text <- gsub("@\\w+", "", us_text) # remove at people
us_text <- gsub("[ \t]{2,}", "", us_text) # remove unnecessary tabs
us_text <- gsub("^\\s+|\\s+$", "", us_text) # remove unnecessary spaces
us_text <- gsub('https://','',us_text) # removes https://
us_text <- gsub('http://','',us_text) # removes http://
us_text <- gsub('[^[:graph:]]', ' ',us_text) ## removes graphic characters
us_text <- gsub('[[:punct:]]', '', us_text) # removes punctuation
us_text <- gsub('[[:cntrl:]]', '', us_text) # removes control characters
us_text <- gsub('\\d+', '', us_text) # removes numbers
us_text <- tolower(us_text) # makes all letters lowercase
us_text <- us_text[!is.na(us_text)] # remove NAs
us_w_clean_text <- cbind(all_us_df, clean_text = us_text)
us_df <- us_w_clean_text %>%
select(source, line_number, clean_text) %>%
rename(text = clean_text)
us_df$text <- as.character(us_df$text)
This process uses the the tidytext unnest_tokens() function. *NOTE: Since I’m going to introduce randomiztion I’ll set the random seed for reproducibility.
us_unnested <- us_df %>%
unnest_tokens(word, text)
us_unnested %>%
group_by(source) %>%
summarise(unique_words = n_distinct(word))
## # A tibble: 3 x 2
## source unique_words
## <fctr> <int>
## 1 us_blogs 30872
## 2 us_news 30625
## 3 us_twitter 15452
Interesting that us_blogs and us_news have almost the same number of unique words.
Even from this random sample you can see that there are a number of common words, called stopwords, that need to be excluded since they don’t provide any context.
Examples of stopwords are ‘be’, ‘or’, and ‘of’.
The 30,000 rows of text turn into ~868k individual words.
nrow(us_unnested)
## [1] 868300
The tidytext package comes with a dataset of English stopwords These words provide little context and can be removed so that only the important words are retained. Removal is done with an anti-join. I added some stopwords to the main list to eliminate rt, and the numbers from 1:10.
my_stopwords <- data_frame( word = c("rt", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), lexicon = "MINE")
stop_words <- rbind(stop_words, my_stopwords)
us_no_stopwords <- us_unnested %>%
anti_join(stop_words)
## Joining, by = "word"
us_no_stopwords %>%
group_by(source) %>%
summarise(unique_words = n_distinct(word))
## # A tibble: 3 x 2
## source unique_words
## <fctr> <int>
## 1 us_blogs 30224
## 2 us_news 29996
## 3 us_twitter 14871
Not a big change because stopwords are common in usage but are a very small number of words.
1 - (nrow(us_no_stopwords)/nrow(us_unnested))
## [1] 0.5845526
This indicates that only about 41% of words in these works is important for analysis.
us_no_stopwords %>%
count(word, sort = TRUE)
## # A tibble: 50,966 x 2
## word n
## <chr> <int>
## 1 time 1853
## 2 people 1322
## 3 im 1319
## 4 day 1197
## 5 love 1055
## 6 dont 981
## 7 home 743
## 8 life 671
## 9 school 613
## 10 week 607
## # ... with 50,956 more rows
I select only words that were used more than 500 times.
us_no_stopwords %>%
count(word, sort = TRUE) %>%
filter(n > 500) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
us_ranked_top_10 <- us_no_stopwords %>%
group_by(source) %>%
count(word, sort=TRUE) %>%
arrange(source, -n) %>%
top_n(10)
## Selecting by n
us_ranked_top_10[1:10, ]
## # A tibble: 10 x 3
## # Groups: source [1]
## source word n
## <fctr> <chr> <int>
## 1 us_blogs time 1018
## 2 us_blogs people 639
## 3 us_blogs day 539
## 4 us_blogs love 516
## 5 us_blogs im 478
## 6 us_blogs life 403
## 7 us_blogs world 325
## 8 us_blogs dont 317
## 9 us_blogs god 308
## 10 us_blogs don 305
us_ranked_top_10[11:20, ]
## # A tibble: 10 x 3
## # Groups: source [1]
## source word n
## <fctr> <chr> <int>
## 1 us_news time 514
## 2 us_news people 485
## 3 us_news percent 370
## 4 us_news school 351
## 5 us_news city 346
## 6 us_news home 332
## 7 us_news million 325
## 8 us_news game 298
## 9 us_news county 293
## 10 us_news day 284
us_ranked_top_10[21:30, ]
## # A tibble: 10 x 3
## # Groups: source [1]
## source word n
## <fctr> <chr> <int>
## 1 us_twitter im 644
## 2 us_twitter love 435
## 3 us_twitter dont 386
## 4 us_twitter day 374
## 5 us_twitter time 321
## 6 us_twitter lol 260
## 7 us_twitter follow 225
## 8 us_twitter happy 207
## 9 us_twitter people 198
## 10 us_twitter tonight 194
Note that the words ‘time’, ‘day’, and ‘people’ are in the top ten of each dataset.
We can use spread() and gather() from tidyr package to reshape our dataframe so that it is just what we need for plotting and comparing the three sources of data.
us_ranked_complete <- us_no_stopwords %>%
group_by(source) %>%
count(word, sort=TRUE)
head(us_ranked_complete)
## # A tibble: 6 x 3
## # Groups: source [3]
## source word n
## <fctr> <chr> <int>
## 1 us_blogs time 1018
## 2 us_twitter im 644
## 3 us_blogs people 639
## 4 us_blogs day 539
## 5 us_blogs love 516
## 6 us_news time 514
I am going to consider the us_news source as having the most formal use of English. In this next part we’ll see how close blogs and tweets are to formal English. Note how I leave the us_news proportion column while gathering blogs and tweets into a single column.
us_gathered <- us_ranked_complete %>%
mutate(proportion = n / sum(n)) %>% # Calculate the proportion
select(-n) %>% # remove the count
spread(source, proportion) %>% # Spread out so I can select the base work (us_news)
select(word, us_news, everything()) %>% # Re-order to put us_news in 2nd column
gather(source, proportion, us_blogs:us_twitter) # Gather so that the proportion of Bronte and Wells can be compared to us_news.
head(us_gathered)
## # A tibble: 6 x 4
## word us_news source proportion
## <chr> <dbl> <chr> <dbl>
## 1 à NA us_blogs 2.580212e-05
## 2 â 6.491567e-06 us_blogs NA
## 3 aa 1.947470e-05 us_blogs 4.515372e-05
## 4 aaa 3.894940e-05 us_blogs NA
## 5 aaaa 6.491567e-06 us_blogs NA
## 6 aaaaaand NA us_blogs 6.450531e-06
There’s a lot going on here but this is the format you want for the next chart.
This plot compares words used in us_news with words in us_blogs & us_twitter. The words that are further away from the correlation line are more unique to the particular datasource than us_news. For instance, the word ‘blog’ occurs more frequently in us_blogs than in us_news. The word ‘twitter’ appears more frequently in the us_twitter data than in us_news.
I like this plot because you can compare uniqueness in one datasource against others.
ggplot(us_gathered, aes(x = proportion, y = us_news, color = abs(us_news - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~source, ncol = 2) +
theme(legend.position="none") +
labs(y = "us_news", x = NULL)
## Warning: Removed 79272 rows containing missing values (geom_point).
## Warning: Removed 79272 rows containing missing values (geom_text).
You can see that blogs is denser along the correlation line than tweets which indicates that tweets are a less formal communication than blogs.
cor.test(data = us_gathered[us_gathered$source == "us_blogs",],
~ proportion + us_news)
##
## Pearson's product-moment correlation
##
## data: proportion and us_news
## t = 120.59, df = 14397, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7006603 0.7169143
## sample estimates:
## cor
## 0.7088814
cor.test(data = us_gathered[us_gathered$source == "us_twitter",],
~ proportion + us_news)
##
## Pearson's product-moment correlation
##
## data: proportion and us_news
## t = 54.163, df = 8259, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4958722 0.5277011
## sample estimates:
## cor
## 0.5119624
From this we see that blogs uses 71% of the same words as news, and tweets use 51% of the same words.
I hope you enjoyed this exporatory analysis of three types of written communications originating in the US.
In this project I have:
1. Done basic summaries of the three files? Word counts, line counts and basic data tables.
2. Made basic plots, such as histograms to illustrate features of the data.
3. and wrote in a brief, concise style, in a way that a non-data scientist manager could appreciate.