Summary

This is Week 2 Milestone Report for the Coursera Data Science Capstone course. Its aim is to explore the Swiftkey data sets prior to building a prediction algorithm and a Shiny app. Using tidytext as its main text-mining library, this report summarizes the main features of the training data, such as most frequent content words for different types of texts (blogs, news, twitter) and most frequent bigrams across the three genres.

Load and Prepare Data

We read in the data in a “tidy” format using the vroom library. At this stage, we want to keep the information about the text type (blog, news, twitter) which we copy from the file name.

## read in blogs, twitter, and news data
library(vroom)
library(stringr)
library(dplyr)

list_of_files <- list.files(path = "./final/en_US/", recursive = TRUE, pattern = "\\.txt$", full.names = TRUE)
docs.df <- vroom(list_of_files, id = "FileName", delim = "\n", col_names = F)
##  rename and mutate
docs.df$FileName <- str_remove_all(docs.df$FileName, "./final/en_US//en_US.")
docs.df$FileName <- str_remove_all(docs.df$FileName, ".txt")
docs.tidy <- docs.df %>% rename(text = X1, type = FileName) %>% mutate(type = factor(type))

Summary and Line Counts

This summary gives a general idea about our data set; it comprises more than 3 mln lines, most of them coming from twitter (where, however, these lines contain fewer words, as we shall see further).

docs.tidy
## Warning: One or more parsing issues, see `problems()` for details
## # A tibble: 3,322,964 × 2
##    type  text                                                                   
##    <fct> <chr>                                                                  
##  1 blogs In the years thereafter, most of the Oil fields and platforms were nam…
##  2 blogs We love you Mr. Brown.                                                 
##  3 blogs Chad has been awesome with the kids and holding down the fort while I …
##  4 blogs so anyways, i am going to share some home decor inspiration that i hav…
##  5 blogs With graduation season right around the corner, Nancy has whipped up a…
##  6 blogs If you have an alternative argument, let's hear it! :)                 
##  7 blogs If I were a bear,                                                      
##  8 blogs Other friends have similar stories, of how they were treated brusquely…
##  9 blogs Although our beloved Cantab can’t claim the international recognition …
## 10 blogs Peter Schiff: Hard to tell. It will look pretty bad for most Americans…
## # … with 3,322,954 more rows
summary(docs.tidy)
##       type             text          
##  blogs  : 681062   Length:3322964    
##  news   : 628797   Class :character  
##  twitter:2013105   Mode  :character

Sample data

As our training set, we shall only use a part (1/5) of the dataset. We add doc id’s to our data frame so that we could count the number of tokens per document after tokenization.

set.seed(1234)
training <- docs.tidy %>% group_by(type) %>% sample_frac(0.2) %>% mutate(doc = row_number(), .after = type) %>% ungroup()
training
## # A tibble: 664,592 × 3
##    type    doc text                                                             
##    <fct> <int> <chr>                                                            
##  1 blogs     1 "Fast Triggering Speed Up to 12 frames per second."              
##  2 blogs     2 "I should have advertised this recipe on TV."                    
##  3 blogs     3 "— My foot is about one inch long (about 2½ cm). I probably have…
##  4 blogs     4 "You listening to Gotye again?"                                  
##  5 blogs     5 "One of the things we tend to shy away from when we \"eat health…
##  6 blogs     6 "And like every great writer, having dedicated so much of her li…
##  7 blogs     7 "it reminded me of many of alexandria's beautiful and thoughtful…
##  8 blogs     8 "Thank you very much for visiting this blog. It has been close t…
##  9 blogs     9 "While I enjoyed the mudbugs…I must say that I feel like I got t…
## 10 blogs    10 "Mix everything together and put into a large greased baking tin…
## # … with 664,582 more rows

General properties

We can have a look at the general properties of our training set. On average, twits contain less words as compared to blogs and news; blogs and news are comparable in terms of mean and median, but news have a smaller inter-quantile range (IQR).

## tokenize data
library(tidytext)
words.training <- training %>% unnest_tokens(word, text) 

## calculate document length for each type 
doc_count <- words.training %>% group_by(type, doc) %>%
  summarize(n = n()) %>% select(-doc)

## summary stat by group
doc_count %>%
  group_by(type) %>% 
  summarize(min = min(n),
            q1 = quantile(n, 0.25),
            median = median(n),
            mean = mean(n),
            q3 = quantile(n, 0.75),
            max = max(n))
## # A tibble: 3 × 7
##   type      min    q1 median  mean    q3   max
##   <fct>   <int> <dbl>  <dbl> <dbl> <dbl> <int>
## 1 blogs       1     9     28  54.6    60 44380
## 2 news        1    19     32  53.2    46 16921
## 3 twitter     1     7     12  15.1    18 36156

Unigrams Analysis

Before we plot the most frequent words in our corpus, we want to remove stop-words and non-word characters. The barplots represent most frequent tokens per genre (type); the wordcloud represents most frequent tokens for all the three data sets.

word.counts <- words.training %>% select(-doc) %>% group_by(type) %>% count(word, sort = TRUE) %>% ungroup()
## remove stop-words
data("stop_words")
word.content <- word.counts %>% anti_join(stop_words) 
## remove non-word characters
non.word <- grepl("[^a-zA-Z']", word.content$word)
word.clean <- word.content[!non.word, ]

library(ggplot2)
word.clean %>% group_by(type) %>% arrange(desc(n)) %>% top_n(15) %>% ungroup() %>% ggplot(aes(x=reorder_within(word, n, type), y = n, fill = type)) + geom_col(show.legend = F) + facet_wrap(~type, scales = "free") + ylab(NULL) + xlab("most frequent content words") + scale_x_reordered() + coord_flip() + theme(axis.text.x = element_blank())

library(wordcloud)
wordcloud(words = word.clean$word, freq = word.clean$n, 
          max.words = 100, random.order = TRUE, 
          rot.per = .1, vfont=c("serif","plain"))

##Bigrams Tokenization and Count To build a prediction algorithm, we shall need to analyze combinations of words, such as bigrams or trigrams. The code below finds the most common bigrams in our corpus (this time, we do not delete stop-words as they need to be predicted, too).

bigrams.training <- training %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
bigrams.count <- bigrams.training %>% count(bigram, sort = TRUE)

##Bigrams Network Analysis We can visualize our most common bigrams (function words mainly) using igraph and ggraph libraries. The arrows indicate the most common connections in our data, and this gives us an initial idea of the types of connections we need to predict.

library(igraph)
library(tidyr)
library(ggraph)
bigram_graph <- bigrams.count %>% filter(n > 20000) %>% separate(bigram, c("word1", "word2")) %>% graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") + geom_edge_link(aes(edge_alpha = n), arrow = grid::arrow(type = "closed", length = unit(2, "mm")),  end_cap = circle(2, "mm"), show.legend = F) + geom_node_point(color= "lightblue", size = 5) + geom_node_text(aes(label = name), vjust = 1, hjust = 1) + theme_void()