Coursera Data Science Capstone Course

Week 2 Assignment

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

  2. Create a basic report of summary statistics about the data sets.

  3. Report any interesting findings that you amassed so far.

  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.


#packs
library(feather)
library(readr)
library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
library(stringr)
#global sets
data("stop_words")
n <- 20000
set.seed(143)
#reads (converted to feather for faster loading in .rmd)
blogs <- read_feather("~/Data_Science/Capstone/blogs.feather")
news <- read_feather("~/Data_Science/Capstone/news.feather")
tweets <- read_feather("~/Data_Science/Capstone/tweets.feather")
#From Shutterfly github
profanity <- read_csv("~/Data_Science/Capstone/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master/en", 
                      col_names = FALSE)

Summary of Data Sets

Below is the summary of the data sets. We look at the number of rows from each data source to get a sense for comparable size. We can then look at each observation within each of the sources to do some summary statistics on the count of characters and the count of words. This helps us understand the uniformity (or not) of the data sets. We’d perhaps like to get a sense of the within source uniformity as well as the between source uniformity. This summary serves the purpose of the former, while subsequent sections will begin to explore the latter.

Obviously, without even looking at the statistics, the twitter data set is going to be quite different from the other two, given the restrictions on characters in tweets set out by Twitter itself. Still, looking at the statistics below bear this out. One question to consider is whether or not these differences are accretive to the eventual predictive model, or if they actually add variance that will confuse the model. In this respect, it would be important to consider the overall purpose of the text prediction engine. If it was primarily for mobile texting, the twitter set would likely be a very good source of data maybe even better thanblogs and data. If the overall objective is to create a model for general use, than the sampling from disparate sources makes sense.


bind_rows(
  mutate(tweets, source = "twitter"),
  mutate(news, source = "news"),
  mutate(blogs, source = "blogs")) %>% 
    group_by(source) %>% 
    summarise(nrow = NROW(X1) %>% 
                  formatC(format="d", big.mark=','), 
              nchar_avg = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  mean() %>% 
                  round(2),
              nchar_med = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  median() %>% 
                  round(2),
              nchar_sd = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  sd() %>% 
                  round(2),
              nchar_sum = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  sum()%>% 
                  formatC(format="d", big.mark=','),
              word_avg = X1 %>%
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  mean() %>% round(2),
              word_median = X1 %>%
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  median() %>% round(2),
              word_sum = X1 %>% 
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  sum() %>% 
                  round(2) %>% 
                  formatC(format="d", big.mark=','),
              word_sd = X1 %>% 
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  sd() %>% 
                  round(2)
              ) %>% knitr::kable()
source nrow nchar_avg nchar_med nchar_sd nchar_sum word_avg word_median word_sum word_sd
blogs 878,689 235.42 156 329.75 206,857,428 43.60 29 38,309,620 60.51
news 1,000,107 203.20 185 144.60 203,225,578 35.62 32 35,624,454 25.36
twitter 2,342,938 69.19 64 52.00 162,112,737 13.23 12 31,003,501 9.82

Word level analysis

Start with a word level analysis. Here we do simple count to capture frequency via plots.

word_tweets <- tweets %>% 
        anti_join(profanity) %>% 
    sample_n(n)  %>% 
    unnest_tokens(word, X1)
word_news <- news %>% 
        anti_join(profanity) %>% 
    sample_n(n)  %>% 
    unnest_tokens(word, X1)
word_blogs <- blogs %>% 
        anti_join(profanity) %>% 
    sample_n(n)  %>%  
    unnest_tokens(word, X1)


#make a list
word_list <- list(tweets= word_tweets, 
                  news = word_news, 
                  blogs = word_blogs)

word_count <- function(words){ words  %>% 
        count(word, sort = TRUE) }

word_freq_list <- lapply(word_list, word_count)

quick_word_plot <- function(words){ words %>% 
        head(10) %>% 
        ggplot(aes(x=reorder(word,n), y = n, fill = factor(word))) + 
        geom_bar(stat = "identity") + 
        labs(title = "Word Frequency", subtitle = "for 10 Most Used", x = "word") +
        theme(legend.position = "none")}

lapply(word_freq_list, quick_word_plot)
## $tweets

## 
## $news

## 
## $blogs

We can do a quick check to see the overlapping words in the three data sets.

intersect(word_freq_list$tweets[1:10,1], word_freq_list$news[1:10,1])
## # A tibble: 8 × 1
##    word
##   <chr>
## 1   the
## 2    to
## 3   and
## 4     a
## 5    of
## 6    in
## 7   for
## 8    is
intersect(word_freq_list$tweets[1:10,1], word_freq_list$blogs[1:10,1])
## # A tibble: 8 × 1
##    word
##   <chr>
## 1   the
## 2   and
## 3    to
## 4     a
## 5    of
## 6     i
## 7    in
## 8    is
intersect(word_freq_list$news[1:10,1], word_freq_list$blogs[1:10,1])
## # A tibble: 8 × 1
##    word
##   <chr>
## 1   the
## 2   and
## 3    to
## 4     a
## 5    of
## 6    in
## 7  that
## 8    is

How many words would it take to capture 95% of our sample?

#How many words make up 95% of our sample
cum_prop <- function(words){ words %>% mutate(prop = n/sum(n), cumsum = round(cumsum(prop),2)) %>% arrange(desc(n)) }
word_cum_list <- lapply(word_freq_list, cum_prop )
lapply(word_cum_list, function(wrds){which.max(wrds$cumsum==.95)}) 
## $tweets
## [1] 9729
## 
## $news
## [1] 15235
## 
## $blogs
## [1] 12866

Bi-Grams

Repeat the same analysis for Bi-Grams. Note that removing stop words is counterproductive to the generation of prodictive text models that will need to predict stop words.

bi_tweets <- tweets %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 2)
## Joining, by = "X1"
bi_news <- news %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 2)
## Joining, by = "X1"
bi_blogs <- blogs %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 2)
## Joining, by = "X1"
#list to ease coding for computation
bi_list <- list(tweets= bi_tweets, news = bi_news, blogs = bi_blogs)

n_count <- function(words){ words %>% 
        #anti_join(stop_words) %>% 
        count(ngram, sort = TRUE) }
bi_freq_list <- lapply(bi_list, n_count)

#plot
quick_n_plot <- function(words){ words %>% 
        head(10) %>% 
        ggplot(aes(x=reorder(ngram,n), y = n, fill = factor(ngram))) + 
        geom_bar(stat = "identity") + 
        labs(title = "N-Gram Frequency", subtitle = "for 10 Most Used", x = "ngram") +
        theme(legend.position = "none",
              axis.text.x = element_text(angle = 45, hjust = 1))}
lapply(bi_freq_list, quick_n_plot)
## $tweets

## 
## $news

## 
## $blogs

#how many bigrams make up 95% of our sample
bi_cum_list <- lapply(bi_freq_list, cum_prop )
lapply(bi_cum_list, function(wrds){which.max(wrds$cumsum==.95)})
## $tweets
## [1] 133013
## 
## $news
## [1] 322269
## 
## $blogs
## [1] 344941

A quick look now at where these three sources overlap.

intersect(bi_freq_list$tweets[1:10,1], bi_freq_list$news[1:10,1])
## # A tibble: 7 × 1
##     ngram
##     <chr>
## 1  of the
## 2  in the
## 3  to the
## 4 for the
## 5  on the
## 6  at the
## 7   to be
intersect(bi_freq_list$tweets[1:10,1], bi_freq_list$blogs[1:10,1])
## # A tibble: 6 × 1
##     ngram
##     <chr>
## 1  of the
## 2  in the
## 3  to the
## 4  on the
## 5   to be
## 6 for the
intersect(bi_freq_list$news[1:10,1], bi_freq_list$blogs[1:10,1])
## # A tibble: 7 × 1
##     ngram
##     <chr>
## 1  of the
## 2  in the
## 3  to the
## 4  on the
## 5   to be
## 6 and the
## 7 for the

Tri Grams

tri_tweets <- tweets %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 3)
## Joining, by = "X1"
tri_news <- news %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 3)
## Joining, by = "X1"
tri_blogs <- blogs %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 3)
## Joining, by = "X1"
tri_list <- list(tweets= tri_tweets, news = tri_news, blogs = tri_blogs)


tri_freq_list <- lapply(tri_list, n_count)

lapply(tri_freq_list, quick_n_plot)
## $tweets

## 
## $news

## 
## $blogs

A look at those that overlap. This is proving that blogs and news have lots of overlap, whereas twitter brings something new to the data sets.

intersect(tri_freq_list$tweets[1:10,1], tri_freq_list$news[1:10,1])
## # A tibble: 2 × 1
##         ngram
##         <chr>
## 1    a lot of
## 2 going to be
intersect(tri_freq_list$tweets[1:10,1],tri_freq_list$blogs[1:10,1])
## # A tibble: 1 × 1
##      ngram
##      <chr>
## 1 a lot of
intersect(tri_freq_list$news[1:10,1],tri_freq_list$blogs[1:10,1])
## # A tibble: 5 × 1
##         ngram
##         <chr>
## 1  one of the
## 2    a lot of
## 3     to be a
## 4  as well as
## 5 some of the

How many tri-grams would we need to capture 95% of the sample? Again, we can take the cumulative proportion and see how many records it would take to cover .95 of the content.

tri_cum_list <- lapply(tri_freq_list, cum_prop )
lapply(tri_cum_list, function(wrds){which.max(wrds$cumsum==.95)})
## $tweets
## [1] 216007
## 
## $news
## [1] 570018
## 
## $blogs
## [1] 671661

Conclusions and Next Steps

Next, I plan to investigate the use of Markov Chains in a n-gram predictive model. The data sets appear to be sufficiently varied to build a predictive model with a comprehensive n-gram dictionary at its base.