Week 2 Assignment

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

#packs
library(feather)
library(readr)
library(tidytext)
library(tidyr)
library(dplyr)
library(ggplot2)
library(stringr)
#global sets
data("stop_words")
n <- 20000
set.seed(143)
#reads (converted to feather for faster loading in .rmd)
blogs <- read_feather("~/Data_Science/Capstone/blogs.feather")
news <- read_feather("~/Data_Science/Capstone/news.feather")
tweets <- read_feather("~/Data_Science/Capstone/tweets.feather")
#From Shutterfly github
profanity <- read_csv("~/Data_Science/Capstone/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words-master/en", 
                      col_names = FALSE)

Summary of Data Sets

Below is the summary of the data sets. We look at the number of rows from each data source to get a sense for comparable size. We can then look at each observation within each of the sources to do some summary statistics on the count of characters and the count of words. This helps us understand the uniformity (or not) of the data sets. We’d perhaps like to get a sense of the within source uniformity as well as the between source uniformity. This summary serves the purpose of the former, while subsequent sections will begin to explore the latter.

Obviously, without even looking at the statistics, the twitter data set is going to be quite different from the other two, given the restrictions on characters in tweets set out by Twitter itself. Still, looking at the statistics below bear this out. One question to consider is whether or not these differences are accretive to the eventual predictive model, or if they actually add variance that will confuse the model. In this respect, it would be important to consider the overall purpose of the text prediction engine. If it was primarily for mobile texting, the twitter set would likely be a very good source of data maybe even better thanblogs and data. If the overall objective is to create a model for general use, than the sampling from disparate sources makes sense.

bind_rows(
  mutate(tweets, source = "twitter"),
  mutate(news, source = "news"),
  mutate(blogs, source = "blogs")) %>% 
    group_by(source) %>% 
    summarise(nrow = NROW(X1) %>% 
                  formatC(format="d", big.mark=','), 
              nchar_avg = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  mean() %>% 
                  round(2),
              nchar_med = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  median() %>% 
                  round(2),
              nchar_sd = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  sd() %>% 
                  round(2),
              nchar_sum = X1 %>% 
                  as.vector() %>% 
                  nchar() %>% 
                  sum()%>% 
                  formatC(format="d", big.mark=','),
              word_avg = X1 %>%
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  mean() %>% round(2),
              word_median = X1 %>%
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  median() %>% round(2),
              word_sum = X1 %>% 
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  sum() %>% 
                  round(2) %>% 
                  formatC(format="d", big.mark=','),
              word_sd = X1 %>% 
                  as.vector() %>% 
                  str_count('\\w+') %>% 
                  sd() %>% 
                  round(2)
              ) %>% knitr::kable()

source	nrow	nchar_avg	nchar_med	nchar_sd	nchar_sum	word_avg	word_median	word_sum	word_sd
blogs	878,689	235.42	156	329.75	206,857,428	43.60	29	38,309,620	60.51
news	1,000,107	203.20	185	144.60	203,225,578	35.62	32	35,624,454	25.36
twitter	2,342,938	69.19	64	52.00	162,112,737	13.23	12	31,003,501	9.82

Word level analysis

Start with a word level analysis. Here we do simple count to capture frequency via plots.

word_tweets <- tweets %>% 
        anti_join(profanity) %>% 
    sample_n(n)  %>% 
    unnest_tokens(word, X1)
word_news <- news %>% 
        anti_join(profanity) %>% 
    sample_n(n)  %>% 
    unnest_tokens(word, X1)
word_blogs <- blogs %>% 
        anti_join(profanity) %>% 
    sample_n(n)  %>%  
    unnest_tokens(word, X1)


#make a list
word_list <- list(tweets= word_tweets, 
                  news = word_news, 
                  blogs = word_blogs)

word_count <- function(words){ words  %>% 
        count(word, sort = TRUE) }

word_freq_list <- lapply(word_list, word_count)

quick_word_plot <- function(words){ words %>% 
        head(10) %>% 
        ggplot(aes(x=reorder(word,n), y = n, fill = factor(word))) + 
        geom_bar(stat = "identity") + 
        labs(title = "Word Frequency", subtitle = "for 10 Most Used", x = "word") +
        theme(legend.position = "none")}

lapply(word_freq_list, quick_word_plot)

## $tweets

## 
## $news

## 
## $blogs

We can do a quick check to see the overlapping words in the three data sets.

intersect(word_freq_list$tweets[1:10,1], word_freq_list$news[1:10,1])

## # A tibble: 8 × 1
##    word
##   <chr>
## 1   the
## 2    to
## 3   and
## 4     a
## 5    of
## 6    in
## 7   for
## 8    is

intersect(word_freq_list$tweets[1:10,1], word_freq_list$blogs[1:10,1])

## # A tibble: 8 × 1
##    word
##   <chr>
## 1   the
## 2   and
## 3    to
## 4     a
## 5    of
## 6     i
## 7    in
## 8    is

intersect(word_freq_list$news[1:10,1], word_freq_list$blogs[1:10,1])

## # A tibble: 8 × 1
##    word
##   <chr>
## 1   the
## 2   and
## 3    to
## 4     a
## 5    of
## 6    in
## 7  that
## 8    is

How many words would it take to capture 95% of our sample?

#How many words make up 95% of our sample
cum_prop <- function(words){ words %>% mutate(prop = n/sum(n), cumsum = round(cumsum(prop),2)) %>% arrange(desc(n)) }
word_cum_list <- lapply(word_freq_list, cum_prop )
lapply(word_cum_list, function(wrds){which.max(wrds$cumsum==.95)})

## $tweets
## [1] 9729
## 
## $news
## [1] 15235
## 
## $blogs
## [1] 12866

Bi-Grams

Repeat the same analysis for Bi-Grams. Note that removing stop words is counterproductive to the generation of prodictive text models that will need to predict stop words.

bi_tweets <- tweets %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 2)

## Joining, by = "X1"

bi_news <- news %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 2)

## Joining, by = "X1"

bi_blogs <- blogs %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 2)

## Joining, by = "X1"

#list to ease coding for computation
bi_list <- list(tweets= bi_tweets, news = bi_news, blogs = bi_blogs)

n_count <- function(words){ words %>% 
        #anti_join(stop_words) %>% 
        count(ngram, sort = TRUE) }
bi_freq_list <- lapply(bi_list, n_count)

#plot
quick_n_plot <- function(words){ words %>% 
        head(10) %>% 
        ggplot(aes(x=reorder(ngram,n), y = n, fill = factor(ngram))) + 
        geom_bar(stat = "identity") + 
        labs(title = "N-Gram Frequency", subtitle = "for 10 Most Used", x = "ngram") +
        theme(legend.position = "none",
              axis.text.x = element_text(angle = 45, hjust = 1))}
lapply(bi_freq_list, quick_n_plot)

## $tweets

## 
## $news

## 
## $blogs

#how many bigrams make up 95% of our sample
bi_cum_list <- lapply(bi_freq_list, cum_prop )
lapply(bi_cum_list, function(wrds){which.max(wrds$cumsum==.95)})

## $tweets
## [1] 133013
## 
## $news
## [1] 322269
## 
## $blogs
## [1] 344941

A quick look now at where these three sources overlap.

intersect(bi_freq_list$tweets[1:10,1], bi_freq_list$news[1:10,1])

## # A tibble: 7 × 1
##     ngram
##     <chr>
## 1  of the
## 2  in the
## 3  to the
## 4 for the
## 5  on the
## 6  at the
## 7   to be

intersect(bi_freq_list$tweets[1:10,1], bi_freq_list$blogs[1:10,1])

## # A tibble: 6 × 1
##     ngram
##     <chr>
## 1  of the
## 2  in the
## 3  to the
## 4  on the
## 5   to be
## 6 for the

intersect(bi_freq_list$news[1:10,1], bi_freq_list$blogs[1:10,1])

## # A tibble: 7 × 1
##     ngram
##     <chr>
## 1  of the
## 2  in the
## 3  to the
## 4  on the
## 5   to be
## 6 and the
## 7 for the

Tri Grams

tri_tweets <- tweets %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 3)

## Joining, by = "X1"

tri_news <- news %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 3)

## Joining, by = "X1"

tri_blogs <- blogs %>% 
        anti_join(profanity) %>% 
    sample_n(n) %>% 
    unnest_tokens(ngram, X1, token = "ngrams", n = 3)

## Joining, by = "X1"

tri_list <- list(tweets= tri_tweets, news = tri_news, blogs = tri_blogs)


tri_freq_list <- lapply(tri_list, n_count)

lapply(tri_freq_list, quick_n_plot)

## $tweets

## 
## $news

## 
## $blogs

A look at those that overlap. This is proving that blogs and news have lots of overlap, whereas twitter brings something new to the data sets.

intersect(tri_freq_list$tweets[1:10,1], tri_freq_list$news[1:10,1])

## # A tibble: 2 × 1
##         ngram
##         <chr>
## 1    a lot of
## 2 going to be

intersect(tri_freq_list$tweets[1:10,1],tri_freq_list$blogs[1:10,1])

## # A tibble: 1 × 1
##      ngram
##      <chr>
## 1 a lot of

intersect(tri_freq_list$news[1:10,1],tri_freq_list$blogs[1:10,1])

## # A tibble: 5 × 1
##         ngram
##         <chr>
## 1  one of the
## 2    a lot of
## 3     to be a
## 4  as well as
## 5 some of the

How many tri-grams would we need to capture 95% of the sample? Again, we can take the cumulative proportion and see how many records it would take to cover .95 of the content.

tri_cum_list <- lapply(tri_freq_list, cum_prop )
lapply(tri_cum_list, function(wrds){which.max(wrds$cumsum==.95)})

## $tweets
## [1] 216007
## 
## $news
## [1] 570018
## 
## $blogs
## [1] 671661

Conclusions and Next Steps

Next, I plan to investigate the use of Markov Chains in a n-gram predictive model. The data sets appear to be sufficiently varied to build a predictive model with a comprehensive n-gram dictionary at its base.