Coursera Capstone Project

Exploratory Analysis

Summary table - Words Count, Lines Count

##   Data_Source Word_Counts Line_Counts
## 1       Blogs    37334131      899288
## 2        News     2643969       77259
## 3     Twitter    30373583     2360148

What are ‘stopwords’?

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

Sampling and creates corpus from different sources

set.seed(123)
#Sampling
sample.blog <- sample(blog, round(length(blog)*0.1), replace = FALSE)
corpus.blog <- corpus(sample.blog)

sample.news <- sample(news, round(length(news)*1), replace = FALSE)
corpus.news <- corpus(sample.news)

sample.tweet <- sample(tweet, round(length(tweet)*0.05), replace = FALSE)
corpus.tweet <- corpus(sample.tweet)

#Pre-process
corpus.blog <- toLower(corpus.blog)
corpus.news <- toLower(corpus.news)
corpus.tweet <- toLower(corpus.tweet)

Tokenize words - Blog

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 89,929 documents
##    ... indexing features: 104,551 feature types
##    ... created a 89929 x 104552 sparse dfm
##    ... complete. 
## Elapsed time: 11.03 seconds.

## removed 174 features, from 174 supplied (glob) feature types

## removed 591,399 features, from 174 supplied (glob) feature types

## removed 2,242,283 features, from 174 supplied (glob) feature types

Plot Charts - Blog

Tokenize words - News

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 77,259 documents
##    ... indexing features: 92,147 feature types
##    ... created a 77259 x 92148 sparse dfm
##    ... complete. 
## Elapsed time: 6.22 seconds.

## removed 172 features, from 174 supplied (glob) feature types

## removed 442,785 features, from 174 supplied (glob) feature types

## removed 1,565,095 features, from 174 supplied (glob) feature types

Plot Charts - News

Tokenize words - Twitter

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 118,007 documents
##    ... indexing features: 68,296 feature types
##    ... created a 118007 x 68297 sparse dfm
##    ... complete. 
## Elapsed time: 5.34 seconds.

## removed 173 features, from 174 supplied (glob) feature types

## removed 262,965 features, from 174 supplied (glob) feature types

## removed 799,857 features, from 174 supplied (glob) feature types

Plot Charts - Twitters

Combine texts from all sources and analyze 3-grams without removing stopwords

##                  trigram.top.all.dec.order.
## one_of_the                              152
## a_lot_of                                143
## going_to_be                              86
## to_be_a                                  79
## i_want_to                                75
## some_of_the                              75
## be_able_to                               73
## out_of_the                               69
## as_well_as                               68
## it_was_a                                 67
## the_end_of                               63
## part_of_the                              54
## thanks_for_the                           54
## according_to_the                         51
## a_couple_of                              49
## all_of_the                               49
## most_of_the                              48
## the_rest_of                              48
## the_fact_that                            47
## you_want_to                              46

Coursera Capstone Project - Week 2

Ching Yin Goh

Friday, June 11, 2016

Executive Summary

Load Data

Exploratory Analysis

Summary table - Words Count, Lines Count

What are ‘stopwords’?

Sampling and creates corpus from different sources

Tokenize words - Blog

Plot Charts - Blog

Tokenize words - News

Plot Charts - News

Tokenize words - Twitter

Plot Charts - Twitters

Combine texts from all sources and analyze 3-grams without removing stopwords

Plan for Prediction Algorithm and Shiny App