Executive Summary

This milestone report is to show basic data summary and explain the major features of data from blogs, news, and twitters in the United States. The data sets are downloaded from, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

Depending on the purpose of the analysis, with stopwords, the results are day-to-day sentences. On the other hand, without stopwords, we identify theme/topic.

Load Data

blog <- readLines("./en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
tweet <- readLines("./en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Exploratory Analysis

Summary table - Words Count, Lines Count

##   Data_Source Word_Counts Line_Counts
## 1       Blogs    37334131      899288
## 2        News     2643969       77259
## 3     Twitter    30373583     2360148

What are ‘stopwords’?

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

Sampling and creates corpus from different sources

set.seed(123)
#Sampling
sample.blog <- sample(blog, round(length(blog)*0.1), replace = FALSE)
corpus.blog <- corpus(sample.blog)

sample.news <- sample(news, round(length(news)*1), replace = FALSE)
corpus.news <- corpus(sample.news)

sample.tweet <- sample(tweet, round(length(tweet)*0.05), replace = FALSE)
corpus.tweet <- corpus(sample.tweet)

#Pre-process
corpus.blog <- toLower(corpus.blog)
corpus.news <- toLower(corpus.news)
corpus.tweet <- toLower(corpus.tweet)

Tokenize words - Blog

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 89,929 documents
##    ... indexing features: 104,551 feature types
##    ... created a 89929 x 104552 sparse dfm
##    ... complete. 
## Elapsed time: 11.03 seconds.
## removed 174 features, from 174 supplied (glob) feature types
## removed 591,399 features, from 174 supplied (glob) feature types
## removed 2,242,283 features, from 174 supplied (glob) feature types

Plot Charts - Blog

Tokenize words - News

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 77,259 documents
##    ... indexing features: 92,147 feature types
##    ... created a 77259 x 92148 sparse dfm
##    ... complete. 
## Elapsed time: 6.22 seconds.
## removed 172 features, from 174 supplied (glob) feature types
## removed 442,785 features, from 174 supplied (glob) feature types
## removed 1,565,095 features, from 174 supplied (glob) feature types

Plot Charts - News

Tokenize words - Twitter

## 
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 118,007 documents
##    ... indexing features: 68,296 feature types
##    ... created a 118007 x 68297 sparse dfm
##    ... complete. 
## Elapsed time: 5.34 seconds.
## removed 173 features, from 174 supplied (glob) feature types
## removed 262,965 features, from 174 supplied (glob) feature types
## removed 799,857 features, from 174 supplied (glob) feature types

Plot Charts - Twitters

Combine texts from all sources and analyze 3-grams without removing stopwords

##                  trigram.top.all.dec.order.
## one_of_the                              152
## a_lot_of                                143
## going_to_be                              86
## to_be_a                                  79
## i_want_to                                75
## some_of_the                              75
## be_able_to                               73
## out_of_the                               69
## as_well_as                               68
## it_was_a                                 67
## the_end_of                               63
## part_of_the                              54
## thanks_for_the                           54
## according_to_the                         51
## a_couple_of                              49
## all_of_the                               49
## most_of_the                              48
## the_rest_of                              48
## the_fact_that                            47
## you_want_to                              46

Plan for Prediction Algorithm and Shiny App

The Shiny App that I plan to create will show the next possible words that are highly associated with the words entered by user. If possible, I will leverage the tokens built from this exploratory analysis.