This milestone report is to show basic data summary and explain the major features of data from blogs, news, and twitters in the United States. The data sets are downloaded from, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
Depending on the purpose of the analysis, with stopwords, the results are day-to-day sentences. On the other hand, without stopwords, we identify theme/topic.
blog <- readLines("./en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
tweet <- readLines("./en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
## Data_Source Word_Counts Line_Counts
## 1 Blogs 37334131 899288
## 2 News 2643969 77259
## 3 Twitter 30373583 2360148
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
set.seed(123)
#Sampling
sample.blog <- sample(blog, round(length(blog)*0.1), replace = FALSE)
corpus.blog <- corpus(sample.blog)
sample.news <- sample(news, round(length(news)*1), replace = FALSE)
corpus.news <- corpus(sample.news)
sample.tweet <- sample(tweet, round(length(tweet)*0.05), replace = FALSE)
corpus.tweet <- corpus(sample.tweet)
#Pre-process
corpus.blog <- toLower(corpus.blog)
corpus.news <- toLower(corpus.news)
corpus.tweet <- toLower(corpus.tweet)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 89,929 documents
## ... indexing features: 104,551 feature types
## ... created a 89929 x 104552 sparse dfm
## ... complete.
## Elapsed time: 11.03 seconds.
## removed 174 features, from 174 supplied (glob) feature types
## removed 591,399 features, from 174 supplied (glob) feature types
## removed 2,242,283 features, from 174 supplied (glob) feature types
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 77,259 documents
## ... indexing features: 92,147 feature types
## ... created a 77259 x 92148 sparse dfm
## ... complete.
## Elapsed time: 6.22 seconds.
## removed 172 features, from 174 supplied (glob) feature types
## removed 442,785 features, from 174 supplied (glob) feature types
## removed 1,565,095 features, from 174 supplied (glob) feature types
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 118,007 documents
## ... indexing features: 68,296 feature types
## ... created a 118007 x 68297 sparse dfm
## ... complete.
## Elapsed time: 5.34 seconds.
## removed 173 features, from 174 supplied (glob) feature types
## removed 262,965 features, from 174 supplied (glob) feature types
## removed 799,857 features, from 174 supplied (glob) feature types
## trigram.top.all.dec.order.
## one_of_the 152
## a_lot_of 143
## going_to_be 86
## to_be_a 79
## i_want_to 75
## some_of_the 75
## be_able_to 73
## out_of_the 69
## as_well_as 68
## it_was_a 67
## the_end_of 63
## part_of_the 54
## thanks_for_the 54
## according_to_the 51
## a_couple_of 49
## all_of_the 49
## most_of_the 48
## the_rest_of 48
## the_fact_that 47
## you_want_to 46
The Shiny App that I plan to create will show the next possible words that are highly associated with the words entered by user. If possible, I will leverage the tokens built from this exploratory analysis.