This document summarizes the textual data used in the Johns Hopkins Data Science Capstone Project offered through Coursera.
Textual data of US news, tweets, and blog posts are analyzed.
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
The data were downloaded from Coursera per the instructions, directly from the Coursera website. A copy of the data can be obtained by clicking here.
The zip file contained four folders, each pertaining to text from different languages, for the purposes of this anaysis only the English text will be considered.
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
This English text contains text entries for blogs, news, and Twitter.
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## Warning in scan("final/en_US/en_US.twitter.txt", what = "character", sep =
## "\n"): embedded nul(s) found in input
str(en_US.blogs)
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...
str(en_US.twitter)
## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
str(en_US.news)
## chr [1:1010242] "He wasn't home alone, apparently." ...
The following output come from a user-generated function for summarizing number of characters per line of text.
summarize.lines( en_US.twitter )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
summarize.lines( en_US.blogs )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40830
summarize.lines( en_US.news )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 110.0 185.0 201.2 268.0 11380.0
Here are a deeper set of statistics
stri_stats_general( en_US.twitter )
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096031 134082634
stri_stats_general( en_US.blogs )
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general( en_US.news )
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
words_twitter <- stri_count_words(en_US.twitter)
summary( words_twitter )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot( words_twitter )
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
words_blogs <- stri_count_words(en_US.blogs)
summary( words_blogs )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
qplot( words_blogs )
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
words_news <- stri_count_words(en_US.news)
summary( words_news )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
qplot( words_news )
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The data summaries above show how large the text files are. In the abscense of a big data technology such as SparkR or Hadoop, we will use random sampling to increase computational speed without losing the properties of the data for further exploratory analysis.
Next the data are sampled (100 lines each) to do further analysis in a way that improves efficiency but maintains accuracy. The sampled data are saved for future retrieval (code not displayed).
ds <- DirSource("~/Documents/coursera-data-science-capstone/en_US_sample")
corpus <- Corpus(ds)
summary(corpus)
## Length Class Mode
## blogs.txt 2 PlainTextDocument list
## news.txt 2 PlainTextDocument list
## twitter.txt 2 PlainTextDocument list
The following code cleans the corpus by removing punctuation, removing numbers, transforming to lowercase, removing common ‘stopwords’, and removing any remaining white space.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
After the corpus has been cleaned it is saved, this is the basis of all future analysis.
writeCorpus(corpus, path = "~/Documents/coursera-data-science-capstone/Clean Corpus")
The following code tokenizes the words in the corpus single words, pairs of words, and triples of words, oftern referred to a n-grams, where n denotes the number of words grouped. Thus tokenizing pairs of words would produce a 2-gram.
cleantext <- data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
onetoken <- NGramTokenizer(cleantext, Weka_control(min = 1, max = 1))
bitoken <- NGramTokenizer(cleantext, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
tritoken <- NGramTokenizer(cleantext, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
bitritoken <- paste(tritoken,bitoken)
The following charts plot the most common tokens, bitokens, and tritokens.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Created a document-term-matrix, removed the sparse terms, and performed a hierarchical clustering.
dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 38983)>>
## Non-/sparse entries: 57199/59750
## Sparsity : 51%
## Maximal term length: 95
## Weighting : term frequency (tf)
dtm <- removeSparseTerms(dtm, 0.5)
dtm
## <<DocumentTermMatrix (documents: 3, terms: 12940)>>
## Non-/sparse entries: 31156/7664
## Sparsity : 20%
## Maximal term length: 18
## Weighting : term frequency (tf)
dtmClust <- hclust(dist(dtm), method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(dtmClust)
Created a k-means clustering of the documents as well.
dtmKMeans <- kmeans(dtm, 2)
dtmKMeans$cluster
## blogs.txt news.txt twitter.txt
## 1 1 2
The algorithmic development of the Shiny app will use a model to ensure an acceptable level of predictability. Care will be taken to ensure that any string of words not found in the training data will have a default set of predictions. It will use a Backoff model for the modeling.