Johns Hopkins Data Science-Milestone Report

Executive Summary

This document summarizes the textual data used in the Johns Hopkins Data Science Capstone Project offered through Coursera.

Textual data of US news, tweets, and blog posts are analyzed.

Learnings from this Analysis

As whole the Twitter data contains the most number of characters followed by news and blogs, which may be a function of the prevelance of each medium
Twitter is limited in the number of words each tweet can have, however blogs and news stories tend to have highly skewed distributions with higher frequency of words and characters
The most common 1-gram is ‘said’ , 2-gram is ‘new york’, and 3-gram is z.
In terms of similarity of style and content writing, blogs and news are more similar to each other, but distintly different than writing on Twitter

The Organization of this Document is as Follows

Loading in the Raw Data
Summarizing the Raw Data
Sampling the Raw Data
Processing the Sampled Data
Further Analysis of Processed Data

## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Loading Raw Data and Summarizing Files

Loading the raw data

The data were downloaded from Coursera per the instructions, directly from the Coursera website. A copy of the data can be obtained by clicking here.

The zip file contained four folders, each pertaining to text from different languages, for the purposes of this anaysis only the English text will be considered.

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

This English text contains text entries for blogs, news, and Twitter.

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

## Warning in scan("final/en_US/en_US.twitter.txt", what = "character", sep =
## "\n"): embedded nul(s) found in input

Summarizing Files

How many lines of text are in each file?

str(en_US.blogs)

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." ...

str(en_US.twitter)

##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

str(en_US.news)

##  chr [1:1010242] "He wasn't home alone, apparently." ...

Summarizing number of characters per line of text from each source (e.g. news, blogs, tweets)?

The following output come from a user-generated function for summarizing number of characters per line of text.

summarize.lines( en_US.twitter )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

summarize.lines( en_US.blogs )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830

summarize.lines( en_US.news )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   110.0   185.0   201.2   268.0 11380.0

Here are a deeper set of statistics

stri_stats_general( en_US.twitter )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096031   134082634

stri_stats_general( en_US.blogs )

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general( en_US.news )

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

How many words are in each file type (e.g. blogs, news, tweets)

words_twitter   <- stri_count_words(en_US.twitter)
summary( words_twitter )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

qplot(   words_twitter )

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

words_blogs   <- stri_count_words(en_US.blogs)
summary( words_blogs )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

qplot(   words_blogs )

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

words_news   <- stri_count_words(en_US.news)
summary( words_news )

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.41   46.00 1796.00

qplot(   words_news )

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Sampling and Processing Raw Data

The data summaries above show how large the text files are. In the abscense of a big data technology such as SparkR or Hadoop, we will use random sampling to increase computational speed without losing the properties of the data for further exploratory analysis.

Sampling Data

Next the data are sampled (100 lines each) to do further analysis in a way that improves efficiency but maintains accuracy. The sampled data are saved for future retrieval (code not displayed).

Creating a Corpus of the Sampled Data for Processing

ds <- DirSource("~/Documents/coursera-data-science-capstone/en_US_sample")
corpus <- Corpus(ds)
summary(corpus)

##             Length Class             Mode
## blogs.txt   2      PlainTextDocument list
## news.txt    2      PlainTextDocument list
## twitter.txt 2      PlainTextDocument list

Processing Data

Cleaning the Corpus

The following code cleans the corpus by removing punctuation, removing numbers, transforming to lowercase, removing common ‘stopwords’, and removing any remaining white space.

corpus <- tm_map(corpus, removePunctuation) 
corpus <- tm_map(corpus, removeNumbers)   
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

After the corpus has been cleaned it is saved, this is the basis of all future analysis.

writeCorpus(corpus, path = "~/Documents/coursera-data-science-capstone/Clean Corpus")

Tokenizing the Corpus for Deeper Analysis

The following code tokenizes the words in the corpus single words, pairs of words, and triples of words, oftern referred to a n-grams, where n denotes the number of words grouped. Thus tokenizing pairs of words would produce a 2-gram.

cleantext <- data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
onetoken <- NGramTokenizer(cleantext, Weka_control(min = 1, max = 1))
bitoken <- NGramTokenizer(cleantext, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
tritoken <- NGramTokenizer(cleantext, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
bitritoken <- paste(tritoken,bitoken)

Saving the Tokens (code not displayed)

Further Exploratory Analysis on Sampled Data

Frequency Analysis of Tokens

The following charts plot the most common tokens, bitokens, and tritokens.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Word Cloud of Cleaned Corpus

Statistical Clustering of Documents

Created a document-term-matrix, removed the sparse terms, and performed a hierarchical clustering.

dtm <- DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 3, terms: 38983)>>
## Non-/sparse entries: 57199/59750
## Sparsity           : 51%
## Maximal term length: 95
## Weighting          : term frequency (tf)

dtm <- removeSparseTerms(dtm, 0.5)
dtm

## <<DocumentTermMatrix (documents: 3, terms: 12940)>>
## Non-/sparse entries: 31156/7664
## Sparsity           : 20%
## Maximal term length: 18
## Weighting          : term frequency (tf)

dtmClust <- hclust(dist(dtm), method = "ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot(dtmClust)

Created a k-means clustering of the documents as well.

dtmKMeans <- kmeans(dtm, 2)
dtmKMeans$cluster

##   blogs.txt    news.txt twitter.txt 
##           1           1           2

Algorithmic Development and Plans for Shiny App

The algorithmic development of the Shiny app will use a model to ensure an acceptable level of predictability. Care will be taken to ensure that any string of words not found in the training data will have a default set of predictions. It will use a Backoff model for the modeling.