# Milestone Assignment in the Capstone Project in the Coursera ‘Data Science Specialization’ (Johns Hopkins University)

title: “CapstoneProject-Milestone.Rmd” author: “Mark A. Jack” date: “March 24, 2016” output: html_document —

Executive Summary

This report provides an initial exploratory analysis of the three data files provided for the NLP capstone project. Several libraries are uploaded. The creation of a corpus of documents from the three text data files mostly relies on the use of the libary ‘quanteda’. It allows to quickly tokenize the corpus of documents to remove text features such as punctuation, numbers, white space, lowercase words etc. The processing time for the complete text data is considerable. Thus, a corpus is only created for a sample of the documents. Unigrams, bigrams and trigrams are generated via ‘quanteda’s’ format of a document-frequency matrix (dfm). A dfm allows for quick and easy analysis of the most frequent occurying ngrams. ## Environment

Load and Prepare Data for Tokenization

The blogs, news and twitter data files are uploaded via the call ‘readLines’ (en_US.blogs.txt, en_US.news.txt, n_US.twitter.txt) and the number of lines in each text file are counted.

## [1] 899288

## [1] 1010242

## [1] 2360148

Collapse each text to a stream of characters to count the number of characters in each data file after removing white space " " defining individual words and lines of text:

## [1] 207723792

## [1] 204233400

## [1] 164456321

Tokenization of the Data

A corpus is created from samples of each of the three text documents blogs, news and twitter. For quick processing, a small sample size of 1% of the original document size is selected.

# Create one corpus of text using the library 'quanteda'
require(quanteda)
sampleSize <- 0.01
set.seed(1234)
blogs.sample <- sample(blogs, nl_blogs*sampleSize)
news.sample <- sample(news, nl_news*sampleSize)
twitter.sample <- sample(twitter, nl_twitter*sampleSize)
doc.sample <- c(blogs.sample, news.sample, twitter.sample)
doc.corpus <- corpus(doc.sample)

The object sizes [in bytes] are printed to see the change in object size for the blogs data, the sample taken from the blogs file, the complete sample created from the blogs, bnews and twitter samples and finally the change in file size when creating a corpus from the sample file:

## 260564320 bytes

## 2622656 bytes

## 8449456 bytes

## 11158072 bytes

A summary of the corpus of documents is available here:

## Corpus consisting of 42695 documents, showing 5 documents.
## 
##   Text Types Tokens Sentences
##  text1     3      5         2
##  text2    20     26         4
##  text3    47     62         4
##  text4   104    170         9
##  text5     9     10         1
## 
## Source:  /Users/markjack/Desktop/* on x86_64 by markjack
## Created: Fri Mar 25 00:52:54 2016
## Notes:

The corpus is tokenized: Text is transformed by removing profanity, creating lowercase words, removing numbers, punctuations, hyphens, separators, twitter symbols, english stop words and stemming of words. A list of words of profanity is downloaded from the website: http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool. In the same breadth, unigrams, bigrams and trigrams are generated using ‘quanteda’s’ tool ‘dfm’, a document frequency matrix.

profanity <- readLines(file("/Users/markjack/capstone-project/Terms-to-Block.csv",encoding = "UTF-8"),encoding = "UTF-8")
close(file("/Users/markjack/capstone-project/Terms-to-Block.csv"))
#
unigrams.dfm <- dfm(doc.corpus, ngrams = 1, ignoredFeatures = c(profanity, stopwords("english")),
                    removePunct = TRUE, removeNumbers = TRUE, 
                    removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = TRUE)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing

## Warning: closing unused connection 6 (/Users/markjack/capstone-project/
## Terms-to-Block.csv)

## 
##    ... indexing documents: 42,695 documents
##    ... indexing features:

## Warning: closing unused connection 5 (/Users/markjack/capstone-project/
## en_US.twitter.txt)

## 53,046 feature types
##    ... removed 173 features, from 894 supplied (glob) feature types
##    ... stemming features (English), trimmed 16078 feature variants
##    ... created a 42695 x 36795 sparse dfm
##    ... complete. 
## Elapsed time: 4.275 seconds.

#
bigrams.dfm <- dfm(doc.corpus, ngrams = 2, ignoredFeatures = c(profanity, stopwords("english")),
                    removePunct = TRUE, removeNumbers = TRUE, 
                    removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = TRUE)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 42,695 documents
##    ... indexing features: 444,694 feature types
##    ... removed 237,935 features, from 894 supplied (glob) feature types
##    ... stemming features (English), trimmed 5457 feature variants
##    ... created a 42695 x 201302 sparse dfm
##    ... complete. 
## Elapsed time: 21.049 seconds.

#
trigrams.dfm <- dfm(doc.corpus, ngrams = 3, ignoredFeatures = c(profanity, stopwords("english")),
                    removePunct = TRUE, removeNumbers = TRUE, 
                    removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = TRUE)

## Creating a dfm from a corpus ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 42,695 documents
##    ... indexing features: 779,437 feature types
##    ... removed 665,056 features, from 894 supplied (glob) feature types
##    ... stemming features (English), trimmed 181 feature variants
##    ... created a 42695 x 114200 sparse dfm
##    ... complete. 
## Elapsed time: 33.544 seconds.

Exploratory Data Analysis

With the ‘topfeatures’ call in each of the unigrams, bigrams and trigrams dfms, we obtain the 25 most frequent features in each of the sets of ngrams. In three bar plots, we show the number of occurances of each of the most common words or 2- or 3-word combinations as horizontal bars.

Unigrams:

# Create bar plots of 25 most frequent features in unigrams, bigrams and trigrams:
par(mar=c(4,4,4,2))
par(mfrow = c(1,1))
barplot(topfeatures(unigrams.dfm, 25), horiz=TRUE, las=1)

#dev.copy(png, file = "/Users/markjack/capstone-project/barplot_uni.png")
#dev.off()

Bigrams:

par(mar=c(4,8,4,2))
par(mfrow = c(1,1))
barplot(topfeatures(bigrams.dfm, 25), horiz=TRUE, las=1)

#dev.copy(png, file = "/Users/markjack/capstone-project/barplot_bi.png")
#dev.off()

Trigrams:

par(mar=c(4,12,4,2))
par(mfrow = c(1,1))
barplot(topfeatures(trigrams.dfm, 25), horiz=TRUE, las=1)

#dev.copy(png, file = "/Users/markjack/capstone-project/barplot_tri.png")
#dev.off()

We may further process unigrams, bigrams and trigrams by removing infrequent data, e.g. data that occurs less than 10 times:

unigrams.freq0 <- colSums(unigrams.dfm)
unigrams.freq <- sort(unigrams.freq0, decreasing=TRUE) 
#
bigrams.freq0 <- colSums(bigrams.dfm)
bigrams.freq <- sort(bigrams.freq0, decreasing=TRUE) 
#
trigrams.freq0 <- colSums(trigrams.dfm)
trigrams.freq <- sort(trigrams.freq0, decreasing=TRUE) 
#
#-------------------------------------------------------------------------
frequency <- 10
#
unigrams.most <- as.numeric()
for (i in 1:length(unigrams.freq)) { 
  if (unigrams.freq[i] > frequency) {
    unigrams.most  <- c(unigrams.most, unigrams.freq[i]) }
}
length(unigrams.most)

## [1] 5542

#
bigrams.most <- as.numeric()
for (i in 1:length(bigrams.freq)) { 
  if (bigrams.freq[i] > frequency) {
    bigrams.most  <- c(bigrams.most, bigrams.freq[i]) }
}
length(bigrams.most)

## [1] 750

#
trigrams.most <- as.numeric()
for (i in 1:length(trigrams.freq)) { 
  if (trigrams.freq[i] > frequency) {
    trigrams.most  <- c(trigrams.most, trigrams.freq[i]) }
}
length(trigrams.most)

## [1] 15

Create data frames, select 25 most frequent occurances in sorted lists of unigrams, bigrams and trigrams and label columns:

unigrams_most <- data.frame(unigrams.most)
unigrams_most[, 1] <- as.character(names(unigrams.most))
unigrams_most[, 2] <- as.numeric(unigrams.most)
unigrams_most0 <- unigrams_most[1:25,]
colnames(unigrams_most0) <- c("Word","Frequency")
row.names(unigrams_most0) <- NULL
head(unigrams_most0)

##   Word Frequency
## 1  one      3308
## 2 will      3204
## 3  get      3125
## 4 said      3046
## 5 just      3031
## 6 like      2983

#
bigrams_most <- data.frame(bigrams.most)
bigrams_most[, 1] <- as.character(names(bigrams.most))
bigrams_most[, 2] <- as.numeric(bigrams.most)
bigrams_most0 <- bigrams_most[1:25,]
colnames(bigrams_most0) <- c("Word","Frequency")
row.names(bigrams_most0) <- NULL
head(bigrams_most0)

##          Word Frequency
## 1   right_now       259
## 2   last_year       231
## 3    year_old       219
## 4    new_york       209
## 5  last_night       160
## 6 high_school       156

#
trigrams_most <- data.frame(trigrams.most)
trigrams_most[, 1] <- as.character(names(trigrams.most))
trigrams_most[, 2] <- as.numeric(trigrams.most)
trigrams_most0 <- trigrams_most[1:25,]
colnames(trigrams_most0) <- c("Word","Frequency")
row.names(trigrams_most0) <- NULL
head(trigrams_most0)

##                 Word Frequency
## 1         new_york_c        30
## 2      cinco_de_mayo        23
## 3        let_us_know        23
## 4     happy_new_year        20
## 5      two_years_ago        18
## 6 happy_mother's_day        18