title: “CapstoneProject-Milestone.Rmd” author: “Mark A. Jack” date: “March 24, 2016” output: html_document —
This report provides an initial exploratory analysis of the three data files provided for the NLP capstone project. Several libraries are uploaded. The creation of a corpus of documents from the three text data files mostly relies on the use of the libary ‘quanteda’. It allows to quickly tokenize the corpus of documents to remove text features such as punctuation, numbers, white space, lowercase words etc. The processing time for the complete text data is considerable. Thus, a corpus is only created for a sample of the documents. Unigrams, bigrams and trigrams are generated via ‘quanteda’s’ format of a document-frequency matrix (dfm). A dfm allows for quick and easy analysis of the most frequent occurying ngrams. ## Environment
The blogs, news and twitter data files are uploaded via the call ‘readLines’ (en_US.blogs.txt, en_US.news.txt, n_US.twitter.txt) and the number of lines in each text file are counted.
## [1] 899288
## [1] 1010242
## [1] 2360148
Collapse each text to a stream of characters to count the number of characters in each data file after removing white space " " defining individual words and lines of text:
## [1] 207723792
## [1] 204233400
## [1] 164456321
A corpus is created from samples of each of the three text documents blogs, news and twitter. For quick processing, a small sample size of 1% of the original document size is selected.
# Create one corpus of text using the library 'quanteda'
require(quanteda)
sampleSize <- 0.01
set.seed(1234)
blogs.sample <- sample(blogs, nl_blogs*sampleSize)
news.sample <- sample(news, nl_news*sampleSize)
twitter.sample <- sample(twitter, nl_twitter*sampleSize)
doc.sample <- c(blogs.sample, news.sample, twitter.sample)
doc.corpus <- corpus(doc.sample)
The object sizes [in bytes] are printed to see the change in object size for the blogs data, the sample taken from the blogs file, the complete sample created from the blogs, bnews and twitter samples and finally the change in file size when creating a corpus from the sample file:
## 260564320 bytes
## 2622656 bytes
## 8449456 bytes
## 11158072 bytes
A summary of the corpus of documents is available here:
## Corpus consisting of 42695 documents, showing 5 documents.
##
## Text Types Tokens Sentences
## text1 3 5 2
## text2 20 26 4
## text3 47 62 4
## text4 104 170 9
## text5 9 10 1
##
## Source: /Users/markjack/Desktop/* on x86_64 by markjack
## Created: Fri Mar 25 00:52:54 2016
## Notes:
The corpus is tokenized: Text is transformed by removing profanity, creating lowercase words, removing numbers, punctuations, hyphens, separators, twitter symbols, english stop words and stemming of words. A list of words of profanity is downloaded from the website: http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool. In the same breadth, unigrams, bigrams and trigrams are generated using ‘quanteda’s’ tool ‘dfm’, a document frequency matrix.
profanity <- readLines(file("/Users/markjack/capstone-project/Terms-to-Block.csv",encoding = "UTF-8"),encoding = "UTF-8")
close(file("/Users/markjack/capstone-project/Terms-to-Block.csv"))
#
unigrams.dfm <- dfm(doc.corpus, ngrams = 1, ignoredFeatures = c(profanity, stopwords("english")),
removePunct = TRUE, removeNumbers = TRUE,
removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## Warning: closing unused connection 6 (/Users/markjack/capstone-project/
## Terms-to-Block.csv)
##
## ... indexing documents: 42,695 documents
## ... indexing features:
## Warning: closing unused connection 5 (/Users/markjack/capstone-project/
## en_US.twitter.txt)
## 53,046 feature types
## ... removed 173 features, from 894 supplied (glob) feature types
## ... stemming features (English), trimmed 16078 feature variants
## ... created a 42695 x 36795 sparse dfm
## ... complete.
## Elapsed time: 4.275 seconds.
#
bigrams.dfm <- dfm(doc.corpus, ngrams = 2, ignoredFeatures = c(profanity, stopwords("english")),
removePunct = TRUE, removeNumbers = TRUE,
removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 42,695 documents
## ... indexing features: 444,694 feature types
## ... removed 237,935 features, from 894 supplied (glob) feature types
## ... stemming features (English), trimmed 5457 feature variants
## ... created a 42695 x 201302 sparse dfm
## ... complete.
## Elapsed time: 21.049 seconds.
#
trigrams.dfm <- dfm(doc.corpus, ngrams = 3, ignoredFeatures = c(profanity, stopwords("english")),
removePunct = TRUE, removeNumbers = TRUE,
removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 42,695 documents
## ... indexing features: 779,437 feature types
## ... removed 665,056 features, from 894 supplied (glob) feature types
## ... stemming features (English), trimmed 181 feature variants
## ... created a 42695 x 114200 sparse dfm
## ... complete.
## Elapsed time: 33.544 seconds.
With the ‘topfeatures’ call in each of the unigrams, bigrams and trigrams dfms, we obtain the 25 most frequent features in each of the sets of ngrams. In three bar plots, we show the number of occurances of each of the most common words or 2- or 3-word combinations as horizontal bars.
Unigrams:
# Create bar plots of 25 most frequent features in unigrams, bigrams and trigrams:
par(mar=c(4,4,4,2))
par(mfrow = c(1,1))
barplot(topfeatures(unigrams.dfm, 25), horiz=TRUE, las=1)
#dev.copy(png, file = "/Users/markjack/capstone-project/barplot_uni.png")
#dev.off()
Bigrams:
par(mar=c(4,8,4,2))
par(mfrow = c(1,1))
barplot(topfeatures(bigrams.dfm, 25), horiz=TRUE, las=1)
#dev.copy(png, file = "/Users/markjack/capstone-project/barplot_bi.png")
#dev.off()
Trigrams:
par(mar=c(4,12,4,2))
par(mfrow = c(1,1))
barplot(topfeatures(trigrams.dfm, 25), horiz=TRUE, las=1)
#dev.copy(png, file = "/Users/markjack/capstone-project/barplot_tri.png")
#dev.off()
We may further process unigrams, bigrams and trigrams by removing infrequent data, e.g. data that occurs less than 10 times:
unigrams.freq0 <- colSums(unigrams.dfm)
unigrams.freq <- sort(unigrams.freq0, decreasing=TRUE)
#
bigrams.freq0 <- colSums(bigrams.dfm)
bigrams.freq <- sort(bigrams.freq0, decreasing=TRUE)
#
trigrams.freq0 <- colSums(trigrams.dfm)
trigrams.freq <- sort(trigrams.freq0, decreasing=TRUE)
#
#-------------------------------------------------------------------------
frequency <- 10
#
unigrams.most <- as.numeric()
for (i in 1:length(unigrams.freq)) {
if (unigrams.freq[i] > frequency) {
unigrams.most <- c(unigrams.most, unigrams.freq[i]) }
}
length(unigrams.most)
## [1] 5542
#
bigrams.most <- as.numeric()
for (i in 1:length(bigrams.freq)) {
if (bigrams.freq[i] > frequency) {
bigrams.most <- c(bigrams.most, bigrams.freq[i]) }
}
length(bigrams.most)
## [1] 750
#
trigrams.most <- as.numeric()
for (i in 1:length(trigrams.freq)) {
if (trigrams.freq[i] > frequency) {
trigrams.most <- c(trigrams.most, trigrams.freq[i]) }
}
length(trigrams.most)
## [1] 15
Create data frames, select 25 most frequent occurances in sorted lists of unigrams, bigrams and trigrams and label columns:
unigrams_most <- data.frame(unigrams.most)
unigrams_most[, 1] <- as.character(names(unigrams.most))
unigrams_most[, 2] <- as.numeric(unigrams.most)
unigrams_most0 <- unigrams_most[1:25,]
colnames(unigrams_most0) <- c("Word","Frequency")
row.names(unigrams_most0) <- NULL
head(unigrams_most0)
## Word Frequency
## 1 one 3308
## 2 will 3204
## 3 get 3125
## 4 said 3046
## 5 just 3031
## 6 like 2983
#
bigrams_most <- data.frame(bigrams.most)
bigrams_most[, 1] <- as.character(names(bigrams.most))
bigrams_most[, 2] <- as.numeric(bigrams.most)
bigrams_most0 <- bigrams_most[1:25,]
colnames(bigrams_most0) <- c("Word","Frequency")
row.names(bigrams_most0) <- NULL
head(bigrams_most0)
## Word Frequency
## 1 right_now 259
## 2 last_year 231
## 3 year_old 219
## 4 new_york 209
## 5 last_night 160
## 6 high_school 156
#
trigrams_most <- data.frame(trigrams.most)
trigrams_most[, 1] <- as.character(names(trigrams.most))
trigrams_most[, 2] <- as.numeric(trigrams.most)
trigrams_most0 <- trigrams_most[1:25,]
colnames(trigrams_most0) <- c("Word","Frequency")
row.names(trigrams_most0) <- NULL
head(trigrams_most0)
## Word Frequency
## 1 new_york_c 30
## 2 cinco_de_mayo 23
## 3 let_us_know 23
## 4 happy_new_year 20
## 5 two_years_ago 18
## 6 happy_mother's_day 18