Capstone Project in the Coursera 'Data Science Specialization' (Johns Hopkins University)

title: “CapstoneProject.Rmd” author: “Mark A. Jack” date: “Nov. 11, 2016”

output: html_document

Executive Summary

This report provides an initial exploratory analysis of the three data files provided for the NLP capstone project. Several libraries are uploaded. The creation of a corpus of documents from the three text data files mostly relies on the use of the libary 'quanteda'. It allows to quickly tokenize the corpus of documents to remove text features such as punctuation, numbers, white space, lowercase words etc. The processing time for the complete text data is considerable. Thus, a corpus is only created for a sample of the documents. Unigrams, bigrams and trigrams are generated via 'quanteda's' format of a document-frequency matrix (dfm). A dfm allows for quick and easy analysis of the most frequent occurying ngrams.

Environment

Load and Prepare Data for Tokenization

The blogs, news and twitter data files are uploaded via the call 'readLines' (en_US.blogs.txt, en_US.news.txt, n_US.twitter.txt) and the number of lines in each text file are counted.

## [1] 899288

## [1] 1010242

## [1] 2360148

We then collapse each text to a stream of characters to count the number of characters in each data file after removing white space “ ” and defining individual words and lines of text:

## [1] 207723792

## [1] 204233400

## [1] 164456321

Tokenization of the Data

A corpus is created from samples of each of the three text documents blogs, news and twitter. For quick processing, a small sample size of 1% of the original document size is selected. A key R library to accelerate the generation and handling of the 'corpus' from a group of text documents is the tool 'quanteda' which is being used here.

# Create one corpus of text using the library 'quanteda'
require(quanteda)
sampleSize <- 0.01
set.seed(1234)
blogs.sample <- sample(blogs, nl_blogs*sampleSize)
news.sample <- sample(news, nl_news*sampleSize)
twitter.sample <- sample(twitter, nl_twitter*sampleSize)
doc.sample <- c(blogs.sample, news.sample, twitter.sample)
#
doc.sample <- gsub("^", " ", doc.sample)
doc.sample <- gsub("$", " ", doc.sample)
#doc.sample  <- gsub("[;,\\.\\_\\?\\!\\“\\”]", " ", doc.sample)
doc.corpus <- corpus(doc.sample)

The object sizes [in bytes] are printed to see the change in object size for the blogs data, the sample taken from the blogs file, the complete sample created from the blogs, bnews and twitter samples and finally the change in file size when creating a corpus from the sample file:

## 260564320 bytes

## 2622656 bytes

## 8540624 bytes

## 11248776 bytes

A summary of the corpus of documents is available here:

## Corpus consisting of 42695 documents, showing 5 documents.
## 
##   Text Types Tokens Sentences
##  text1     3      5         2
##  text2    20     26         4
##  text3    47     62         4
##  text4   104    170         9
##  text5     9     10         1
## 
## Source:  /Users/markjack/Desktop/Coursera Capstone Project/shiny apps/* on x86_64 by markjack
## Created: Mon Nov 21 02:07:27 2016
## Notes:

The corpus is tokenized: Text is transformed by removing profanity, creating lowercase words, removing numbers, punctuations, hyphens, separators, twitter symbols, and stemming of words. A list of words of profanity is downloaded from the website: http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool. Stopwords (in English) (e.g. terms like 'I'm', 'won't', 'Harry's') are kept in the text corpus to prevent a skewing of calculated probabilities for word combinations (ngrams) by artefacts created such as single letters like “s” or “t”, and to stay close to how a human user would typically type in words on the commamd line in a word prediction algorithm. In the same breadth, unigrams, bigrams and trigrams are generated using 'quanteda's' feature 'dfm', a document frequency matrix.

# Download list of words of profanity from Website 
# http://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to-use-facebooks-moderation-tool/
profanity <- readLines(file("Terms-to-Block.csv",encoding = "UTF-8"),encoding = "UTF-8")
close(file("Terms-to-Block.csv"))
#
# Tokenization: Create corpus of unigrams, bigrams, trigrams
#doc.tokens <- tokenize(toLower(doc.corpus), removePunct = TRUE, removeNumbers = TRUE, 
#              removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE)
#doc.nosw <- removeFeatures(doc.tokens, stopwords("english"))
#
unigrams.dfm  <- dfm(doc.corpus, ngrams = 1, ignoredFeatures = c(profanity), 
                     toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, concatenator = " ",
                     removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = FALSE)

## Creating a dfm from a corpus ...

## 
##    ... lowercasing

## 
##    ... tokenizing

## 
##    ... indexing documents: 42,695 documents

## 
##    ... indexing features:

## 53,046 feature types

##

## Warning: closing unused connection 5 (Terms-to-Block.csv)

## ...

## removed 0 features, from 720 supplied (glob) feature types

##    ... created a 42695 x 53046 sparse dfm
##    ... complete. 
## Elapsed time: 2.31 seconds.

#
bigrams.dfm   <- dfm(doc.corpus, ngrams = 2, ignoredFeatures = c(profanity), 
                     toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, concatenator = " ",
                     removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = FALSE)

## Creating a dfm from a corpus ...

## 
##    ... lowercasing

## 
##    ... tokenizing

## 
##    ... indexing documents: 42,695 documents

## 
##    ... indexing features:

## 444,694 feature types

##

## ...

## removed 0 features, from 720 supplied (glob) feature types

##    ... created a 42695 x 444694 sparse dfm
##    ... complete. 
## Elapsed time: 28.6 seconds.

#
trigrams.dfm  <- dfm(doc.corpus, ngrams = 3, ignoredFeatures = c(profanity), 
                     toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, concatenator = " ",
                     removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = FALSE)

## Creating a dfm from a corpus ...

## 
##    ... lowercasing

## 
##    ... tokenizing

## 
##    ... indexing documents: 42,695 documents

## 
##    ... indexing features:

## 779,437 feature types

##

## ...

## removed 0 features, from 720 supplied (glob) feature types

##    ... created a 42695 x 779437 sparse dfm
##    ... complete. 
## Elapsed time: 40 seconds.

#
quadgrams.dfm <- dfm(doc.corpus, ngrams = 4, ignoredFeatures = c(profanity),
                     toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, concatenator = " ",
                     removeTwitter = TRUE, removeSeparators = TRUE, removeHyphens = TRUE, stem = FALSE)

## Creating a dfm from a corpus ...

## 
##    ... lowercasing

## 
##    ... tokenizing

## 
##    ... indexing documents: 42,695 documents

## 
##    ... indexing features:

## 858,242 feature types

##

## ...

## removed 0 features, from 720 supplied (glob) feature types

##    ... created a 42695 x 858242 sparse dfm
##    ... complete. 
## Elapsed time: 45.6 seconds.

#
objectSize(unigrams.dfm)

## 16308472 bytes

objectSize(bigrams.dfm)

## 44960608 bytes

objectSize(trigrams.dfm)

## 73020352 bytes

objectSize(quadgrams.dfm)

## 84419032 bytes

Exploratory Data Analysis

With the 'topfeatures' call in each of the unigrams, bigrams and trigrams dfms, we obtain the 25 most frequent features in each of the sets of ngrams. In three bar plots, we show the number of occurances of each of the most common words or 2- or 3- or 4-word combinations (unigrams, bigrams, trigrams, quadgrams) as horizontal bars.

plot of chunk unnamed-chunk-8

We may further process unigrams, bigrams, trigrams and quadgrams by removing infrequent data, e.g. data that occurs less than twice:

head(unigrams.dfm, 3)

## Document-feature matrix of: 4,269 documents, 14,712 features.
## (showing first 3 documents and first 6 features)
##        features
## docs    babs i don't know maybe they're
##   text1    1 0     0    0     0       0
##   text2    0 3     1    1     1       1
##   text3    0 0     0    0     3       0

head(bigrams.dfm, 3)

## Document-feature matrix of: 4,269 documents, 64,238 features.
## (showing first 3 documents and first 6 features)
##        features
## docs    i don't don't know know maybe maybe they're they're getting
##   text1       0          0          0             0               0
##   text2       1          1          1             1               1
##   text3       0          0          0             0               0
##        features
## docs    getting too
##   text1           0
##   text2           1
##   text3           0

head(trigrams.dfm, 3)

## Document-feature matrix of: 4,269 documents, 85,531 features.
## (showing first 3 documents and first 6 features)
##        features
## docs    i don't know don't know maybe know maybe they're
##   text1            0                0                  0
##   text2            1                1                  1
##   text3            0                0                  0
##        features
## docs    maybe they're getting they're getting too getting too much
##   text1                     0                   0                0
##   text2                     1                   1                1
##   text3                     0                   0                0

head(quadgrams.dfm, 3)

## Document-feature matrix of: 4,269 documents, 85,673 features.
## (showing first 3 documents and first 6 features)
##        features
## docs    i don't know maybe don't know maybe they're
##   text1                  0                        0
##   text2                  1                        1
##   text3                  0                        0
##        features
## docs    know maybe they're getting maybe they're getting too
##   text1                          0                         0
##   text2                          1                         1
##   text3                          0                         0
##        features
## docs    they're getting too much getting too much sun
##   text1                        0                    0
##   text2                        1                    1
##   text3                        0                    0

#
unigrams.freq0 <- colSums(unigrams.dfm)
unigrams.freq <- sort(unigrams.freq0, decreasing=TRUE) 
#
bigrams.freq0 <- colSums(bigrams.dfm)
bigrams.freq <- sort(bigrams.freq0, decreasing=TRUE) 
#
trigrams.freq0 <- colSums(trigrams.dfm)
trigrams.freq <- sort(trigrams.freq0, decreasing=TRUE) 
#
quadgrams.freq0 <- colSums(quadgrams.dfm)
quadgrams.freq <- sort(quadgrams.freq0, decreasing=TRUE) 
#
#-------------------------------------------------------------------------
frequency <- 2
#
unigrams.most <- as.numeric()
for (i in 1:length(unigrams.freq)) { 
  if (unigrams.freq[i] > frequency) {
    unigrams.most  <- c(unigrams.most, unigrams.freq[i]) }
}
length(unigrams.most)

## [1] 4102

#
bigrams.most <- as.numeric()
for (i in 1:length(bigrams.freq)) { 
  if (bigrams.freq[i] > frequency) {
    bigrams.most  <- c(bigrams.most, bigrams.freq[i]) }
}
length(bigrams.most)

## [1] 4052

#
trigrams.most <- as.numeric()
for (i in 1:length(trigrams.freq)) { 
  if (trigrams.freq[i] > frequency) {
    trigrams.most  <- c(trigrams.most, trigrams.freq[i]) }
}
length(trigrams.most)

## [1] 796

#
quadgrams.most <- as.numeric()
for (i in 1:length(quadgrams.freq)) { 
  if (quadgrams.freq[i] > frequency) {
    quadgrams.most  <- c(quadgrams.most, quadgrams.freq[i]) }
}
length(quadgrams.most)

## [1] 67

We then create data frames and select the 25 most frequent occurances in sorted lists of unigrams, bigrams, trigrams and quadgrams and label the columns:

unigrams_most <- data.frame(unigrams.most)
unigrams_most[, 1] <- as.character(names(unigrams.most))
unigrams_most[, 2] <- as.numeric(unigrams.most)
unigrams_most0 <- unigrams_most[1:25,]
colnames(unigrams_most0) <- c("Word","Frequency")
row.names(unigrams_most0) <- NULL
head(unigrams_most0)

##   Word Frequency
## 1  the      4665
## 2   to      2626
## 3  and      2368
## 4    a      2351
## 5   of      1948
## 6    i      1635

#
bigrams_most <- data.frame(bigrams.most)
bigrams_most[, 1] <- as.character(names(bigrams.most))
bigrams_most[, 2] <- as.numeric(bigrams.most)
bigrams_most0 <- bigrams_most[1:25,]
colnames(bigrams_most0) <- c("Word","Frequency")
row.names(bigrams_most0) <- NULL
head(bigrams_most0)

##      Word Frequency
## 1  of the       449
## 2  in the       414
## 3 for the       221
## 4  to the       180
## 5  on the       174
## 6   to be       158

#
trigrams_most <- data.frame(trigrams.most)
trigrams_most[, 1] <- as.character(names(trigrams.most))
trigrams_most[, 2] <- as.numeric(trigrams.most)
trigrams_most0 <- trigrams_most[1:25,]
colnames(trigrams_most0) <- c("Word","Frequency")
row.names(trigrams_most0) <- NULL
head(trigrams_most0)

##             Word Frequency
## 1     one of the        48
## 2       a lot of        30
## 3 thanks for the        26
## 4    going to be        24
## 5     i love you        20
## 6 call call call        20

#
quadgrams_most <- data.frame(quadgrams.most)
quadgrams_most[, 1] <- as.character(names(quadgrams.most))
quadgrams_most[, 2] <- as.numeric(quadgrams.most)
quadgrams_most0 <- quadgrams_most[1:25,]
colnames(quadgrams_most0) <- c("Word","Frequency")
row.names(quadgrams_most0) <- NULL
head(quadgrams_most0)

##                  Word Frequency
## 1 call call call call        18
## 2    at the same time        11
## 3       at the end of         9
## 4     the rest of the         8
## 5    when it comes to         7
## 6     one of the best         7

We have created histograms of the unigram, bigram, trigram, and quadgram occurrances from the selected text corpus:

plot of chunk unnamed-chunk-11

As further illustration, here are so-called 'word cloud plots' for the 4 different 'ngram' statistics:

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): call call
## call call could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): if you want
## to could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): the end of
## the could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): the middle
## of the could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): in the same
## way could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): when it
## comes to could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): thank you
## for the could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): the face of
## things could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): rose or
## percent to could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): thanks for
## the follow could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): thank you so
## much could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): one of the
## most could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): is one of
## the could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): in a written
## statement could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): are you
## going to could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): figure out
## how to could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): phantom of
## the opera could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): in the
## united states could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): all the way
## to could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): at the same
## time could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): in the
## middle of could not be fit on page. It will not be plotted.

## Warning in wordcloud::wordcloud(features(x), colSums(x), ...): the fact
## that i could not be fit on page. It will not be plotted.

plot of chunk unnamed-chunk-12

Finally, this is a snapshot of the final interactive 'shiny app' application - a word prediction application that uses the 'Kneser-Ney smoothing' formalism with calculated continuation probabilities from the individual ngram occurrences in the selected text corpus and estimating probabilities for rare ngrams that did not appear in the corpus. The app provides up to the three most likely word suggestions to continue a typed sentence that is submitted by the user.