ExploratoryAnalysis

Introduction

This document contains an exploratory analysis of a corpus containing three documents. It also outlines the modeling approach that will be used to develop and app for predicting the next work given a phrase. Because of the memory and time requirements, I decided to use samples from the documents for exploratory analyses.

# set seed for reproducible results. Use seed number a suffix for sample files in last three lines.
n <- 1
set.seed(n)
path <- "C:/Users/Hernan/Documents/CapstoneProject/Coursera-SwiftKey/final/en_US/"
get_sample <- function(filename, lineCount) {
# filename = name of the file from which to sample lines
# lineCount = number of lines to sample 
      fcon <- file(filename)
      a <- readLines(fcon, ok = TRUE, skipNul = TRUE, warn = FALSE)
      close(fcon)
      sample(a, lineCount)
}
blogs_sample <- get_sample(paste(path, "en_US.blogs.txt", sep=""), 25000)
news_sample <- get_sample(paste(path, "en_US.news.txt", sep=""), 25000)
twitter_sample <- get_sample(paste(path, "en_US.twitter.txt", sep=""), 25000)
path <-"C:/Users/Hernan/Documents/CapstoneProject/Coursera-SwiftKey/en_samples/"
write_sample <- function(char_vector, filename) {
# char_vector = name of a character vector 
# filename = name of the file to be written out to disk
      fcon <- file(filename)
      writeLines(char_vector, con=fcon)
      close(fcon)
}
write_sample(blogs_sample, paste(path, "en_US.blogs.sample", n, ".txt", sep=""))
write_sample(news_sample, paste(path, "en_US.news.sample", n, ".txt", sep=""))
write_sample(twitter_sample, paste(path, "en_US.twitter.sample", n, ".txt", sep=""))
rm(blogs_sample, news_sample, twitter_sample)

Prepare Corpus and Term Document Matrices

Use the sample files to create a volatile corpus and term document matrices for. Output number of lines and number of characters. Perform transformations and create document term matrices.

#
# Create term matrix and term document matrix. Save them for future use.
#
# Load Text Mining Libraries
library(tm)
library(RTextTools)
#
# Load list of potentially offensive words to remove
# downloaded from # https://www.cs.cmu.edu/~biglou/resources/bad-words.txt 
#
fcon <- file("C:/Users/Hernan/Documents/CapstoneProject/badwords.txt", "r")
potentiallyOffensive <- as.vector(readLines(con = fcon))
close(fcon)
#
# Create volatile corpus and inspect
#
path <- "C:/Users/Hernan/Documents/CapstoneProject/Coursera-SwiftKey/en_samples"
docs <- VCorpus(DirSource(path))
#inspect(docs)
#
# Apply trasnformations, generate document term matrices and inspect them
#
# map special characters to hex codes
docs <- tm_map(docs, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
# remove numbers
docs <- tm_map(docs, removeNumbers)
# convert letter to lower case
docs <- tm_map(docs, content_transformer(tolower))
# remove commonly occurring words not useful for prediction
docs <- tm_map(docs, removeWords, stopwords("english"))
# remove punctuation markds
docs <- tm_map(docs, removePunctuation)
# remove potentially offensive words
docs <- tm_map(docs, removeWords, potentiallyOffensive)
# remove extra white space between words, leaving only one space
docs <- tm_map(docs, stripWhitespace)
# create term document matrix containing words and inspect it
tdm <- TermDocumentMatrix(docs)
#inspect(tdm)
# create term document matrix containing 2-grams and inspect it
biGrams <- function(x) {
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
bg_tdm <- TermDocumentMatrix(docs, control = list(tokenize = biGrams))
#inspect(bg_tdm)
# create term document matrix containing 3-grams and inspect it
triGrams <- function(x) {
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
tg_tdm <- TermDocumentMatrix(docs, control = list(tokenize = triGrams))
#inspect(tg_tdm)
#
# Save objects for future use
#
save(path, docs, tdm, bg_tdm, tg_tdm, 
     file="C:/Users/Hernan/Documents/CapstoneProject/docstdms.Rdata")

Explore File Size

Verify the size of the sample documents in the corpus. Print Number of terms, bigrams and trigrams.

for ( k in ( 1:3) ) {
      print(paste(docs[[k]]$meta$id, ", lines = ", length(docs[[k]]$content),
            ", characters = ", sum(nchar(docs[[k]]$content)), sep=""))
}

## [1] "en_US.blogs.sample1.txt, lines = 25000, characters = 3676096"
## [1] "en_US.news.sample1.txt, lines = 25000, characters = 3460512"
## [1] "en_US.twitter.sample1.txt, lines = 25000, characters = 1112117"

print(paste("Number of Terms =", tdm$nrow))

## [1] "Number of Terms = 89281"

print(paste("Number of Bigrams =", bg_tdm$nrow))

## [1] "Number of Bigrams = 930632"

print(paste("Number of Trigrams =", tg_tdm$nrow))

## [1] "Number of Trigrams = 1158120"

Remove Sparse Terms

Removing sparse terms reduces the size of the term document matrix, reducing memory requirements and response time. Additionally, some of the sparse terms are nonsensical. The following code removes terms if they do not appear in all three documents.

library(tm)
tdmsr <- removeSparseTerms(tdm, 0.1)
#inspect(tdmsr)
bg_tdmsr <- removeSparseTerms(bg_tdm, 0.1)
#inspect(bg_tdmsr)
tg_tdmsr <- removeSparseTerms(tg_tdm, 0.1)
#inspect(tg_tdmsr)

Explore the Distribution of Terms

Zipf’s law postulates an inverse relationship between a word’s frequency in a corpus and its rank the frequency table. This relationship can be examined by plotting the logarithm of the frequency versus the logarithm of the rank (https://en.Wikipedia.org/wiki/Zipf%27s_law).

library(tm)
Zipf_plot(tdm, main="Zipf's Law Plot", sub="Original Term Document Matrix")

## (Intercept)           x 
##   13.939736   -1.258647

Zipf_plot(tdmsr, main="Zipf's Law Plot", sub="Sparse Terms Removed")

## (Intercept)           x 
##   13.849162   -1.255564

Heap’s law (https://en.Wikipedia.org/wiki/Heaps%27_law) relates the size of the vocabulary to the size of the text. That is, the relationship between the number of distinct words and the total number of words. This relationship can also be explored with a log-log plot. Note the plot in the Wikipedia article does not use log-transformed variables, but the function in the tm package does.

library(tm)
Heaps_plot(tdm, main="Heap's Law Plot", sub="Original Term Document Matrix")

## (Intercept)           x 
##   2.7844883   0.6169066

Heaps_plot(tdmsr, main="Heap's Law Plot", sub="Sparse Terms Removed")

##   (Intercept)             x 
##  9.442166e+00 -4.081434e-15

Explore Frequencies

Find most frequent terms, bigrams and trigrams. Show distribution of term frequencies, bigram frequencies and trigram frequencies.

sortFrequent <- function(tdmObject) {
      x <- as.matrix(tdmObject)
      y <- as.data.frame(x)
      colnames(y) <- substr(colnames(y), 7, 10)
      y$total <- rowSums(x)
      z <- y[order(-y[4]),]
}
z <- sortFrequent(tdmsr)
print(z[1:10,])

##        blog news twit total
## said   1043 6268  178  7489
## will   3048 2770  977  6795
## one    3557 2096  893  6546
## just   2769 1316 1498  5583
## can    2756 1472  972  5200
## like   2632 1236 1304  5172
## time   2391 1308  840  4539
## get    1960 1116 1186  4262
## new    1508 1682  705  3895
## people 1567 1224  547  3338

summary(z$total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   10.00   22.00   76.47   57.00 7489.00

hist(log(z$total), main="Histogram", sub="Across All Documents",
     xlab="Log of Term Frequency")

z <- sortFrequent(bg_tdmsr)
print(z[1:10,])

##             blog news twit total
## last year    110  314   19   443
## new york     155  249   27   431
## right now    145   88  166   399
## years ago    132  176   26   334
## high school   65  225   27   317
## last week    116  179   16   311
## first time   129  108   36   273
## feel like    108   46   84   238
## new jersey    20  205    6   231
## make sure    121   71   38   230

summary(z$total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    4.00    6.00   10.83   11.00  443.00

hist(log(z$total), main="Histogram", sub="Across All Documents",
     xlab="Log of Bigram Frequency")

z <- sortFrequent(tg_tdmsr)
print(z[1:10,])

##                  blog news twit total
## new york city      24   27    2    53
## two years ago      16   26    2    44
## new york times     19   15    3    37
## first time since    7   18    1    26
## let us know         4    2   18    24
## past two years      2   18    1    21
## world war ii        8   12    1    21
## u u u               1   17    1    19
## couple years ago    9    8    1    18
## three years ago     3   14    1    18

summary(z$total)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   4.500   6.393   7.000  53.000

hist(log(z$total), main="Histogram", sub="Across All Documents",
     xlab="Log of Trigram Frequency")

Words needed to cover 50% and 90% of the terms.

print(paste("Words needed for 50% =", round(0.5*tdmsr$nrow,0)))

## [1] "Words needed for 50% = 6304"

print(paste("Words needed for 90% =", round(0.9*tdmsr$nrow,0)))

## [1] "Words needed for 90% = 11348"

To evaluate how many words come from foreign languages we need to use foreign language dictionaries. However, it seems it may be more productive to use an English dictionary, and filter out words not in English. This might exclude words and n-grams commonly used in the United States, such as “Cinco de Mayo.”

The documents provided for the project contain a lot of garbage, “words” consisting of special characters, which are not used in the English language (or any language as far as I could tell). I tried to eliminate them by keeping only words that appeared in all three documents, and I had some success with that. However, some of those appeared in all three documents. Thus, I plan to obtain additional, perhaps cleaner documents. This will have two purposes: first, to increase the vocabulary, and second, to clean up the term document matrices.

Nest Steps

Obtain an additional one or two clean documents.
Build n-gram models.
Build backoff models to handle n-grams not previously seen.
Build Shiny app to implement the model, allowing users to test it.