This document contains an exploratory analysis of a corpus containing three documents. It also outlines the modeling approach that will be used to develop and app for predicting the next work given a phrase. Because of the memory and time requirements, I decided to use samples from the documents for exploratory analyses.
# set seed for reproducible results. Use seed number a suffix for sample files in last three lines.
n <- 1
set.seed(n)
path <- "C:/Users/Hernan/Documents/CapstoneProject/Coursera-SwiftKey/final/en_US/"
get_sample <- function(filename, lineCount) {
# filename = name of the file from which to sample lines
# lineCount = number of lines to sample
fcon <- file(filename)
a <- readLines(fcon, ok = TRUE, skipNul = TRUE, warn = FALSE)
close(fcon)
sample(a, lineCount)
}
blogs_sample <- get_sample(paste(path, "en_US.blogs.txt", sep=""), 25000)
news_sample <- get_sample(paste(path, "en_US.news.txt", sep=""), 25000)
twitter_sample <- get_sample(paste(path, "en_US.twitter.txt", sep=""), 25000)
path <-"C:/Users/Hernan/Documents/CapstoneProject/Coursera-SwiftKey/en_samples/"
write_sample <- function(char_vector, filename) {
# char_vector = name of a character vector
# filename = name of the file to be written out to disk
fcon <- file(filename)
writeLines(char_vector, con=fcon)
close(fcon)
}
write_sample(blogs_sample, paste(path, "en_US.blogs.sample", n, ".txt", sep=""))
write_sample(news_sample, paste(path, "en_US.news.sample", n, ".txt", sep=""))
write_sample(twitter_sample, paste(path, "en_US.twitter.sample", n, ".txt", sep=""))
rm(blogs_sample, news_sample, twitter_sample)
Use the sample files to create a volatile corpus and term document matrices for. Output number of lines and number of characters. Perform transformations and create document term matrices.
#
# Create term matrix and term document matrix. Save them for future use.
#
# Load Text Mining Libraries
library(tm)
library(RTextTools)
#
# Load list of potentially offensive words to remove
# downloaded from # https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
#
fcon <- file("C:/Users/Hernan/Documents/CapstoneProject/badwords.txt", "r")
potentiallyOffensive <- as.vector(readLines(con = fcon))
close(fcon)
#
# Create volatile corpus and inspect
#
path <- "C:/Users/Hernan/Documents/CapstoneProject/Coursera-SwiftKey/en_samples"
docs <- VCorpus(DirSource(path))
#inspect(docs)
#
# Apply trasnformations, generate document term matrices and inspect them
#
# map special characters to hex codes
docs <- tm_map(docs, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
# remove numbers
docs <- tm_map(docs, removeNumbers)
# convert letter to lower case
docs <- tm_map(docs, content_transformer(tolower))
# remove commonly occurring words not useful for prediction
docs <- tm_map(docs, removeWords, stopwords("english"))
# remove punctuation markds
docs <- tm_map(docs, removePunctuation)
# remove potentially offensive words
docs <- tm_map(docs, removeWords, potentiallyOffensive)
# remove extra white space between words, leaving only one space
docs <- tm_map(docs, stripWhitespace)
# create term document matrix containing words and inspect it
tdm <- TermDocumentMatrix(docs)
#inspect(tdm)
# create term document matrix containing 2-grams and inspect it
biGrams <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
bg_tdm <- TermDocumentMatrix(docs, control = list(tokenize = biGrams))
#inspect(bg_tdm)
# create term document matrix containing 3-grams and inspect it
triGrams <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
tg_tdm <- TermDocumentMatrix(docs, control = list(tokenize = triGrams))
#inspect(tg_tdm)
#
# Save objects for future use
#
save(path, docs, tdm, bg_tdm, tg_tdm,
file="C:/Users/Hernan/Documents/CapstoneProject/docstdms.Rdata")
Verify the size of the sample documents in the corpus. Print Number of terms, bigrams and trigrams.
for ( k in ( 1:3) ) {
print(paste(docs[[k]]$meta$id, ", lines = ", length(docs[[k]]$content),
", characters = ", sum(nchar(docs[[k]]$content)), sep=""))
}
## [1] "en_US.blogs.sample1.txt, lines = 25000, characters = 3676096"
## [1] "en_US.news.sample1.txt, lines = 25000, characters = 3460512"
## [1] "en_US.twitter.sample1.txt, lines = 25000, characters = 1112117"
print(paste("Number of Terms =", tdm$nrow))
## [1] "Number of Terms = 89281"
print(paste("Number of Bigrams =", bg_tdm$nrow))
## [1] "Number of Bigrams = 930632"
print(paste("Number of Trigrams =", tg_tdm$nrow))
## [1] "Number of Trigrams = 1158120"
Removing sparse terms reduces the size of the term document matrix, reducing memory requirements and response time. Additionally, some of the sparse terms are nonsensical. The following code removes terms if they do not appear in all three documents.
library(tm)
tdmsr <- removeSparseTerms(tdm, 0.1)
#inspect(tdmsr)
bg_tdmsr <- removeSparseTerms(bg_tdm, 0.1)
#inspect(bg_tdmsr)
tg_tdmsr <- removeSparseTerms(tg_tdm, 0.1)
#inspect(tg_tdmsr)
Zipf’s law postulates an inverse relationship between a word’s frequency in a corpus and its rank the frequency table. This relationship can be examined by plotting the logarithm of the frequency versus the logarithm of the rank (https://en.Wikipedia.org/wiki/Zipf%27s_law).
library(tm)
Zipf_plot(tdm, main="Zipf's Law Plot", sub="Original Term Document Matrix")
## (Intercept) x
## 13.939736 -1.258647
Zipf_plot(tdmsr, main="Zipf's Law Plot", sub="Sparse Terms Removed")
## (Intercept) x
## 13.849162 -1.255564
Heap’s law (https://en.Wikipedia.org/wiki/Heaps%27_law) relates the size of the vocabulary to the size of the text. That is, the relationship between the number of distinct words and the total number of words. This relationship can also be explored with a log-log plot. Note the plot in the Wikipedia article does not use log-transformed variables, but the function in the tm package does.
library(tm)
Heaps_plot(tdm, main="Heap's Law Plot", sub="Original Term Document Matrix")
## (Intercept) x
## 2.7844883 0.6169066
Heaps_plot(tdmsr, main="Heap's Law Plot", sub="Sparse Terms Removed")
## (Intercept) x
## 9.442166e+00 -4.081434e-15
Find most frequent terms, bigrams and trigrams. Show distribution of term frequencies, bigram frequencies and trigram frequencies.
sortFrequent <- function(tdmObject) {
x <- as.matrix(tdmObject)
y <- as.data.frame(x)
colnames(y) <- substr(colnames(y), 7, 10)
y$total <- rowSums(x)
z <- y[order(-y[4]),]
}
z <- sortFrequent(tdmsr)
print(z[1:10,])
## blog news twit total
## said 1043 6268 178 7489
## will 3048 2770 977 6795
## one 3557 2096 893 6546
## just 2769 1316 1498 5583
## can 2756 1472 972 5200
## like 2632 1236 1304 5172
## time 2391 1308 840 4539
## get 1960 1116 1186 4262
## new 1508 1682 705 3895
## people 1567 1224 547 3338
summary(z$total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 10.00 22.00 76.47 57.00 7489.00
hist(log(z$total), main="Histogram", sub="Across All Documents",
xlab="Log of Term Frequency")
z <- sortFrequent(bg_tdmsr)
print(z[1:10,])
## blog news twit total
## last year 110 314 19 443
## new york 155 249 27 431
## right now 145 88 166 399
## years ago 132 176 26 334
## high school 65 225 27 317
## last week 116 179 16 311
## first time 129 108 36 273
## feel like 108 46 84 238
## new jersey 20 205 6 231
## make sure 121 71 38 230
summary(z$total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 4.00 6.00 10.83 11.00 443.00
hist(log(z$total), main="Histogram", sub="Across All Documents",
xlab="Log of Bigram Frequency")
z <- sortFrequent(tg_tdmsr)
print(z[1:10,])
## blog news twit total
## new york city 24 27 2 53
## two years ago 16 26 2 44
## new york times 19 15 3 37
## first time since 7 18 1 26
## let us know 4 2 18 24
## past two years 2 18 1 21
## world war ii 8 12 1 21
## u u u 1 17 1 19
## couple years ago 9 8 1 18
## three years ago 3 14 1 18
summary(z$total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 4.500 6.393 7.000 53.000
hist(log(z$total), main="Histogram", sub="Across All Documents",
xlab="Log of Trigram Frequency")
Words needed to cover 50% and 90% of the terms.
print(paste("Words needed for 50% =", round(0.5*tdmsr$nrow,0)))
## [1] "Words needed for 50% = 6304"
print(paste("Words needed for 90% =", round(0.9*tdmsr$nrow,0)))
## [1] "Words needed for 90% = 11348"
To evaluate how many words come from foreign languages we need to use foreign language dictionaries. However, it seems it may be more productive to use an English dictionary, and filter out words not in English. This might exclude words and n-grams commonly used in the United States, such as “Cinco de Mayo.”
The documents provided for the project contain a lot of garbage, “words” consisting of special characters, which are not used in the English language (or any language as far as I could tell). I tried to eliminate them by keeping only words that appeared in all three documents, and I had some success with that. However, some of those appeared in all three documents. Thus, I plan to obtain additional, perhaps cleaner documents. This will have two purposes: first, to increase the vocabulary, and second, to clean up the term document matrices.