Scope: preliminary text mining of the HC Corpora texts’ set using the R statistical computing environment;

1st released on 2016-12-01 00:32:22
2nd version#: 2016-12-02 21:06:58

added a table that writes the number of lines and terms per input text.

Text mining aims at automating information extraction from large set of free texts, referred to as corpus;

The data set pertaining to this report consisted of three distinct input text documents, namely blogs, news and tweets;

Human written texts follow linguistic and semantic rules which are not trivial to programmatically emulate;

One approach of text mining, which is very briefly illustrated in this report, consists of transforming the input “raw” text data into a data structure amenable to standard analyses. Simple rules are used to parse the input text in order to create a data object reminiscent to a matrix. Matrices are populated with the number of occurrences of individual terms or series of n-associated terms (referred to as n-grams).

Term is used to designate a unit of information, as opposed to word, because it already represents the outcome of data processing steps. One typical case is what is referred to as stemming. Stemming trims prefix and suffixes which are used to specify e.g. gender and/or plurality in nouns;

Given the large size of the corpus, in extenso analyzing is computationally time consuming. Therefore, sampling was evaluated in order to conclude if analyzing excerpts of the input corpus could lead to conclusions that could be as valid;

This report documents the extraction of the most frequent terms form the three input text categories based on a sample whose size was determined to be reasonably representative of the whole input collection;

One major intermediate data outcome of this analyzes was a so-called dataframe named FreqDf which contained the frequencies of the most frequent terms for each document category (885 terms);

The 1st question that was addressed with that FreqDf data was: could the number of occurrences of this set of 885 terms be used to classify an input text. A standard analytical method referred to as Principal Component Analyses was used to that end: PCA summarized the 885-dimension input data into a 3-dimension component that reflected the greatest spread of term occurrences values. The easiest interpretation of such analyses is a plot referred to as a score plot. The 3 axes are the 3 main components (i.e. aggregate of the underlying initial 885 variables, which were the terms). Each dot on the plot represent a sample of the corpus, annotated with its original category, i.e. tweets, blogs or news. Given that the dots clustered w.r.t. their category suggest that this series of 885 terms’ occurrences values might allow assigning an unknown input text into either one of the three aforementioned categories. This will lead us to the next analyses to come, whose objective will precisely be to assess the extent to which a data model using term and/or n-grams frequency values could perform accurately and sensitively to predict a text category;

Dependencies

# attaching libraries
sapply(c("curl", "tm", "SnowballC", "tau", "knitr", "magrittr", "stringr", "plyr", "reshape2", "ggplot2", "gridExtra", "FactoMineR", "factoextra", "plotly", "foreach", "doParallel"), 
       library, character.only = TRUE)

Retrieving the data.

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/"
arch <- "Coursera-SwiftKey.zip"
#creating a subdirectory under local file system
if(!file.exists("capstoneProj")){
  dir.create("capstoneProj")
}
#downloading
download.file(paste(url, arch, sep = ""), 
              destfile = "./capstoneProj/Coursera-SwiftKey.zip")
unzip("./capstoneProj/Coursera-SwiftKey.zip", exdir = "./capstoneProj")

loading the corpus into memory and preprocessing the text documents

#creating a Corpus object with the tm package Corpus constructor
enCorpus <- Corpus(DirSource("capstoneProj/final/en_US/"))
#printing out the TextDocument content of the enCorpus object
kable(summary(enCorpus)[,-3])

	Length	Class
en_US.blogs.txt	2	PlainTextDocument
en_US.news.txt	2	PlainTextDocument
en_US.twitter.txt	2	PlainTextDocument

##preprocessing steps
#creating a custom preprocessing function aimed at replacing contractions
removeContractions <- content_transformer(function(x, pattern) 
{return (str_replace_all(str_c(x), c("'m" = " am", "'re" = " are", "'s" = " is")))})
#stemming, white space trimming, lower case conversion, etc..
enCorpus <- enCorpus %>% tm_map(removeContractions) %>% 
  tm_map(stemDocument) %>% tm_map(stripWhitespace) %>% 
  tm_map(removePunctuation) %>% tm_map(removeNumbers) %>% 
  tm_map(content_transformer(tolower)) %>% 
  tm_map(removeWords, stopwords("english"))

Highly frequent words in the corpus

#printing out the size of each input text document of the
#given enCorpus object in terms of lines.
corpSize <- as.data.frame(t(rbind(row.names(summary(enCorpus)), sapply(1:3, function(i) {length(enCorpus[[i]]$content)}))))

#the enCropus corpus object can be used to infer the number of terms.
#the analyses was run on an 36-core linux computer.
#For each input text category encompassing the enCorpus corpus object, the data was split into 18 non-overlapping subsets of equal size.  
#Each subset was transformed to a DTM (Document Text Matrix) and then passed to the tm package nTerms function which derive the number of terms per input dtm.
TermNumberList <- lapply(1:3, function(i) {
registerDoParallel(18)
foreach(j = 1:10, .combine = rbind) %dopar%{
  li <-  ((j-1) * 100000) + 1
  hi <- j * 100000
  nTerms(DocumentTermMatrix(Corpus(VectorSource(PlainTextDocument(enCorpus[[i]][[1]][li:hi])))))
}
})
corpSize <- cbind(corpSize, t(as.data.frame(lapply(TermNumberList, colSums))))
colnames(corpSize) <- c("inputFileName", "Lines", "Terms")
kable(corpSize, caption = "Lines and terms per input text file", row.names = F)

Lines and terms per input text file
inputFileName	Lines	Terms
en_US.blogs.txt	899288	928107
en_US.news.txt	1010242	924254
en_US.twitter.txt	2360148	585012

#assessing the minimum sample size when the intersample variability flattens out for the US_blogs input. 
set.seed = 112916
n5 = 5000
l5 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[1]][[1]][as.numeric(sample(length(enCorpus[[1]]$content), n5))]))})
l5 <- lapply(1:10, function(i) {as.data.frame(head(l5[[i]][order(-l5[[i]])], n = 10))})
freq5 <- na.omit(join_all(l5, by = 'txt', type = 'left'))
row.names(freq5) <- freq5$txt
g1.1 <- ggplot(melt(t(freq5[, -1] / colSums(freq5[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.blogs input, n = 5,000") + xlab("top frequent words") + theme(legend.title=element_blank())

n20 = 20000
l20 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[1]][[1]][as.numeric(sample(length(enCorpus[[1]]$content), n20))]))})
l20 <- lapply(1:10, function(i) {as.data.frame(head(l20[[i]][order(-l20[[i]])], n = 10))})
freq20 <- na.omit(join_all(l20, by = 'txt', type = 'left'))
row.names(freq20) <- freq20$txt
g1.2 <- ggplot(melt(t(freq20[, -1] / colSums(freq20[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.blogs input, n = 20,000") + xlab("top frequent words") + theme(legend.title=element_blank())

n50 = 50000
l50 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[1]][[1]][as.numeric(sample(length(enCorpus[[1]]$content), n50))]))})
l50 <- lapply(1:10, function(i) {as.data.frame(head(l50[[i]][order(-l50[[i]])], n = 10))})
freq50 <- na.omit(join_all(l50, by = 'txt', type = 'left'))
row.names(freq50) <- freq50$txt
g1.3 <- ggplot(melt(t(freq50[, -1] / colSums(freq50[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.blogs input, n = 50,000") + xlab("top frequent words") + theme(legend.title=element_blank())

n100 = 100000
l100 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[1]][[1]][as.numeric(sample(length(enCorpus[[1]]$content), n100))]))})
l100 <- lapply(1:10, function(i) {as.data.frame(head(l100[[i]][order(-l100[[i]])], n = 10))})
freq100 <- na.omit(join_all(l100, by = 'txt', type = 'left'))
row.names(freq100) <- freq100$txt
g1.4 <- ggplot(melt(t(freq100[, -1] / colSums(freq100[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.blogs input, n = 100,000") + xlab("top frequent words")

grid.arrange(g1.1, g1.2, g1.3, g1.4, ncol = 2,
             top = "Sampling distribution of most frequent words in the en blogs doc")

The aforementioned sampling distribution suggested that sampling 100,000 lines from the en_US.blogs input document sufficed to analyze the data set

#Same analysis for the 2nd input document element of the enCorpus object
#assessing the minimum sample size when the intersample variability flattens out for the US_blogs input. 
l5.2 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[2]][[1]][as.numeric(sample(length(enCorpus[[2]]$content), n5))]))})
l5.2 <- lapply(1:10, function(i) {as.data.frame(head(l5.2[[i]][order(-l5.2[[i]])], n = 10))})
freq5.2 <- na.omit(join_all(l5.2, by = 'txt', type = 'left'))
row.names(freq5.2) <- freq5.2$txt
g2.1 <- ggplot(melt(t(freq5.2[, -1] / colSums(freq5.2[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.news input, n = 5,000") + xlab("top frequent words") + theme(legend.title=element_blank())

l20.2 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[2]][[1]][as.numeric(sample(length(enCorpus[[2]]$content), n20))]))})
l20.2 <- lapply(1:10, function(i) {as.data.frame(head(l20.2[[i]][order(-l20.2[[i]])], n = 10))})
freq20.2 <- na.omit(join_all(l20.2, by = 'txt', type = 'left'))
row.names(freq20.2) <- freq20.2$txt
g2.2 <- ggplot(melt(t(freq20.2[, -1] / colSums(freq20.2[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.news input, n = 20,000") + xlab("top frequent words") + theme(legend.title=element_blank())

l50.2 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[2]][[1]][as.numeric(sample(length(enCorpus[[2]]$content), n50))]))})
l50.2 <- lapply(1:10, function(i) {as.data.frame(head(l50.2[[i]][order(-l50.2[[i]])], n = 10))})
freq50.2 <- na.omit(join_all(l50.2, by = 'txt', type = 'left'))
row.names(freq50.2) <- freq50.2$txt
g2.3 <- ggplot(melt(t(freq50.2[, -1] / colSums(freq50.2[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.news input, n = 50,000") + xlab("top frequent words") + theme(legend.title=element_blank())

l100.2 <- lapply(1:10, function(i) {termFreq(PlainTextDocument(enCorpus[[2]][[1]][as.numeric(sample(length(enCorpus[[2]]$content), n100))]))})
l100.2 <- lapply(1:10, function(i) {as.data.frame(head(l100.2[[i]][order(-l100.2[[i]])], n = 10))})
freq100.2 <- na.omit(join_all(l100.2, by = 'txt', type = 'left'))
row.names(freq100.2) <- freq100.2$txt
g2.4 <- ggplot(melt(t(freq100.2[, -1] / colSums(freq100.2[,-1]))), aes(Var2, value, fill = Var2)) + geom_boxplot() + ggtitle("Sampling  distribution of the top frequent words in the en_US.news input, n = 100,000") + xlab("top frequent words")

grid.arrange(g2.1, g2.2, g2.3, g2.4, ncol = 2,
             top = "Sampling distribution of most frequent words in the news doc")

First assessment of the inter document variability

#subsetting the enCorpus VCopus object and filling a list objetc with 5 samples
enCorpus_subl <- lapply(1:5, function(i) {Corpus(VectorSource(c(PlainTextDocument(enCorpus[[1]][[1]][as.numeric(sample(length(enCorpus[[1]]$content), n100))]), PlainTextDocument(enCorpus[[2]][[1]][as.numeric(sample(length(enCorpus[[2]]$content), n100))]), PlainTextDocument(enCorpus[[3]][[1]][as.numeric(sample(length(enCorpus[[3]]$content), n100))]))))})

#instantiating a dtm featuring the tf-idf value per term in the enCorpus
enDtm1.2 <- DocumentTermMatrix(Corpus(VectorSource(c(PlainTextDocument(enCorpus[[1]][[1]][as.numeric(sample(length(enCorpus[[1]]$content), n100))]), PlainTextDocument(enCorpus[[2]][[1]][as.numeric(sample(length(enCorpus[[2]]$content), n100))]), PlainTextDocument(enCorpus[[3]][[1]][as.numeric(sample(length(enCorpus[[3]]$content), n100))])))), control = list(weighting = function(x) weightTfIdf(x, normalize = F)))

enDtm1.1_list <- lapply(enCorpus_subl, DocumentTermMatrix)

#subsetting the enDtm1.1 doc matrix for most frequent terms in the enCorpus object
enDtm1.1_HiFreq_list <- lapply(1:5, function(i) {as.data.frame(t(as.matrix(enDtm1.1_list[[i]][,colnames(enDtm1.1_list[[i]]) %in% c(findFreqTerms(enDtm1.2, lowfreq = 20, highfreq = Inf))])))})

#adding the document category variable
enDtm1.1_HiFreq_list <- lapply(1:5, function(i) enDtm1.1_HiFreq_list[[i]] <- cbind(row.names(enDtm1.1_HiFreq_list[[i]]), enDtm1.1_HiFreq_list[[i]]))

#joining the samples' set
FreqDf <- na.omit(join_all(enDtm1.1_HiFreq_list, by = "row.names(enDtm1.1_HiFreq_list[[i]])", type = 'left'))

#editing the FreqDf df object
row.names(FreqDf) <- FreqDf$`row.names(enDtm1.1_HiFreq_list[[i]])`
colnames(FreqDf)[-1] <- c(sapply(1:5, function(i) c(paste("blog",i, sep = ""), paste("news",i, sep = ""), paste("tweet",i, sep = ""))))

#transposing
FreqDft<- as.data.frame(t(FreqDf[,-1]))

#adding an input text source variable to the FreqDft df  
FreqDft$DocSource <- as.factor(gsub("\\d*", "", row.names(FreqDft)))

#PCA
res.pca <- PCA(FreqDft, quali.sup = dim(FreqDft)[2], graph = F)

#getting the PCA scores.
pcaScore <- as.data.frame(res.pca$ind$coord)
row.names(pcaScore) <- row.names(FreqDft)

#plotlying.
plot_ly(pcaScore, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, text = row.names(pcaScore), 
        type = "scatter3d", mode = "markers", color = FreqDft$DocSource) %>% 
  layout(title = "PCA score plot of 15 input doc on 882 highest corpus frequent terms")

This simple exploration suggestst that the terms’ content of the input corpus might allow to predict the document category
In addition to single term frequencies, n-gram distribution might allow to predict with bettwe specificity and sensitivity the document category
Non-linear model (e.g. rf and/or svm) will be used to train a model and cross validate with samples out test sets

HCCorpora Corpus Analysis

Manuel X. Duval

12/02/2016