Data Science Capstone - Data Exploration

Summary

Datasets are being reduced in size by sampling and being loaded into the standard Corpora format for text mining. Data are being cleaned to allow basic statistical analysis of the sample texts. Plots have been generated for most frequent words and for the example of a 4-gram, that is a 4-word token. The next steps would be to generate 2-,3- and 4-grams as reference data for the prediction of text to be typed.

Introduction and approach

The main idea for creating the text prediction application is to analyze the text files and extract most common sequence of words (n-grams) and store these in objects as reference for the application. As for larger “n” (say 3 or 4) the number of combinations are very high and it is clearly envisaged that memory and processing times will become an issue, fallback approaches need to be considered. At this stage the ideas are to move from larger n’s to lower, consider wordstemming or clustering or similarity analysis of words in the texts.

After getting familiarized with the task to create a text prediction application, text mining in general and the “tm” in particular as well as some basic data exploration, the approach taken so far is basically the following

As the data sets are very large and processing takes a long time and or is hitting memory capacity problems, samples of 1%, 3% and 10% of the original size have been created. Those are stored in specific directories, example: “en_US_1per”.
There are different options to build variations of the class “corpus”, which is key for text mining. However, technical issues have been encountered in particular when analyzing n-grams, i.e. this option appear to be only working when buidling a “VCorpus” from a vectorsource meaning that all invidual tweets, news and blogs are becoming a text-document within the corpus.
This allows the analysis of smaller chunks and comparison of the results.

Function definitions for building Corpora from sample text files.

Corpus for directory-source
VCorpus for directory-source
Corpus for vector-source
VCorpus for vector-source

createCorpus_D <- function(filepath,filename) {

  ds <- DirSource(filepath, pattern = filename, mode = "text", encoding = "UTF-8")
  
  Corpus(ds, readerControl=list(readPlain, language="en_US", load=TRUE))
}

createVCorpus_D <- function(filepath,filename) {

  ds <- DirSource(filepath, pattern = filename, mode = "text", encoding = "UTF-8")
  
  VCorpus(ds, readerControl=list(readPlain, language="en_US", load=TRUE))
}

createCorpus <- function(filepath) {
  conn <- file(filepath, "r")
  fulltext <- readLines(conn)
  close(conn)
  
  vs <- VectorSource(fulltext)
  Corpus(vs, readerControl=list(readPlain, language="en_US", load=TRUE))
}

createVCorpus <- function(filepath) {
  conn <- file(filepath, "r")
  fulltext <- readLines(conn)
  close(conn)
  
  vs <- VectorSource(fulltext)
  VCorpus(vs, readerControl=list(readPlain, language="en_US", load=TRUE))
}

all_corp <- createCorpus_D("./en_US_1per/",NULL)
# str(all_corp)                          
# object.size(all_corp)

Loading and leaning the text-files in a single simple corpus

Two helper functions have been defined to turn character-strings to space or to replace against another string. These are being used in combination with existing functions (e.g. “removePunctuation”) to clean-up the textfiles. Actually “replaceBy” is not used in this document. The function ‘tolower’ which requires also a content_transformer wrapper is taking very high processing times, which lead to the idea of replacing Uppercase characters separately to lowercase (this did in fact run much faster, but would require more code).

In this case the 1% sample is being used.

cleanup <- function(corp) {
  
  toSpace <- content_transformer(function(x, pattern) {return(gsub(pattern, " ", x))})
  replaceBy <- content_transformer(function(x, pat, rep_pat) {return(gsub(pat, rep_pat, x))})

  corp <- tm_map(corp, toSpace, ";")
  corp <- tm_map(corp, toSpace, "`")
  corp <- tm_map(corp, toSpace, "´")
  corp <- tm_map(corp, toSpace, "-")
  corp <- tm_map(corp, toSpace, "–")
  corp <- tm_map(corp, toSpace, "—")
  corp <- tm_map(corp, toSpace, "”")
  corp <- tm_map(corp, toSpace, "“")

  corp <- tm_map(corp, removePunctuation)
  corp <- tm_map(corp,content_transformer(tolower))
  # corp <- tm_map(corp,tolower)
  corp <- tm_map(corp, removeNumbers)
  corp <- tm_map(corp, stripWhitespace)

}

all_corp <- cleanup(all_corp)

Basic Statistical analysis

Basic KPI’s for the sample dataset have been determined by extracting information from the DocumentTermMatrix. The 20 most frequent words are shown in a plot.

all_DTM <- DocumentTermMatrix(all_corp)

all_mat <- as.matrix(all_DTM)
all_fr <- colSums(all_mat)
all_ord <- order(all_fr,decreasing=TRUE)
all_sort <- all_fr[all_ord]
all_top20 <- head(all_sort,20)
all_sum <- sum(all_fr)
all_words <- length(all_fr)
all_mean <- mean(all_fr)
all_median <- median(all_fr)
print(c("Total words in sample",all_sum, "Unique words in Sample", all_words))

## [1] "Total words in sample"  "778458"                
## [3] "Unique words in Sample" "54842"

print(c("Average number of word occurence in sample",all_mean))

## [1] "Average number of word occurence in sample"
## [2] "14.1945589147004"

print(c("Median number number of word occurence in sample", all_median))

## [1] "Median number number of word occurence in sample"
## [2] "1"

all_df <-  data.frame(word=names(all_top20),occurrences=all_top20)
all_plot <- ggplot(all_df, aes(word, occurrences)) + geom_bar(stat="identity")
all_plot <- all_plot + theme(axis.text.x=element_text(angle=45, hjust=1))
all_plot <- all_plot + scale_x_discrete(limits = all_df$word[]) 
all_plot <- all_plot + xlab("20 most frequent words") + ylab("occurences") + 
              ggtitle("1% of all english texts")
all_plot

all_sort_sum <- cumsum(all_sort)
all_cumsum_df <- data.frame(1:all_words, all_sort_sum)

all_cumsum_df <- data.frame(1:all_words, all_sort_sum)
all_cumsum_plot <- ggplot(all_cumsum_df, aes(X1.all_words,all_sort_sum)) + geom_line()
all_cumsum_plot <- all_cumsum_plot + geom_hline(yintercept = all_sum/2) +  
                    scale_x_log10()
all_cumsum_plot <- all_cumsum_plot + xlab("words ordered by highest to lowest frequency, log-scale!") +     
                    ylab("cumulative sum of occurences") + ggtitle("1% of all english texts") +
                    ylim(0,750000)
all_cumsum_plot

## Warning: Removed 28458 rows containing missing values (geom_path).

Investigating the impact of stop-words.

As you can see above the most frequent words are all so called ‘stop-words’. In the following it is being investigated, how the word frequencies change when stop-words are being removed. This is for this project more of academic interest, as the stop-words would appear in the text that is being typed and should also be part of the prediction. For other text analytic purposes, for example analyzing the content, in most simple approaches to count key-words, removing stop-words is very sensible.

The analysis from above is being repeated with stop-words being removed.

all_ns_corp <- tm_map(all_corp,removeWords, stopwords("english")) 

all_ns_DTM <- DocumentTermMatrix(all_ns_corp)

# findFreqTerms(all_ns_DTM, 1000)
all_ns_mat <- as.matrix(all_ns_DTM)
all_ns_fr <- colSums(all_ns_mat)
all_ns_ord <- order(all_ns_fr,decreasing=TRUE)
all_ns_sort <- all_ns_fr[all_ns_ord]
all_ns_top20 <- head(all_ns_sort,20)
all_ns_sum <- sum(all_ns_fr)
all_ns_words <- length(all_ns_fr)
all_ns_mean <- mean(all_ns_fr)
all_ns_median <- median(all_ns_fr)
print(c("Total words in sample",all_ns_sum, "Unique words in Sample", all_ns_words))

## [1] "Total words in sample"  "541328"                
## [3] "Unique words in Sample" "54486"

all_ns_df <-  data.frame(word=names(all_ns_top20),occurrences=all_ns_top20)
all_ns_plot <- ggplot(all_ns_df, aes(word, occurrences)) + geom_bar(stat="identity")
all_ns_plot <- all_ns_plot + theme(axis.text.x=element_text(angle=45, hjust=1))
all_ns_plot <- all_ns_plot + scale_x_discrete(limits = all_ns_df$word[]) 
all_ns_plot

all_ns_sort_sum <- cumsum(all_ns_sort)
all_ns_cumsum_df <- data.frame(1:all_ns_words, all_ns_sort_sum)

all_ns_cumsum_df <- data.frame(1:all_ns_words, all_ns_sort_sum)
all_ns_cumsum_plot <- ggplot(all_ns_cumsum_df, aes(X1.all_ns_words,all_ns_sort_sum)) + geom_line()
all_ns_cumsum_plot <- all_ns_cumsum_plot + geom_hline(yintercept = all_ns_sum/2) + scale_x_log10()
all_ns_cumsum_plot <- all_ns_cumsum_plot + 
                      xlab("words ordered by highest to lowest frequency, log-scale!") +     
                      ylab("cumulative sum of occurences") + ggtitle("1% of all english texts") +
                      ylim(0,500000)
all_ns_cumsum_plot

## Warning: Removed 35435 rows containing missing values (geom_path).

As you can see above the distribution of “non-stop-words” are much flatter and the cumulative sum reaches the 50% mark only at about 1000 words.

Analyzing n-gram occurences

The DocumentTermMatrix could be created by using so called tokenizers, in this case from the RWeka package. That way n-word tokens, so called n-grams are being analyzed instead of the single words. As already stated above, one finding has been that the tokenizer does only work and have an effect if for a VCorpus class built from a Vector Source. Right now the analysis is being applied to the ‘news’ sample file. (remark: creating the DTM lead to a crash in function ‘tolower’, which is strange, as ‘tolower’ has been applied before.)

The steps in the analysis and code below is as follows

The tokenizers for 2-,3- and 4-grams are being defined.
The VCorpus object is being created
Clean-up of special characters as before
The DocumentTermMatrizes are being generated for the n-grams
For the example of a 4-gram the frequency is being extracted from the DTM. Normally one would do this via converting the DTM to a matrix, but this did not work, due to memory constraints (a consequence of the Vectorsource based VCorpus class as basis of the DTM) . Hence this is being done by looping over “findFreqTerms” embedded in a new function “DTM_freq_extract” (returning a dataframe with n-grams and number of occurences) with exactly one “bin”. In this case the highest number has been manually determined before.

twogramTokenizer <- function(x) {
    NGramTokenizer(x, Weka_control(min=2, max=2))
}

threegramTokenizer <- function(x) {
    NGramTokenizer(x, Weka_control(min=3, max=3))
}

fourgramTokenizer <- function(x) {
    NGramTokenizer(x, Weka_control(min=4, max=4))
}

news_corp <- createVCorpus("./en_US_1per/en_US.news.txt")
news_corp <- cleanup(news_corp)
news_DTM <- DocumentTermMatrix(news_corp)

news_DTM2 <- DocumentTermMatrix(news_corp,
                           control=list(tokenize=twogramTokenizer))
news_DTM3 <- DocumentTermMatrix(news_corp,
                           control=list(tokenize=threegramTokenizer))
news_DTM4 <- DocumentTermMatrix(news_corp,
                           control=list(tokenize=fourgramTokenizer))

DTM_freq_extract <- function(DTM, freq) {
  ngrams <- c(findFreqTerms(DTM, freq,freq))
  occurence <- rep(freq,each = length(ngrams))
  df <- data.frame(occurence,ngrams)
} 

imax <- 33
for (i in 1:imax) {
  if (i == 1) {
    news_fr_4 <- DTM_freq_extract(news_DTM4,i)
  } else {
    news_fr_4 <- rbind(news_fr_4,DTM_freq_extract(news_DTM4,i))
  } 
}

news_fr_4_top20 <- tail(news_fr_4,20)
news_4gram_plot <- ggplot(news_fr_4_top20, aes(ngrams,occurence)) + geom_bar(stat="identity")
news_4gram_plot <- news_4gram_plot + theme(axis.text.x=element_text(angle=45, hjust=1))
news_4gram_plot <- news_4gram_plot + xlab("most frequent 4-grams") + ggtitle("From sample news data")
news_4gram_plot

Planned next steps

These are to explore the effect of word-stemming and what options or advantages clustering and similarity analysis would bring. It’s probably also sensible to run this analysis, in particular the 2-, 3- and 4-gram frequency analysis for larger data sets. Likewise it would be good to start early with the application prototype to understand the response-time performance in view of reference dataset sizes.