Summary

Exploratory data analysis is performed for large corpus of text documents. This forms the first few steps towards building linguistic model. It is attempted to understand the distribution and relationship between words in the corpora using exploratory data analysis. Relation and frequency of words and word combinations are represented pictorially.

Packages

Following libraries are loaded first:

library(dplyr)
library(tm) # text mining
library(ggplot2) # plotting

File sizes

The text files for corpora has the following file sizes:

file.size("./data./final./en_US./en_US.blogs.txt")
## [1] 210160014
file.size("./data./final./en_US./en_US.twitter.txt")
## [1] 167105338
file.size("./data./final./en_US./en_US.news.txt")
## [1] 205811889

File length

It is observed that the file sizes are big, which might affect the processing time. Here, the entire dataset is used for analyis, however it is to be noted that the reduced dataset can be used in case of computational dificulties in processing the data. Now, lets examine the length of the text files:

# length of each file 
con <- file("./data./final./en_US./en_US.twitter.txt")
open(con, 'r')
x <- readLines(con)
close(con)

con <- file("./data./final./en_US./en_US.blogs.txt")
open(con, 'r')
y <- readLines(con)
close(con)

con <- file("./data./final./en_US./en_US.news.txt")
open(con, 'r')
z <- readLines(con)
close(con)

length(x) #twitter
## [1] 2360148
length(y) #blog
## [1] 899288
length(z) #news
## [1] 77259

Twitter daata has maximum number of lines followed by blog data and news data.

Combined text vector

For analysis a vector combining all the data is created and stored in a directory same as the working directory:

# vector combining all the text data
v <- c(x[1:as.integer(length(x))], y[1:as.integer(length(y))], z[1:as.integer(length(z))])

# data file directory 
dir.create("corp")

# saving the file
corp <- save(v, file = "./corp./corp.txt")

Data import

Corpus is created using ‘VCorpus’ function from ‘tm’ package. Data stored in corp.txt document is imported here as follows:

# corpus data
doc <- VCorpus(DirSource("./corp"))

Data trasformation

The corpus data is transformed using various operations such as stemming, removing punctuations, extra whitespaces, converting to lower case, removing stopwords etc. This is achieved using tm_map() function as follows:

# Transform data 
doc2 <- tm_map(doc, stripWhitespace) # remove extra white spaces
doc2 <- tm_map(doc2, content_transformer(tolower)) # covert all to lower cases
doc2 <- tm_map(doc2, removeNumbers)# remove numbers
doc2 <- tm_map(doc2, removeWords, stopwords('english')) # remove stop words
doc2 <- tm_map(doc2, removePunctuation) # remove punctuation 
doc2 <- tm_map(doc2, stemDocument) # stemming 

Term document matrix and ngram tokenizer

Term document matrix is created using ‘TermDocumentMatrix’ function in ‘tm’ package. The ngram tokenizer from ngram package is used to split the words. Function is defined for the same as follows:

# ngram tokenizer
term_mat <- function(ngram, data_text){
ngram_tokenizer <- function(x){unlist(lapply(ngrams(words(x), ngram), paste, collapse = " "), use.names = FALSE)}

# Term document matrix
tdm <- TermDocumentMatrix(data_text, control = list(tokenize = ngram_tokenizer))
tdm
}

Unigram, bigram and trigram matrix

TermDocumentMatrix for 1-gram, 2-gram and 3-gram is generated as follows, the matrices are sorted in decreasing order of frequency of use of words.

unigram_mat <- sort(rowSums(as.matrix(term_mat(1, doc2))), decreasing = T) #1-gram
bi_gram_mat <- sort(rowSums(as.matrix(term_mat(2, doc2))), decreasing = T) #2-gram
tri_gram_mat <- sort(rowSums(as.matrix(term_mat(3, doc2))), decreasing = T) #3-gram

Frequencies of 1-gram, 2-gram, 3-grams in dataset

The following plots shows the frequency of 1-gram, 2-gram and 3-gram in dataset:

#1-gram
uni_freq <- data.frame(word = names(unigram_mat), Frequency = unigram_mat)
ggplot(data = uni_freq[1:15,], mapping = aes(word, Frequency)) + geom_col()+ggtitle("1-gram word frequency")+theme(axis.text.x=element_text(angle=70, hjust=1))

# 2-gram 
bi_freq <- data.frame(word = names(bi_gram_mat), Frequency = bi_gram_mat)
ggplot(data = bi_freq[1:15,], mapping = aes(word, Frequency)) + geom_col()+ggtitle("2-gram frequency")+theme(axis.text.x=element_text(angle=70, hjust=1))

#3-gram
tri_freq <- data.frame(word = names(tri_gram_mat), Frequency = tri_gram_mat)
ggplot(data = tri_freq[1:15,], mapping = aes(word, Frequency)) + geom_col()+ggtitle("3-gram frequency")+theme(axis.text.x=element_text(angle=70, hjust=1))

Number of words required for covering 50% and 90% of the word instances

The number of words required from the frquency sorted dictionary for covering 50% and 90% instances in language is determined as follows:

uni_freq$cumsum <- cumsum(uni_freq[,2])/sum(uni_freq$Frequency)
uni_freq$index <- 1:nrow(uni_freq)

prob_words <- function(prob){
  for (i in 1:nrow(uni_freq)){
    if(uni_freq$cumsum[i] > prob){
      print(uni_freq$index[i])
      break
    }}}

prob_words(0.5)
## [1] 542
prob_words(0.9)
## [1] 6637
#prob_words(0.9)  #for 90

542 and 6637 words are required to cover 50% and 90% of the word coverage from frequency sorted dictionary. To cover greater percentage of words instances, more words will be required.