Exploratory data analysis is performed for large corpus of text documents. This forms the first few steps towards building linguistic model. It is attempted to understand the distribution and relationship between words in the corpora using exploratory data analysis. Relation and frequency of words and word combinations are represented pictorially.
Following libraries are loaded first:
library(dplyr)
library(tm) # text mining
library(ggplot2) # plotting
The text files for corpora has the following file sizes:
file.size("./data./final./en_US./en_US.blogs.txt")
## [1] 210160014
file.size("./data./final./en_US./en_US.twitter.txt")
## [1] 167105338
file.size("./data./final./en_US./en_US.news.txt")
## [1] 205811889
It is observed that the file sizes are big, which might affect the processing time. Here, the entire dataset is used for analyis, however it is to be noted that the reduced dataset can be used in case of computational dificulties in processing the data. Now, lets examine the length of the text files:
# length of each file
con <- file("./data./final./en_US./en_US.twitter.txt")
open(con, 'r')
x <- readLines(con)
close(con)
con <- file("./data./final./en_US./en_US.blogs.txt")
open(con, 'r')
y <- readLines(con)
close(con)
con <- file("./data./final./en_US./en_US.news.txt")
open(con, 'r')
z <- readLines(con)
close(con)
length(x) #twitter
## [1] 2360148
length(y) #blog
## [1] 899288
length(z) #news
## [1] 77259
Twitter daata has maximum number of lines followed by blog data and news data.
For analysis a vector combining all the data is created and stored in a directory same as the working directory:
# vector combining all the text data
v <- c(x[1:as.integer(length(x))], y[1:as.integer(length(y))], z[1:as.integer(length(z))])
# data file directory
dir.create("corp")
# saving the file
corp <- save(v, file = "./corp./corp.txt")
Corpus is created using ‘VCorpus’ function from ‘tm’ package. Data stored in corp.txt document is imported here as follows:
# corpus data
doc <- VCorpus(DirSource("./corp"))
The corpus data is transformed using various operations such as stemming, removing punctuations, extra whitespaces, converting to lower case, removing stopwords etc. This is achieved using tm_map() function as follows:
# Transform data
doc2 <- tm_map(doc, stripWhitespace) # remove extra white spaces
doc2 <- tm_map(doc2, content_transformer(tolower)) # covert all to lower cases
doc2 <- tm_map(doc2, removeNumbers)# remove numbers
doc2 <- tm_map(doc2, removeWords, stopwords('english')) # remove stop words
doc2 <- tm_map(doc2, removePunctuation) # remove punctuation
doc2 <- tm_map(doc2, stemDocument) # stemming
Term document matrix is created using ‘TermDocumentMatrix’ function in ‘tm’ package. The ngram tokenizer from ngram package is used to split the words. Function is defined for the same as follows:
# ngram tokenizer
term_mat <- function(ngram, data_text){
ngram_tokenizer <- function(x){unlist(lapply(ngrams(words(x), ngram), paste, collapse = " "), use.names = FALSE)}
# Term document matrix
tdm <- TermDocumentMatrix(data_text, control = list(tokenize = ngram_tokenizer))
tdm
}
TermDocumentMatrix for 1-gram, 2-gram and 3-gram is generated as follows, the matrices are sorted in decreasing order of frequency of use of words.
unigram_mat <- sort(rowSums(as.matrix(term_mat(1, doc2))), decreasing = T) #1-gram
bi_gram_mat <- sort(rowSums(as.matrix(term_mat(2, doc2))), decreasing = T) #2-gram
tri_gram_mat <- sort(rowSums(as.matrix(term_mat(3, doc2))), decreasing = T) #3-gram
The following plots shows the frequency of 1-gram, 2-gram and 3-gram in dataset:
#1-gram
uni_freq <- data.frame(word = names(unigram_mat), Frequency = unigram_mat)
ggplot(data = uni_freq[1:15,], mapping = aes(word, Frequency)) + geom_col()+ggtitle("1-gram word frequency")+theme(axis.text.x=element_text(angle=70, hjust=1))
# 2-gram
bi_freq <- data.frame(word = names(bi_gram_mat), Frequency = bi_gram_mat)
ggplot(data = bi_freq[1:15,], mapping = aes(word, Frequency)) + geom_col()+ggtitle("2-gram frequency")+theme(axis.text.x=element_text(angle=70, hjust=1))
#3-gram
tri_freq <- data.frame(word = names(tri_gram_mat), Frequency = tri_gram_mat)
ggplot(data = tri_freq[1:15,], mapping = aes(word, Frequency)) + geom_col()+ggtitle("3-gram frequency")+theme(axis.text.x=element_text(angle=70, hjust=1))
The number of words required from the frquency sorted dictionary for covering 50% and 90% instances in language is determined as follows:
uni_freq$cumsum <- cumsum(uni_freq[,2])/sum(uni_freq$Frequency)
uni_freq$index <- 1:nrow(uni_freq)
prob_words <- function(prob){
for (i in 1:nrow(uni_freq)){
if(uni_freq$cumsum[i] > prob){
print(uni_freq$index[i])
break
}}}
prob_words(0.5)
## [1] 542
prob_words(0.9)
## [1] 6637
#prob_words(0.9) #for 90
542 and 6637 words are required to cover 50% and 90% of the word coverage from frequency sorted dictionary. To cover greater percentage of words instances, more words will be required.