The goal of this task is to understand the basic relationships in the data and prepare to build a linguistic model. We explore a large corpus of text documents that consist of US news, blogs, and twitter text documents from Capstone dataset
The frequency of words and word-pairs in the documents show that a small population of words and word-pairs in the corpus are frequently used. This distribution and relationship can be utilized to predict a following word when the input text sequence would have been given.
Let’s read in US blogs, news, and twitter data for the exploratory data analysis (EDA)
# data loading
en_us_news <- readLines(paste(file_path, fsep = .Platform$file.sep, "en_US.news.txt", sep=""), encoding='UTF-8')
en_us_blogs <- readLines(paste(file_path, fsep = .Platform$file.sep, "en_US.blogs.txt", sep=""), encoding='UTF-8')
en_us_twitter <- readLines(paste(file_path, fsep = .Platform$file.sep, "en_US.twitter.txt", sep=""), encoding = 'UTF-8')
After cleaning data, we tokenize and normalize texts by stemming and removing the stop words. We use quanteda package to perform these operations. The resulting summary of this initial data processing, tokenizaton, and normalization has been presented as a summary table.
library(quanteda, verbose = FALSE, warn.conflicts = FALSE, quietly = TRUE)
## Package version: 1.3.4
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
library(stringi)
library(gridExtra)
preproc_token <- function(corpus) { # corpus, or char vector
x <- stri_trim(corpus, side = "both") %>% # corpus --> character vector
stri_trans_tolower() %>%
stri_replace_all_charclass("[\\p{P}\\p{S}]", "", merge=TRUE) %>% # remove punctuations
stri_replace_all_regex("(?<numbers>[0-9]+)", "") %>% # remove numbers
tokens() %>%
tokens_wordstem() %>% # stemming
tokens_remove(stopwords("english"))# removing stopwords
}
stat_doc <- function(doc_corpus, doc_token) {
data.frame(lines=ndoc(doc_token),
sentences = sum(nsentence(doc_corpus)),
types = sum(ntype(doc_token)),
tokens = sum(ntoken(doc_token)) )
}
plot_doc <- function(doc_dtm, name_ngram) {
x <- textstat_frequency(doc_dtm)
num_breaks = round(x$frequency[1])
par(mfrow=c(3,1), mar=c(5.5,5,4,2))
hist(x$frequency, #breaks = 15000,
breaks = num_breaks,
xlim = c(0, 30), #ylim = c(0,1000),
xlab = paste("number of ",name_ngram, " appearance in the corpus", sep = ""),
main = paste("Histogram of ", name_ngram, " appearance", sep = "") )
plot(1:length(x$frequency),
cumsum(x$frequency)/sum(x$frequency)*100,
xlab = paste("number of unique ", name_ngram, "s (features)", sep=""),
ylab = paste("% of all ", name_ngram, " instances", sep = ""),
main = paste("Cumulative frequency (%) of a frequency sorted ", name_ngram, " dictionary", sep = "") )
abline(h=c(50, 90), col=c("blue", "red") ) # 50% and 90% of word instances
x50_idx <- which.max(cumsum(x$frequency)/sum(x$frequency)*100 > 50)#[1] 667
x90_idx <- which.max(cumsum(x$frequency)/sum(x$frequency)*100 > 90)#[1] 7524
abline(v=c(x50_idx, x90_idx), col=c("blue", "red"), lty=c("dashed", "dashed")) # unique words requried for 50% and 90% of word instances
pos_x <- length(x$frequency)/2 # middle in x-position
pos_y <- 10 # 10% in y-position
text(pos_x, pos_y, paste(x50_idx, " unique ", name_ngram,"s required for 50% ", name_ngram, " instances", sep = ""), col="blue")
text(pos_x, pos_y+50, paste(x90_idx, " unique ", name_ngram,"s required for 90% ", name_ngram, " instances", sep = ""), col="red")
#par(mar=c(11,6,4,4)) # increase margin
barplot(x$frequency[1:20], las=2,
#ylab="Frequency", #xlab = "words",
names.arg = x$feature[1:20],
main = paste("Frequency of top 20 unique ", name_ngram, "s in the corpus", sep=""),
cex.names = 0.9)
par(mfrow=c(1,1))
x # as a frequqncy sorted dictionary
}
### operation on each line of the texts
# en_US.nesw.txt:
news_corpus <- corpus(en_us_news)
news_token <- preproc_token(news_corpus)
stat_news <- stat_doc(news_corpus, news_token)
# en_US.blogs.txt:
blog_corpus <- corpus(en_us_blogs)
blog_token <- preproc_token(blog_corpus)
stat_blog <- stat_doc(blog_corpus, blog_token)
# en_US.twitter.txt:
twitter_corpus <- corpus(en_us_twitter)
twitter_token <- preproc_token(twitter_corpus)
stat_twitter <- stat_doc(twitter_corpus, twitter_token)
sum_table <- t(data.frame(en_US.news.txt = unlist(stat_news), en_US.blogs.txt=unlist(stat_blog), en_US.twitter.txt=unlist(stat_twitter)))
### print the summary table
grid.table(sum_table)
[Table 1. Smmary table of US news, blogs, and twitter documents. The number of lines, sententces, types and tokens in the document have been summarized]
### clean-up memory
rm(en_us_news, en_us_blogs, en_us_twitter)
rm(news_corpus, blog_corpus, twitter_corpus)
gc(verbose = FALSE, reset = TRUE, full = TRUE)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 8359675 446.5 27120444 1448.4 8359675 446.5
## Vcells 48784329 372.2 449580998 3430.1 48784329 372.2
The distribution and relationshop of the words in en_US.news.txt can be seen in the summary Figure 1. The words that apprear only once are dominant in the histogram plot, while there exist words that apprear frequently in the document.
# tokens ==> document-feature matrix(dfm)
dtm_news <- dfm(news_token)
df_dic_news_uni <- plot_doc(dtm_news, "word")
[Figure 1. Summary plot of en_US.news.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]
The distribution and relationshop of the words in en_US.blogs.txt have been shown in the same way in Figure 2.
dtm_blog <- dfm(blog_token)
df_dic_blog_uni <- plot_doc(dtm_blog, "word")
[Figure 2. Summary plot of en_US.blogs.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]
The distribution and relationshop of the words in en_US.twitter.txt have been shown in the same way in Figure 3.
dtm_twitter <- dfm(twitter_token)
df_dic_twitter_uni <- plot_doc(dtm_twitter, "word")
[Figure 3. Summary plot of en_US.twitter.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]
Until now, only the word frequency has been considered and this corresponds to 1-gram (unigram) case in the N-gram model. So, We can extend this analysis into 2-, 3-, or N- grams by generating N-gram model. For instance, we can easily extend our unigram model of en_US.nesw.txt document into bigram model, and apply the same analysis (Figure 4).
# bigrams: en_US.news.txt case
news_token_bigram <- tokens(news_token, ngrams=2)
dtm_news_bigram <- dfm(news_token_bigram)
df_dic_news_bi <- plot_doc(dtm_news_bigram, "bigram")
[Figure 4. Summary plot of bigram model of en_US.news.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]
Using frequency- and context-based information obtained from the EDA above (i.e. frequency sorted N-gram dictionary), we will build a predictive text model. When considering accurary and efficiency of the model, 3-gram model will be initally tried. For the case where a word that has not appreared in the training documents, we will