US news, blogs, and twitter corpora EDA

Summary

The goal of this task is to understand the basic relationships in the data and prepare to build a linguistic model. We explore a large corpus of text documents that consist of US news, blogs, and twitter text documents from Capstone dataset

The frequency of words and word-pairs in the documents show that a small population of words and word-pairs in the corpus are frequently used. This distribution and relationship can be utilized to predict a following word when the input text sequence would have been given.

Data read-in

Let’s read in US blogs, news, and twitter data for the exploratory data analysis (EDA)

# data loading
en_us_news <- readLines(paste(file_path, fsep = .Platform$file.sep,  "en_US.news.txt", sep=""), encoding='UTF-8')
en_us_blogs <- readLines(paste(file_path, fsep = .Platform$file.sep,  "en_US.blogs.txt", sep=""), encoding='UTF-8') 
en_us_twitter <- readLines(paste(file_path, fsep = .Platform$file.sep,  "en_US.twitter.txt", sep=""), encoding = 'UTF-8')

Data processing and analysis

After cleaning data, we tokenize and normalize texts by stemming and removing the stop words. We use quanteda package to perform these operations. The resulting summary of this initial data processing, tokenizaton, and normalization has been presented as a summary table.

library(quanteda, verbose = FALSE, warn.conflicts = FALSE, quietly = TRUE)

## Package version: 1.3.4

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

library(stringi)
library(gridExtra)

preproc_token <- function(corpus) { # corpus, or char vector
    x <- stri_trim(corpus, side = "both")  %>% # corpus --> character vector
         stri_trans_tolower() %>%
         stri_replace_all_charclass("[\\p{P}\\p{S}]", "", merge=TRUE) %>% # remove punctuations
         stri_replace_all_regex("(?<numbers>[0-9]+)", "") %>% # remove numbers
         tokens() %>%
         tokens_wordstem() %>%              # stemming
         tokens_remove(stopwords("english"))# removing stopwords
}
stat_doc <- function(doc_corpus, doc_token) {
    data.frame(lines=ndoc(doc_token), 
                sentences = sum(nsentence(doc_corpus)),
                types = sum(ntype(doc_token)),
                tokens = sum(ntoken(doc_token)) )
}
plot_doc <- function(doc_dtm, name_ngram) {
    x <- textstat_frequency(doc_dtm)
    num_breaks = round(x$frequency[1])
    par(mfrow=c(3,1), mar=c(5.5,5,4,2))
    hist(x$frequency, #breaks = 15000,
            breaks = num_breaks,
            xlim = c(0, 30), #ylim = c(0,1000),
            xlab = paste("number of ",name_ngram, " appearance in the corpus", sep = ""),
            main = paste("Histogram of ", name_ngram, " appearance", sep = "") ) 
    plot(1:length(x$frequency), 
            cumsum(x$frequency)/sum(x$frequency)*100,
            xlab = paste("number of unique ", name_ngram, "s (features)", sep=""),
            ylab = paste("% of all ", name_ngram, " instances", sep = ""),
            main = paste("Cumulative frequency (%) of a frequency sorted ", name_ngram, " dictionary", sep = "") )
    abline(h=c(50, 90), col=c("blue", "red") ) # 50% and 90% of word instances
    x50_idx <- which.max(cumsum(x$frequency)/sum(x$frequency)*100 > 50)#[1] 667
    x90_idx <- which.max(cumsum(x$frequency)/sum(x$frequency)*100 > 90)#[1] 7524
    abline(v=c(x50_idx, x90_idx), col=c("blue", "red"), lty=c("dashed", "dashed")) # unique words requried for 50% and 90% of word instances
    pos_x <- length(x$frequency)/2  # middle in x-position
    pos_y <- 10                     # 10% in y-position 
    text(pos_x, pos_y, paste(x50_idx, " unique ", name_ngram,"s required for 50% ", name_ngram, " instances", sep = ""), col="blue")
    text(pos_x, pos_y+50, paste(x90_idx, " unique ", name_ngram,"s required for 90% ", name_ngram, " instances", sep = ""), col="red")
    #par(mar=c(11,6,4,4)) # increase margin
    barplot(x$frequency[1:20], las=2, 
            #ylab="Frequency", #xlab = "words", 
            names.arg = x$feature[1:20], 
            main = paste("Frequency of top 20 unique ", name_ngram, "s in the corpus", sep=""),
            cex.names = 0.9)
    par(mfrow=c(1,1))
    x # as a frequqncy sorted dictionary
}

### operation on each line of the texts
# en_US.nesw.txt:
news_corpus <- corpus(en_us_news)
news_token <- preproc_token(news_corpus)
stat_news <- stat_doc(news_corpus, news_token)
# en_US.blogs.txt:
blog_corpus <- corpus(en_us_blogs)
blog_token <- preproc_token(blog_corpus)
stat_blog <- stat_doc(blog_corpus, blog_token)
# en_US.twitter.txt:
twitter_corpus <- corpus(en_us_twitter)
twitter_token <- preproc_token(twitter_corpus)
stat_twitter <- stat_doc(twitter_corpus, twitter_token)

sum_table <- t(data.frame(en_US.news.txt = unlist(stat_news), en_US.blogs.txt=unlist(stat_blog), en_US.twitter.txt=unlist(stat_twitter)))

### print the summary table
grid.table(sum_table)

[Table 1. Smmary table of US news, blogs, and twitter documents. The number of lines, sententces, types and tokens in the document have been summarized]

### clean-up memory
rm(en_us_news, en_us_blogs, en_us_twitter)
rm(news_corpus, blog_corpus, twitter_corpus)
gc(verbose = FALSE, reset = TRUE, full = TRUE)

##            used  (Mb) gc trigger   (Mb) max used  (Mb)
## Ncells  8359675 446.5   27120444 1448.4  8359675 446.5
## Vcells 48784329 372.2  449580998 3430.1 48784329 372.2

EDA of en_US.news.txt

The distribution and relationshop of the words in en_US.news.txt can be seen in the summary Figure 1. The words that apprear only once are dominant in the histogram plot, while there exist words that apprear frequently in the document.

# tokens ==> document-feature matrix(dfm)
dtm_news <- dfm(news_token)
df_dic_news_uni <- plot_doc(dtm_news, "word")

[Figure 1. Summary plot of en_US.news.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]

EDA of en_US.blogs.txt

The distribution and relationshop of the words in en_US.blogs.txt have been shown in the same way in Figure 2.

dtm_blog <- dfm(blog_token)
df_dic_blog_uni <- plot_doc(dtm_blog, "word")

[Figure 2. Summary plot of en_US.blogs.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]

EDA of en_US.twitter.txt

The distribution and relationshop of the words in en_US.twitter.txt have been shown in the same way in Figure 3.

dtm_twitter <- dfm(twitter_token)
df_dic_twitter_uni <- plot_doc(dtm_twitter, "word")

[Figure 3. Summary plot of en_US.twitter.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]

Extension of EDA to N-grams

Until now, only the word frequency has been considered and this corresponds to 1-gram (unigram) case in the N-gram model. So, We can extend this analysis into 2-, 3-, or N- grams by generating N-gram model. For instance, we can easily extend our unigram model of en_US.nesw.txt document into bigram model, and apply the same analysis (Figure 4).

# bigrams: en_US.news.txt case
news_token_bigram <- tokens(news_token, ngrams=2)
dtm_news_bigram <- dfm(news_token_bigram)
df_dic_news_bi <- plot_doc(dtm_news_bigram, "bigram")

[Figure 4. Summary plot of bigram model of en_US.news.txt document. Histogram of word appreanrance (top), cumulative frequency of unique words in a frequency sorted dictionalry (middle), and top 20 unique words frequeny (bottom) have been presented.]

Plan for builing a predictive text model and Shiny app (predictive text mining app)

Using frequency- and context-based information obtained from the EDA above (i.e. frequency sorted N-gram dictionary), we will build a predictive text model. When considering accurary and efficiency of the model, 3-gram model will be initally tried. For the case where a word that has not appreared in the training documents, we will into our dictionary in the model training stage. Also, a word not seen in the N-gram model, we will try several smoohting techniques such as simple Laplace, Add-k, and backoff smoothing. The model will be evaluated by using a separate validation and test data set. Once a satififactory model has been built, we will make a Shiny app as a predictive text mining app. The essential components of Shiny app will consist of a reactive input text box and three output text boxes that display the suggested following words.