NLP - Exploratory Data Analysis

Introduction

This report documented the exploratory analysis I conducted for analyzing the raw text data to better understand the underlying text data and begin to develop a strategy towards building a text prediction model that predicts the next word to be typed by the user based on her/his previous inputs.

Raw Data Overview

The training data is provided by SwiftKey. I chose to use the English version of twitter,news, and blog text files.

Let’s first load the source files and convert them into a corpus object using the Quanteda package in R. The summary of the three .txt files are below:

## Corpus consisting of 3 documents:
## 
##               Text  Types   Tokens Sentences
##    en_US.blogs.txt 482484 42840192   2072941
##     en_US.news.txt 431667 39918317   1867522
##  en_US.twitter.txt 566995 36719702   2588548
## 
## Source: /home/roger/NLP-R/* on x86_64 by roger
## Created: Wed Sep 25 14:12:23 2019
## Notes:

Since the raw data is consisted of large amount of information - millions of sentences and tens of millions of tokens - it will take large resouces and long time to process them all together at once. For the purposes of exploratory data analysis, only 1% of randomly sampled data from each of the three text files is used for practical reasons: speed of processing with sufficient amount of information to find patterns.

Once a good strategy for cleaning/processing the data and for constructing the text prediction model is developed, greater portion of the raw data will be used for the analysis will be used/revisited as needed later.

Exploratory Analysis

Twitter Data

Start with looking into the Twitter file by reading the lines from .txt into data.table and randomly draw 1% of the lines for the analysis.

twit.dt <- as.data.table(read_lines(file = "/home/roger/NLP-R/Data/en_US.twitter.txt"))
set.seed(95130)
samp_twit <- twit.dt[sample(.N, round(.N * 0.01))]

Then create a corpus on the drawn Twitter data and see what the most frequent token is.

twit_features <- samp_twit[, V1] %>%
  corpus() %>%
  dfm() %>%
  textstat_frequency() %>%
  setDT()

twit_features[, 1:2]

##            feature frequency
##     1:           .     25177
##     2:           !     12619
##     3:         the      9258
##     4:          to      7776
##     5:           ,      7456
##    ---                      
## 27276:       ankel         1
## 27277:    sports-_         1
## 27278:      kassim         1
## 27279:          tp         1
## 27280: #iamamentor         1

At first glance, the sampled Twitter data has 27279 unique features/tokens without any processing/trimming/stemming is performed. A closer loook at the features are required for determining the strategy for cleaning up the text data.

It’s easy to notice the following features should be removed: - punctuations - numbers - emojis - foreign characters

twit_features[grep("^[^a-zA-Z]{1}$", feature),]

##      feature frequency  rank docfreq group
##   1:       .     25177     1   12340   all
##   2:       !     12619     2    7262   all
##   3:       ,      7456     5    5416   all
##   4:       ?      4179    10    3289   all
##   5:       :      4041    11    3576   all
##  ---                                      
## 172:       😢         1 11085       1   all
## 173:       🍆         1 11085       1   all
## 174:       💏         1 11085       1   all
## 175:       🚼         1 11085       1   all
## 176:       🐬         1 11085       1   all

Additionally, common Enlish stopwords, url, twitter characters, and hyphens will also be removed and triming will be applied.

twit_features <- samp_twit[, V1] %>%
  corpus() %>%
  dfm(
    tolower = TRUE,
    remove = stopwords("english"),
    stem = FALSE,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_twitter = TRUE,
    remove_url = TRUE,
    remove_symbols = TRUE,
    remove_hyphens = TRUE
  ) %>%
  dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>%   # Remove non-english single char
  textstat_frequency() %>%
  setDT()

We will then follow the similar approach to analyze/process the news and blog text data

news.dt <- as.data.table(read_lines(file = "/home/roger/NLP-R/Data/en_US.news.txt"))
blog.dt <- as.data.table(read_lines(file = "/home/roger/NLP-R/Data/en_US.blogs.txt"))
set.seed(95130)
samp_news <- news.dt[sample(.N, round(.N * 0.01))]
samp_blog <- blog.dt[sample(.N, round(.N * 0.01))]

news_features <- samp_news[, V1] %>%
  corpus() %>%
  dfm(
    tolower = TRUE,
    remove = stopwords("english"),
    stem = FALSE,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_twitter = TRUE,
    remove_url = TRUE,
    remove_symbols = TRUE,
    remove_hyphens = TRUE
  ) %>%
  dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>%   # Remove non-english single char
  textstat_frequency() %>%
  setDT()

blog_features <- samp_blog[, V1] %>%
  corpus() %>%
  dfm(
    tolower = TRUE,
    remove = stopwords("english"),
    stem = FALSE,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_twitter = TRUE,
    remove_url = TRUE,
    remove_symbols = TRUE,
    remove_hyphens = TRUE
  ) %>%
  dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>%   # Remove non-english single char
  textstat_frequency() %>%
  setDT()

Visualizing the features/tokens

As the last step of the intial exploratory analysis, we will visualize the top 100 features from each of the three data set

p1 <-
  ggplot(twit_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Top 50 Most Frequent Features of US Twitter Data (1% Sample)",
       x = "feature", y = "frequency")

p2 <-
  ggplot(news_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Top 50 Most Frequent Features of US News Data (1% Sample)",
       x = "feature", y = "frequency") 

p3 <-
  ggplot(blog_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Top 50 Most Frequent Features of US Blog Data (1% Sample)",
       x = "feature", y = "frequency") 

grid.arrange(p1, p2, p3, ncol =1)

Finally, we look at how everything looks when features from all three data sets are combined.

combined_features <- rbindlist(list(twit_features[, 1:2], blog_features[, 1:2], news_features[, 1:2]))[, lapply(.SD, sum, na.rm = TRUE), by = feature]

ggplot(combined_features[1:50, ], aes(x = reorder(feature, -frequency), y = frequency)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Top 50 Most Frequent Features of Combined Data (1% Sample)",
       x = "feature", y = "frequency")

Word cloud of top 200 features from all three source data combined.

topfeatures <- head(combined_features[order(-frequency)],200)
wordcloud(words = topfeatures[,feature], 
          freq = topfeatures[,frequency],
          colors = brewer.pal(6,"Dark2"),
          random.order = FALSE)

N-Gram Modeling

N-grams can be created easily using the same process under the feature creations so far with minor modifications in calling the dfm() function from the quanteda package.

We will first write a function to create n-grams.

createNG <- function(dt, n = 1L) {
  ng <- dt[, V1] %>%
    corpus() %>%
    dfm(
      tolower = TRUE,
      remove = stopwords("english"),
      stem = FALSE,
      remove_punct = TRUE,
      remove_numbers = TRUE,
      remove_twitter = TRUE,
      remove_url = TRUE,
      remove_symbols = TRUE,
      remove_hyphens = TRUE,
      ngrams = n
    ) %>%
    dfm(remove = "^[^a-zA-Z]$", valuetype = "regex") %>% # Remove non-english single char
    textstat_frequency() %>%
    setDT()
  ng[,ngram := n]
  return(ng[,c("feature", "frequency", "ngram")])
}

We will then create 2-gram, 3-gram, 4-gram, and 5-gram models for each of the three data sets.

twit.ng2 <- createNG(samp_twit,2)
twit.ng3 <- createNG(samp_twit,3)
twit.ng4 <- createNG(samp_twit,4)
twit.ng5 <- createNG(samp_twit,5)

news.ng2 <- createNG(samp_news,2)
news.ng3 <- createNG(samp_news,3)
news.ng4 <- createNG(samp_news,4)
news.ng5 <- createNG(samp_news,5)

blog.ng2 <- createNG(samp_blog,2)
blog.ng3 <- createNG(samp_blog,3)
blog.ng4 <- createNG(samp_blog,4)
blog.ng5 <- createNG(samp_blog,5)

And then combine them together.

combined_ng <- rbindlist(list(
  twit.ng2, twit.ng3, twit.ng4, twit.ng5,
  news.ng2, news.ng3, news.ng4, news.ng5,
  blog.ng2, blog.ng3, blog.ng4, blog.ng5
))[, lapply(.SD, sum, na.rm = TRUE),
   by = c("feature", "ngram")
   ][order(ngram, -frequency)]

Visualize N-grams

For demonstration purposes, the to 25 frequent 2-gram and 3-gram models are plotted below.

ggplot(combined_ng[ngram==2, head(.SD, 25)], aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(
    title = "Top 25 Frequncy of 2-grams from US Combined Data (1% Sample)",
    x = "feature", y = "frequency"
  ) +
  coord_flip()

ggplot(combined_ng[ngram==3, head(.SD, 25)], aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  labs(
    title = "Top 25 Frequncy of 3-grams from US Combined Data (1% Sample)",
    x = "feature", y = "frequency"
  )+
  coord_flip()

Next Steps

Study various smoothing methods and choose the most appropiate one to apply to the n-gram model
Experienment with various sample data sizes (e.g 1% > 2% > 5% > 10%?) to see if there is improvment in prodiction accuracies. If so, rebuilding n-gram models with larger data set is necessary
Build a Shiny app to make next word suggestion as the user type in text in a input bar.

References

Speech and Language Processing. Daniel Jurafsky & James H. Martin. [https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf]
Quanteda Quick Start Guide [https://quanteda.io/articles/quickstart.html]
Frequently Asked Questions about data.table [https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html]