Exploratory Text Data Analysis

Overview

Many people in the world are spending a lot of time on their mobile devices for email, social networking, banking and a whole range of other activities. Also many smart keyboards such as SwiftKey, QuickPath are designed for people to type on their mobile devices easilly. When you type a word, you will get a next word suggestion immediately. Wow, how can they do that?

The answer for this question is Natual Language Processing (NLP). They use NLP for discover the structure in the text data and how words are put together. You can read more about NLP here.

And an important step for discover the structure in the text data is exploration. That is also the main content of this analysis.

Data Processing

The data I use in this report is collection text data from blogs, news and Twitter from Coursera Capstone Project. You can download from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

And for this analyst, I use en-US text data. Note: size in bytes

##                                     size               mtime
## ./data/en_US/en_US.blogs.txt   210160014 2014-07-22 10:13:06
## ./data/en_US/en_US.news.txt    205811889 2014-07-22 10:13:04
## ./data/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:58

You can try with others or even though feel free to use your local language.

On the other hand, I use some the following common R libraries for process with text data and exploration. Specially one of them is tidytext that I think it’s very powerful for me working with text data, n-gram, frequency, read more here. Also, I create my own utils for common processing like read text file, max line length, etc…

library(dplyr)
library(tidytext)
library(tidyr)
library(tools)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(igraph)
library(ggraph)
library(stringr)
source("./Utils.R")

First step, we will get dataset from text file, cleaning by remove profanity and other words which we do not want to predict. I use utils getDataSet with three params:

data_sources: data sources files as chr vectors, en_US txt file path
prob: probability for get random sample data per source file. Default: 0.05 = 5%
nonascii.rm: for remove non-ASCII character. Default: FALSE

Because we use prob then we should set seed for that

set.seed(1102)

filePath <- "./data/dataset.rds"
if(!file.exists(filePath)) {
  dataset <- getDataSet(data_sources, nonascii.rm=TRUE)
  saveRDS(dataset, filePath)
} else {
  dataset <- readRDS(filePath)
}

Summary about dataset we have total records is 213483 and per source:

## 
##   Blogs    News Twitter 
##   44964   50512  118007

Sample data:

## # A tibble: 15 x 2
## # Groups:   source [3]
##    txt                                                               source
##    <chr>                                                             <chr> 
##  1 Zuma has tried to reassure investors that there will be no whole… Blogs 
##  2 ZZ TOP: Neighbor, Neighbor                                        Blogs 
##  3 Zukini -1                                                         Blogs 
##  4 Zuma, not being the Chaka, might be interrupted with impunity by… Blogs 
##  5 Zythophile informs us that, according to chairman Colin Valentin… Blogs 
##  6 Zuerlein is coming off one of the best seasons by a kicker in co… News  
##  7 "Zydeco is also peppered throughout the festival, including perf… News  
##  8 ZZ Top and 3 Doors Down: with Gretchen Wilson and Leroy Powell &… News  
##  9 Zufall agreed with his coach.                                     News  
## 10 Zumwalt West's relay team of Matthews, Eric Rogers, Tyler Percy … News  
## 11 Zucotti Park eviction, #SOPA controversies, Penn State...         Twitt…
## 12 zooming their way to sunny PDX for a big show in the lounge star… Twitt…
## 13 Zotero is amazing, and (imho) should really be taught to high sc… Twitt…
## 14 Zoos furnish the live experience of animals, but books give you … Twitt…
## 15 Zubaz never die Nate                                              Twitt…

Exploratory

Second step, we will explore frequency of word, 2-grams, 3-grams which means is:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application

Word frequency

For word frequency, I process the following step by step:

Get tokenized with token is word
Remove profanity words
Remove stop words which means “stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English"
Get frequency

See Appendix for function definitions. Also, we will get profanity dataset for skipping them in this analysis from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words. Thank @Shutterstock for it.

profanity_words <- getProfanityDf("./en_US.profanity.txt")
head(profanity_words)

## # A tibble: 6 x 1
##   word              
##   <chr>             
## 1 2g1c              
## 2 2 girls 1 cup     
## 3 acrotomophilia    
## 4 alabama hot pocket
## 5 alaskan pipeline  
## 6 anal

data(stop_words)
token_word <- tokenized(dataset) %>% anti_join(profanity_words) %>% anti_join(stop_words)
freq_word <- freq(token_word)
summary(freq_word)

##      word                 n                 freq          
##  Length:131158      Min.   :    1.00   Min.   :4.630e-07  
##  Class :character   1st Qu.:    1.00   1st Qu.:4.630e-07  
##  Mode  :character   Median :    1.00   Median :4.630e-07  
##                     Mean   :   16.45   Mean   :7.624e-06  
##                     3rd Qu.:    4.00   3rd Qu.:1.854e-06  
##                     Max.   :11160.00   Max.   :5.172e-03

For histogram

ggplot(freq_word, aes(freq)) + geom_histogram(aes(fill=..count..),show.legend = FALSE, binwidth = 1e-06) +
  xlim(NA, 0.00005) + scale_fill_gradient("count", low="green", high="red")

For workcloud, we can see almost common words in dataset as a whole again.

freq_word %>% with(wordcloud(word, n, max.words = 500, min.freq = 100, random.order = FALSE, colors=brewer.pal(8, "Dark2")))

For sentiment (positive/negative) words

bing_word_counts <- token_word %>% inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort=TRUE) %>% ungroup()

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

And sentiment per source, you can see with Twitter, we have less negative words than blogs and news.

sentiment_word <- token_word %>% inner_join(get_sentiments("bing")) %>% 
  count(source, index = row_number() %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

ggplot(sentiment_word, aes(index, sentiment, fill = source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~source, ncol = 3, scales = "free_x", shrink = TRUE) +
  theme(panel.spacing.x=unit(1.5, "lines"))

2-grams

We use the same step with words for 2-grams

token_two_grams <- tokenized(dataset, token="ngrams")

token_two_grams_separated <- token_two_grams %>%
  separate(bigram, c("word1", "word2"), sep=" ") %>%
  filter(!word1 %in% profanity_words$word) %>%
  filter(!word2 %in% profanity_words$word) %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

token_two_grams_united <- token_two_grams_separated %>%
  unite(bigram, word1, word2, sep=" ")

Actually, for n-grams analyst a bigram can be treated as a term in a document in the same way that we treated individual words.

two_grams_tf_idf <- token_two_grams_united %>%
  count(source, bigram) %>%
  bind_tf_idf(bigram, source, n) %>%
  arrange(desc(tf_idf))

two_grams_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>% 
  group_by(source) %>% 
  top_n(15) %>% 
  ungroup() %>%
  ggplot(aes(bigram, tf_idf, fill = source)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~source, ncol = 3, scales = "free") +
  coord_flip()

For sentiment analysis on the bigram data, we can examine how often sentiment-associated words are preceded by “not” or other negating words. We could use this to ignore or even reverse their contribution to the sentiment score.

AFINN <- get_sentiments("afinn")
two_grams_hate_words <- token_two_grams_separated %>%
  filter(word1 == "hate") %>%
  inner_join(AFINN, by=c(word2 = "word")) %>%
  count(word2, value, sort = TRUE)

two_grams_hate_words %>%
  mutate(contribution = n * value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * value, fill = n * value > 0)) +
  geom_col(show.legend = FALSE) +
  xlab("Words preceded by \"hate\"") +
  ylab("Sentiment value * number of occurrences") +
  coord_flip()

One thing when we think about bigrams is visualize relationship between words in graph.

token_two_grams_separated %>%
  count(word1, word2, sort=TRUE) %>%
  filter(n > 100, !str_detect(word1, "\\d"), !str_detect(word2, "\\d")) %>%
  top_n(100) %>%
  visualizeBigrams()

Conclusion

With this exploratory analysis, we can see some inference informations:

Almost words has low frequency but those are important words in a content (key word).
Might be the positive words are more than negative words in own dataset.
Some noisy words in 2-grams that we should remove them for best analyst like 2 tsp, rt rt,…

Next step, with experience in n-grams, sentiment and TF_TDF, I think we can use that for create a model for predict word.

Appendix

For get tokenized from dataset, I create a util like this

## function (df, token = "word", n = 2) 
## {
##     if (token == "ngrams") {
##         tokenized <- df %>% unnest_tokens(bigram, txt, token = "ngrams", 
##             n = n)
##     }
##     else {
##         tokenized <- df %>% unnest_tokens(word, txt)
##     }
##     return(tokenized)
## }
## <bytecode: 0x7ffba41b4888>

For get frequency of token, I also create a util like this

## function (tokenized, token = "word") 
## {
##     if (token == "ngrams") {
##         word_count <- tokenized %>% count(bigram, sort = TRUE)
##     }
##     else {
##         word_count <- tokenized %>% count(word, sort = TRUE)
##     }
##     total_words <- sum(word_count$n)
##     word_freq <- word_count %>% mutate(freq = n/total_words)
##     return(word_freq)
## }

And for visual bigram graph

## function (bigrams) 
## {
##     a <- grid::arrow(type = "closed", length = unit(0.15, "inches"))
##     bigrams %>% graph_from_data_frame() %>% ggraph(layout = "fr") + 
##         geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, 
##             arrow = a) + geom_node_point(color = "lightblue", 
##         size = 5) + geom_node_text(aes(label = name), vjust = 1, 
##         hjust = 1) + theme_void()
## }