Many people in the world are spending a lot of time on their mobile devices for email, social networking, banking and a whole range of other activities. Also many smart keyboards such as SwiftKey, QuickPath are designed for people to type on their mobile devices easilly. When you type a word, you will get a next word suggestion immediately. Wow, how can they do that?
The answer for this question is Natual Language Processing (NLP). They use NLP for discover the structure in the text data and how words are put together. You can read more about NLP here.
And an important step for discover the structure in the text data is exploration. That is also the main content of this analysis.
The data I use in this report is collection text data from blogs, news and Twitter from Coursera Capstone Project. You can download from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
And for this analyst, I use en-US
text data. Note: size in bytes
## size mtime
## ./data/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:06
## ./data/en_US/en_US.news.txt 205811889 2014-07-22 10:13:04
## ./data/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:58
You can try with others or even though feel free to use your local language.
On the other hand, I use some the following common R libraries for process with text data and exploration. Specially one of them is tidytext
that I think it’s very powerful for me working with text data, n-gram, frequency, read more here. Also, I create my own utils
for common processing like read text file, max line length, etc…
library(dplyr)
library(tidytext)
library(tidyr)
library(tools)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
library(igraph)
library(ggraph)
library(stringr)
source("./Utils.R")
First step, we will get dataset from text file, cleaning by remove profanity and other words which we do not want to predict. I use utils getDataSet
with three params:
data_sources
: data sources files as chr vectors, en_US txt file pathprob
: probability for get random sample data per source file. Default: 0.05 = 5%nonascii.rm
: for remove non-ASCII character. Default: FALSEBecause we use prob then we should set seed for that
set.seed(1102)
filePath <- "./data/dataset.rds"
if(!file.exists(filePath)) {
dataset <- getDataSet(data_sources, nonascii.rm=TRUE)
saveRDS(dataset, filePath)
} else {
dataset <- readRDS(filePath)
}
Summary about dataset we have total records is 213483 and per source:
##
## Blogs News Twitter
## 44964 50512 118007
Sample data:
## # A tibble: 15 x 2
## # Groups: source [3]
## txt source
## <chr> <chr>
## 1 Zuma has tried to reassure investors that there will be no whole… Blogs
## 2 ZZ TOP: Neighbor, Neighbor Blogs
## 3 Zukini -1 Blogs
## 4 Zuma, not being the Chaka, might be interrupted with impunity by… Blogs
## 5 Zythophile informs us that, according to chairman Colin Valentin… Blogs
## 6 Zuerlein is coming off one of the best seasons by a kicker in co… News
## 7 "Zydeco is also peppered throughout the festival, including perf… News
## 8 ZZ Top and 3 Doors Down: with Gretchen Wilson and Leroy Powell &… News
## 9 Zufall agreed with his coach. News
## 10 Zumwalt West's relay team of Matthews, Eric Rogers, Tyler Percy … News
## 11 Zucotti Park eviction, #SOPA controversies, Penn State... Twitt…
## 12 zooming their way to sunny PDX for a big show in the lounge star… Twitt…
## 13 Zotero is amazing, and (imho) should really be taught to high sc… Twitt…
## 14 Zoos furnish the live experience of animals, but books give you … Twitt…
## 15 Zubaz never die Nate Twitt…
Second step, we will explore frequency of word, 2-grams, 3-grams which means is:
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application
Read more here
For word frequency, I process the following step by step:
word
See Appendix
for function definitions. Also, we will get profanity dataset for skipping them in this analysis from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words. Thank @Shutterstock for it.
profanity_words <- getProfanityDf("./en_US.profanity.txt")
head(profanity_words)
## # A tibble: 6 x 1
## word
## <chr>
## 1 2g1c
## 2 2 girls 1 cup
## 3 acrotomophilia
## 4 alabama hot pocket
## 5 alaskan pipeline
## 6 anal
data(stop_words)
token_word <- tokenized(dataset) %>% anti_join(profanity_words) %>% anti_join(stop_words)
freq_word <- freq(token_word)
summary(freq_word)
## word n freq
## Length:131158 Min. : 1.00 Min. :4.630e-07
## Class :character 1st Qu.: 1.00 1st Qu.:4.630e-07
## Mode :character Median : 1.00 Median :4.630e-07
## Mean : 16.45 Mean :7.624e-06
## 3rd Qu.: 4.00 3rd Qu.:1.854e-06
## Max. :11160.00 Max. :5.172e-03
For histogram
ggplot(freq_word, aes(freq)) + geom_histogram(aes(fill=..count..),show.legend = FALSE, binwidth = 1e-06) +
xlim(NA, 0.00005) + scale_fill_gradient("count", low="green", high="red")
For workcloud, we can see almost common words in dataset as a whole again.
freq_word %>% with(wordcloud(word, n, max.words = 500, min.freq = 100, random.order = FALSE, colors=brewer.pal(8, "Dark2")))
For sentiment (positive/negative) words
bing_word_counts <- token_word %>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort=TRUE) %>% ungroup()
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
And sentiment per source, you can see with Twitter, we have less negative words than blogs and news.
sentiment_word <- token_word %>% inner_join(get_sentiments("bing")) %>%
count(source, index = row_number() %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(sentiment_word, aes(index, sentiment, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~source, ncol = 3, scales = "free_x", shrink = TRUE) +
theme(panel.spacing.x=unit(1.5, "lines"))
We use the same step with words for 2-grams
token_two_grams <- tokenized(dataset, token="ngrams")
token_two_grams_separated <- token_two_grams %>%
separate(bigram, c("word1", "word2"), sep=" ") %>%
filter(!word1 %in% profanity_words$word) %>%
filter(!word2 %in% profanity_words$word) %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
token_two_grams_united <- token_two_grams_separated %>%
unite(bigram, word1, word2, sep=" ")
Actually, for n-grams analyst a bigram can be treated as a term in a document in the same way that we treated individual words.
two_grams_tf_idf <- token_two_grams_united %>%
count(source, bigram) %>%
bind_tf_idf(bigram, source, n) %>%
arrange(desc(tf_idf))
two_grams_tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(source) %>%
top_n(15) %>%
ungroup() %>%
ggplot(aes(bigram, tf_idf, fill = source)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~source, ncol = 3, scales = "free") +
coord_flip()
For sentiment analysis on the bigram data, we can examine how often sentiment-associated words are preceded by “not” or other negating words. We could use this to ignore or even reverse their contribution to the sentiment score.
AFINN <- get_sentiments("afinn")
two_grams_hate_words <- token_two_grams_separated %>%
filter(word1 == "hate") %>%
inner_join(AFINN, by=c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
two_grams_hate_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(word2, n * value, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
xlab("Words preceded by \"hate\"") +
ylab("Sentiment value * number of occurrences") +
coord_flip()
One thing when we think about bigrams is visualize relationship between words in graph.
token_two_grams_separated %>%
count(word1, word2, sort=TRUE) %>%
filter(n > 100, !str_detect(word1, "\\d"), !str_detect(word2, "\\d")) %>%
top_n(100) %>%
visualizeBigrams()
With this exploratory analysis, we can see some inference informations:
2 tsp
, rt rt
,…Next step, with experience in n-grams, sentiment and TF_TDF, I think we can use that for create a model for predict word.
For get tokenized from dataset, I create a util like this
## function (df, token = "word", n = 2)
## {
## if (token == "ngrams") {
## tokenized <- df %>% unnest_tokens(bigram, txt, token = "ngrams",
## n = n)
## }
## else {
## tokenized <- df %>% unnest_tokens(word, txt)
## }
## return(tokenized)
## }
## <bytecode: 0x7ffba41b4888>
For get frequency of token, I also create a util like this
## function (tokenized, token = "word")
## {
## if (token == "ngrams") {
## word_count <- tokenized %>% count(bigram, sort = TRUE)
## }
## else {
## word_count <- tokenized %>% count(word, sort = TRUE)
## }
## total_words <- sum(word_count$n)
## word_freq <- word_count %>% mutate(freq = n/total_words)
## return(word_freq)
## }
And for visual bigram graph
## function (bigrams)
## {
## a <- grid::arrow(type = "closed", length = unit(0.15, "inches"))
## bigrams %>% graph_from_data_frame() %>% ggraph(layout = "fr") +
## geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
## arrow = a) + geom_node_point(color = "lightblue",
## size = 5) + geom_node_text(aes(label = name), vjust = 1,
## hjust = 1) + theme_void()
## }