We continue from our previous report on #qurananalytics using the tidytext and quRan packages. We explore the tidytext features applied to the English translation of the Quran following the examples used in Text Mining with R - A Tidy Approach, by Julia Silge and David Robinson.
Many interesting text analyses are based on the relationships between words, like examining which words tend to follow others immediately, or that tend to co-occur within the same documents.
In this article, we explore some of the methods tidytext offers for calculating and visualizing relationships between words in the English Quran dataset. This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones. We’ll also use the widyr package, which calculates pairwise correlations and distances within a tidy data frame.
packages=c('dplyr', 'tidyverse', 'tidytext', 'ggplot2', 'ggraph', 'knitr', 'quRan')
for (p in packages){
if (! require (p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
We will analyze selected variable (columns) from quran_en_sahih in the quRan packaage
quranES <- quran_en_sahih %>% select(surah_id,
ayah_id,
surah_title_en,
surah_title_en_trans,
revelation_type,
text,
ayah_title)
quranES
To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
tidyQ <- quranES %>%
unnest_tokens(word, text)
tidyQ
This function separates each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.
Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().
data(stop_words)
tidyQ <- tidyQ %>%
anti_join(stop_words)
# apply(tidyQ, 2, function(x) any(is.na(x))) (Good habit to test for NA and others like NaN etc)
The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.
We can also use dplyr’s count() to find the most common words in all the Quran as a whole. Because we’ve been using tidy tools, our word counts are stored in a tidy data frame. This allows us to pipe this directly to the ggplot2 package.
tidyQ %>% count(word, sort = TRUE)
tidyQ %>%
count(word, sort = TRUE) %>%
filter(n > 150) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col(fill = "#79ad95") +
coord_flip() +
theme(axis.text = element_text(
angle = 0,
color="blue",
size=10)) +
labs(title = "Top Single Words in the Quran",
subtitle = "Count > 150",
caption = "Saheeh International Translation")
Lots of useful work can be done by tokenizing at the word level (Please refer to earlier post). But sometimes it is useful or necessary to look at different units of text. Many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same sentences or chapters (verses or Surahs in the Quran).
In this post, we’ll explore some of the methods tidytext offers for calculating and visualizing relationships between words in the English Quran dataset. This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones.
We’ll also use two new packages: ggraph, which extends ggplot2 to construct network plots, and widyr, which calculates pairwise correlations and distances within a tidy data frame. Together these expand our toolbox for exploring text within the tidy data framework.
We can also use the function unnest_tokens to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.
We do this by adding the token = “ngrams” option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set n to 2, we are examining pairs of two consecutive words, often called “bigrams”.
quran_bigrams <- quranES %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
quran_bigrams
# apply(quran_bigrams, 2, function(x) any(is.na(x)))
This data structure is still a variation of the tidy text format. It is structured as one-token-per-row (with extra metadata, such as surah, still preserved), but each token now represents a bigram.
Notice that these bigrams overlap: “in the”, “the name”, “name of” are separate bigrams.
We can examine the most common bigrams using dplyr’s count():
quran_bigrams %>%
count(bigram, sort = TRUE)
Many of the most common bigrams are pairs of common (uninteresting) words, such as “of the”, “those who”, “and the”, “do not”. We call these “stop-words”. We use tidyr’s separate(), which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.
bigrams_separated <- quran_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
We can see that words related to some aspects of the Islamic teachings form the most common pairs in Quran Surahs.
In other analyses, we may want to work with the recombined words. tidyr’s unite() function is the inverse of separate(), and lets us recombine the columns into one. Thus, “separate/filter/count/unite” let us find the most common bigrams not containing stop-words.
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united
# apply(bigrams_united, 2, function(x) any(is.na(x)))
In other analyses you may be interested in the most common trigrams, which are consecutive sequences of 3 words. We can find this by setting n = 3:
trigram_counts <- quranES %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
# apply(trigram_counts, 2, function(x) any(is.na(x))) (found some NA, so next line corrects)
trigram_counts <- trigram_counts %>% filter(!is.na(word1) & !is.na(word2) & !is.na(word3))
trigrams_united <- trigram_counts %>%
unite(trigram, word1, word2, word3, sep = " ")
trigrams_united
This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most common word with “allah” mentioned in each Surah:
bigrams_filtered %>%
filter(word1 == "allah") %>%
count(surah_title_en, word2, sort = TRUE)
We plot such words with counts >= 5.
bigrams_filtered %>%
filter(word1 == "allah") %>%
group_by(word2) %>%
summarise(num = n()) %>%
filter(num >= 5) %>%
mutate(word2 = reorder(word2, num)) %>%
ggplot(aes(x = word2, y = num)) +
geom_col(fill = "#79ad95") +
coord_flip() +
theme(axis.text = element_text(
angle = 0,
color="blue",
size=10)) +
labs(title = "Top Words With 'Allah' in the Quran",
subtitle = "Count >= 5",
caption = "Saheeh International Translation")
A bigram can also be treated as a term in a document in the same way that we treated individual words. For example, we can look at the tf-idf of bigrams across Quran Surahs. These tf-idf values can be visualized within each long surah, just as we did for words.
bigram_tf_idf <- bigrams_united %>%
filter(surah_title_en %in%
c("Al-Baqara", "Aal-i-Imraan", "An-Nisaa", "Al-Maaida", "Al-An'aam", "Al-A'raaf")) %>%
count(surah_title_en, bigram) %>%
bind_tf_idf(bigram, surah_title_en, n) %>%
arrange(desc(tf_idf))
# bigram_tf_idf
bigram_tf_idf %>% mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(surah_title_en) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(bigram, tf_idf, fill = surah_title_en)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~surah_title_en, ncol = 2, scales = "free") +
coord_flip()
There are advantages and disadvantages to examining the tf-idf of bigrams rather than individual words. Pairs of consecutive words might capture structure that isn’t present when we are just counting single words, and may provide context that makes tokens more understandable (for example, “sacred house”, in Al-Maaida, is more informative than “house” or “sacred” separately). However, the per-bigram counts are also more sparse: a typical two-word pair is more rare than either of its component words. Thus, bigrams can be especially useful when you have a very large text dataset.
We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network graph. Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:
The igraph package has many powerful functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from” (word1), “to” (word2), and edge attributes (in this case n):
library(igraph)
head(bigram_counts, 10) # original counts
bigram_graph <- bigram_counts %>%
filter(n >= 10) %>%
graph_from_data_frame()
bigram_graph
## IGRAPH 0ccf21d DN-- 65 44 --
## + attr: name (v/c), n (e/n)
## + edges from 0ccf21d (vertex names):
## [1] fear ->allah painful ->punishment righteous ->deeds
## [4] worldly ->life allah ->belongs defiantly ->disobedient
## [7] rivers ->flow establish ->prayer straight ->path
## [10] abide ->eternally gardens ->beneath worship ->allah
## [13] wrongdoing->people rightly ->guided severe ->punishment
## [16] believing ->women palm ->trees allah ->loves
## [19] al ->haram al ->masjid masjid ->al
## [22] abiding ->eternally allah ->lord obey ->allah
## + ... omitted several edges
igraph has plotting functions built in, but many other packages have developed visualization methods for graph objects. The ggraph package (Pedersen 2017) implements these visualizations in terms of the grammar of graphics. We can convert an igraph object into a ggraph with the ggraph function, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges, and text.
library(ggraph)
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(color = "gray") +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1, color = "steelblue") +
labs(title = "Common Bigrams in English Quran",
subtitle = "Count >= 10",
caption = "Saheeh International Translation")
We can visualize some details of the text structure. For example, we see that “allah” has a dominant role. We also see some known concepts in Islam like “straight path” and “establish prayer”. “perpetual residence”, “rivers flow”, “gardens beneath” are known rewards for those who “enter paradise”.
We conclude with a few polishing operations to make a better looking graph:
set.seed(2016)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(width = n, edge_alpha = 3*n), edge_colour = "red", show.legend = FALSE) +
geom_node_point(color = "blue", size = 2) +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
theme_void() +
labs(title = "Common Bigrams in English Quran",
subtitle = "Count >= 10",
caption = "Saheeh International Translation")
It may take some experimentation with ggraph to get your networks into a presentable format, but the network model is a useful and flexible way to visualize relational tidy data.
Note that this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In this case, a random generator following this model might spit out “lord”, then “loves”, then “guides/obey”, by following each word to the most common words that follow it. To make the visualization interpretable, we chose to show only the most common word to word connections, but one could imagine an enormous graph representing all connections that occur in the Quran.
We can do this by making our own list of custom stop words and using anti_join() to remove them from the data frame, just like we removed the default stop words that are in the tidytext package. As we go along we will add to this tibble.
We start with surah_words tibble and clean it. Save into a different tibble in case we need it later.
surah_words <- quranES %>%
unnest_tokens(word, text) %>%
count(surah_title_en, word, sort = TRUE)
# surah_words
total_words <- surah_words %>%
group_by(surah_title_en) %>%
summarize(total = sum(n))
surah_words <- left_join(surah_words, total_words)
# surah_words
my_stopwords <- tibble(word = c(as.character(1:10),
"al"))
# my_stopwords
surah_words_clean <- surah_words %>%
anti_join(stop_words) %>%
anti_join(my_stopwords)
## Remove NA, NULL and others
surah_words_clean <- surah_words_clean %>% filter(!is.na(word) | !is.null(word))
# Plot
surah_words_clean %>% filter (n >= 20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n, fill = surah_title_en)) +
geom_col() +
coord_flip() +
labs(title = "Common Words in English Quran",
subtitle = "Count >= 20 and Separated by Surah",
caption = "Saheeh International Translation")
As expected “allah” and “lord” dominates. Surah Yusuf is about Prophet Joseph. Surah Taa-Haa and Al-A’raaf provides many details abot the story of Prophet Moses.
As a next step, let’s examine which words commonly occur together in the Surahs. We can then examine word networks for these fields; this may help us see, for example, which Surahs are related to each other.
We can use pairwise_count() from the widyr package to count how many times each pair of words occurs together in a Surah.
library(widyr)
word_pairs <- surah_words_clean %>%
pairwise_count(word, surah_title_en, sort = TRUE, upper = FALSE)
word_pairs
Create a graph network of these co-occurring words so we can see the relationships better. The filter will determine the size of the graph.
Gquran <- word_pairs %>%
filter(n >= 30) %>%
graph_from_data_frame()
Gquran
## IGRAPH 16f8561 DN-- 124 1683 --
## + attr: name (v/c), n (e/n)
## + edges from 16f8561 (vertex names):
## [1] allah ->lord allah ->earth allah ->day
## [4] lord ->day allah ->muhammad lord ->earth
## [7] earth ->day lord ->muhammad allah ->people
## [10] earth ->muhammad earth ->punishment allah ->punishment
## [13] people ->lord people ->earth lord ->punishment
## [16] day ->muhammad allah ->righteous day ->punishment
## [19] punishment->muhammad lord ->righteous people ->day
## [22] people ->muhammad allah ->created allah ->heavens
## + ... omitted several edges
Now plot.
set.seed(1234)
Gquran %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a7f2dc") +
geom_node_point(aes(size = igraph::degree(Gquran)), colour = "#e384dd") +
geom_node_text(aes(label = name), repel = TRUE, size = 2) +
theme_void () +
labs(title = "Common Word Pairs in English Quran",
subtitle = "Count (n) >= 30",
caption = "Saheeh International Translation")
Let’s make a network of the keywords to see which keywords commonly occur together in the Quran.
What are the most common keywords?
surah_words_clean %>%
group_by(word) %>%
count(sort = TRUE)
To examine the relationships among keywords in a different way, we can find the correlation among the keywords by looking for those keywords that are more likely to occur together than with other keywords for a dataset.
keyword_cors <- surah_words_clean %>%
group_by(word) %>%
filter(n() >= 50) %>%
pairwise_cor(word, surah_title_en, sort = TRUE, upper = FALSE)
keyword_cors
Let’s visualize the network of keyword correlations.
set.seed(1234)
Gquran <- keyword_cors %>%
filter(correlation > .3) %>%
graph_from_data_frame()
Gquran
## IGRAPH 34bd498 DN-- 27 280 --
## + attr: name (v/c), correlation (e/n)
## + edges from 34bd498 (vertex names):
## [1] earth ->punishment punishment->exalted earth ->heavens
## [4] people ->truth punishment->heavens heavens ->exalted
## [7] knowing ->merciful merciful ->heavens truth ->heavens
## [10] knowing ->truth believed ->exalted people ->heavens
## [13] people ->knowing punishment->truth righteous ->deeds
## [16] earth ->exalted knowing ->heavens knowing ->punishment
## [19] merciful ->exalted knowing ->exalted fear ->punishment
## [22] people ->exalted punishment->muhammad truth ->disbelievers
## + ... omitted several edges
Gquran %>% ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "#a7f2dc") +
geom_node_point(size = 0.5*igraph::degree(Gquran), colour = "#e384dd") +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()+
labs(title = "Common Keyword Correlations in English Quran",
subtitle = "Correlation Factor > 0.3",
caption = "Saheeh International Translation")
This post showed how the tidy text approach is useful not only for analyzing individual words, but also for exploring the relationships and connections between words. Such relationships can involve n-grams, which enable us to see what words tend to appear after others, or co-occurences and correlations, for words that appear in proximity to each other. This post also demonstrated the ggraph package for visualizing both of these types of relationships as networks. These network visualizations are a flexible tool for exploring relationships, and will play an important role in the future case studies on #qurananalytics.