Introduction

We continue from our previous report on #qurananalytics using the tidytext and quRan packages. We explore the tidytext features applied to the English translation of the Quran following the examples used in Text Mining with R - A Tidy Approach, by Julia Silge and David Robinson.

Many interesting text analyses are based on the relationships between words, like examining which words tend to follow others immediately, or that tend to co-occur within the same documents.

In this article, we explore some of the methods tidytext offers for calculating and visualizing relationships between words in the English Quran dataset. This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones. We’ll also use the widyr package, which calculates pairwise correlations and distances within a tidy data frame.


Preliminaries

Load Packages and Libraries

packages=c('dplyr', 'tidyverse', 'tidytext', 'ggplot2', 'ggraph', 'knitr', 'quRan')
for (p in packages){
  if (! require (p,character.only = T)){
    install.packages(p)
  }
library(p,character.only = T)
}

Focus on Selected Quran version and Variables

We will analyze selected variable (columns) from quran_en_sahih in the quRan packaage

quranES <- quran_en_sahih %>% select(surah_id, 
                                   ayah_id,
                                   surah_title_en, 
                                   surah_title_en_trans, 
                                   revelation_type, 
                                   text,
                                   ayah_title)
quranES

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.

tidyQ <- quranES %>%
  unnest_tokens(word, text)
tidyQ

This function separates each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().

data(stop_words)

tidyQ <- tidyQ %>%
  anti_join(stop_words)
# apply(tidyQ, 2, function(x) any(is.na(x))) (Good habit to test for NA and others like NaN etc)

The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.

Count and Plot

We can also use dplyr’s count() to find the most common words in all the Quran as a whole. Because we’ve been using tidy tools, our word counts are stored in a tidy data frame. This allows us to pipe this directly to the ggplot2 package.

tidyQ %>% count(word, sort = TRUE)
tidyQ %>%
  count(word, sort = TRUE) %>%
  filter(n > 150) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col(fill = "#79ad95") +
  coord_flip() +
  theme(axis.text = element_text( 
    angle = 0, 
    color="blue", 
    size=10)) +
  labs(title = "Top Single Words in the Quran",
       subtitle = "Count > 150",
       caption = "Saheeh International Translation")


Relationships between words: n-grams and correlations

Lots of useful work can be done by tokenizing at the word level (Please refer to earlier post). But sometimes it is useful or necessary to look at different units of text. Many interesting text analyses are based on the relationships between words, whether examining which words tend to follow others immediately, or that tend to co-occur within the same sentences or chapters (verses or Surahs in the Quran).

In this post, we’ll explore some of the methods tidytext offers for calculating and visualizing relationships between words in the English Quran dataset. This includes the token = “ngrams” argument, which tokenizes by pairs of adjacent words rather than by individual ones.

We’ll also use two new packages: ggraph, which extends ggplot2 to construct network plots, and widyr, which calculates pairwise correlations and distances within a tidy data frame. Together these expand our toolbox for exploring text within the tidy data framework.

Tokenizing by n-gram

We can also use the function unnest_tokens to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

We do this by adding the token = “ngrams” option to unnest_tokens(), and setting n to the number of words we wish to capture in each n-gram. When we set n to 2, we are examining pairs of two consecutive words, often called “bigrams”.

quran_bigrams <- quranES %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

quran_bigrams
# apply(quran_bigrams, 2, function(x) any(is.na(x)))

This data structure is still a variation of the tidy text format. It is structured as one-token-per-row (with extra metadata, such as surah, still preserved), but each token now represents a bigram.

Notice that these bigrams overlap: “in the”, “the name”, “name of” are separate bigrams.

Counting and filtering n-grams

We can examine the most common bigrams using dplyr’s count():

quran_bigrams %>%
  count(bigram, sort = TRUE)

Many of the most common bigrams are pairs of common (uninteresting) words, such as “of the”, “those who”, “and the”, “do not”. We call these “stop-words”. We use tidyr’s separate(), which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word.

bigrams_separated <- quran_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

We can see that words related to some aspects of the Islamic teachings form the most common pairs in Quran Surahs.

In other analyses, we may want to work with the recombined words. tidyr’s unite() function is the inverse of separate(), and lets us recombine the columns into one. Thus, “separate/filter/count/unite” let us find the most common bigrams not containing stop-words.

bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

bigrams_united
# apply(bigrams_united, 2, function(x) any(is.na(x)))

In other analyses you may be interested in the most common trigrams, which are consecutive sequences of 3 words. We can find this by setting n = 3:

trigram_counts <- quranES %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort = TRUE)
# apply(trigram_counts, 2, function(x) any(is.na(x))) (found some NA, so next line corrects)
trigram_counts <- trigram_counts %>% filter(!is.na(word1) & !is.na(word2) & !is.na(word3))
trigrams_united <- trigram_counts %>%
  unite(trigram, word1, word2, word3, sep = " ")
trigrams_united

Analyzing bigrams

This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most common word with “allah” mentioned in each Surah:

bigrams_filtered %>%
  filter(word1 == "allah") %>%
  count(surah_title_en, word2, sort = TRUE)

We plot such words with counts >= 5.

bigrams_filtered %>%
   filter(word1 == "allah") %>%
   group_by(word2) %>%
   summarise(num = n()) %>%
   filter(num >= 5) %>%
   mutate(word2 = reorder(word2, num)) %>%
   ggplot(aes(x = word2, y = num)) +
   geom_col(fill = "#79ad95") +
   coord_flip() +
   theme(axis.text = element_text( 
     angle = 0, 
     color="blue", 
     size=10)) +
   labs(title = "Top Words With 'Allah' in the Quran",
        subtitle = "Count >= 5",
        caption = "Saheeh International Translation")

A bigram can also be treated as a term in a document in the same way that we treated individual words. For example, we can look at the tf-idf of bigrams across Quran Surahs. These tf-idf values can be visualized within each long surah, just as we did for words.

bigram_tf_idf <- bigrams_united %>%
  filter(surah_title_en %in% 
            c("Al-Baqara", "Aal-i-Imraan", "An-Nisaa", "Al-Maaida", "Al-An'aam", "Al-A'raaf")) %>%
  count(surah_title_en, bigram) %>%
  bind_tf_idf(bigram, surah_title_en, n) %>%
  arrange(desc(tf_idf))

# bigram_tf_idf
bigram_tf_idf %>% mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>% 
  group_by(surah_title_en) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(bigram, tf_idf, fill = surah_title_en)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~surah_title_en, ncol = 2, scales = "free") +
  coord_flip()

There are advantages and disadvantages to examining the tf-idf of bigrams rather than individual words. Pairs of consecutive words might capture structure that isn’t present when we are just counting single words, and may provide context that makes tokens more understandable (for example, “sacred house”, in Al-Maaida, is more informative than “house” or “sacred” separately). However, the per-bigram counts are also more sparse: a typical two-word pair is more rare than either of its component words. Thus, bigrams can be especially useful when you have a very large text dataset.

Visualizing a network of bigrams with ggraph

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network graph. Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:

  1. from: the node an edge is coming from
  2. to: the node an edge is going towards
  3. weight: A numeric value associated with each edge

The igraph package has many powerful functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from” (word1), “to” (word2), and edge attributes (in this case n):

library(igraph)

head(bigram_counts, 10) # original counts
bigram_graph <- bigram_counts %>%
  filter(n >= 10) %>%
  graph_from_data_frame()

bigram_graph
## IGRAPH 0ccf21d DN-- 65 44 -- 
## + attr: name (v/c), n (e/n)
## + edges from 0ccf21d (vertex names):
##  [1] fear      ->allah       painful   ->punishment  righteous ->deeds      
##  [4] worldly   ->life        allah     ->belongs     defiantly ->disobedient
##  [7] rivers    ->flow        establish ->prayer      straight  ->path       
## [10] abide     ->eternally   gardens   ->beneath     worship   ->allah      
## [13] wrongdoing->people      rightly   ->guided      severe    ->punishment 
## [16] believing ->women       palm      ->trees       allah     ->loves      
## [19] al        ->haram       al        ->masjid      masjid    ->al         
## [22] abiding   ->eternally   allah     ->lord        obey      ->allah      
## + ... omitted several edges

igraph has plotting functions built in, but many other packages have developed visualization methods for graph objects. The ggraph package (Pedersen 2017) implements these visualizations in terms of the grammar of graphics. We can convert an igraph object into a ggraph with the ggraph function, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges, and text.

library(ggraph)
set.seed(2017)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(color = "gray") +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1, color = "steelblue") +
  labs(title = "Common Bigrams in English Quran",
        subtitle = "Count >= 10",
        caption = "Saheeh International Translation")

We can visualize some details of the text structure. For example, we see that “allah” has a dominant role. We also see some known concepts in Islam like “straight path” and “establish prayer”. “perpetual residence”, “rivers flow”, “gardens beneath” are known rewards for those who “enter paradise”.

We conclude with a few polishing operations to make a better looking graph:

  • Add the edge_alpha aesthetic to the link layer to make links transparent based on how common or rare the bigram is
  • Tinker with the options to the node layer to make the nodes more attractive (larger, blue points)
  • Add a theme that’s useful for plotting networks, theme_void()
set.seed(2016)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(width = n, edge_alpha = 3*n), edge_colour = "red", show.legend = FALSE) +
  geom_node_point(color = "blue", size = 2) +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  theme_void() +
  labs(title = "Common Bigrams in English Quran",
        subtitle = "Count >= 10",
        caption = "Saheeh International Translation")

It may take some experimentation with ggraph to get your networks into a presentable format, but the network model is a useful and flexible way to visualize relational tidy data.

Note that this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In this case, a random generator following this model might spit out “lord”, then “loves”, then “guides/obey”, by following each word to the most common words that follow it. To make the visualization interpretable, we chose to show only the most common word to word connections, but one could imagine an enormous graph representing all connections that occur in the Quran.

Fine Tuning

Custom Stopwords

We can do this by making our own list of custom stop words and using anti_join() to remove them from the data frame, just like we removed the default stop words that are in the tidytext package. As we go along we will add to this tibble.

We start with surah_words tibble and clean it. Save into a different tibble in case we need it later.

surah_words <- quranES %>%
  unnest_tokens(word, text) %>%
  count(surah_title_en, word, sort = TRUE)
# surah_words

total_words <- surah_words %>% 
  group_by(surah_title_en) %>% 
  summarize(total = sum(n))

surah_words <- left_join(surah_words, total_words)

# surah_words
my_stopwords <- tibble(word = c(as.character(1:10), 
                                    "al"))
# my_stopwords

surah_words_clean <- surah_words %>%
   anti_join(stop_words) %>%
   anti_join(my_stopwords)
## Remove NA, NULL and others
surah_words_clean <- surah_words_clean %>% filter(!is.na(word) | !is.null(word))
# Plot
surah_words_clean %>% filter (n >= 20) %>% 
   mutate(word = reorder(word, n)) %>% 
   ggplot(aes(x = word, y = n, fill = surah_title_en)) + 
   geom_col() + 
   coord_flip() +
   labs(title = "Common Words in English Quran",
        subtitle = "Count >= 20 and Separated by Surah",
        caption = "Saheeh International Translation")

As expected “allah” and “lord” dominates. Surah Yusuf is about Prophet Joseph. Surah Taa-Haa and Al-A’raaf provides many details abot the story of Prophet Moses.

Word co-ocurrences and correlations

As a next step, let’s examine which words commonly occur together in the Surahs. We can then examine word networks for these fields; this may help us see, for example, which Surahs are related to each other.

We can use pairwise_count() from the widyr package to count how many times each pair of words occurs together in a Surah.

library(widyr)
word_pairs <- surah_words_clean %>% 
  pairwise_count(word, surah_title_en, sort = TRUE, upper = FALSE)
word_pairs

Create a graph network of these co-occurring words so we can see the relationships better. The filter will determine the size of the graph.

Gquran <- word_pairs %>%
  filter(n >= 30) %>%
  graph_from_data_frame()
Gquran
## IGRAPH 16f8561 DN-- 124 1683 -- 
## + attr: name (v/c), n (e/n)
## + edges from 16f8561 (vertex names):
##  [1] allah     ->lord       allah     ->earth      allah     ->day       
##  [4] lord      ->day        allah     ->muhammad   lord      ->earth     
##  [7] earth     ->day        lord      ->muhammad   allah     ->people    
## [10] earth     ->muhammad   earth     ->punishment allah     ->punishment
## [13] people    ->lord       people    ->earth      lord      ->punishment
## [16] day       ->muhammad   allah     ->righteous  day       ->punishment
## [19] punishment->muhammad   lord      ->righteous  people    ->day       
## [22] people    ->muhammad   allah     ->created    allah     ->heavens   
## + ... omitted several edges

Now plot.

set.seed(1234)
Gquran %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a7f2dc") +
  geom_node_point(aes(size = igraph::degree(Gquran)), colour = "#e384dd") +
  geom_node_text(aes(label = name), repel = TRUE, size = 2) +
  theme_void () +
  labs(title = "Common Word Pairs in English Quran",
        subtitle = "Count (n) >= 30",
        caption = "Saheeh International Translation")

Networks of Keywords

Let’s make a network of the keywords to see which keywords commonly occur together in the Quran.

What are the most common keywords?

surah_words_clean %>% 
  group_by(word) %>% 
  count(sort = TRUE)

To examine the relationships among keywords in a different way, we can find the correlation among the keywords by looking for those keywords that are more likely to occur together than with other keywords for a dataset.

keyword_cors <- surah_words_clean %>% 
  group_by(word) %>%
  filter(n() >= 50) %>%
  pairwise_cor(word, surah_title_en, sort = TRUE, upper = FALSE)

keyword_cors

Let’s visualize the network of keyword correlations.

set.seed(1234)
Gquran <- keyword_cors %>%
  filter(correlation > .3) %>%
  graph_from_data_frame()
Gquran
## IGRAPH 34bd498 DN-- 27 280 -- 
## + attr: name (v/c), correlation (e/n)
## + edges from 34bd498 (vertex names):
##  [1] earth     ->punishment   punishment->exalted      earth     ->heavens     
##  [4] people    ->truth        punishment->heavens      heavens   ->exalted     
##  [7] knowing   ->merciful     merciful  ->heavens      truth     ->heavens     
## [10] knowing   ->truth        believed  ->exalted      people    ->heavens     
## [13] people    ->knowing      punishment->truth        righteous ->deeds       
## [16] earth     ->exalted      knowing   ->heavens      knowing   ->punishment  
## [19] merciful  ->exalted      knowing   ->exalted      fear      ->punishment  
## [22] people    ->exalted      punishment->muhammad     truth     ->disbelievers
## + ... omitted several edges
Gquran %>% ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "#a7f2dc") +
  geom_node_point(size = 0.5*igraph::degree(Gquran), colour = "#e384dd") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()+
  labs(title = "Common Keyword Correlations in English Quran",
        subtitle = "Correlation Factor > 0.3",
        caption = "Saheeh International Translation")

Summary

This post showed how the tidy text approach is useful not only for analyzing individual words, but also for exploring the relationships and connections between words. Such relationships can involve n-grams, which enable us to see what words tend to appear after others, or co-occurences and correlations, for words that appear in proximity to each other. This post also demonstrated the ggraph package for visualizing both of these types of relationships as networks. These network visualizations are a flexible tool for exploring relationships, and will play an important role in the future case studies on #qurananalytics.

Reference

  1. https://www.tidytextmining.com/tidytext.html
  2. library(quRan)
  3. Pedersen, Thomas Lin. 2017. ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. https://cran.r-project.org/package=ggraph.