1 Introduction

We posted our first report on #qurananalytics using the tidytext and quRan packages. Here we explore the quanteda package.

1.1 Brief On Natural Language Processing and Word Network Analysis

Natural Language Processing (NLP) is a combination of linguistics and data science analyzing large amounts of natural language data which includes collection of speeches, text corpora and other forms of data generated from the usage of languages. The tasks of NLP vary from text mining and speech recognition (data-driven) to more complex tasks such as automatic text generation or speech production (AI-driven).

In this article we will focus on one particular aspect of NLP applied on a chosen text of the English translation of the Quran, namely lexical semantic analysis. This analysis focuses on what is termed as individual words in context analysis. Lexical semantics is the study of word meanings within its internal semantic structure or the semantic relations that occur within the corpus as a whole.1 We will focus on the second approach, namely to study words in relation to the rest of the words in the complete text, in this case, an English translation of the Quran.

We will also take a very specific approach by deploying network analysis (or properly known as graph theory), whereby we start with visualization of words within the text as a network of relations (words in the text as nodes and their presence in a sentence as directed edges). The relations could be of any particular interest - such as to discover main messages in the text, analytical reasoning such as to uncover the major topics within those messages, and explorative analysis, such as how these messages and topics relate to each other within the main message (or text).

Another important point to mention is the difference between the parametrical and non-parametrical approach to the task at hand. The parametrical approach relies on some pre-built models, such as sentiment scoring, semantic ontologies, etc.; whereas the non-parametric approach does not rely on any models and instead will be driven by the empirical nature of the words and the text itself (i.e. do not rely on other samples from outside of the sample at hand). We will use the second approach by using network analysis and graph theory.

In network analysis, few methods could be of assistance - as an example, the formation of the network whether it follows a random graph or any particular graph structure. Another important issue is on the emergent structures, whether any emergent structure can be observed, and if it exists, we can uncover the factors of the emergent structures and sub-structures. This will bring us into the subject of complicatedness and complexity of systems analysis.

Given the enormous possibilities and size of the task, this article will focus on providing preliminary findings using basic network analysis and leave the remaining for future work.

1.2 R packages, graph software, and data used

To perform the various analysis, we will use two main packages in R, namely quanteda and iGraph. Quanteda is a complete suite of R packages for text analytics with many ready-made built-in functions that are easy to use 2. iGraph is a network (or graph network) package in R 3. Both packages are well developed and supported within the R programming community. For the data, we will use the prebuilt text in tidydata format from quRan package, and for some of the utilities required, we will borrow from tidytext package.

For purposes of fast computation and visualization of a large network, we will use open-source software, Gephi. Similar software are Pajek, Cytoscape, and NodeXL. As far as computation is concerned there are no additional advantages offered by these software, except for easier visual manipulations and production of images. We will rely on Gephi for this purpose while using R as our main engine for computations.

packages=c('dplyr', 'tidyverse', 'tidytext', 'ggplot2', 'ggraph', 'knitr', 'quRan', 'quanteda')
for (p in packages){
  if (! require (p,character.only = T)){
    install.packages(p)
  }
library(p,character.only = T)
}

1.3 Focus on Selected Quran version and Variables

The quRan packaage has 4 versions of the Quran.

  1. quran_ar
  2. quran_ar_min
  3. quran_en_sahih
  4. quran_en_yusufali

We will analyze selected variables (columns) from quran_en_sahih and the stop_words data (from tidytext).

quranES <- quran_en_sahih %>% select(surah_id, 
                                   ayah_id,
                                   surah_title_en, 
                                   surah_title_en_trans, 
                                   revelation_type, 
                                   text,
                                   ayah_title)
data(stop_words)

Now we create tokenized documents, grouped by each verse in the Quran. This approach is different than tidytext in the sense that all tokens are still kept under the headings of each verse (sentence), which is useful for some analysis later.

tokensQ = quranES$text %>% 
      tokens(remove_punct = TRUE) %>%
      tokens_tolower() %>%
      tokens_remove(pattern = stop_words$word, padding = FALSE)

2 Top Words Structure

Our analysis begins with top_features (words) for the whole Quran English.

We can quickly see that our word network is centered on “Allah”, and the main emphasis relates to Rukun Iman (Pillars of Faith): believed, believers, disbelieve, disbelievers, Allah (Allah, lord, merciful), the Angels (angels), the Book (scripture, verses, revealed), the Messengers (messengers, Muhammad, Moses), the Hereafter (reward, punishment, mercy). Furthermore, we can see the emphasis on deeds (deeds, worship, prayer, evil, hearts), and creation (earth, heavens, people, children, created, life, day, night, time), and about truth, knowledge, signs. The only unclear subject is Qadha’ and Qadr - whether it appears in the meaning of other words (such as creation) is a bit ambiguous at this stage. These findings are consistent with what is mentioned in a recent book by one of us 4.

When we expand it to a wider range of 100 top words, the relative importance of “allah” becomes more prominent.

If we expand it to 200 top words5, we can see that the word network remains dense at the center (i.e. approximately top 50 words). Here we have to weigh the words (nodes) by a log scale to allow for better visualization.

3 Network of Word Co-occurrences

The word co-occurrence network is about understanding how each word that appears in the text relates to all other words which appear in the whole text. The connectivities between the words explain the structure of the messages or topics of the texts. An example below is a word co-occurrence network for the novel, Moby Dick (Chapter 1):

Sample word co-occurence network

Sample word co-occurence network

Here we can see that the whole text is centered around “sea” and “man”, and sub-grouped by “water”, “ship”, and “voyage”. The colors of the nodes represent sub-groupings (or cliques) whereby such sub-groupings may represent another message or sub-topic by itself.

Now let us work using the iGraph package and explore various analytics using graph theory in understanding a word co-occurrence network for Quran English.

library(igraph)
igphQ = quanteda::as.igraph(fcmQ)  # this is from quanteda package
igphQ
## IGRAPH 75aa3b5 DN-- 4801 267668 -- 
## + attr: name (v/c), frequency (v/n)
## + edges from 75aa3b5 (vertex names):
##  [1] merciful->allah merciful->allah merciful->allah merciful->allah
##  [5] merciful->allah merciful->allah merciful->allah merciful->allah
##  [9] merciful->allah merciful->allah merciful->allah merciful->allah
## [13] merciful->allah merciful->allah merciful->allah merciful->allah
## [17] merciful->allah merciful->allah merciful->allah merciful->allah
## [21] merciful->allah merciful->allah merciful->allah merciful->allah
## [25] merciful->allah merciful->allah merciful->allah merciful->allah
## [29] merciful->allah merciful->allah merciful->allah merciful->allah
## + ... omitted several edges

First let us get the number of words (nodes) and co-occurrences (edges) in the whole word co-occurrence network (graph).

V(igphQ)
## + 4801/4801 vertices, named, from 75aa3b5:
##    [1] allah             merciful          praise            due              
##    [5] lord              worlds            sovereign         day              
##    [9] recompense        worship           guide             straight         
##   [13] path              bestowed          favor             evoked           
##   [17] anger             astray            alif              lam              
##   [21] meem              book              doubt             guidance         
##   [25] conscious         unseen            establish         prayer           
##   [29] spend             provided          revealed          muhammad         
##   [33] faith             successful        disbelieve        warn             
##   [37] set               seal              hearts            hearing          
## + ... omitted several vertices
E(igphQ)
## + 267668/267668 edges from 75aa3b5 (vertex names):
##  [1] merciful->allah merciful->allah merciful->allah merciful->allah
##  [5] merciful->allah merciful->allah merciful->allah merciful->allah
##  [9] merciful->allah merciful->allah merciful->allah merciful->allah
## [13] merciful->allah merciful->allah merciful->allah merciful->allah
## [17] merciful->allah merciful->allah merciful->allah merciful->allah
## [21] merciful->allah merciful->allah merciful->allah merciful->allah
## [25] merciful->allah merciful->allah merciful->allah merciful->allah
## [29] merciful->allah merciful->allah merciful->allah merciful->allah
## [33] merciful->allah merciful->allah merciful->allah merciful->allah
## [37] merciful->allah merciful->allah merciful->allah merciful->allah
## + ... omitted several edges

There are 4,801 words (nodes). Note that we have removed all the stop-words. There are 267,668 directed relations (directed edges) between the nodes in the network.

To physically view the entire graph is daunting due to the computer memory required. We create an edge list and use Gephi for large network visualization. This step is done below.

# this need to be evaluated only once and save the csv file.
elQ = as_edgelist(igphQ)
elQ = as.data.frame(elQ)
elQ = elQ %>% rename("Source" = V1, "Target" = V2 )
write_csv(elQ,"elQ.csv")

Network visualization requires graph layout algorithms such as Force Atlas, Fruchterman-Reingold, Kamada-Kawai, and many others. These algorithms help two-dimensional visualizations of network structures by expanding the position of the nodes like a “spring” system and “coloring” process. The nodes are expanded and the edges are stretched to the furthest possible distance. Depending on the objective of the visualization, different algorithms are used. We will use the force-directed graph layout algorithms of Fruchterman-Reingold - which is a combination of applying attractive forces using the edges and repulsive forces on the nodes so that the nodes are spanned out systematically to produce visual results.6

4 General network layout

Here we present the general layout based on the Fruchterman-Reingold algorithm. The dots represent the nodes (each word for the 4,801 total words or nodes), and the lines (of various colors) represent the edges (or relations) between the words (nodes). The size and gradient intensity of the nodes are adjusted for the number of edges (relations) it has. We can see visually that it is a very dense and highly “connected” network.

Quran English word co-occurence network

Quran English word co-occurence network

A snapshot zoomed to the center of the network is shown below, where we can observe the appearance of extremely large nodes (words) which are densely related to almost all other nodes (words) in the network.

Close up view of the center of the network

Close up view of the center of the network

At the center of the network are the main words - Allah, Lord, people, Muhammad, etc., which are similar to the network shown before for the top 50, 100, and 200 words.

A clear emergent structure of the network is obviously prominent: the main subject matter is Allah (SWT), surrounded by other subject matters: the Lord, Prophet Muhammad (SAW), the people, and other related subject matters. Furthermore, we observe what is called hub-and-spoke structure of the network whereby few nodes (words) serve as hubs and evidently a single word (kalimah) is the main hub, which is Allah(SWT), and few other words serve as supporting hubs. We will explore this later.

4.1 General structure of the network

There are a few structures of networks that will be of interest, namely: diameter, paths and distances, connectedness, clustering, and modulairy measures of the network. These are the subjects we will turn to next.

Diameter and average distance and maximum distance

The network diameter is 6, which means that any word is less than 5 words away from any other word and the average distance is 2.55. These properties very strongly indicate how closely related are all the words in the Quran and the conciseness of the sentences in the text. As a comparison, the network diameter for the internet is 6.98, and its diameter is 26. The closest network in nature which exhibits similar measures is the E.coli metabolism network, which has a diameter of 8 and an average distance of 2.98. Lower diameters and average distances are good measures of “efficiencies” in a large network. In the case of the English Quran, the measures clearly imply that the words in the texts are used with extreme efficiency.

Connectedness

Now we measure the connectedness of the network:

comp_size = components(igphQ)
comp_size$no
## [1] 9
comp_size$csize
## [1] 4783    2    2    2    3    2    3    2    2

The result shows that there are 9 components, and in fact the single largest component (giant component) consists of 4783 nodes, which is 1% of the total nodes. This shows that the network is actually a single giant component, and it is also a fully connected network, which means that there is no single word that is not related to at least another word. It also implies that the whole network (i.e. every word) has a relation (directly or indirectly) with every other word in the network!!! 7

Clustering coefficients

There are many ways to compute clustering coefficients in a network, we will use the iGraph method called transitivity. The measure for our network is at 0.11%. This is a measure of the probability that given a node, the adjacent nodes (words) are connected. The number obtained here is extremely high. For most other real networks, the probabilities are extremely small; for example, the internet (0), World Wide Web (0), and E.coli metabolism (0.0054). An implication of this finding is an indication of how “dense” the words are in the text, and no word is left without relations to other words.

Modularity

The modularity algorithm is used to find community structures or groupings of nodes and edges in large networks. In iGraph, this is accomplished by applying the cluster_walktrap function. However, this approach has some shortcomings, mainly it relies on a random walk approach in finding communities, which is sensitive to the starting position and is used mainly in an undirected graph. For this purpose, we rely instead on the “modularity class” function of Gephi for calculations.

Modularity class

Modularity class

It is interesting to note that there are seven major modular classes with members of 300 or more, with the largest community having about 1,600 members. In fact, the smaller classes are with members of less than ten, and can be ignored. The percentage of nodes within each class is as follows: 33.63% (one-third of the nodes), 17.87%, 15.68%, 9.58%, 8.33%, 7.42%, and 6.33% (from the first to the seventh).

Let us have a view of the total picture of the modularity classes within the network.

Now let us check the structure of each of the various sub-groups.

The largest group, which is centered on the word “Allah” and similar to the main network that we saw earlier.

The second largest group, which do not have any clear words in the center, and quite distributed across many words.

The third largest group, which do not have any clear words in the center, and quite distributed across many words. Further inspection reveals that the other groupings look pretty similar.

These groupings might reveal many other interpretations if we dive much deeper into the various sub-networks of word co-occurrences in bringing out the messages or topics of focus/interest. For example we can compile all the words within a clique and perform sentiment analysis on the subset of words. We can also weigh the sentiment scores against the position of the word within the sub-network, etc. We will leave this for future works.

Degree distributions

The average degree of the network is 2.693022410^{-5}. A plot of the degree distribution below shows that as the case where a very small number of words have a high degree of edges, whilst a very large number of words have a small number of edges.

Let us observe the words with high degree values.

degQ = degree(igphQ, v = V(igphQ), mode = "total", loops = TRUE, normalized = FALSE)
top_degree = degQ[rev(order(degQ))]
top_degree[1:30]
##        allah         lord       people          day        earth   punishment 
##        37132        10668         8796         5484         5272         3794 
##         fear    messenger     believed     muhammad        truth      heavens 
##         3510         3242         3026         2912         2802         2612 
##      knowing    believers         life disbelievers         fire        mercy 
##         2190         2190         2048         1962         1912         1898 
##       surely        moses        wills     revealed       hearts    righteous 
##         1882         1872         1804         1770         1756         1746 
##        signs        deeds       verses     merciful         evil   messengers 
##         1742         1736         1720         1684         1662         1622

A quick observation confirms that most of the words are almost the same as top-features words from the earlier section. Of interesting observation is the word “moses” which has a high degree of relations as compared to let say “abraham”. The degree is a measure of “prestige” of the word within the whole text.

Betweenness

Betweenness measures the relative importance of words in connecting other words as the word in between.

btwnQ = betweenness(igphQ, v = V(igphQ), directed = TRUE, weights = NULL, nobigint = TRUE, normalized = FALSE)
top_btwn = btwnQ[rev(order(btwnQ))]
top_btwn[1:30]
##        allah         lord       people          day        earth   punishment 
##   12393253.6    2372992.3    1914394.2    1020178.4     857687.5     398517.5 
##         fear         fire     believed     muhammad      created         evil 
##     301924.0     263718.0     261461.9     261025.5     232926.7     218509.2 
## disbelievers        truth    messenger        women        moses        bring 
##     215326.0     191211.2     185947.3     182885.9     173676.7     152897.7 
##       hearts       surely         life         land        signs       qur'an 
##     148511.3     144949.8     144215.4     142438.9     133450.6     132977.6 
##         time      gardens    believers        night   disbelieve     merciful 
##     124821.6     124602.2     124262.2     117464.4     117338.4     117254.3

A comparison with words from the top degree, reveals some interesting observations: in the case of betweenness, some words such as “evil”, “women” and “qur’an” appears to have a higher position as compared to the degrees.

Centrality

Centrality measures refer to the centrality position of a word in the whole text. There are many ways to measure centrality, the simplest one being eigenvector centrality. This is done below:

evcentQ = eigen_centrality(igphQ)
top_evcent =evcentQ$vector
top_evcent = top_evcent[rev(order(top_evcent))]
top_evcent[1:30]
##        allah       people         lord        earth         fear    messenger 
##   1.00000000   0.43968293   0.41559959   0.31946184   0.28629404   0.27017213 
##          day      heavens      knowing     believed   punishment        truth 
##   0.25094688   0.22366509   0.20385360   0.19517225   0.18111264   0.17724441 
##     muhammad    believers        wills     merciful      exalted        mercy 
##   0.15861383   0.15488163   0.15316060   0.12926644   0.12379714   0.11810613 
##      worship      belongs disbelievers     revealed    forgiving         wise 
##   0.11752424   0.11637912   0.11282551   0.11011288   0.10873771   0.10595218 
##       reward     religion         life       verses       surely       hearts 
##   0.10514239   0.10454492   0.10389173   0.10163132   0.10032456   0.09896275

Summary

Summarizing all the three statistical properties of the nodes of the network, we can say that all the important words (top features) have a high degree, betweenness, and centrality within the network. The consistencies of these measures across these top features are very interesting in the sense that the first word, Allah is the topmost in all cases (the highest degree/prestige, the most betweenness, and also highest in centrality).


5 Deep Dive Into Selected Surahs

The methods introduced earlier can be applied to a Surah (Chapter of the Quran) as well as in performing comparisons between Surahs. Let us choose two intermediate-length Surahs, namely Al-Kahfi (No. 18, 110 verses) and Maryam (No. 19, 98 verses). There are a few approaches we could take:

  1. understanding the total network of the texts (tokens) in the Surah;
  2. understanding the “topics” of the Surah.

Surah Al-Kahf:

Surah Maryam:

Let us look at comparing key words in Surah Al-Kahf versus Maryam by plotting the keyness of the main words.

We can see that in Surah Al-Kahf, the word “found”, which relates to the cave-dwellers and al-khidh (Khidir) as well as Dhul-Qarnayn, while in Surah Maryam, the key message is “merciful” attribute of Allah, and of course Maryam (Mary) and her son, Isa (AS).

Another approach of comparison is called key words in context or kwic lexical dispersion plot, which is shown below. Note that “text1” to “text110” is from Surah Al-Kahf and the rest (“text111 to”text208") are from Surah Maryam.

First let us apply it to the word “Allah”.

And to the word “people”.

The plots shown display the frequency of the selected key word and its appearance within the various verses (“texts”). Lexical dispersion demonstrates the richness of emphasis of the whole document in regard to the message, via frequencies of occurence relative to the sentences (verses) within the document.

There are many other methods that can be applied in NLP studies such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis, and numerous other methods of machine learning and probabilistic models. All these methods reveal different meanings and purposes depending on the interest of the task.


6 Closing remarks

We have demonstrated the use and power of NLP tools using network analysis (or graph theory). Important observations that we must take cognizance of are:

  1. the emergent structures of the network could reveal much interesting meaning;
  2. many properties (general and statistical) of the network provide lots of insights into the formation of the network - which is an important subject by itself;
  3. the general findings provided here could lead to many other paths of analysis.

Based on the previous posts, we have covered some tools for text analysis in the form of R packages, notably tidytext, udpipe, and quanteda. In all the posts, #networkscience tools like igraph, tidygraph and ggraph are important for our #qurananalytics project. As such, we have also posted a simple tutorial on these tools.

We will post future work focused on specific Surahs of the Quran. Based on the post on Surah Yusuf, we find that it is easier to comment on the subject matters and topics of individual Surahs.


7 References


  1. Lexical semantics, Oxford research encyclopedias; https://oxfordre.com/linguistics/view/10.1093/acrefore/9780199384655.001.0001/acrefore-9780199384655-e-29

  2. Quanteda reference, https://quanteda.io/index.html

  3. iGraph reference, https://igraph.org/r/

  4. Tareq Alsuwaidan and Azman Hussin, Islam Simplified - A Holistic View of the Quran, to be published.

  5. we limit to 200 words due to processing time on our own computer

  6. Reference: Fruchterman-Reingold algorithm. https://github.com/gephi/gephi/wiki/Fruchterman-Reingold

  7. This is a very important phenomenon that requires deeper interpretations. Just imagine webpages of the Internet, whereby every page is directly or indirectly connected to every other page on the network!!! We know that this is not true in the case of the Internet.