Introduction

Earlier we posted our first report on #qurananalytics using the tidytext and quRan packages. We continue to explore the tidytext features applied to the English translation of the Quran following the examples used in Text Mining with R - A Tidy Approach, by Julia Silge and David Robinson.

This article will examine word and document frequencies: tf-idf. The utility of statistical word analysis in natural language processing (NLP) and text analaytics that we will apply in this post is nicely summarized in this visual below. (Coursera: Text Mining and Analytics by Atsushi Takayama is a good robust introduction to the subject.)

Takayama commented that Word-based representation of documents is general and robust, requires no or little manual effort, and is “surprisingly” powerful.


Preliminaries

Load Packages and Libraries

packages=c('dplyr', 'tidyverse', 'tidytext', 'ggplot2', 'ggraph', 'knitr', 'quRan')
for (p in packages){
  if (! require (p,character.only = T)){
    install.packages(p)
  }
library(p,character.only = T)
}

Focus on Selected Quran version and Variables

The quRan packaage has 4 versions of the Quran.

  1. quran_ar
  2. quran_ar_min
  3. quran_en_sahih
  4. quran_en_yusufali

We will analyze selected variable (columns) from quran_en_sahih.

quranES <- quran_en_sahih %>% select(surah_id, 
                                   ayah_id,
                                   surah_title_en, 
                                   surah_title_en_trans, 
                                   revelation_type, 
                                   text,
                                   ayah_title)
quranES
ABCDEFGHIJ0123456789
surah_id
<int>
ayah_id
<int>
surah_title_en
<fctr>
surah_title_en_trans
<fctr>
revelation_type
<chr>
11Al-FaatihaThe OpeningMeccan
12Al-FaatihaThe OpeningMeccan
13Al-FaatihaThe OpeningMeccan
14Al-FaatihaThe OpeningMeccan
15Al-FaatihaThe OpeningMeccan
16Al-FaatihaThe OpeningMeccan
17Al-FaatihaThe OpeningMeccan
28Al-BaqaraThe CowMedinan
29Al-BaqaraThe CowMedinan
210Al-BaqaraThe CowMedinan

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format. This is done with the unnest_tokens() function.

tidyES <- quranES %>%
  unnest_tokens(word, text)
tidyES
ABCDEFGHIJ0123456789
surah_id
<int>
ayah_id
<int>
surah_title_en
<fctr>
surah_title_en_trans
<fctr>
revelation_type
<chr>
ayah_title
<chr>
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1
11Al-FaatihaThe OpeningMeccan1:1

This function uses the tokenizers package to separate each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr.


Analyzing Word and Document Frequency: tf-idf

A central question in text mining and natural language processing (NLP) is how to quantify what a document is about. We can do this by looking at the words that make up the document. One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a very sophisticated approach to adjusting term frequency for commonly used words.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites. In the case of the Quran, it can be comparing between Surahs or Juz or Hizb.

The tf-idf is a rule-of-thumb or heuristic quantity; while it has proved useful in text mining, search engines, etc., its theoretical foundations are considered less than firm by information theory experts. The inverse document frequency for any given term is defined as

idf(term) = ln[(number of documents)/(number of documents containing term)]

We can use tidy data principles to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.

Term frequency in English Quran

We start by looking at the Surahs of the Quran and examine first term frequency, and then tf-idf. We can start just by using dplyr verbs such as group_by() and join(). What are the most commonly used words in the Surahs? We also calculate the total words in each Surah here, for later use.

surah_words <- quranES %>%
  unnest_tokens(word, text) %>%
  count(surah_title_en, word, sort = TRUE)
surah_words
ABCDEFGHIJ0123456789
surah_title_en
<fctr>
word
<chr>
n
<int>
Al-Baqaraand762
Al-Baqarathe523
An-Nisaaand460
Al-Baqarayou425
Aal-i-Imraanand410
Aal-i-Imraanthe389
Al-A'raafand365
Al-An'aamand350
Al-Baqaraof323
Al-Maaidaand322
total_words <- surah_words %>% 
  group_by(surah_title_en) %>% 
  summarize(total = sum(n))

surah_words <- left_join(surah_words, total_words)

surah_words
ABCDEFGHIJ0123456789
surah_title_en
<fctr>
word
<chr>
n
<int>
total
<int>
Al-Baqaraand76212337
Al-Baqarathe52312337
An-Nisaaand4607355
Al-Baqarayou42512337
Aal-i-Imraanand4107083
Aal-i-Imraanthe3897083
Al-A'raafand3656798
Al-An'aamand3506228
Al-Baqaraof32312337
Al-Maaidaand3225602

There is one row in this surah_words data frame for each word-surah combination; n is the number of times that word is used in that surah and total is the total words in that surah. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth. In the following figures, we look at the distribution of n/total for selected long surahs (with > 150 verses), the number of times a word appears in a surah divided by the total number of terms (words) in that surah. This is exactly what term frequency is.

# library(ggplot2)
surah_words %>% filter(surah_title_en %in% 
            c("Al-Baqara", "Aal-i-Imraan", "An-Nisaa", "Al-Maaida", "Al-An'aam", "Al-A'raaf")) %>%
            ggplot(aes(n/total, fill = surah_title_en)) +
            geom_histogram(show.legend = TRUE) +
            xlim(NA, 0.01) +
            facet_wrap(~surah_title_en, ncol = 2, scales = "free_y") +
            labs(title = "Term frequency distribution of the 6 Long Surahs",
                 x = "(Word Frequency)/(Total Words)",
                 y = "Count")

There are long tails to the right for these selected surahs (those extremely rare words!) that we have not shown in these plots.

We repeat the same with the Hamim Surahs that are betwwen 4 to 10 pages each, medium length Surahs.

surah_words %>% filter(surah_title_en %in% 
            c("Ghafir", "Fussilat", "Ash-Shura", "Az-Zukhruf", "Ad-Dukhaan", "Al-Jaathiya")) %>%
            ggplot(aes(n/total, fill = surah_title_en)) +
            geom_histogram(show.legend = TRUE) +
            xlim(NA, 0.01) +
            facet_wrap(~surah_title_en, ncol = 2, scales = "free_y") +
            labs(title = "Term frequency distribution of 6 Medium Surahs",
                 x = "(Word Frequency)/(Total Words)",
                 y = "Count")

These plots exhibit similar distributions for all the surahs, with many words that occur rarely and fewer words that occur frequently.


The bind_tf_idf function

The idea of tf-idf is to find the important words for the content of each Surah by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Quran Surahs as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a Surah, but not too common. Let’s do that now.

The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (Surah in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example). We calculated a total for each Surah for our explorations in previous sections, but it is not necessary for the bind_tf_idf function; the table only needs to contain all the words in each document.

surah_words <- surah_words %>%
  bind_tf_idf(word, surah_title_en, n)

surah_words
ABCDEFGHIJ0123456789
surah_title_en
<fctr>
word
<chr>
n
<int>
total
<int>
tf
<dbl>
idf
<dbl>
tf_idf
<dbl>
Al-Baqaraand762123370.06176542110.017699580.0010932218
Al-Baqarathe523123370.04239280210.026668250.0011305417
An-Nisaaand46073550.06254248810.017699580.0011069756
Al-Baqarayou425123370.03444921780.091807550.0031626983
Aal-i-Imraanand41070830.05788507690.017699580.0010245414
Aal-i-Imraanthe38970830.05492023150.026668250.0014646263
Al-A'raafand36567980.05369226240.017699580.0009503303
Al-An'aamand35062280.05619781630.017699580.0009946776
Al-Baqaraof323123370.02618140550.054067220.0014155558
Al-Maaidaand32256020.05747947160.017699580.0010173623

Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all 114 of Quran Surahs, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.

Let’s look at terms with high tf-idf in Quran Surahs.

surah_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))
ABCDEFGHIJ0123456789
surah_title_en
<fctr>
word
<chr>
n
<int>
tf
<dbl>
idf
<dbl>
tf_idf
<dbl>
Al-Asradvised20.06896551723.637586160.250868011
Al-Ikhlaasbegets10.04166666674.736198450.197341602
Quraishsaving20.04878048784.043051270.197222013
Quraishaccustomed20.04878048783.637586160.177443227
An-Naasmankind50.13157894741.335001070.175658035
Al-Kawtharkawthar10.03703703704.736198450.175414757
Al-Humazacrusher20.03030303034.736198450.143521165
Al-Ikhlaaseternal10.04166666673.126760540.130281689
An-Naaswhisperer10.02631578954.736198450.124636801
Quraishsecurity20.04878048782.538973870.123852384

Some of the values for idf are the same for different terms because there are 114 Surahs and we are seeing the numerical value for ln(114/1), ln(114/2) etc.

Let’s look at a visualization for these high tf-idf words. We focus on selected long Surahs.

surah_words %>%
  filter(surah_title_en %in% 
            c("Al-Baqara", "Aal-i-Imraan", "An-Nisaa", "Al-Maaida", "Al-An'aam", "Al-A'raaf")) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(surah_title_en) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = surah_title_en)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~surah_title_en, ncol = 2, scales = "free") +
  coord_flip()

Redo with the last 6 Surahs.

surah_words %>%
  filter(surah_title_en %in% 
            c("An-Naas", "Al-Falaq", "Al-Ikhlaas", "Al-Masad", "An-Nasr", "Al-Kaafiroon")) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(surah_title_en) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = surah_title_en)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~surah_title_en, ncol = 2, scales = "free") +
  coord_flip()

Still all proper nouns in the two figures! These words are, as measured by tf-idf, the most important to each Surah and most readers would likely agree. This is the point of tf-idf; it identifies words that are important to one document within a collection of documents.


Summary

Using term frequency (tf) and inverse document frequency (idf) allows us to find words that are characteristic for one document within a collection of documents. Exploring term frequency on its own can give us insight into how language is used in a collection of natural language, and dplyr verbs like count() and rank() give us tools to reason about term frequency. The tidytext package uses an implementation of tf-idf consistent with tidy data principles that enables us to see how different words are important in documents within a collection or corpus of documents.

The proper noun “Allah” ranks very high on almost all the statistics of the English Quran. This confirms that “Allah” is the central and most important subject matter of the Quran, a topic that I will cover in an upcoming book [4].

Numerical analysis of words from the Quran is a good and “easy” start to #qurananalytics. It is general and robust, requires no or little manual effort, and is “surprisingly” powerful.

Reference

  1. https://www.tidytextmining.com/tidytext.html
  2. library(quRan)
  3. Atsushi Takayama, Coursera: Text Mining and Analytics, https://medium.com/@taka.atsushi/coursera-text-mining-and-analytics-bf314d7e130e
  4. Alsuwaidan, T. and Hussin, A., Islam Simplified - A Holistic View of the Quran (to be published)