Introduction
Preliminaries
- Load Packages and Libraries
- Focus on Selected Quran version and Variables
Analyzing Word and Document Frequency: tf-idf
- Term frequency in English Quran
- The bind_tf_idf function
Summary
Reference

Introduction

Earlier we posted our first report on #qurananalytics using the tidytext and quRan packages. We continue to explore the tidytext features applied to the English translation of the Quran following the examples used in Text Mining with R - A Tidy Approach, by Julia Silge and David Robinson.

This article will examine word and document frequencies: tf-idf. The utility of statistical word analysis in natural language processing (NLP) and text analaytics that we will apply in this post is nicely summarized in this visual below. (Coursera: Text Mining and Analytics by Atsushi Takayama is a good robust introduction to the subject.)

Takayama commented that Word-based representation of documents is general and robust, requires no or little manual effort, and is “surprisingly” powerful.

Preliminaries

Load Packages and Libraries

packages=c('dplyr', 'tidyverse', 'tidytext', 'ggplot2', 'ggraph', 'knitr', 'quRan')
for (p in packages){
  if (! require (p,character.only = T)){
    install.packages(p)
  }
library(p,character.only = T)
}

Focus on Selected Quran version and Variables

The quRan packaage has 4 versions of the Quran.

quran_ar
quran_ar_min
quran_en_sahih
quran_en_yusufali

We will analyze selected variable (columns) from quran_en_sahih.

quranES <- quran_en_sahih %>% select(surah_id, 
                                   ayah_id,
                                   surah_title_en, 
                                   surah_title_en_trans, 
                                   revelation_type, 
                                   text,
                                   ayah_title)
quranES

ABCDEFGHIJ0123456789

surah_id <int>	ayah_id <int>	surah_title_en <fctr>	surah_title_en_trans <fctr>	revelation_type <chr>
1	1	Al-Faatiha	The Opening	Meccan
1	2	Al-Faatiha	The Opening	Meccan
1	3	Al-Faatiha	The Opening	Meccan
1	4	Al-Faatiha	The Opening	Meccan
1	5	Al-Faatiha	The Opening	Meccan
1	6	Al-Faatiha	The Opening	Meccan
1	7	Al-Faatiha	The Opening	Meccan
2	8	Al-Baqara	The Cow	Medinan
2	9	Al-Baqara	The Cow	Medinan
2	10	Al-Baqara	The Cow	Medinan

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format. This is done with the unnest_tokens() function.

tidyES <- quranES %>%
  unnest_tokens(word, text)
tidyES

ABCDEFGHIJ0123456789

surah_id <int>	ayah_id <int>	surah_title_en <fctr>	surah_title_en_trans <fctr>	revelation_type <chr>	ayah_title <chr>
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1
1	1	Al-Faatiha	The Opening	Meccan	1:1

This function uses the tokenizers package to separate each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr.

Analyzing Word and Document Frequency: tf-idf

A central question in text mining and natural language processing (NLP) is how to quantify what a document is about. We can do this by looking at the words that make up the document. One measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important; in English, these are probably words like “the”, “is”, “of”, and so forth. We might take the approach of adding words like these to a list of stop words and removing them before analysis, but it is possible that some of these words might be more important in some documents than others. A list of stop words is not a very sophisticated approach to adjusting term frequency for commonly used words.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites. In the case of the Quran, it can be comparing between Surahs or Juz or Hizb.

The tf-idf is a rule-of-thumb or heuristic quantity; while it has proved useful in text mining, search engines, etc., its theoretical foundations are considered less than firm by information theory experts. The inverse document frequency for any given term is defined as

idf(term) = ln[(number of documents)/(number of documents containing term)]

We can use tidy data principles to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection.

Term frequency in English Quran

We start by looking at the Surahs of the Quran and examine first term frequency, and then tf-idf. We can start just by using dplyr verbs such as group_by() and join(). What are the most commonly used words in the Surahs? We also calculate the total words in each Surah here, for later use.

surah_words <- quranES %>%
  unnest_tokens(word, text) %>%
  count(surah_title_en, word, sort = TRUE)
surah_words

ABCDEFGHIJ0123456789

surah_title_en <fctr>	word <chr>	n <int>
Al-Baqara	and	762
Al-Baqara	the	523
An-Nisaa	and	460
Al-Baqara	you	425
Aal-i-Imraan	and	410
Aal-i-Imraan	the	389
Al-A'raaf	and	365
Al-An'aam	and	350
Al-Baqara	of	323
Al-Maaida	and	322

total_words <- surah_words %>% 
  group_by(surah_title_en) %>% 
  summarize(total = sum(n))

surah_words <- left_join(surah_words, total_words)

surah_words

ABCDEFGHIJ0123456789

surah_title_en <fctr>	word <chr>	n <int>	total <int>
Al-Baqara	and	762	12337
Al-Baqara	the	523	12337
An-Nisaa	and	460	7355
Al-Baqara	you	425	12337
Aal-i-Imraan	and	410	7083
Aal-i-Imraan	the	389	7083
Al-A'raaf	and	365	6798
Al-An'aam	and	350	6228
Al-Baqara	of	323	12337
Al-Maaida	and	322	5602

There is one row in this surah_words data frame for each word-surah combination; n is the number of times that word is used in that surah and total is the total words in that surah. The usual suspects are here with the highest n, “the”, “and”, “to”, and so forth. In the following figures, we look at the distribution of n/total for selected long surahs (with > 150 verses), the number of times a word appears in a surah divided by the total number of terms (words) in that surah. This is exactly what term frequency is.

# library(ggplot2)
surah_words %>% filter(surah_title_en %in% 
            c("Al-Baqara", "Aal-i-Imraan", "An-Nisaa", "Al-Maaida", "Al-An'aam", "Al-A'raaf")) %>%
            ggplot(aes(n/total, fill = surah_title_en)) +
            geom_histogram(show.legend = TRUE) +
            xlim(NA, 0.01) +
            facet_wrap(~surah_title_en, ncol = 2, scales = "free_y") +
            labs(title = "Term frequency distribution of the 6 Long Surahs",
                 x = "(Word Frequency)/(Total Words)",
                 y = "Count")

There are long tails to the right for these selected surahs (those extremely rare words!) that we have not shown in these plots.

We repeat the same with the Hamim Surahs that are betwwen 4 to 10 pages each, medium length Surahs.

surah_words %>% filter(surah_title_en %in% 
            c("Ghafir", "Fussilat", "Ash-Shura", "Az-Zukhruf", "Ad-Dukhaan", "Al-Jaathiya")) %>%
            ggplot(aes(n/total, fill = surah_title_en)) +
            geom_histogram(show.legend = TRUE) +
            xlim(NA, 0.01) +
            facet_wrap(~surah_title_en, ncol = 2, scales = "free_y") +
            labs(title = "Term frequency distribution of 6 Medium Surahs",
                 x = "(Word Frequency)/(Total Words)",
                 y = "Count")

These plots exhibit similar distributions for all the surahs, with many words that occur rarely and fewer words that occur frequently.

The bind_tf_idf function

The idea of tf-idf is to find the important words for the content of each Surah by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents, in this case, the group of Quran Surahs as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a Surah, but not too common. Let’s do that now.

The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (Surah in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example). We calculated a total for each Surah for our explorations in previous sections, but it is not necessary for the bind_tf_idf function; the table only needs to contain all the words in each document.

surah_words <- surah_words %>%
  bind_tf_idf(word, surah_title_en, n)

surah_words

ABCDEFGHIJ0123456789

surah_title_en <fctr>	word <chr>	n <int>	total <int>	tf <dbl>	idf <dbl>	tf_idf <dbl>
Al-Baqara	and	762	12337	0.0617654211	0.01769958	0.0010932218
Al-Baqara	the	523	12337	0.0423928021	0.02666825	0.0011305417
An-Nisaa	and	460	7355	0.0625424881	0.01769958	0.0011069756
Al-Baqara	you	425	12337	0.0344492178	0.09180755	0.0031626983
Aal-i-Imraan	and	410	7083	0.0578850769	0.01769958	0.0010245414
Aal-i-Imraan	the	389	7083	0.0549202315	0.02666825	0.0014646263
Al-A'raaf	and	365	6798	0.0536922624	0.01769958	0.0009503303
Al-An'aam	and	350	6228	0.0561978163	0.01769958	0.0009946776
Al-Baqara	of	323	12337	0.0261814055	0.05406722	0.0014155558
Al-Maaida	and	322	5602	0.0574794716	0.01769958	0.0010173623

Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all 114 of Quran Surahs, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.

Let’s look at terms with high tf-idf in Quran Surahs.

surah_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))

ABCDEFGHIJ0123456789

surah_title_en <fctr>	word <chr>	n <int>	tf <dbl>	idf <dbl>	tf_idf <dbl>
Al-Asr	advised	2	0.0689655172	3.63758616	0.250868011
Al-Ikhlaas	begets	1	0.0416666667	4.73619845	0.197341602
Quraish	saving	2	0.0487804878	4.04305127	0.197222013
Quraish	accustomed	2	0.0487804878	3.63758616	0.177443227
An-Naas	mankind	5	0.1315789474	1.33500107	0.175658035
Al-Kawthar	kawthar	1	0.0370370370	4.73619845	0.175414757
Al-Humaza	crusher	2	0.0303030303	4.73619845	0.143521165
Al-Ikhlaas	eternal	1	0.0416666667	3.12676054	0.130281689
An-Naas	whisperer	1	0.0263157895	4.73619845	0.124636801
Quraish	security	2	0.0487804878	2.53897387	0.123852384

Some of the values for idf are the same for different terms because there are 114 Surahs and we are seeing the numerical value for ln(114/1), ln(114/2) etc.

Let’s look at a visualization for these high tf-idf words. We focus on selected long Surahs.

surah_words %>%
  filter(surah_title_en %in% 
            c("Al-Baqara", "Aal-i-Imraan", "An-Nisaa", "Al-Maaida", "Al-An'aam", "Al-A'raaf")) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(surah_title_en) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = surah_title_en)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~surah_title_en, ncol = 2, scales = "free") +
  coord_flip()

Redo with the last 6 Surahs.

surah_words %>%
  filter(surah_title_en %in% 
            c("An-Naas", "Al-Falaq", "Al-Ikhlaas", "Al-Masad", "An-Nasr", "Al-Kaafiroon")) %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(surah_title_en) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = surah_title_en)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~surah_title_en, ncol = 2, scales = "free") +
  coord_flip()

Still all proper nouns in the two figures! These words are, as measured by tf-idf, the most important to each Surah and most readers would likely agree. This is the point of tf-idf; it identifies words that are important to one document within a collection of documents.

Summary

Using term frequency (tf) and inverse document frequency (idf) allows us to find words that are characteristic for one document within a collection of documents. Exploring term frequency on its own can give us insight into how language is used in a collection of natural language, and dplyr verbs like count() and rank() give us tools to reason about term frequency. The tidytext package uses an implementation of tf-idf consistent with tidy data principles that enables us to see how different words are important in documents within a collection or corpus of documents.

The proper noun “Allah” ranks very high on almost all the statistics of the English Quran. This confirms that “Allah” is the central and most important subject matter of the Quran, a topic that I will cover in an upcoming book [4].

Numerical analysis of words from the Quran is a good and “easy” start to #qurananalytics. It is general and robust, requires no or little manual effort, and is “surprisingly” powerful.

Reference

https://www.tidytextmining.com/tidytext.html
library(quRan)
Atsushi Takayama, Coursera: Text Mining and Analytics, https://medium.com/@taka.atsushi/coursera-text-mining-and-analytics-bf314d7e130e
Alsuwaidan, T. and Hussin, A., Islam Simplified - A Holistic View of the Quran (to be published)

Quran English Word and Document Frequency With Tidytext

Azman Hussin

2020-11-13

Introduction

Preliminaries

Load Packages and Libraries

Focus on Selected Quran version and Variables

Analyzing Word and Document Frequency: tf-idf

Term frequency in English Quran

The bind_tf_idf function

Summary

Reference