Introduction

Prophet (SAW) said, “The best of you are those who learn the Quran and teach it.” [Bukhari]

Natural Language Processing (NLP) is a combination of linguistics and data science analyzing large amounts of natural language data which includes collection of speeches, text corpora and other forms of data generated from the usage of languages

Certainly these tools are new and allow new forms of analysis on old sources of text as shown in the examples used in Text Mining with R - A Tidy Approach, by Julia Silge and David Robinson, my first introduction to the subject matter. Recently, I came across a well prepared dataset of the Holy Quran which was quite recently published (2019-01-17). The obvious idea came to apply the current tools and methods in text analysis with the Quran datasets. In research, different methods used on the same data often yield different results. So using the new methods and tools of #datascience, #textanalytics, and #networkscience to “learn the Quran” should be interesting. We intend to share our work as we progress through this project.

This article will focus on providing preliminary findings using basic text mining analysis. We have done work on network analysis and topic modeling, among others. We have also started on applying knowledge graph techniques using the Quran data. These are for future posts.

R packages and data used

For this article we replicated some steps in Text Mining with R - A Tidy Approach. As such, we used two main packages in R, the prebuilt text in tidydata format from the quRan package, and the tidytext package.

Preliminaries

Load Packages and Libraries

packages=c('dplyr', 'tidyverse', 'tidytext', 'ggplot2', 'ggraph', 'knitr', 'quRan')
for (p in packages){
  if (! require (p,character.only = T)){
    install.packages(p)
  }
library(p,character.only = T)
}

The unnest_tokens function

Before we load the data, let us start with the first chapter or Surah in the Quran, Al-Fatihah. We will mainly use the Saheeh International English translation of the Quran. For comparison purposes, we will us the Yusuf Ali translation. Both are available in the quRan package.

This section just introduces the tidy tools and the concept of tokens.

text <- c("In the name of Allah, the Entirely Merciful, the Especially Merciful",
          "[All] praise is [due] to Allah, Lord of the worlds",
          "The Entirely Merciful, the Especially Merciful",
          "Sovereign of the Day of Recompense.",
          "It is You we worship and You we ask for help.",
          "Guide us to the straight path",
          "The path of those upon whom You have bestowed favor, not of those who have evoked [Your] anger or those who go astray")

text
## [1] "In the name of Allah, the Entirely Merciful, the Especially Merciful"                                                 
## [2] "[All] praise is [due] to Allah, Lord of the worlds"                                                                   
## [3] "The Entirely Merciful, the Especially Merciful"                                                                       
## [4] "Sovereign of the Day of Recompense."                                                                                  
## [5] "It is You we worship and You we ask for help."                                                                        
## [6] "Guide us to the straight path"                                                                                        
## [7] "The path of those upon whom You have bestowed favor, not of those who have evoked [Your] anger or those who go astray"

This is a typical character vector. We first need to put it into a data frame to make it a tidy text dataset.

text_df <- tibble(line = 1:7, text = text)
text_df

A tibble is a modern class of data frame within R, available in the dplyr and tibble packages. Tibbles are great for use with tidy tools.

Notice that this data frame containing text isn’t yet compatible with tidy text analysis. We cannot filter out words or count which occur most frequently, since each row is made up of multiple combined words. We need to convert this so that it has one-token-per-document-per-row.

A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.

In this first example, we only have one document (Al-Fatihah), but we will explore examples with multiple documents (Surahs).

Within our tidy text framework, we need to both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure. * Use tidytext’s unnest_tokens() function.

text_df %>%
  unnest_tokens(word, text)

After using unnest_tokens, there is one token (word) in each row of the new data frame; the default tokenization in unnest_tokens() is for single words.

  • Other columns, such as the line number each word came from, are retained.
  • Punctuation has been stripped.
  • By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this feature).

Now we can manipulate, process, and visualize the text using the standard set of tidy tools, namely dplyr, tidyr, and ggplot2


Focus on Selected Quran version and Variables

The quRan packaage has 4 versions of the Quran. It gives the full verses of the Quran, in data frames containing one row per verse, formatted to be convenient for text analysis. of the ) and in , .

  1. quran_ar (Quran in Arabic with vowels)
  2. quran_ar_min (Quran in Arabic without vowels)
  3. quran_en_sahih (Quran in English, Saheeh International translation)
  4. quran_en_yusufali (Quran in English, Yusuf Ali translation)

We will analyze selected variable (columns) from quran_en_sahih

quranES <- quran_en_sahih %>% select(surah_id, 
                                   ayah_id,
                                   surah_title_en, 
                                   surah_title_en_trans, 
                                   revelation_type, 
                                   text, 
                                   surah, 
                                   ayah, 
                                   ayah_title)
# quranES

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.

tidyES <- quranES %>%
  unnest_tokens(word, text)
tidyES

This function separates each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. Often in text analysis, we remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().

data(stop_words)

tidyES <- tidyES %>%
  anti_join(stop_words)

The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.

Count and Plot

We can also use dplyr’s count() to find the most common words in all the Quran as a whole.

Because we’ve been using tidy tools, our word counts are stored in a tidy data frame. This allows us to pipe this directly to the ggplot2 package, for example to create a visualization of the most common words in the Sahih International Translation of the Quran. We plot the words that occur more then 150 times.

tidyES %>% count(word, sort = TRUE)
tidyES %>%
  count(word, sort = TRUE) %>%
  filter(n > 150) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  theme(axis.text = element_text( 
    angle = 0, 
    color="blue", 
    size=10)
  )

Repeat the same for (Quran in English, Yusuf Ali translation)

This translation uses words like “thee” and “thou” so we want to filter that out too.

my_stopwords <- tibble(word = c('ye', 'verily', 'will', 'said', 'say', 'us', 
                                'thy', 'thee', 'thou', 'hath', 'doth'))
tidyEY <- tidyEY %>%
  anti_join(my_stopwords)

Now we count and plot the words that occur more then 150 times.


Sentiment analysis with tidy data

In the previous section, we explored the tidy text format and showed it can easily be used to approach questions about word frequency in the English Quran. This allowed us to analyze which words are used most frequently in the Quran and to compare two versions of the English Quran,

Now let us address the topic of opinion mining or sentiment analysis. We can use the tools of text mining to approach the emotional content of text programmatically.

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. There are other approaches but this approach can easily take advantage of the tidy tools.

The sentiments dataset

The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are

  1. AFINN from Finn Årup Nielsen,
  2. bing from Bing Liu and collaborators, and
  3. nrc from Saif Mohammad and Peter Turney.

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. In this post, we will use the bing lexicon which categorizes words in a binary fashion into positive and negative categories.

The function get_sentiments() allows us to get specific sentiment lexicons. For example, get_sentiments(“afinn”). We will use

get_sentiments("bing")

Most common positive and negative words

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames

bing_word_counts <- tidyES %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() +
  theme(axis.text = element_text( 
    angle = 0, 
    color="blue", 
    size=10))

Again, we repeat with Yusufali version.

Interesingly, for both translations the shape is identical for the 2 pairs of graphs. Many of the words are also similar.

Zooming into the Surahs

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidyES %>%
  group_by(surah_title_en) %>%
  summarize(words = n())

tidyES %>%
  semi_join(bingnegative) %>%
  group_by(surah_title_en) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("surah_title_en")) %>%
  mutate(ratio = negativewords/words) %>%
  top_n(20) %>%
  ggplot(aes(x = surah_title_en, y = ratio)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  theme(axis.text = element_text( 
    angle = 0, 
    color="blue", 
    size=10))

These are the Surahs or chapters with the most sad words, normalized for number of words in the Surah. We repeat for the Yusufali version.

Again, the similarity is interesting.


Wordclouds

Having our data in a tidy format is useful for other plots as well. Using the wordcloud package, we can look at the most common words in the English Quran again, but this time as a wordcloud.

library(wordcloud)
tidyES %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

For the Yusufali version, the wordcloud looks like,

In other functions, such as comparison.cloud(), we may need to turn the data frame into a matrix with reshape2’s acast(). Many steps can all be done with joins, piping, and dplyr because the data is in tidy format.

library(reshape2)
tidyES %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("#eb52a6", "#54f0b1"),
                   max.words = 50)

Repeat the same for Yusufali.

Summary

Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. In this article, we explored how to apply sentiment analysis for two versions of the English Quran using tidy data principles. Most of the results are almost similar.

Reference

  1. Text Mining with R - A Tidy Approach, by Julia Silge and David Robinson
  2. quRan: Complete Text of the Qur’an by Andrew Heiss
  3. Arnold, Taylor B. 2016. cleanNLP: A Tidy Data Model for Natural Language Processing. https://cran.r-project.org/package=cleanNLP.
  4. Arnold, Taylor, and Lauren Tilton. 2016. coreNLP: Wrappers Around Stanford Corenlp Tools. https://cran.r-project.org/package=coreNLP.
  5. Rinker, Tyler W. 2017. sentimentr: Calculate Text Polarity Sentiment. Buffalo, New York: University at Buffalo/SUNY. http://github.com/trinker/sentimentr.
  6. Pedersen, Thomas Lin. 2017. ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. https://cran.r-project.org/package=ggraph.