Prophet (SAW) said, “The best of you are those who learn the Quran and teach it.” [Bukhari]
Natural Language Processing (NLP) is a combination of linguistics and data science analyzing large amounts of natural language data which includes collection of speeches, text corpora and other forms of data generated from the usage of languages
Certainly these tools are new and allow new forms of analysis on old sources of text as shown in the examples used in Text Mining with R - A Tidy Approach, by Julia Silge and David Robinson, my first introduction to the subject matter. Recently, I came across a well prepared dataset of the Holy Quran which was quite recently published (2019-01-17). The obvious idea came to apply the current tools and methods in text analysis with the Quran datasets. In research, different methods used on the same data often yield different results. So using the new methods and tools of #datascience, #textanalytics, and #networkscience to “learn the Quran” should be interesting. We intend to share our work as we progress through this project.
This article will focus on providing preliminary findings using basic text mining analysis. We have done work on network analysis and topic modeling, among others. We have also started on applying knowledge graph techniques using the Quran data. These are for future posts.
For this article we replicated some steps in Text Mining with R - A Tidy Approach. As such, we used two main packages in R, the prebuilt text in tidydata format from the quRan package, and the tidytext package.
packages=c('dplyr', 'tidyverse', 'tidytext', 'ggplot2', 'ggraph', 'knitr', 'quRan')
for (p in packages){
if (! require (p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Before we load the data, let us start with the first chapter or Surah in the Quran, Al-Fatihah. We will mainly use the Saheeh International English translation of the Quran. For comparison purposes, we will us the Yusuf Ali translation. Both are available in the quRan package.
This section just introduces the tidy tools and the concept of tokens.
text <- c("In the name of Allah, the Entirely Merciful, the Especially Merciful",
"[All] praise is [due] to Allah, Lord of the worlds",
"The Entirely Merciful, the Especially Merciful",
"Sovereign of the Day of Recompense.",
"It is You we worship and You we ask for help.",
"Guide us to the straight path",
"The path of those upon whom You have bestowed favor, not of those who have evoked [Your] anger or those who go astray")
text
## [1] "In the name of Allah, the Entirely Merciful, the Especially Merciful"
## [2] "[All] praise is [due] to Allah, Lord of the worlds"
## [3] "The Entirely Merciful, the Especially Merciful"
## [4] "Sovereign of the Day of Recompense."
## [5] "It is You we worship and You we ask for help."
## [6] "Guide us to the straight path"
## [7] "The path of those upon whom You have bestowed favor, not of those who have evoked [Your] anger or those who go astray"
This is a typical character vector. We first need to put it into a data frame to make it a tidy text dataset.
text_df <- tibble(line = 1:7, text = text)
text_df
A tibble is a modern class of data frame within R, available in the dplyr and tibble packages. Tibbles are great for use with tidy tools.
Notice that this data frame containing text isn’t yet compatible with tidy text analysis. We cannot filter out words or count which occur most frequently, since each row is made up of multiple combined words. We need to convert this so that it has one-token-per-document-per-row.
A token is a meaningful unit of text, most often a word, that we are interested in using for further analysis, and tokenization is the process of splitting text into tokens.
In this first example, we only have one document (Al-Fatihah), but we will explore examples with multiple documents (Surahs).
Within our tidy text framework, we need to both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure. * Use tidytext’s unnest_tokens() function.
text_df %>%
unnest_tokens(word, text)
After using unnest_tokens, there is one token (word) in each row of the new data frame; the default tokenization in unnest_tokens() is for single words.
Now we can manipulate, process, and visualize the text using the standard set of tidy tools, namely dplyr, tidyr, and ggplot2
The quRan packaage has 4 versions of the Quran. It gives the full verses of the Quran, in data frames containing one row per verse, formatted to be convenient for text analysis. of the ) and in , .
We will analyze selected variable (columns) from quran_en_sahih
quranES <- quran_en_sahih %>% select(surah_id,
ayah_id,
surah_title_en,
surah_title_en_trans,
revelation_type,
text,
surah,
ayah,
ayah_title)
# quranES
To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
tidyES <- quranES %>%
unnest_tokens(word, text)
tidyES
This function separates each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.
Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. Often in text analysis, we remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().
data(stop_words)
tidyES <- tidyES %>%
anti_join(stop_words)
The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together, as we have here, or filter() to only use one set of stop words if that is more appropriate for a certain analysis.
We can also use dplyr’s count() to find the most common words in all the Quran as a whole.
Because we’ve been using tidy tools, our word counts are stored in a tidy data frame. This allows us to pipe this directly to the ggplot2 package, for example to create a visualization of the most common words in the Sahih International Translation of the Quran. We plot the words that occur more then 150 times.
tidyES %>% count(word, sort = TRUE)
tidyES %>%
count(word, sort = TRUE) %>%
filter(n > 150) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
theme(axis.text = element_text(
angle = 0,
color="blue",
size=10)
)
This translation uses words like “thee” and “thou” so we want to filter that out too.
my_stopwords <- tibble(word = c('ye', 'verily', 'will', 'said', 'say', 'us',
'thy', 'thee', 'thou', 'hath', 'doth'))
tidyEY <- tidyEY %>%
anti_join(my_stopwords)
Now we count and plot the words that occur more then 150 times.
In the previous section, we explored the tidy text format and showed it can easily be used to approach questions about word frequency in the English Quran. This allowed us to analyze which words are used most frequently in the Quran and to compare two versions of the English Quran,
Now let us address the topic of opinion mining or sentiment analysis. We can use the tools of text mining to approach the emotional content of text programmatically.
One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. There are other approaches but this approach can easily take advantage of the tidy tools.
The tidytext package provides access to several sentiment lexicons. Three general-purpose lexicons are
All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. In this post, we will use the bing lexicon which categorizes words in a binary fashion into positive and negative categories.
The function get_sentiments() allows us to get specific sentiment lexicons. For example, get_sentiments(“afinn”). We will use
get_sentiments("bing")
One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.
This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames
bing_word_counts <- tidyES %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
bing_word_counts %>%
group_by(sentiment) %>%
top_n(20) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip() +
theme(axis.text = element_text(
angle = 0,
color="blue",
size=10))
Interesingly, for both translations the shape is identical for the 2 pairs of graphs. Many of the words are also similar.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidyES %>%
group_by(surah_title_en) %>%
summarize(words = n())
tidyES %>%
semi_join(bingnegative) %>%
group_by(surah_title_en) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("surah_title_en")) %>%
mutate(ratio = negativewords/words) %>%
top_n(20) %>%
ggplot(aes(x = surah_title_en, y = ratio)) +
geom_col() +
xlab(NULL) +
coord_flip() +
theme(axis.text = element_text(
angle = 0,
color="blue",
size=10))
These are the Surahs or chapters with the most sad words, normalized for number of words in the Surah. We repeat for the Yusufali version.
Again, the similarity is interesting.
Having our data in a tidy format is useful for other plots as well. Using the wordcloud package, we can look at the most common words in the English Quran again, but this time as a wordcloud.
library(wordcloud)
tidyES %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
For the Yusufali version, the wordcloud looks like,
In other functions, such as comparison.cloud(), we may need to turn the data frame into a matrix with reshape2’s acast(). Many steps can all be done with joins, piping, and dplyr because the data is in tidy format.
library(reshape2)
tidyES %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#eb52a6", "#54f0b1"),
max.words = 50)
Repeat the same for Yusufali.
Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. In this article, we explored how to apply sentiment analysis for two versions of the English Quran using tidy data principles. Most of the results are almost similar.