Introduction

The aim of this post is to explore the R package udpipe as a tool for our project on #qurananalytics. udpipe is a tool for NLP (natural language processing) and Text Analytics that provides language tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in NLP. We will use it together with the quRan R package.

The utility of Lexical analysis (part-of-speech tagging, POS tagging) and Syntactic analysis (parsing) in NLP that we will apply in this post is nicely summarized in this visual[1]

(Coursera: Text Mining and Analytics by Atsushi Takayama is a good robust introduction to the subject.)

We will show a few things that can be done easily with text annotated using the udpipe package, using merely the Parts of Speech (POS) tags & the Lemma of each word. We closely follow the UDPipe - Basic Analytics guide but using the verses of Surah Yusuf from the Saheeh International English Quran as found in the package quRan. Surah Yusuf is a fairly long chapter with 111 verses and mainly narrates the story of Prophet Yusuf (Joseph).

In this report we focus on basic analytical use cases of POS tagging, lemmatisation and co-occurrences where we will show some basic frequency statistics which can be extracted easily once we have annotated Surah Yusuf.

Udpipe provides pretrained language models for respective languages and we can download the required model using udpipe_download_model()

Load Packages and Libraries

packages=c('dplyr', 'tidyverse', 'udpipe', 'ggplot2',
           'igraph', 'ggraph', 'knitr', 'quRan', 'textrank', 'wordcloud')
for (p in packages){
  if (! require (p,character.only = T)){
    install.packages(p)
  }
library(p,character.only = T)
}

Getting the Language Model Ready

# during first time model download execute the below line too
# udmodel <- udpipe_download_model(language = "english")
setwd("F:/RProjects")
# Load the model
udmodel <- udpipe_load_model(file = 'english-ewt-ud-2.5-191206.udpipe')

Start with annotating

Let’s start by annotating Surah Yusuf. The annotated data.frame can next be used for basic text analytics.

# Select the surah
Q01 <- quran_en_sahih %>% filter(surah == 12)
x <- udpipe_annotate(udmodel, x = Q01$text, doc_id = Q01$ayah_title)
x <- as.data.frame(x)

The resulting data.frame has a field called upos which is the Universal Parts of Speech tag and also a field called lemma which is the root form of each token in the text. These 2 fields give us a broad range of analytical possibilities.

Basic frequency statistics

In most languages, nouns (NOUN) are the most common types of words, next to verbs (VERB) and these are the most relevant for analytical purposes, next to the adjectives (ADJ) and proper nouns (PROPN). For a detailed list of all POS tags: visit https://universaldependencies.org/u/pos/index.html.

Takayama [1] commented that Word-based representation of documents is general and robust, requires no or little manual effort, and is “surprisingly” powerful.

stats <- txt_freq(x$upos)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% ggplot() + 
  geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "UPOS (Universal Parts of Speech)",
       subtitle = "Frequency of occurrence",
       caption = "Surah Yusuf (Saheeh International)")

Parts of Speech (POS) tags allow us to extract easily the words we like to plot. We may not need stopwords for doing this, just select nouns or verbs or adjectives and we have the most relevant parts for basic frequency analysis.

NOUNS

stats <- subset(x, upos %in% c("NOUN")) 
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() + 
  geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "UPOS (Universal Parts of Speech)",
       subtitle = "Most occurring nouns",
       caption = "Surah Yusuf (Saheeh International)")

Obviously we are missing the names of Prophet Yusuf and his father Prophet Ya’qub. So we need to include the proper nouns (PROPN) too.

stats <- subset(x, upos %in% c("NOUN", "PROPN")) 
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() + 
  geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "UPOS (Universal Parts of Speech)",
       subtitle = "Most Occurring Nouns and Proper Nouns",
       caption = "Surah Yusuf (Saheeh International)",
       y = "Frequency",
       x = "Keywords")

The NOUN and PROPN frequency plot correctly reflects Allah (SWT) (also the words Lord, Him) as the central dominant subject matter of the Quran[3]. The noticeable noun missing in the plot is “prison”. The others are all recognizable by those familiar with Surah Yusuf.

ADJECTIVES

stats <- subset(x, upos %in% c("ADJ")) 
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() + 
  geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "UPOS (Universal Parts of Speech)",
       subtitle = "Most Occurring Adjectives",
       caption = "Surah Yusuf (Saheeh International)",
       y = "Frequency",
       x = "Keywords")

VERBS

stats <- subset(x, upos %in% c("VERB")) 
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() + 
  geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "UPOS (Universal Parts of Speech)",
       subtitle = "Most Occurring Verbs",
       caption = "Surah Yusuf (Saheeh International)",
       y = "Frequency",
       x = "Keywords")

Easy adjectives and verbs dominate the Surah. Our previous post, Quran English Word and Document Frequency With Tidytext, confirms the same. No wonder the Quran says,

“And certainly We have made the Quran easy to remember, but is there any one who will mind?” [54:17]

Co-occurrences

Analyzing single words is a good start. Multi-word expressions should be more interesting. We can get multi-word expressions by looking either at collocations (words following one another), at word co-occurrences within each sentence or at word co-occurrences of words which are close in the neighbourhood of one another.

Co-occurrences allow to see how words are used either in the same sentence or next to each other. The udpipe package makes creating co-occurrence graphs using the relevant POS tags easy.

Nouns, adjectives, verbs, and adverbs used in same sentence

We look how many times nouns, proper nouns, adjectives, verbs, and adverbs are used in the same sentence.

cooccur <- cooccurrence(x = subset(x, upos %in% c("NOUN", "PROPN", "VERB",
                                                  "ADJ", "ADV")), 
                     term = "lemma", 
                     group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooccur)

The result can be easily visualised using the igraph and ggraph R packages.

library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(cooccur, 100)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
  geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "#ed9de9") +
  geom_node_point(aes(size = igraph::degree(wordnetwork)), shape = 1, color = "black") +
  geom_node_text(aes(label = name), col = "darkblue", size = 3) +
  labs(title = "Co-occurrences within sentence",
       subtitle = "Top 100 Nouns, Names, Adjectives, Verbs, Adverbs",
       caption = "Surah Yusuf (Saheeh International)")

The story is revealed by Allah (SWT). The main characters are Joseph, his father, his brothers, the king, and the wife of the minister (al-’Azeez). So the verb “say” dominates since it is a narrated story. Interesting to see the strong link and occurence of “know” with “Allah”.

Words which follow one another

Visualising which words follow one another can be done by calculating word cooccurrences of a specific POS type which follow one another and specify how far away we want to look regarding ‘following one another’ (in the example below we indicate skipgram = 1 which means look to the next word and the word after that). Here we include the major POS.

cooccur <- cooccurrence(x$lemma,
                     relevant = x$upos %in% c("NOUN", "PROPN", "VERB", "ADV", "ADJ"),
                     skipgram = 1)
head(cooccur, 15)

Once you have these coocccurrences, you can easily perform the same plotting as above.

wordnetwork <- head(cooccur, 100)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
  geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "#ed9de9") +
  geom_node_text(aes(label = name), col = "darkblue", size = 3, repel = TRUE) +
  labs(title = "Words following one another",
       caption = "Surah Yusuf (Saheeh International)")

Correlations

Keyword correlations indicate how terms are just together in a same document/sentence. While co-occurrences focus on frequency, correlation measures between 2 terms can also be high even if 2 terms occur only a small number of times but always appear together.

We show how nouns, proper nouns, verbs, adverbs, and adjectives are correlated within each verse of Surah Yusuf.

x$id <- unique_identifier(x, fields = c("sentence_id", "doc_id"))
dtm <- subset(x, upos %in% c("NOUN", "PROPN", "VERB", "ADV", "ADJ"))
dtm <- document_term_frequencies(dtm, document = "id", term = "lemma")
dtm <- document_term_matrix(dtm)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 5)
termcorrelations <- dtm_cor(dtm)
y <- as_cooccurrence(termcorrelations)
y <- subset(y, term1 < term2 & abs(cooc) > 0.2)
y <- y[order(abs(y$cooc), decreasing = TRUE), ]
head(y, 15)

The above pairings indeed reflect the story of Prophet Joseph.

Finding keywords

Frequency statistics of words are nice but many words only make sense in combination with other words. Thus we want to find keywords which are a combination of words. We follow the example from An overview of keyword extraction techniques. The article suggested the following techniques to easily extract keywords.

by doing Parts of Speech tagging in order to identify nouns
based on Collocations and Co-occurrences
based on RAKE (rapid automatic keyword extraction)
by looking for Phrases (noun phrases / verb phrases)
based on the Textrank algorithm
based on results of dependency parsing (getting the subject of the text)

We have covered (1) and (2) earlier. In this section we will cover (3) and (4). We will cover (5) and (6) in the next sections.

Currently, the udpipe R package provides 3 methods to identify keywords in text.

RAKE (Rapid Automatic Keyword Extraction)
Collocation ordering using Pointwise Mutual Information
Parts of Speech phrase sequence detection

Using RAKE

Time for some advanced Machine Learning. RAKE is one of the most popular (unsupervised) algorithms for extracting keywords. RAKE is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearances and its co-occurrence with other words in the text.

RAKE looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words. Namely by

calculating a score for each word which is part of any candidate keyword, this is done by
- among the words of the candidate keywords, the algorithm looks how many times each word is occurring and how many times it co-occurs with other words
- each word gets a score which is the ratio of the word degree (how many times it co-occurs with other words) to the word frequency
a RAKE score for the full candidate keyword is calculated by summing up the scores of each of the words which define the candidate keyword

stats <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                       relevant = x$upos %in% c("NOUN", "PROPN", "VERB", "ADJ"))
# stats
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
head(subset(stats, freq > 2), 30) %>% ggplot() + 
  geom_bar(aes(x = key, y = rake), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "Keywords (NOUN, PROPN, VERB, ADJ) identified by RAKE",
       caption = "Surah Yusuf (Saheeh International)",
       y = "Rake",
       x = "Keywords")

Using Pointwise Mutual Information Collocations

x$word <- tolower(x$token)
stats <- keywords_collocation(x = x, term = "word", group = "doc_id")
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
# stats
head(subset(stats, freq > 2), 30) %>% ggplot() + 
  geom_bar(aes(x = key, y = pmi), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "Keywords identified by PMI Collocation",
       caption = "Surah Yusuf (Saheeh International)",
       y = "PMI (Pointwise Mutual Information)",
       x = "Keywords")

Using a sequence of POS tags (noun phrases / verb phrases)

Another option is to extract phrases. These are defined as a sequence of POS Tags. Common types of phrases are noun phrases or verb phrases. In English, a noun and a verb can form a phrase. Like, “Joseph said” — with the noun Joseph and verb said, we can understand the context of the sentence. Reverse-engineering the same with this Surah Yusuf, let us bring out top phrases – that are just keywords/topics

POS tags are recoded to one of the following one-letters:

A: adjective,
C: coordinating conjuction,
D: determiner,
M: modifier of verb,
N: noun or proper noun,
P: preposition).

We then define a regular expression to indicate a sequence of POS tags which we want to extract from the text.

x$phrase_tag <- as_phrasemachine(x$upos, type = "upos")
stats <- keywords_phrases(x = x$phrase_tag, term = tolower(x$token), 
                          pattern = "(A|N)*N(P+D*(A|N)*N)*", 
                          is_regex = TRUE, detailed = FALSE)
stats <- subset(stats, ngram > 1 & freq > 3)
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
# stats
head(stats, 30) %>% ggplot() + 
  geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") + 
  theme_minimal() + 
  coord_flip() +
  labs(title = "Keywords - simple noun phrases",
       caption = "Surah Yusuf (Saheeh International)",
       y = "Frequency",
       x = "Keywords")

Textrank

Textrank is a word network ordered by Google Pagerank as implemented in the textrank R package. The algorithm allows to summarise text and extract keywords. This is done by constructing a word network by looking if words are following one another. On top of that network the ‘Google Pagerank’ algorithm is applied to extract relevant words after which the relevant words which are following one another are combined to get keywords. In the below example, we are interested in finding keywords using that algorithm of either “NOUN”, “PROPN”, “VERB”, “ADJ” following one another.

# library(textrank)
stats <- textrank_keywords(x$lemma, 
                  relevant = x$upos %in% c("NOUN", "PROPN", "VERB", "ADJ"), 
                  ngram_max = 8, sep = " ")
stats <- subset(stats$keywords, ngram > 1 & freq >= 2)
stats

# library(wordcloud)
wordcloud(words = stats$keyword, freq = stats$freq)

The plot above shows that the keywords combines words together into multi-word expressions. Again we see the dominance of the verb “say” since Surah Yusuf is a narrated story. It is welcoming to note that “fear Allah” and “do good” are in fact the top moral lessons from this Surah.

“We relate to you the best of stories through Our revelation of this Quran, though before this you were totally unaware of them.” [12:3]

Dependency Parsing

We use the dependency parsing output to get the nominal subject and the adjective. When we executed the annotation using udpipe, the dep_rel field indicates how words are related to one another. A token is related to the parent using token_id and head_token_id. The dep_rel field indicates how words are linked to one another. The type of relations are defined at http://universaldependencies.org/u/dep/index.html. Here we are going to take the words which have as dependency relation nsubj indicating the nominal subject and we are adding to that the adjective which is changing the nominal subject.

In this way we can combine what the Surah is talking about with the adjective or verb it uses when it talks about the subject.

stats <- merge(x, x, 
           by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"),
           by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"),
           all.x = TRUE, all.y = FALSE, 
           suffixes = c("", "_parent"), sort = FALSE)
stats <- subset(stats, dep_rel %in% "nsubj" & upos %in% c("NOUN", "PROPN") & upos_parent %in% c("VERB", "ADJ"))
stats$term <- paste(stats$lemma_parent, stats$lemma, sep = " ")
stats <- txt_freq(stats$term)
stats

library(wordcloud)
wordcloud(words = stats$key, freq = stats$freq, min.freq = 2, max.words = 100,
          random.order = FALSE, colors = brewer.pal(6, "Dark2"))

The plot above confirms the comment we made earlier about “say”. Another known moral lesson from Surah Yusuf, “fitting patience”, now appears.

Summary

This initial exploration of the udpipe package with just one of 114 Surahs in the Quran has indeed shown how all the sample codes can be easily replicated. The results confirm many familiar lessons for those acquainted with the Quran and Surah Yusuf, in particular.

The study also opens other investigation avenues like

looking into other Surahs of the Quran,
other translations,
other use cases of udpipe,
and most importantly the original Arabic Quran.

We will probably explore (1) and (3) in our next post.

References

Atsushi Takayama, Coursera: Text Mining and Analytics, https://medium.com/@taka.atsushi/coursera-text-mining-and-analytics-bf314d7e130e
https://bnosac.github.io/udpipe/docs/doc5.html https://doi.org/10.21105/joss.00774
Tareq Alsuwaidan and Azman Hussin, Islam Simplified - A Holistic View of the Quran, to be published.
An overview of keyword extraction techniques, https://www.r-bloggers.com/2018/04/an-overview-of-keyword-extraction-techniques/
http://www.bnosac.be/index.php/blog/93-dependency-parsing-with-udpipe

Lexical Analytics of Surah Yusuf Using udpipe Package

Azman Hussin

2020-11-17