The aim of this post is to explore the R package udpipe as a tool for our project on #qurananalytics. udpipe is a tool for NLP (natural language processing) and Text Analytics that provides language tokenization, tagging, lemmatization and dependency parsing of raw text, which is an essential part in NLP. We will use it together with the quRan R package.
The utility of Lexical analysis (part-of-speech tagging, POS tagging) and Syntactic analysis (parsing) in NLP that we will apply in this post is nicely summarized in this visual[1]
(Coursera: Text Mining and Analytics by Atsushi Takayama is a good robust introduction to the subject.)
We will show a few things that can be done easily with text annotated using the udpipe package, using merely the Parts of Speech (POS) tags & the Lemma of each word. We closely follow the UDPipe - Basic Analytics guide but using the verses of Surah Yusuf from the Saheeh International English Quran as found in the package quRan. Surah Yusuf is a fairly long chapter with 111 verses and mainly narrates the story of Prophet Yusuf (Joseph).
In this report we focus on basic analytical use cases of POS tagging, lemmatisation and co-occurrences where we will show some basic frequency statistics which can be extracted easily once we have annotated Surah Yusuf.
Udpipe provides pretrained language models for respective languages and we can download the required model using udpipe_download_model()
packages=c('dplyr', 'tidyverse', 'udpipe', 'ggplot2',
'igraph', 'ggraph', 'knitr', 'quRan', 'textrank', 'wordcloud')
for (p in packages){
if (! require (p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
# during first time model download execute the below line too
# udmodel <- udpipe_download_model(language = "english")
setwd("F:/RProjects")
# Load the model
udmodel <- udpipe_load_model(file = 'english-ewt-ud-2.5-191206.udpipe')
Let’s start by annotating Surah Yusuf. The annotated data.frame can next be used for basic text analytics.
# Select the surah
Q01 <- quran_en_sahih %>% filter(surah == 12)
x <- udpipe_annotate(udmodel, x = Q01$text, doc_id = Q01$ayah_title)
x <- as.data.frame(x)
The resulting data.frame has a field called upos which is the Universal Parts of Speech tag and also a field called lemma which is the root form of each token in the text. These 2 fields give us a broad range of analytical possibilities.
In most languages, nouns (NOUN) are the most common types of words, next to verbs (VERB) and these are the most relevant for analytical purposes, next to the adjectives (ADJ) and proper nouns (PROPN). For a detailed list of all POS tags: visit https://universaldependencies.org/u/pos/index.html.
Takayama [1] commented that Word-based representation of documents is general and robust, requires no or little manual effort, and is “surprisingly” powerful.
stats <- txt_freq(x$upos)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% ggplot() +
geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") +
theme_minimal() +
coord_flip() +
labs(title = "UPOS (Universal Parts of Speech)",
subtitle = "Frequency of occurrence",
caption = "Surah Yusuf (Saheeh International)")
Parts of Speech (POS) tags allow us to extract easily the words we like to plot. We may not need stopwords for doing this, just select nouns or verbs or adjectives and we have the most relevant parts for basic frequency analysis.
stats <- subset(x, upos %in% c("NOUN"))
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() +
geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") +
theme_minimal() +
coord_flip() +
labs(title = "UPOS (Universal Parts of Speech)",
subtitle = "Most occurring nouns",
caption = "Surah Yusuf (Saheeh International)")
Obviously we are missing the names of Prophet Yusuf and his father Prophet Ya’qub. So we need to include the proper nouns (PROPN) too.
stats <- subset(x, upos %in% c("NOUN", "PROPN"))
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() +
geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") +
theme_minimal() +
coord_flip() +
labs(title = "UPOS (Universal Parts of Speech)",
subtitle = "Most Occurring Nouns and Proper Nouns",
caption = "Surah Yusuf (Saheeh International)",
y = "Frequency",
x = "Keywords")
The NOUN and PROPN frequency plot correctly reflects Allah (SWT) (also the words Lord, Him) as the central dominant subject matter of the Quran[3]. The noticeable noun missing in the plot is “prison”. The others are all recognizable by those familiar with Surah Yusuf.
stats <- subset(x, upos %in% c("ADJ"))
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() +
geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") +
theme_minimal() +
coord_flip() +
labs(title = "UPOS (Universal Parts of Speech)",
subtitle = "Most Occurring Adjectives",
caption = "Surah Yusuf (Saheeh International)",
y = "Frequency",
x = "Keywords")
stats <- subset(x, upos %in% c("VERB"))
stats <- txt_freq(stats$token)
# stats
stats$key <- factor(stats$key, levels = rev(stats$key))
# Plot
stats %>% head(20) %>% ggplot() +
geom_bar(aes(x = key, y = freq), stat = "identity", fill = "#79ad95") +
theme_minimal() +
coord_flip() +
labs(title = "UPOS (Universal Parts of Speech)",
subtitle = "Most Occurring Verbs",
caption = "Surah Yusuf (Saheeh International)",
y = "Frequency",
x = "Keywords")
Easy adjectives and verbs dominate the Surah. Our previous post, Quran English Word and Document Frequency With Tidytext, confirms the same. No wonder the Quran says,
“And certainly We have made the Quran easy to remember, but is there any one who will mind?” [54:17]
Analyzing single words is a good start. Multi-word expressions should be more interesting. We can get multi-word expressions by looking either at collocations (words following one another), at word co-occurrences within each sentence or at word co-occurrences of words which are close in the neighbourhood of one another.
Co-occurrences allow to see how words are used either in the same sentence or next to each other. The udpipe package makes creating co-occurrence graphs using the relevant POS tags easy.
We look how many times nouns, proper nouns, adjectives, verbs, and adverbs are used in the same sentence.
cooccur <- cooccurrence(x = subset(x, upos %in% c("NOUN", "PROPN", "VERB",
"ADJ", "ADV")),
term = "lemma",
group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooccur)
The result can be easily visualised using the igraph and ggraph R packages.
library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(cooccur, 100)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "#ed9de9") +
geom_node_point(aes(size = igraph::degree(wordnetwork)), shape = 1, color = "black") +
geom_node_text(aes(label = name), col = "darkblue", size = 3) +
labs(title = "Co-occurrences within sentence",
subtitle = "Top 100 Nouns, Names, Adjectives, Verbs, Adverbs",
caption = "Surah Yusuf (Saheeh International)")
The story is revealed by Allah (SWT). The main characters are Joseph, his father, his brothers, the king, and the wife of the minister (al-’Azeez). So the verb “say” dominates since it is a narrated story. Interesting to see the strong link and occurence of “know” with “Allah”.
Visualising which words follow one another can be done by calculating word cooccurrences of a specific POS type which follow one another and specify how far away we want to look regarding ‘following one another’ (in the example below we indicate skipgram = 1 which means look to the next word and the word after that). Here we include the major POS.
cooccur <- cooccurrence(x$lemma,
relevant = x$upos %in% c("NOUN", "PROPN", "VERB", "ADV", "ADJ"),
skipgram = 1)
head(cooccur, 15)
Once you have these coocccurrences, you can easily perform the same plotting as above.
wordnetwork <- head(cooccur, 100)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "#ed9de9") +
geom_node_text(aes(label = name), col = "darkblue", size = 3, repel = TRUE) +
labs(title = "Words following one another",
caption = "Surah Yusuf (Saheeh International)")
Keyword correlations indicate how terms are just together in a same document/sentence. While co-occurrences focus on frequency, correlation measures between 2 terms can also be high even if 2 terms occur only a small number of times but always appear together.
We show how nouns, proper nouns, verbs, adverbs, and adjectives are correlated within each verse of Surah Yusuf.
x$id <- unique_identifier(x, fields = c("sentence_id", "doc_id"))
dtm <- subset(x, upos %in% c("NOUN", "PROPN", "VERB", "ADV", "ADJ"))
dtm <- document_term_frequencies(dtm, document = "id", term = "lemma")
dtm <- document_term_matrix(dtm)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 5)
termcorrelations <- dtm_cor(dtm)
y <- as_cooccurrence(termcorrelations)
y <- subset(y, term1 < term2 & abs(cooc) > 0.2)
y <- y[order(abs(y$cooc), decreasing = TRUE), ]
head(y, 15)
The above pairings indeed reflect the story of Prophet Joseph.
Frequency statistics of words are nice but many words only make sense in combination with other words. Thus we want to find keywords which are a combination of words. We follow the example from An overview of keyword extraction techniques. The article suggested the following techniques to easily extract keywords.
We have covered (1) and (2) earlier. In this section we will cover (3) and (4). We will cover (5) and (6) in the next sections.
Currently, the udpipe R package provides 3 methods to identify keywords in text.
Time for some advanced Machine Learning. RAKE is one of the most popular (unsupervised) algorithms for extracting keywords. RAKE is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearances and its co-occurrence with other words in the text.
RAKE looks for keywords by looking to a contiguous sequence of words which do not contain irrelevant words. Namely by
stats <- keywords_rake(x = x, term = "lemma", group = "doc_id",
relevant = x$upos %in% c("NOUN", "PROPN", "VERB", "ADJ"))
# stats
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
head(subset(stats, freq > 2), 30) %>% ggplot() +
geom_bar(aes(x = key, y = rake), stat = "identity", fill = "#79ad95") +
theme_minimal() +
coord_flip() +
labs(title = "Keywords (NOUN, PROPN, VERB, ADJ) identified by RAKE",
caption = "Surah Yusuf (Saheeh International)",
y = "Rake",
x = "Keywords")
x$word <- tolower(x$token)
stats <- keywords_collocation(x = x, term = "word", group = "doc_id")
stats$key <- factor(stats$keyword, levels = rev(stats$keyword))
# stats
head(subset(stats, freq > 2), 30) %>% ggplot() +
geom_bar(aes(x = key, y = pmi), stat = "identity", fill = "#79ad95") +
theme_minimal() +
coord_flip() +
labs(title = "Keywords identified by PMI Collocation",
caption = "Surah Yusuf (Saheeh International)",
y = "PMI (Pointwise Mutual Information)",
x = "Keywords")
Textrank is a word network ordered by Google Pagerank as implemented in the textrank R package. The algorithm allows to summarise text and extract keywords. This is done by constructing a word network by looking if words are following one another. On top of that network the ‘Google Pagerank’ algorithm is applied to extract relevant words after which the relevant words which are following one another are combined to get keywords. In the below example, we are interested in finding keywords using that algorithm of either “NOUN”, “PROPN”, “VERB”, “ADJ” following one another.
# library(textrank)
stats <- textrank_keywords(x$lemma,
relevant = x$upos %in% c("NOUN", "PROPN", "VERB", "ADJ"),
ngram_max = 8, sep = " ")
stats <- subset(stats$keywords, ngram > 1 & freq >= 2)
stats
# library(wordcloud)
wordcloud(words = stats$keyword, freq = stats$freq)
The plot above shows that the keywords combines words together into multi-word expressions. Again we see the dominance of the verb “say” since Surah Yusuf is a narrated story. It is welcoming to note that “fear Allah” and “do good” are in fact the top moral lessons from this Surah.
“We relate to you the best of stories through Our revelation of this Quran, though before this you were totally unaware of them.” [12:3]
We use the dependency parsing output to get the nominal subject and the adjective. When we executed the annotation using udpipe, the dep_rel field indicates how words are related to one another. A token is related to the parent using token_id and head_token_id. The dep_rel field indicates how words are linked to one another. The type of relations are defined at http://universaldependencies.org/u/dep/index.html. Here we are going to take the words which have as dependency relation nsubj indicating the nominal subject and we are adding to that the adjective which is changing the nominal subject.
In this way we can combine what the Surah is talking about with the adjective or verb it uses when it talks about the subject.
stats <- merge(x, x,
by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"),
by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"),
all.x = TRUE, all.y = FALSE,
suffixes = c("", "_parent"), sort = FALSE)
stats <- subset(stats, dep_rel %in% "nsubj" & upos %in% c("NOUN", "PROPN") & upos_parent %in% c("VERB", "ADJ"))
stats$term <- paste(stats$lemma_parent, stats$lemma, sep = " ")
stats <- txt_freq(stats$term)
stats
library(wordcloud)
wordcloud(words = stats$key, freq = stats$freq, min.freq = 2, max.words = 100,
random.order = FALSE, colors = brewer.pal(6, "Dark2"))
The plot above confirms the comment we made earlier about “say”. Another known moral lesson from Surah Yusuf, “fitting patience”, now appears.
This initial exploration of the udpipe package with just one of 114 Surahs in the Quran has indeed shown how all the sample codes can be easily replicated. The results confirm many familiar lessons for those acquainted with the Quran and Surah Yusuf, in particular.
The study also opens other investigation avenues like
We will probably explore (1) and (3) in our next post.