1 Introduction

We have posted a few articles based on our proposed #qurananalytics project brief.

These can be considered exploratory work in applying some tools in text analytics and natural language processing (NLP) with the Quran dataset available in the quRan package. We have demonstrated that many informative and interesting text analyses on the English translation of the Quran can be performed using these tools. We now have some building blocks based upon individual operations that can perhaps be glued together for a more complete and powerul #qurananalytics process. It will be like a flowchart that combines the various tools, data analysis and visualization.

This article will examine topic modeling as one of the analytical building blocks. Topic modeling is a method for unsupervised classification of texts, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.

The above figure shows that we can use tidy text principles to approach topic modeling with the same set of tidy tools we have used in the earlier posts.1 In this post, we will use the topicmodels package, tidying such models so that they can be manipulated with ggplot2 and dplyr. We’ll explore an example of clustering topics from Surah Al-Kahfi.

The figure also shows that a key step in topic modeling is the creation of a document-term matrix (DTM). In an earlier post2, we used the quanteda package to create a DTM. We will also be using it in this post. As such, we decided to begin this post with a customized tutorial on quanteda using the selected Surah Al-Kahfi. This will enable us to explore more details about this useful R tool for NLP and text analytics.3


2 Preliminaries

2.1 Load Packages and Libraries

The necessary libraries are installed (if not already available) and then loaded.

packages=c('dplyr', 'tidyverse', 'readtext', 'ggplot2', 'ggraph', 
           'tidytext', 'knitr', 'quRan', 'igraph', 'quanteda', 'topicmodels')
for (p in packages){
  if (! require (p,character.only = T)){
    install.packages(p)
  }
library(p,character.only = T)
}

2.2 Select Quran Version, Variables and Surah

The quRan package has 4 versions of the Quran.

  1. quran_ar
  2. quran_ar_min
  3. quran_en_sahih
  4. quran_en_yusufali

We will analyze selected Surahs and variables (columns) from quran_en_sahih.

Q18 <- quran_en_sahih %>% filter(surah == 18)
quranES <- Q18 %>% select(surah_id, 
                          ayah_id,
                          ayah,
                          surah_title_en, 
                          text,
                          ayah_title)
data(stop_words)

Next we manually create groups of the verses in Surah Al-Kahfi based on known distinct topics in the Surah. This is, we repeat, a manual intervention based on human subject matter knowledge. This should be useful for topic modeling and later sections of this post.

quranES <- quranES %>% mutate(Group = ifelse(ayah %in% 1:31, "Story1",
                                      ifelse(ayah %in% 32:44, "Parable1",
                                      ifelse(ayah %in% 45:59, "Parable2",
                                      ifelse(ayah %in% 60:82, "Story2",
                                      ifelse(ayah %in% 83:102, "Story3", "End"))))))

Now we create tokenized documents, grouped by each verse in the selected Surah. This approach is different than tidytext in the sense that all tokens are still kept under the headings of each verse (sentence), which is useful for future analysis.

tokensQ = quranES$text %>% 
      tokens(remove_punct = TRUE) %>%
      tokens_tolower() %>%
      tokens_remove(pattern = stop_words$word, padding = FALSE)

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr.


3 Brief Tutorial on quanteda

quanteda has three basic types of objects:4

  1. Corpus
  • Saves character strings and variables in a data frame
  • Combines texts with document-level variables
  1. Tokens
  • Stores tokens in a list of vectors
  • More efficient than character strings, but preserves positions of words
  • Positional (string-of-words) analysis is performed using textstat_collocations(), tokens_ngrams() and tokens_select() or fcm() with window option
  1. Document-feature matrix (DFM)
  • Represents frequencies of features in documents in a matrix
  • The most efficient structure, but it does not have information on positions of words

Text analysis with quanteda goes through all those three types of objects either explicitly or implicitly.

3.1 Corpus

3.1.1 Construct a corpus from dataframe

library(quanteda)
corpQ18 <- corpus(quranES)
print(corpQ18)
## Corpus consisting of 110 documents and 6 docvars.
## text1 :
## "[All] praise is [due] to Allah, who has sent down upon His S..."
## 
## text2 :
## "[He has made it] straight, to warn of severe punishment from..."
## 
## text3 :
## "In which they will remain forever"
## 
## text4 :
## "And to warn those who say, "Allah has taken a son.""
## 
## text5 :
## "They have no knowledge of it, nor had their fathers. Grave i..."
## 
## text6 :
## "Then perhaps you would kill yourself through grief over them..."
## 
## [ reached max_ndoc ... 104 more documents ]
summary(corpQ18)
class(corpQ18)
## [1] "corpus"    "character"

We can edit the docnames for a corpus to change them from text1, text2 etc to a meaningful identifier.

3.1.2 Document-level variables

quanteda‘s objects keep information associated with documents. They are called “document-level variables”, or “docvars” and accessed using docvars(). To extract individual elements of document variables, we can specify field.

head(docvars(corpQ18), 10)
tail(docvars(corpQ18), 10)
head(docvars(corpQ18, field = "Group"), 10)
##  [1] "Story1" "Story1" "Story1" "Story1" "Story1" "Story1" "Story1" "Story1"
##  [9] "Story1" "Story1"
tail(docvars(corpQ18, field = "Group"), 10)
##  [1] "Story3" "Story3" "End"    "End"    "End"    "End"    "End"    "End"   
##  [9] "End"    "End"

We can also access the individual document-level variables using the $ operator.

head(corpQ18$ayah_title, 10)
##  [1] "18:1"  "18:2"  "18:3"  "18:4"  "18:5"  "18:6"  "18:7"  "18:8"  "18:9" 
## [10] "18:10"
tail(corpQ18$ayah_title, 10)
##  [1] "18:101" "18:102" "18:103" "18:104" "18:105" "18:106" "18:107" "18:108"
##  [9] "18:109" "18:110"

docvars() also allows us to create or update document variables. We can also create the document-level variable using the $ operator.

docvars() is explained only in this section, but it works on other quanteda objects (tokens and dfm) in the same way.

3.1.3 Subset corpus

corpus_subset() allows us to select documents in a corpus based on document-level variables.

ndoc(corpQ18)
## [1] 110
head(docvars(corpQ18))
corpQ18a <- corpus_subset(corpQ18, ayah >= 100)
ndoc(corpQ18a)
## [1] 11
head(docvars(corpQ18a))
corpQ18b <- corpus_subset(corpQ18, Group %in% c("Story2", "Parable2"))
ndoc(corpQ18b)
## [1] 38
head(docvars(corpQ18b))

3.2 Tokens

tokens() segments texts in a corpus into tokens (words or sentences) by word boundaries.

tokensQ <- tokens(corpQ18)
class(tokensQ)
## [1] "tokens"

A corpus is passed to tokens() in the code above, but it works with a character string too.

toks <- tokens(quranES$text[1])
print(toks)
## Tokens consisting of 1 document.
## text1 :
##  [1] "["      "All"    "]"      "praise" "is"     "["      "due"    "]"     
##  [9] "to"     "Allah"  ","      "who"   
## [ ... and 16 more ]

By default, tokens() only removes separators (typically whitespaces), but we can remove punctuation and numbers.

toks <- tokens(quranES$text[1], remove_punct = TRUE)
print(toks)
## Tokens consisting of 1 document.
## text1 :
##  [1] "All"    "praise" "is"     "due"    "to"     "Allah"  "who"    "has"   
##  [9] "sent"   "down"   "upon"   "His"   
## [ ... and 10 more ]

3.2.1 Keywords in context

We can see how keywords are used in the actual contexts in a concordance view produced by kwic().

kw1 <- kwic(tokensQ, pattern =  "allah*")
head(kw1, 10)

kwic() also takes multiple keywords in a character vector.

kw2 <- kwic(tokensQ, pattern = c("allah*", "lord*"))
head(kw2, 10)

With the window argument, you can specify the number of words to be displayed around the keyword.

kw2 <- kwic(tokensQ, pattern = c("allah*", "know*"), window = 5)
head(kw2, 10)

To find multi-word expressions, separate words by whitespace and wrap the character vector by phrase().

kw3 <- kwic(tokensQ, pattern = phrase("right* deed*"))
head(kw3)

Use View() to see the keywords-in-context in an interactive HTML table.

View(kw2)

3.2.2 Select tokens

We can remove tokens that we are not interested in by using tokens_select(). Usually we remove function words (grammatical words) that have little or no substantive meaning in pre-processing. stopwords() returns a pre-defined list of function words.

toks_nostop <- tokens_select(toks, pattern = stopwords("en"), selection = "remove")
print(toks)
## Tokens consisting of 1 document.
## text1 :
##  [1] "All"    "praise" "is"     "due"    "to"     "Allah"  "who"    "has"   
##  [9] "sent"   "down"   "upon"   "His"   
## [ ... and 10 more ]
print(toks_nostop)
## Tokens consisting of 1 document.
## text1 :
##  [1] "praise"   "due"      "Allah"    "sent"     "upon"     "Servant" 
##  [7] "Book"     "made"     "therein"  "deviance"

Removal of tokens changes the lengths of documents, but they remain the same if we set padding = TRUE. This option is useful especially when we perform positional analysis.

toks_nostop_pad <- tokens_select(toks, pattern = stopwords("en"), 
                                 padding = TRUE, selection = "remove")
print(toks_nostop_pad)
## Tokens consisting of 1 document.
## text1 :
##  [1] ""       "praise" ""       "due"    ""       "Allah"  ""       ""      
##  [9] "sent"   ""       "upon"   ""      
## [ ... and 10 more ]

If we are only interested in certain words, we can keep these and remove others.

toks2 <- tokens_select(tokensQ, pattern = c("mose*", "merc*"), padding = TRUE)

To analyze words that appear around keywords, use the window argument.

toks2 <- tokens_select(tokensQ, pattern = c("mose*", "merc*"), 
                       padding = FALSE, window = 5)

3.2.3 n-grams

We can generate n-grams in any length from using tokens_ngrams(). n-grams are a sequence of tokens from already tokenized text objects.

toks_ngram <- tokens_ngrams(tokensQ, n = 2:4)
head(toks_ngram[[110]], 30)
##  [1] "Say_,"         ",_\""          "\"_I"          "I_am"         
##  [5] "am_only"       "only_a"        "a_man"         "man_like"     
##  [9] "like_you"      "you_,"         ",_to"          "to_whom"      
## [13] "whom_has"      "has_been"      "been_revealed" "revealed_that"
## [17] "that_your"     "your_god"      "god_is"        "is_one"       
## [21] "one_God"       "God_."         "._So"          "So_whoever"   
## [25] "whoever_would" "would_hope"    "hope_for"      "for_the"      
## [29] "the_meeting"   "meeting_with"

tokens_ngrams() also supports skip to generate skip-grams.

toks_skip <- tokens_ngrams(tokensQ, n = 2, skip = 1:2)
head(toks_skip[[10]], 30)
##  [1] "[_]"            "[_when"         "Mention_when"   "Mention_the"   
##  [5] "]_the"          "]_youths"       "when_youths"    "when_retreated"
##  [9] "the_retreated"  "the_to"         "youths_to"      "youths_the"    
## [13] "retreated_the"  "retreated_cave" "to_cave"        "to_and"        
## [17] "the_and"        "the_said"       "cave_said"      "cave_,"        
## [21] "and_,"          "and_\""         "said_\""        "said_Our"      
## [25] ",_Our"          ",_Lord"         "\"_Lord"        "\"_,"          
## [29] "Our_,"          "Our_grant"

3.2.4 Selective ngrams

While tokens_ngrams() generates n-grams or skip-grams in all possible combinations of tokens, tokens_compound() generates n-grams more selectively. For example, we can make negation bi-grams using phrase() and a wild card (*).

toks_neg_bigram <- tokens_compound(tokensQ, pattern = phrase("not *"))
toks_neg_bigram_select <- tokens_select(toks_neg_bigram, pattern = phrase("not_*"))
head(toks_neg_bigram_select, 30)
## Tokens consisting of 30 documents and 6 docvars.
## text1 :
## [1] "not_made"
## 
## text2 :
## character(0)
## 
## text3 :
## character(0)
## 
## text4 :
## character(0)
## 
## text5 :
## [1] "not_except"
## 
## text6 :
## [1] "not_believe"
## 
## [ reached max_ndoc ... 24 more documents ]

tokens_ngrans() is an efficient function, but it returns a large object if multiple values are given to n or skip. Since n-grams inflates the size of objects without adding much information, it is better to generate n-grams more selectively using tokens_compound().

3.3 Document-feature matrix (DFM)

dfm() constructs a document-feature matrix (DFM) from a tokens object.

tokensQ <- tokens(corpQ18, remove_punct = TRUE)
dfmQ <- dfm(tokensQ)
print(dfmQ)
## Document-feature matrix of: 110 documents, 751 features (96.7% sparse) and 6 docvars.
##        features
## docs    all praise is due to allah who has sent down
##   text1   1      1  1   1  1     1   1   2    1    1
##   text2   0      0  0   0  3     0   1   1    0    0
##   text3   0      0  0   0  0     0   0   0    0    0
##   text4   0      0  0   0  1     1   1   1    0    0
##   text5   0      0  1   0  0     0   0   0    0    0
##   text6   0      0  0   0  0     0   0   0    0    0
## [ reached max_ndoc ... 104 more documents, reached max_nfeat ... 741 more features ]

We can get the number of documents and features ndoc() and nfeat().

ndoc(dfmQ)
## [1] 110
nfeat(dfmQ)
## [1] 751

We can also obtain the names of documents and features by docnames() and featnames().

head(docnames(dfmQ), 20)
##  [1] "text1"  "text2"  "text3"  "text4"  "text5"  "text6"  "text7"  "text8" 
##  [9] "text9"  "text10" "text11" "text12" "text13" "text14" "text15" "text16"
## [17] "text17" "text18" "text19" "text20"
head(featnames(dfmQ), 20)
##  [1] "all"     "praise"  "is"      "due"     "to"      "allah"   "who"    
##  [8] "has"     "sent"    "down"    "upon"    "his"     "servant" "the"    
## [15] "book"    "and"     "not"     "made"    "therein" "any"

We use rowSums() and colSums() to calculate marginals, just like normal matrices.

head(rowSums(dfmQ), 10)
##  text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##     22     31      6     11     26     24     27     14     19     26
head(colSums(dfmQ), 10)
##    all praise     is    due     to  allah    who    has   sent   down 
##      3      1     45      1     78     17     24     12      1      2

The most frequent features can be found using topfeatures().

topfeatures(dfmQ, 10)
##  and  the   of   to  you they will them    a   we 
##  167  130   91   78   78   72   68   66   62   54

To convert the frequency count to a proportion within documents, use dfm_weight(scheme = “prop”).

dfmQ_prop <- dfm_weight(dfmQ, scheme  = "prop")
print(dfmQ_prop)
## Document-feature matrix of: 110 documents, 751 features (96.7% sparse) and 6 docvars.
##        features
## docs           all     praise         is        due         to      allah
##   text1 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455 0.04545455
##   text2 0          0          0          0          0.09677419 0         
##   text3 0          0          0          0          0          0         
##   text4 0          0          0          0          0.09090909 0.09090909
##   text5 0          0          0.03846154 0          0          0         
##   text6 0          0          0          0          0          0         
##        features
## docs           who        has       sent       down
##   text1 0.04545455 0.09090909 0.04545455 0.04545455
##   text2 0.03225806 0.03225806 0          0         
##   text3 0          0          0          0         
##   text4 0.09090909 0.09090909 0          0         
##   text5 0          0          0          0         
##   text6 0          0          0          0         
## [ reached max_ndoc ... 104 more documents, reached max_nfeat ... 741 more features ]

We can also weight frequency count by uniqueness of the features across documents using dfm_tfidf().

dfmQ_tfidf <- dfm_tfidf(dfmQ)
print(dfmQ_tfidf)
## Document-feature matrix of: 110 documents, 751 features (96.7% sparse) and 6 docvars.
##        features
## docs         all   praise        is      due        to     allah       who has
##   text1 1.564271 2.041393 0.5228787 2.041393 0.2560629 0.8653014 0.7861202   2
##   text2 0        0        0         0        0.7681886 0         0.7861202   1
##   text3 0        0        0         0        0         0         0           0
##   text4 0        0        0         0        0.2560629 0.8653014 0.7861202   1
##   text5 0        0        0.5228787 0        0         0         0           0
##   text6 0        0        0         0        0         0         0           0
##        features
## docs        sent     down
##   text1 2.041393 1.740363
##   text2 0        0       
##   text3 0        0       
##   text4 0        0       
##   text5 0        0       
##   text6 0        0       
## [ reached max_ndoc ... 104 more documents, reached max_nfeat ... 741 more features ]

3.3.1 Select features

We can select features from a DFM using dfm_select().

dfm_select(dfmQ, pattern = stopwords("en"), selection = "remove")
## Document-feature matrix of: 110 documents, 652 features (98.3% sparse) and 6 docvars.
##        features
## docs    praise due allah sent upon servant book made therein deviance
##   text1      1   1     1    1    1       1    1    1       1        1
##   text2      0   0     0    0    0       0    0    1       0        0
##   text3      0   0     0    0    0       0    0    0       0        0
##   text4      0   0     1    0    0       0    0    0       0        0
##   text5      0   0     0    0    0       0    0    0       0        0
##   text6      0   0     0    0    0       0    0    0       0        0
## [ reached max_ndoc ... 104 more documents, reached max_nfeat ... 642 more features ]

We can select features based on the length of features. In the example below, we only keep features consisting of at least five characters.

dfm_keep(dfmQ, min_nchar = 5)
## Document-feature matrix of: 110 documents, 528 features (98.3% sparse) and 6 docvars.
##        features
## docs    praise allah servant therein deviance straight severe punishment
##   text1      1     1       1       1        1        0      0          0
##   text2      0     0       0       0        0        1      1          1
##   text3      0     0       0       0        0        0      0          0
##   text4      0     1       0       0        0        0      0          0
##   text5      0     0       0       0        0        0      0          0
##   text6      0     0       0       0        0        0      0          0
##        features
## docs    tidings believers
##   text1       0         0
##   text2       1         1
##   text3       0         0
##   text4       0         0
##   text5       0         0
##   text6       0         0
## [ reached max_ndoc ... 104 more documents, reached max_nfeat ... 518 more features ]
topfeatures(dfmQ, 10)
##  and  the   of   to  you they will them    a   we 
##  167  130   91   78   78   72   68   66   62   54
dfmQ_prop <- dfm_select(dfmQ, pattern = stopwords("en"), selection = "remove")
topfeatures(dfmQ_prop, 10)
##   lord   said    say  allah indeed    one  never except people  moses 
##     38     36     18     17     15     14     12     10     10     10

While dfm_select() selects features based on patterns, dfm_trim() does this based on feature frequencies. If min_termfreq = 10, features that occur less than 10 times in the corpus are removed.

dfm_trim(dfmQ, min_termfreq = 10)
## Document-feature matrix of: 110 documents, 63 features (78.7% sparse) and 6 docvars.
##        features
## docs    is to allah who has his the and not he
##   text1  1  1     1   1   2   1   1   1   1  0
##   text2  0  3     0   1   1   0   1   1   0  1
##   text3  0  0     0   0   0   0   0   0   0  0
##   text4  0  1     1   1   1   0   0   1   0  0
##   text5  1  0     0   0   0   0   1   0   1  0
##   text6  0  0     0   0   0   0   0   1   1  0
## [ reached max_ndoc ... 104 more documents, reached max_nfeat ... 53 more features ]
dfm_trim(dfmQ_prop, min_termfreq = 10)
## Document-feature matrix of: 110 documents, 10 features (85.6% sparse) and 6 docvars.
##        features
## docs    allah say except indeed said lord never people one moses
##   text1     1   0      0      0    0    0     0      0   0     0
##   text2     0   0      0      0    0    0     0      0   0     0
##   text3     0   0      0      0    0    0     0      0   0     0
##   text4     1   1      0      0    0    0     0      0   0     0
##   text5     0   0      1      0    0    0     0      0   0     0
##   text6     0   0      0      0    0    0     0      0   0     0
## [ reached max_ndoc ... 104 more documents ]

If max_docfreq = 0.1, features that occur in more than 10% of the documents are removed.

dfm_trim(dfmQ, max_docfreq = 0.1, docfreq_type = "prop")
## Document-feature matrix of: 110 documents, 700 features (98.2% sparse) and 6 docvars.
##        features
## docs    all praise due has sent down upon servant book made
##   text1   1      1   1   2    1    1    1       1    1    1
##   text2   0      0   0   1    0    0    0       0    0    1
##   text3   0      0   0   0    0    0    0       0    0    0
##   text4   0      0   0   1    0    0    0       0    0    0
##   text5   0      0   0   0    0    0    0       0    0    0
##   text6   0      0   0   0    0    0    0       0    0    0
## [ reached max_ndoc ... 104 more documents, reached max_nfeat ... 690 more features ]

3.3.2 Group documents

head(colSums(dfmQ), 10)
##    all praise     is    due     to  allah    who    has   sent   down 
##      3      1     45      1     78     17     24     12      1      2
dfmQ_Group <- dfm_group(dfmQ, groups = "Group")
print(dfmQ_Group)
## Document-feature matrix of: 6 documents, 751 features (71.0% sparse) and 3 docvars.
##           features
## docs       all praise is due to allah who has sent down
##   End        0      0  3   0  3     0   2   1    0    0
##   Parable1   0      0  4   0 10     5   1   1    0    0
##   Parable2   1      0  7   0 18     1   3   3    0    1
##   Story1     1      1 25   1 21     9  13   6    1    1
##   Story2     0      0  3   0 17     1   1   0    0    0
##   Story3     1      0  3   0  9     1   4   1    0    0
## [ reached max_nfeat ... 741 more features ]
head(colSums(dfmQ_Group), 10)
##    all praise     is    due     to  allah    who    has   sent   down 
##      3      1     45      1     78     17     24     12      1      2

dfm_group() merges documents based on a vector given to the groups argument. In grouping documents, it takes the sums of feature frequencies.

dfm_group() identifies document-level variables that are the same within groups and keeps these variables.

docvars(dfmQ_Group)

3.4 Feature Co-occurence Matrix (FCM)

A feature co-occurrence matrix (FCM) records the number of co-occurrences of tokens. This is a special object in quanteda, but behaves similarly to a DFM.

When a corpus is large, we have to select features of a DFM before constructing a FCM. In the example below, we first remove all stopwords and punctuation characters. Afterwards, we remove certain patterns. Then, we keep only terms that occur at least 3 times in the DFM.

dfmQ_prop <- dfm(corpQ18, remove = stopwords("en"), remove_punct = TRUE)
dfmQ_prop <- dfm_remove(dfmQ_prop, pattern = c("Al", "o", "r", "|"))
dfmQ_prop <- dfm_trim(dfmQ_prop, min_termfreq = 3)
topfeatures(dfmQ_prop, 20)
##   lord   said    say  allah indeed    one  never except people  moses   made 
##     38     36     18     17     15     14     12     10     10     10      9 
##  deeds  earth     us    two  found   upon   make  mercy   find 
##      8      8      8      8      8      7      7      7      7
nfeat(dfmQ_prop)
## [1] 114

We can construct a FCM from a DFM or a tokens object using fcm(). topfeatures() returns the most frequntly co-occuring words.

fcmQ <- fcm(dfmQ_prop)
dim(fcmQ)
## [1] 114 114
topfeatures(fcmQ, 20)
##     lord      let      one     said  knowing    found      day     like 
##       97       83       80       52       45       45       42       42 
##     long    moses     find   anyone    whose    never al-khidh     open 
##       40       40       39       39       37       36       35       32 
##      dog       us remained     ever 
##       32       31       31       31

We can select features of a FCM using fcm_select().

feat <- names(topfeatures(fcmQ, 50))
fcmQa <- fcm_select(fcmQ, pattern = feat, selection = "keep")
dim(fcmQa)
## [1] 50 50

A FCM can be used to visualize a semantic network analysis with textplot_network().

size <- colSums(dfm_select(dfmQ, feat, selection = "keep"))
set.seed(144)
textplot_network(fcmQa, min_freq = 0.8, 
                 edge_color = "gold",
                 edge_alpha = 0.8,
                 edge_size = 1,
                 vertex_color = "grey",
                 vertex_alpha = 0.5,
                 vertex_size = size,
                 vertex_labelcolor = "black",
                 vertex_labelsize = size*0.5,
                 offset = NULL)

We can see the nodes (words) with the higher number of edges. Sometimes, a few nodes really dominate and a log scale may help.

size <- log(colSums(dfm_select(dfmQ, feat, selection = "keep")))
set.seed(144)
textplot_network(fcmQa, min_freq = 0.8,
                 vertex_size = size/max(size),
                 vertex_labelsize = size)

We next show a plot of the top_features (200 words) for Surah Al-Kahfi.

3.5 Statistical Analysis

3.5.1 Simple frequency analysis

textstat_frequency() shows both term and document frequencies. We can also use the function to find the most frequent features within groups.

tstat_freq <- textstat_frequency(dfmQ_prop, n = 10, groups = "Group")
head(tstat_freq, 20)
tstat_freq %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency, fill=group)) +
  geom_col() +
  coord_flip() +
  labs(x = "Token", y = "Frequency")

We can plot the frequencies easily using ggplot().

dfmQ_prop %>% 
  textstat_frequency(n = 20) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point(color="salmon", size=3) +
  coord_flip() +
  labs(x = "Token", y = "Frequency") +
  theme_minimal()

We can create a word cloud of the 200 most common tokens.

set.seed(132)
textplot_wordcloud(dfmQ_prop, max_words = 200)

textplot_wordcloud(dfmQ_Group, comparison = TRUE, max_words = 200)

As shown above, we can compare different groups within one Wordcloud. We first create a grouped dfm, as we have done for dfmQ_Group, and compare groups. Recall that the groups were manually defined.


4 Creating A Document Term Matrix (DTM)

One of the most common structures that text mining packages work with is the document-term matrix (DTM). This is a matrix where:5

  • each row represents one document (such as a book or article or verse),
  • each column represents one term, and
  • each value (typically) contains the number of appearances of that term in that document.

Since most pairings of document and term do not occur (they have the value zero), DTMs are usually implemented as sparse matrices. These objects can be treated as though they were matrices (for example, accessing particular rows and columns), but are stored in a more efficient format.

DTM objects cannot be used directly with tidy tools, just as tidy data frames cannot be used as input for most text mining packages. Thus, the tidytext package provides two verbs that convert between the two formats.

  • tidy() turns a DTM into a tidy data frame.
  • cast() turns a tidy one-term-per-row data frame into a matrix. tidytext function cast_dfm() (converting to a dfm object from quanteda).

In Section 3.3 we used the dfm() function from the quanteda package to create a DTM for our selected Surah. quanteda uses the term DFM for DTM. It is a required input for the Latent Dirichlet Allocation (LDA) function which is one of the most common algorithms for topic modeling that we will be covering in the next section.


5 Simple Process Steps in Topic Modeling

Topic modeling is an unsupervised machine learning method, suitable for the exploration of textual data. The calculation of topic models aims to determine the proportionate composition of a fixed number of topics in the documents of a collection. Since it is an unsupervised machine learning method, it is useful to experiment with different parameters in order to find the most suitable parameters for our own analysis needs.

Topic Modeling often involves the following steps.6

  1. Read in and preprocess text data,
  2. Calculate a topic model using the R package topmicmodels and analyze its results in more detail,
  3. Visualize the results from the calculated model and
  4. Select documents based on their topic composition.

We have done step [1] in preparing the dataset and creating the document frequency matrix (DFM). We will use the DFM with the stop words removed.

##   lord   said    say  allah indeed    one  never except people  moses   made 
##     38     36     18     17     15     14     12     10     10     10      9 
##  deeds  earth     us    two  found   upon   make  mercy   find 
##      8      8      8      8      8      7      7      7      7

Next we will explain the inputs and analyze the results of step [2] in more detail.

5.1 Calculate Topic Model and Analyze Results

Latent Dirichlet Allocation (LDA) is one of the most common algorithms for topic modeling. LDA is guided by two principles.

  1. Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 80% topic A and 20% topic B, while Document 2 is 40% topic A and 60% topic B.”
  2. Every topic is a mixture of words. Importantly, words can be shared between topics; a word like “deeds” might appear in both equally.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document.

For parameterized models such as LDA, the number of topics k is the most important parameter to define in advance. How an optimal k should be selected depends on various factors. If k is too small, the collection is divided into a few very general semantic contexts. If k is too large, the collection is divided into too many topics of which some may overlap and others are hardly interpretable.

For our analysis we choose a thematic “resolution” of k = 6 topics. In contrast to a resolution of 100 or more, 6 topics can be evaluated qualitatively very easy. We also set the seed for the random number generator to ensure reproducible results between repeated model inferences.

5.1.1 Fitting the LDA Model

We use the LDA() function from the topicmodels package, setting k = 6, to create a six-topic LDA model.

Almost any topic model in practice will use a larger k, but we will soon see that this analysis approach extends to a larger number of topics.

This function returns an object containing the full details of the model fit, such as how words are associated with topics and how topics are associated with documents. The first time we ran the command, there was en error.

  • Error in LDA(dfmQ_prop, k = 6, control = list(seed = 1234)) : Each row of the input matrix needs to contain at least one non-zero entry

So we run this code first to remove non-zero entries our dfmQ_prop. In the loop we count the number of non-zero values per row, and assign TRUE if this is less than 1 (FALSE otherwise). The vector named ‘drop’ holds the information for which row is TRUE then FALSE. In the final step, we exclude those rows for which drop == TRUE.

drop <- NULL
for(i in 1:NROW(dfmQ_prop)){
     count.non.zero <- sum(dfmQ_prop[i,]!=0, na.rm=TRUE)
     drop <- c(drop, count.non.zero < 1)
   }
dfmQ_prop <- dfmQ_prop[!drop == TRUE,]
# Qlda <- LDA(dfmQ, k = 6, control = list(seed = 1234))
# Qlda

After making sure that each row of the input DTM needs to contain at least one non-zero entry, we run the LDA function.

# load package topicmodels
require(topicmodels)
# number of topics
k <- 6
# compute the LDA model, inference via n iterations of Gibbs sampling
Qlda <- LDA(dfmQ_prop, k, method="Gibbs", 
            control=list(iter = 300, seed = 1234, verbose = 25))
## K = 6; V = 240; M = 109
## Sampling 300 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Gibbs sampling completed!

Depending on the size of the vocabulary, the collection size and the number k, the inference of topic models can take a long time. This calculation may take several minutes. If it takes too long, reduce the vocabulary in the DTM by increasing the minimum frequency in the previous step.

The topic model inference results in two (approximate) posterior probability distributions: a distribution theta over k topics within each document and a distribution beta over V terms within each topic, where V represents the length of the vocabulary of the collection (V = 240). Let’s take a closer look at these results:

# have a look a some of the results (posterior distributions)
tmResult <- posterior(Qlda)
# format of the resulting object
attributes(tmResult)
## $names
## [1] "terms"  "topics"
ncol(dfmQ_prop)               # lengthOfVocab
## [1] 240
beta <- tmResult$terms   # get beta from results
dim(beta)                # K distributions over ncol(DTM) terms
## [1]   6 240
rowSums(beta)            # rows in beta sum to 1
## 1 2 3 4 5 6 
## 1 1 1 1 1 1
nrow(dfmQ_prop)               # size of collection
## [1] 109
# for every document we have a probability distribution of its contained topics
theta <- tmResult$topics 
dim(theta)               # nDocs(DTM) distributions over K topics
## [1] 109   6
rowSums(theta)[1:10]     # rows in theta sum to 1
##  text1  text2  text4  text5  text6  text7  text8  text9 text10 text11 
##      1      1      1      1      1      1      1      1      1      1

We take a look at the 10 most likely terms within the term probabilities beta of the inferred topics.

terms(Qlda, 10)
##       Topic 1    Topic 2      Topic 3  Topic 4      Topic 5    Topic 6    
##  [1,] "indeed"   "said"       "allah"  "lord"       "never"    "say"      
##  [2,] "deeds"    "people"     "moses"  "except"     "two"      "said"     
##  [3,] "make"     "made"       "upon"   "day"        "find"     "one"      
##  [4,] "among"    "mercy"      "one"    "therein"    "good"     "cave"     
##  [5,] "anything" "let"        "way"    "said"       "taken"    "knowledge"
##  [6,] "mention"  "us"         "reward" "present"    "think"    "earth"    
##  [7,] "done"     "anyone"     "see"    "able"       "ever"     "right"    
##  [8,] "within"   "patience"   "set"    "remained"   "found"    "guidance" 
##  [9,] "believed" "certainly"  "whose"  "send"       "allah"    "sea"      
## [10,] "besides"  "punishment" "warn"   "disbelieve" "muhammad" "best"

For the next steps, we want to give the topics more descriptive names than just numbers. Therefore, we simply concatenate the 3 most likely terms of each topic to a string that represents a pseudo-name for each topic.

top5termsPerTopic <- terms(Qlda, 3)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
topicNames
##             Topic 1             Topic 2             Topic 3             Topic 4 
## "indeed deeds make"  "said people made"  "allah moses upon"   "lord except day" 
##             Topic 5             Topic 6 
##    "never two find"      "say said one"

5.1.2 Word-topic probabilities

The tidy() method from the tidytext package provides a method for extracting the per-topic-per-word probabilities, called β (“beta”), from the model.

library(tidytext)
Qtopics <- tidy(Qlda, matrix = "beta")
Qtopics

5.2 Visualize Results

We use dplyr’s top_n() to find the terms that are most common within each topic and then plot.

Qtop_terms <- Qtopics %>%
  group_by(topic) %>%
  top_n(7, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

Qtop_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

For those familiar with Surah Al-Kahfi, the fact that the terms “cave”, “AlKhidr” & “Mose”, and “dhul-qarnain” appear in separate topics is indeed true. They reflect the 3 major stories in the Surah.

As an alternative, we could consider the terms that had the greatest difference in β between topic 1 and topic 2. To constrain it to a set of especially relevant words, we can filter for relatively common words, such as those that have a β greater than 1/1000 in at least one topic.

beta_spread <- Qtopics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001 | topic3 > .001 | 
         topic4 > .001 | topic5 > .001 | topic6 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1))

beta_spread

5.2.1 Document-topic probabilities

Besides estimating each topic as a mixture of words, LDA also models each document as a mixture of topics. We can examine the per-document-per-topic probabilities, called (“gamma”), with the matrix = “gamma” argument to tidy().

Each document in this analysis represented a single verse (Ayat). Thus, we may want to know which topics are associated with each verse. We can find this by examining the per-document-per-topic probabilities, γ (“gamma”).

Qdocuments <- tidy(Qlda, matrix = "gamma")
Qdocuments

Each of these values is an estimated proportion of words from that document that are generated from that topic. For example, the model estimates that about 20% of the words in document 11 (text11) were generated from topic 1.

We can tidy() the document-term matrix and check what the most common words in that document were.

tidy(dfmQ_prop) %>%
  filter(document == "text11") %>%
  arrange(desc(count))

Based on the most common words, this appears to be a verse about youth taking refuge in the caves which is more suitable for topic3.


6 Summary

This post explored topic modeling for finding clusters of words that characterize a set of documents, which in our case was the 110 verses (ayahs) of Surah Al-Kahfi. We showed how to use the quanteda package through a simple tutorial. We concluded by showing how to create a DTM (DFM) as an input for the LDA function from the topicmodels package. The tidy() verb let us explore and understand these models using the tidy tools, dplyr and ggplot2. This is one of the advantages of the tidy approach to model exploration: the challenges of different output formats are handled by the tidying functions, and we can explore model results using a standard set of tools.

The proper noun “Allah” and “Lord” ranks very high on almost all the statistics of the English Quran. This confirms that “Allah” is the central and most important subject matter of the Quran, a topic that one of us will cover in an upcoming book.7

Numerical and statistical analysis of words from the Quran is a good and “easy” start to #qurananalytics. It is general and robust, requires no or little manual effort, and is “surprisingly” powerful and insightful.

7 Reference