TEXT AND SENTIMENT ANALYSIS ON THE WORLD’S MOST READ BOOK

by Wesley Shen

04/08/17 9:15pm

Introduction

I was brought up in a Christian home, and have read all the Bible stories more times than I could remember. However, I will be the first to admit that I often lose sight of the bigger story. Not just because the Bible is so long and I forget what happened in the previous chapter, but I often take a piecemeal approach by reading bits and pieces, here and there, whatever seems relevant to my life at the time.

To feed my curiosity and to help me read the Bible in it’s intended context and purpose, I performed a text and sentiment analysis to help me discover the overarching themes and the emotional rollercoast ridden by the authors.

I managed to find the ESV Bible in xml format available here (http://www.opensong.org/home/download).

Key concepts in the following text anlaysis include creating corpus, NGram tokenizer, term-document matrix, the use of lexicons for sentiment classification, and stop word lexicons.

Import Data

## # A tibble: 6 x 7
##   Testament Book    `Book#` Author Chapter Verse Text                     
##   <chr>     <fct>     <int> <chr>    <int> <int> <chr>                    
## 1 Old       Genesis       1 Moses        1     1 In the beginning, God cr~
## 2 Old       Genesis       1 Moses        1     2 The earth was without fo~
## 3 Old       Genesis       1 Moses        1     3 "And God said, \"Let the~
## 4 Old       Genesis       1 Moses        1     4 And God saw that the lig~
## 5 Old       Genesis       1 Moses        1     5 God called the light Day~
## 6 Old       Genesis       1 Moses        1     6 "And God said, \"Let the~

Convert Data into Corpus

library(tm)
textCorpus <- VCorpus(VectorSource(Bible_ESV$Text))

Clean Data

Here, I remove uppercase, punctuations, numbers, URLs, stopwords, and unnecessary spaces from corpus

textCorpus <- tm_map(textCorpus, content_transformer(tolower))
textCorpus <- tm_map(textCorpus, removePunctuation)
textCorpus <- tm_map(textCorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) # create remove URL function
textCorpus <- tm_map(textCorpus, content_transformer(removeURL)) # remove URLs using function
textCorpus <- tm_map(textCorpus, removeWords, c(stopwords(kind="en"), "said"))
textCorpus <- tm_map(textCorpus, stripWhitespace)

View Corpus

Just to be sure I have cleaned the data properly, I want to see the first 10 rows of the corpus.

for (i in 1:10) {
  cat(paste("[[", i, "]] ", sep = ""))
  writeLines(as.character(textCorpus[[i]]))
}

## [[1]]  beginning god created heavens earth
## [[2]]  earth without form void darkness face deep spirit god hovering face waters
## [[3]]  god let light light
## [[4]]  god saw light good god separated light darkness
## [[5]] god called light day darkness called night evening morning first day
## [[6]]  god let expanse midst waters let separate waters waters
## [[7]]  god made expanse separated waters expanse waters expanse 
## [[8]]  god called expanse heaven evening morning second day
## [[9]]  god let waters heavens gathered together one place let dry land appear 
## [[10]] god called dry land earth waters gathered together called seas god saw good

Create NGram Tokenizer Function

This function allows me to specify whether my terms in my Term-Document Matrix will be in unigram, bigram, trigram, or mixed format. I set “min” and “max” as arguments when using the function.

library(rJava)
library(RWeka)

token_delim <- " \\t\\r\\n.!?,;\"()"

NgramTokenizer <- function(min, max) {
    
     result <- function(x){
       RWeka::NGramTokenizer(x, RWeka::Weka_control(min = min, max = max, delimiters=token_delim))
     }
    
return(result)
}

Create and Inspect Term-Document Matrix

Term-Document Matrix (TDM) shows me which terms are used in which documents in a corpus. I’ve set min/max to “1” to set my terms in unigram format (or e.g. set min/max to “2” to set terms in bigram format).

textTDM<- TermDocumentMatrix(textCorpus, control=list(tokenize=NgramTokenizer(min=1, max=1)))
inspect(textTDM[5180:5190,1:10])

## <<TermDocumentMatrix (terms: 11, documents: 10)>>
## Non-/sparse entries: 10/100
## Sparsity           : 91%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##             Docs
## Terms        1 10 2 3 4 5 6 7 8 9
##   goats      0  0 0 0 0 0 0 0 0 0
##   goatskin   0  0 0 0 0 0 0 0 0 0
##   goatskins  0  0 0 0 0 0 0 0 0 0
##   gob        0  0 0 0 0 0 0 0 0 0
##   god        1  2 1 1 2 1 1 1 1 1
##   godbut     0  0 0 0 0 0 0 0 0 0
##   goddess    0  0 0 0 0 0 0 0 0 0
##   godfearing 0  0 0 0 0 0 0 0 0 0
##   godhis     0  0 0 0 0 0 0 0 0 0
##   godless    0  0 0 0 0 0 0 0 0 0

Plot Frequencies of Top Terms

Once I’ve sorted my terms from most used to least used, I plot the top 10 most used terms.

textTDM_freq <- sort(rowSums(as.matrix(textTDM)), decreasing=TRUE)
barplot(textTDM_freq[1:10], col="dark blue", las=2, main = "Most Frequent Words", ylab = "Frequency")

The word god falls in 4th place. Personally, I find comfort in knowing that the Bible is really all about God, and not just a rule book bashed into our heads about what we can and cannot do.

But what’s more interesting to me is the word lord being the most frequent word. It is not just used as another “pronoun” for Jesus or God, but an address that characterizes the relationship between us humans and God…God’s deity and sovereignty over everything we think is in our control. To clarify, I believe we can choose how we go about our lives, but our choices always lead to a set of results that are pre-authorized by God.

Finally, what’s doubly interesting is that the words shall and will are the second and third most frequently used words in the Bible. Which sort of tells me that the Bible is actually full of God’s promises for us. Regardless of whether these promises are good things we wait for in anticipation, or bad consequences we may try so hard to avoid by the way we live, the take out for me is that there’s hardly a serious case of ambiguity in the Bible.

Sentiment Analysis

It’s time to look at how the ESV Bible feels.

I load the following packages.

#install.packages(c("syuzhet", "tidytext", "ggplot2","dplyr"))
library(syuzhet)
library(tidytext)
library(ggplot2)
library(dplyr)

NRC Lexicon

Here I’m using the NRC lexicon to provide me my sentiment scores for each row of text. I then attach the sentiment scores to a copy of my data.

esvSentiment<-get_nrc_sentiment(Bible_ESV$Text)
Bible_ESV_Sentiment <- cbind(Bible_ESV, esvSentiment)
head(Bible_ESV_Sentiment[,7:17])

##                                                                                                                                               Text
## 1                                                                                         In the beginning, God created the heavens and the earth.
## 2 The earth was without form and void, and darkness was over the face of the deep. And the Spirit of God was hovering over the face of the waters.
## 3                                                                                         And God said, "Let there be light," and there was light.
## 4                                                              And God saw that the light was good. And God separated the light from the darkness.
## 5                          God called the light Day, and the darkness he called Night. And there was evening and there was morning, the first day.
## 6                              And God said, "Let there be an expanse in the midst of the waters, and let it separate the waters from the waters."
##   anger anticipation disgust fear joy sadness surprise trust negative
## 1     0            1       0    1   2       0        0     2        0
## 2     1            1       0    2   1       1        0     1        1
## 3     0            1       0    1   1       0        0     1        0
## 4     1            2       0    2   2       1        1     2        1
## 5     1            1       0    2   1       1        0     1        1
## 6     0            1       0    1   1       0        0     1        0
##   positive
## 1        2
## 2        2
## 3        1
## 4        2
## 5        1
## 6        1

Plot Sentiment Scores

sentimentTotals <- data.frame(colSums(Bible_ESV_Sentiment[,c(8:15)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
sentimentTotals$sentiment <- reorder(sentimentTotals$sentiment, -sentimentTotals$count)
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
  geom_bar(aes(fill = sentiment), stat = "identity") +
  theme(legend.position = "none") +
  xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Scores for ESV Bible")

Trust is the dominant sentiment in the Bible by a mile while the surprise factor is the weakest.

Sentiment Score Across Plot Trajectory

library(stringr)
Bible_ESV_line <- Bible_ESV %>% mutate(linenumber = row_number())

library(tidytext)
tidy_books <- Bible_ESV_line %>% unnest_tokens(word,Text)

library(tidyr)
tidy_books_sentiment <- tidy_books %>% inner_join(get_sentiments("bing")) %>% count(Book,Chapter, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment=positive - negative)

ggplot(tidy_books_sentiment, aes(Chapter, sentiment, fill = Book))+geom_bar(stat="identity", show.legend= FALSE)+facet_wrap(~Book, ncol=66, scales = "free_x")+theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), plot.title = element_blank(), plot.subtitle = element_blank(), strip.text = element_blank())

When reading the Bible from cover to cover, we can see that the Bible starts with a strong positive and a sudden strong negative, followed by a negative trajectory towards the middle, and then ending in positivity in the New Testament.

Tidy TDM

library(tidytext)
textTDM_tidy <- tidy(textTDM)
head(textTDM_tidy)

## # A tibble: 6 x 3
##   term      document count
##   <chr>     <chr>    <dbl>
## 1 beginning 1           1.
## 2 created   1           1.
## 3 earth     1           1.
## 4 god       1           1.
## 5 heavens   1           1.
## 6 darkness  2           1.

Join Tidy TDM with BING Sentiments

textTDM_tidy_sentiments <- textTDM_tidy %>% inner_join(get_sentiments("bing"), by = c(term="word"))
head(textTDM_tidy_sentiments)

## # A tibble: 6 x 4
##   term     document count sentiment
##   <chr>    <chr>    <dbl> <chr>    
## 1 darkness 2           1. negative 
## 2 darkness 4           1. negative 
## 3 good     4           1. positive 
## 4 darkness 5           1. negative 
## 5 heaven   8           1. positive 
## 6 good     10          1. positive

Plot Most Common Positive/Negative Words

textTDM_tidy_sentiments %>%
count(sentiment, term, wt=count) %>%
ungroup() %>%
filter(n >= 150) %>%
mutate(n=ifelse(sentiment =="negative", -n, n))%>%
mutate(term = reorder(term, n)) %>%
ggplot(aes(term, n, fill=sentiment))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ylab("Contribution to Sentiment")

Tidy Custom Stop Words

As identified by the graph, that like is the most common positive word. However, in nearly all instances in the Bible, like is a neutral word used in the context of similes. So I want to include like as a stop word.

Firstly, I briefly review the list of stop words from three lexicons available in stop_words.

stop_words %>% group_by(lexicon) %>% summarise(noOfWords = n())

## # A tibble: 3 x 2
##   lexicon  noOfWords
##   <chr>        <int>
## 1 onix           404
## 2 SMART          571
## 3 snowball       174

get_stopwords(language = "en", source = "snowball")

## # A tibble: 175 x 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # ... with 165 more rows

get_stopwords(language = "en", source = "smart")

## # A tibble: 571 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           smart  
##  2 a's         smart  
##  3 able        smart  
##  4 about       smart  
##  5 above       smart  
##  6 according   smart  
##  7 accordingly smart  
##  8 across      smart  
##  9 actually    smart  
## 10 after       smart  
## # ... with 561 more rows

stop_words %>% filter(lexicon == "onix")

## # A tibble: 404 x 2
##    word    lexicon
##    <chr>   <chr>  
##  1 a       onix   
##  2 about   onix   
##  3 above   onix   
##  4 across  onix   
##  5 after   onix   
##  6 again   onix   
##  7 against onix   
##  8 all     onix   
##  9 almost  onix   
## 10 alone   onix   
## # ... with 394 more rows

A few things I’ve noted:

The word “like” is included in the SMART lexicon list, so I will include the words from SMART as part of my custom list of stop words
The short list of words in Snowball are usually neutral, so I will include words from Snowball as part of my custom list of stop words
However, Onix contains words such as “good” and “great” as stop words, which, in the context of the Bible, are positive words often used to describe God. So for now I will not include words from Onix as part of my custom list of stop words

tidy_custom_stop_words <- stop_words %>% filter(!lexicon == "onix")
tidy_custom_stop_words %>% group_by(lexicon) %>% summarise(noOfWords = n())

## # A tibble: 2 x 2
##   lexicon  noOfWords
##   <chr>        <int>
## 1 SMART          571
## 2 snowball       174

Removing Custom Stop Words from Graph

textTDM_tidy_sentiments %>%
anti_join(tidy_custom_stop_words, by=c("term" = "word")) %>%
count(sentiment, term, wt=count) %>%
ungroup() %>%
filter(n >= 150) %>%
mutate(n=ifelse(sentiment =="negative", -n, n))%>%
mutate(term = reorder(term, n)) %>%
ggplot(aes(term, n, fill=sentiment))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ylab("Contribution to Sentiment")

Here, I see the top positive words are great, good, holy, love, and heaven. And, with no surprise, the top negative words are evil, death, sin, fear, and dead.

Conclusion

So what is the overarching theme and the grand plot of the Bible based on my observations through this analysis? Text analysis tells me that it’s all about God, and His lordship over what comes by the way we choose to live our lives. There is little ambiguity as God frequently makes promises throughout the text. Overall, the sentiment of the Bible has a balance of both positives and negatives. Words frequently used by the authors to characterize God are great, good, holy, and love. And words such as evil, death, and sin are used the most to characterize Satan, the earthly world, and the human condition. The authors have a profound sentiment of trust, undoubtedly towards God.

So now when I read the Bible, I will keep in mind that:

God is good
Certain things will happen
Trust Him
Acknowledge He is Lord over me

Future Analysis

The sentiment scores across the plot trajectory is based on the Bible as it is printed out, however, it would be interesting to score plot sentiment in chronological order of when each book/chapter of the Bible was actually written by the authors.

Another concept not touched on, though not necessary to meet the purpose of my analysis, is stemming. I may try this at some point in the future and create a word cloud from it just for fun.