Bible Sentiment Analysis

Introduction

The Bible is the best selling book of all time with around 5 billion copies sold. This report with visualize the sentiment of the American Standard Bible using the sentiment lexicons nrc which lists words and their associations with eight basic emotions, as well as the afinn which assigns a value of -5 to 5 based on how positive or negative a word is.

The American Standard Bible is an English translation of the Bible and a revised version of the King James Bible. I chose it because it’s the most recent revision of the Bible that I could find. It was revised most recently in 2020.

Libraries:

library(tidyverse)
library(ggthemes)
library(tidytext)
library(dplyr)
library(plotly)
library(prettydoc)

Datasets:

This report uses one primary dataset which contains every book of the American Standard Bible split up by book, chapter, and verse. Which can be found here.

Method:

This data was already very clean to start with so I didn’t have to do much with it in terms of cleaning. I started by using the un-nesting each word in the text column so I could use each individual word for my analysis. This also kept the data pertaining to book, chapter, and verse. Typically this is where you would use the stop_words (words like “in”,“the”,“of”, etc.) function to take out all the words that would get in the way of a traditional text analysis. The Bible has a few words (mostly from old English) that aren’t included in the stop_words so I made a separate list of these words and took them out of the dataset as well.

american_standard_full %>% 
  unnest_tokens(word, t) -> american_standard_words

bible_stop_words <- c("saith","shalt","hath", "thou", "thy", "ye", "doth", "hast", "dost", "thine", "till", "thee")

Results:

These graphs will use the nrc and afinn sentiment lexicons linked above to derive meaning from the Bible treating the order of the books in the Bible as the chronological order in which the books were written.

Common Words:

Before I do any sentiment analysis I wanted to view the most common words in the Bible overall to get an idea of how the sentiment graphs will skew, if at all.

american_standard_words %>% 
  anti_join(stop_words) %>%
  filter(!word %in% bible_stop_words) %>% 
  count(word, sort=TRUE) %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, n), n, fill = word))+geom_col()+
  coord_flip()+
  labs(title = "Most Common Words in the American Standard Bible")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
  theme(legend.position = "none") -> most_common_words_graph

NRC:

The NRC is a sentiment lexicon that lists words and their association to eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). Words can often fit into multiple emotional categories as well as have a positive or negative sentiment. First we’ll look at the entire text of the Bible split into the emotional categories, then we’ll take a look at the positive and negative sentiments.

american_standard_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("nrc")) %>% 
  filter(sentiment != "positive" & sentiment != "negative") %>% 
  group_by(sentiment) %>%
  count(sentiment, sort = TRUE) %>%
  ggplot(aes(sentiment, n, fill = sentiment))+
  geom_col()+
  labs(title = "American Standard Bible Sentiment by Category (NRC)", x = "", y = "")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
    theme(legend.position = "none") -> emotion_categories_nrc

american_standard_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("nrc")) %>% 
  filter(sentiment == "positive" | sentiment == "negative") %>% 
  group_by(sentiment) %>%
  count(sentiment, sort = TRUE) %>% 
  ggplot(aes(sentiment, n, fill = sentiment))+
  geom_col()+
  labs(title = "American Standard Bible Sentiment (NRC)", x = "", y = "")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
  theme(legend.position = "none") -> sentiment_pos_neg_nrc

Here I tried to find the NRC classification of the top ten most common words in the Bible. It turns out that most of the top ten words don’t appear in the NRC lexicon. The words that are in the lexicon are: “God”, “King”, and “Land”. Particularly “God”, which appears five times in the lexicon in the categories anticipation, fear, joy, trust, and positive. While the other two words only appear in the positive category. This makes the skew towards the positive category more predictable.

Afinn:

The other sentiment lexicon I chose to use was Afinn which assigns a integer value to each word based on how positive or negative the word is. The scores range from -5 for extreme words like “bastard”, up to 5 for words like “breathtaking” and “superb”. Most words tend to fall between -3 to 3, but there can always be outliers. This lexicon assigns only one value to each word which makes visualizing the data much more clear.

american_standard_words %>% 
  anti_join(stop_words) %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(id) %>% 
  summarise(average = mean(value)) %>% 
  ggplot(aes(id, average, fill = average))+
  geom_col()+
  labs(title = "Average Sentiment by Book of the Bible (afinn)")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
  scale_color_gradient2(high = "blue", mid = "purple", low = "red", aesthetics = "fill")+
  theme(legend.position = "none", axis.text.x = element_blank())+
  annotate("segment", x = "40. ", xend = "40. ", y = 0, yend = 1.12)+
  annotate("text", x = "40. ", y = 1.2, label = "End of the Old Testament")->avg_sentiment_afinn

ggplotly(avg_sentiment_afinn)

This plot shows a clear division in the sentiment of the Old and New Testaments of the Bible. This makes sense in the context of the Bible because in the Old Testament God was more often depicted trying to teach sinners lessons and sending terrible disasters and plagues upon the enemies of his chosen people. While in the New Testament God sends his son, Jesus, to gently guide sinners to do the right thing and spread messages of “loving thy neighbor”.

Drilling Down:

When looking at the Afinn graph I noticed a few pretty interesting bits of data. 1. The Song of Solomon is a crazy outlier in a sea of low or negative sentiment values, AKA the Old Testament. 2. The New Testament is obviously more positive so I wonder if there is a significant difference in the most common words between Old and New Testament. 3. From my little bit of Bible knowledge I assumed the most negative book was going to be Revelations. You know, with the apocalypse and everything. Instead the most negative book of the Bible was the Book of Obadiah. Why?

Song of Solomon:

After research I found that the Song of Solomon is quite different from other books of the Bible. It’s a poem on the subject of love, longing, and physical desire that just happened to be set in the time of King Solomon. This can be seen in the most common words in the book.

american_standard_words %>% 
  anti_join(stop_words) %>%
  filter(!word %in% bible_stop_words & b == 22) %>% 
  count(word, sort=TRUE) %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, n), n, fill = word))+geom_col()+
  coord_flip()+
  labs(title = "Most Common Words in the Song of Solomon")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
  theme(legend.position = "none")

Beloved, love, and fair seem to be the workhorses behind why the Song of Solomon has such high sentiment averages. Beloved and love both have values of three and fair has a value of two. Together the three most common words account for 2.67% of the entire text of the Song of Solomon and have the highest sentiment values throughout the entire book.

Old v. New Testament

To start our comparison of Old and New Testament I want to go ahead and state that the Old Testament is significantly longer than the New Testament. The Old Testament has 929 chapters and 39 books averaging 23.8 chapters per book, 10 of which are 5 or less chapters long. The New Testament has 260 chapters and 27 books averaging 9.6 chapters per book, 14 of which are 5 or less chapters long. More than half the books in the New Testament are 5 or less chapters long.

old_testament_words %>% 
  head(15) %>% 
  ggplot(aes(reorder(word, n), n, fill = word))+geom_col()+
  coord_flip()+
  labs(title = "Most Common Words in the Old Testament")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
  theme(legend.position = "none")

new_testament_words %>% 
  head(15) %>% 
  ggplot(aes(reorder(word, n), n, fill = word))+geom_col()+
  coord_flip()+
  labs(title = "Most Common Words in the New Testament")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
  theme(legend.position = "none")

The obvious here is that neither “Jesus” nor “Christ” appears in the Old Testament, because he wasn’t born yet. Another observation is the difference between the types of words in each testament. The Old Testament mentions more concrete nouns such as “land”, “people”, and “house”. The New Testament mentions more abstract nouns such as “spirit”, “heaven”, and “faith”. This shows not only a difference in subject, but also a difference in theme. Where the Old Testament is focused in detailing the history and laws of Israel, the New Testament is more concerned with spirituality and fellowship centered around Jesus Christ.

The Book of Obadiah:

american_standard_words %>% 
  anti_join(stop_words) %>%
  filter(!word %in% bible_stop_words & b == 31) %>% 
  count(word, sort=TRUE) %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, n), n, fill = word))+geom_col()+
  coord_flip()+
  labs(title = "Most Common Words in the Book of Obadiah")+
  theme_fivethirtyeight()+
  scale_color_fivethirtyeight()+
  theme(legend.position = "none")

While this plot is interesting it highlights a limitation of the Afinn sentiment lexicon. While it is vast and extensive, it doesn’t have everything. Only one of the most common ten words even has a sentiment. Even “calamity”, which I would expect to have a sentiment value, has nothing. So sure while the subject matter of the Book of Obadiah may be overwhelming negative the full extent of its sentiment, positive or negative, isn’t able to be fully calculated. At least using these methods. As a final point I’ve attached every word with a sentiment in the Book of Obadiah with their respective counts and sentiment values to give you a peek into the data.

word	n	value
cut	4	-1
deceived	2	-3
distress	2	-2
escape	2	-1
battle	1	-1
destroy	1	-3
destruction	1	-3
disaster	1	-2
dismayed	1	-2
drunk	1	-2
fire	1	-2
leave	1	-1
peace	1	2
proudly	1	2
rejoice	1	4
shame	1	-2
steal	1	-2
treasures	1	2
violence	1	-3
vision	1	1

Conclusion:

The doom and gloom of the Old Testament has pockets of love and the New Testament somehow remains positive in the face of the apocalypse. Through careful examination of the text of the Bible through various means I was able to uncover that it contains a multitude of different types of stories. If you want a detailed history of Israel and its people, look to the Old Testament. If you seek love, look to the Song of Solomon. The data only confirms what billions of people have known for years. The Bible has something for everyone.