04/08/17 9:15pm
I was brought up in a Christian home, and have read all the Bible stories more times than I could remember. However, I will be the first to admit that I often lose sight of the bigger story. Not just because the Bible is so long and I forget what happened in the previous chapter, but I often take a piecemeal approach by reading bits and pieces, here and there, whatever seems relevant to my life at the time.
To feed my curiosity and to help me read the Bible in it’s intended context and purpose, I performed a text and sentiment analysis to help me discover the overarching themes and the emotional rollercoast ridden by the authors.
I managed to find the ESV Bible in xml format available here (http://www.opensong.org/home/download).
Key concepts in the following text anlaysis include creating corpus, NGram tokenizer, term-document matrix, the use of lexicons for sentiment classification, and stop word lexicons.
## # A tibble: 6 x 7
## Testament Book `Book#` Author Chapter Verse Text
## <chr> <fct> <int> <chr> <int> <int> <chr>
## 1 Old Genesis 1 Moses 1 1 In the beginning, God cr~
## 2 Old Genesis 1 Moses 1 2 The earth was without fo~
## 3 Old Genesis 1 Moses 1 3 "And God said, \"Let the~
## 4 Old Genesis 1 Moses 1 4 And God saw that the lig~
## 5 Old Genesis 1 Moses 1 5 God called the light Day~
## 6 Old Genesis 1 Moses 1 6 "And God said, \"Let the~
library(tm)
textCorpus <- VCorpus(VectorSource(Bible_ESV$Text))
Here, I remove uppercase, punctuations, numbers, URLs, stopwords, and unnecessary spaces from corpus
textCorpus <- tm_map(textCorpus, content_transformer(tolower))
textCorpus <- tm_map(textCorpus, removePunctuation)
textCorpus <- tm_map(textCorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) # create remove URL function
textCorpus <- tm_map(textCorpus, content_transformer(removeURL)) # remove URLs using function
textCorpus <- tm_map(textCorpus, removeWords, c(stopwords(kind="en"), "said"))
textCorpus <- tm_map(textCorpus, stripWhitespace)
Just to be sure I have cleaned the data properly, I want to see the first 10 rows of the corpus.
for (i in 1:10) {
cat(paste("[[", i, "]] ", sep = ""))
writeLines(as.character(textCorpus[[i]]))
}
## [[1]] beginning god created heavens earth
## [[2]] earth without form void darkness face deep spirit god hovering face waters
## [[3]] god let light light
## [[4]] god saw light good god separated light darkness
## [[5]] god called light day darkness called night evening morning first day
## [[6]] god let expanse midst waters let separate waters waters
## [[7]] god made expanse separated waters expanse waters expanse
## [[8]] god called expanse heaven evening morning second day
## [[9]] god let waters heavens gathered together one place let dry land appear
## [[10]] god called dry land earth waters gathered together called seas god saw good
This function allows me to specify whether my terms in my Term-Document Matrix will be in unigram, bigram, trigram, or mixed format. I set “min” and “max” as arguments when using the function.
library(rJava)
library(RWeka)
token_delim <- " \\t\\r\\n.!?,;\"()"
NgramTokenizer <- function(min, max) {
result <- function(x){
RWeka::NGramTokenizer(x, RWeka::Weka_control(min = min, max = max, delimiters=token_delim))
}
return(result)
}
Term-Document Matrix (TDM) shows me which terms are used in which documents in a corpus. I’ve set min/max to “1” to set my terms in unigram format (or e.g. set min/max to “2” to set terms in bigram format).
textTDM<- TermDocumentMatrix(textCorpus, control=list(tokenize=NgramTokenizer(min=1, max=1)))
inspect(textTDM[5180:5190,1:10])
## <<TermDocumentMatrix (terms: 11, documents: 10)>>
## Non-/sparse entries: 10/100
## Sparsity : 91%
## Maximal term length: 10
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 1 10 2 3 4 5 6 7 8 9
## goats 0 0 0 0 0 0 0 0 0 0
## goatskin 0 0 0 0 0 0 0 0 0 0
## goatskins 0 0 0 0 0 0 0 0 0 0
## gob 0 0 0 0 0 0 0 0 0 0
## god 1 2 1 1 2 1 1 1 1 1
## godbut 0 0 0 0 0 0 0 0 0 0
## goddess 0 0 0 0 0 0 0 0 0 0
## godfearing 0 0 0 0 0 0 0 0 0 0
## godhis 0 0 0 0 0 0 0 0 0 0
## godless 0 0 0 0 0 0 0 0 0 0
Once I’ve sorted my terms from most used to least used, I plot the top 10 most used terms.
textTDM_freq <- sort(rowSums(as.matrix(textTDM)), decreasing=TRUE)
barplot(textTDM_freq[1:10], col="dark blue", las=2, main = "Most Frequent Words", ylab = "Frequency")
The word god falls in 4th place. Personally, I find comfort in knowing that the Bible is really all about God, and not just a rule book bashed into our heads about what we can and cannot do.
But what’s more interesting to me is the word lord being the most frequent word. It is not just used as another “pronoun” for Jesus or God, but an address that characterizes the relationship between us humans and God…God’s deity and sovereignty over everything we think is in our control. To clarify, I believe we can choose how we go about our lives, but our choices always lead to a set of results that are pre-authorized by God.
Finally, what’s doubly interesting is that the words shall and will are the second and third most frequently used words in the Bible. Which sort of tells me that the Bible is actually full of God’s promises for us. Regardless of whether these promises are good things we wait for in anticipation, or bad consequences we may try so hard to avoid by the way we live, the take out for me is that there’s hardly a serious case of ambiguity in the Bible.
It’s time to look at how the ESV Bible feels.
I load the following packages.
#install.packages(c("syuzhet", "tidytext", "ggplot2","dplyr"))
library(syuzhet)
library(tidytext)
library(ggplot2)
library(dplyr)
Here I’m using the NRC lexicon to provide me my sentiment scores for each row of text. I then attach the sentiment scores to a copy of my data.
esvSentiment<-get_nrc_sentiment(Bible_ESV$Text)
Bible_ESV_Sentiment <- cbind(Bible_ESV, esvSentiment)
head(Bible_ESV_Sentiment[,7:17])
## Text
## 1 In the beginning, God created the heavens and the earth.
## 2 The earth was without form and void, and darkness was over the face of the deep. And the Spirit of God was hovering over the face of the waters.
## 3 And God said, "Let there be light," and there was light.
## 4 And God saw that the light was good. And God separated the light from the darkness.
## 5 God called the light Day, and the darkness he called Night. And there was evening and there was morning, the first day.
## 6 And God said, "Let there be an expanse in the midst of the waters, and let it separate the waters from the waters."
## anger anticipation disgust fear joy sadness surprise trust negative
## 1 0 1 0 1 2 0 0 2 0
## 2 1 1 0 2 1 1 0 1 1
## 3 0 1 0 1 1 0 0 1 0
## 4 1 2 0 2 2 1 1 2 1
## 5 1 1 0 2 1 1 0 1 1
## 6 0 1 0 1 1 0 0 1 0
## positive
## 1 2
## 2 2
## 3 1
## 4 2
## 5 1
## 6 1
sentimentTotals <- data.frame(colSums(Bible_ESV_Sentiment[,c(8:15)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL
sentimentTotals$sentiment <- reorder(sentimentTotals$sentiment, -sentimentTotals$count)
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total Count") + ggtitle("Total Sentiment Scores for ESV Bible")
Trust is the dominant sentiment in the Bible by a mile while the surprise factor is the weakest.
library(stringr)
Bible_ESV_line <- Bible_ESV %>% mutate(linenumber = row_number())
library(tidytext)
tidy_books <- Bible_ESV_line %>% unnest_tokens(word,Text)
library(tidyr)
tidy_books_sentiment <- tidy_books %>% inner_join(get_sentiments("bing")) %>% count(Book,Chapter, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment=positive - negative)
ggplot(tidy_books_sentiment, aes(Chapter, sentiment, fill = Book))+geom_bar(stat="identity", show.legend= FALSE)+facet_wrap(~Book, ncol=66, scales = "free_x")+theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), plot.title = element_blank(), plot.subtitle = element_blank(), strip.text = element_blank())
When reading the Bible from cover to cover, we can see that the Bible starts with a strong positive and a sudden strong negative, followed by a negative trajectory towards the middle, and then ending in positivity in the New Testament.
library(tidytext)
textTDM_tidy <- tidy(textTDM)
head(textTDM_tidy)
## # A tibble: 6 x 3
## term document count
## <chr> <chr> <dbl>
## 1 beginning 1 1.
## 2 created 1 1.
## 3 earth 1 1.
## 4 god 1 1.
## 5 heavens 1 1.
## 6 darkness 2 1.
textTDM_tidy_sentiments <- textTDM_tidy %>% inner_join(get_sentiments("bing"), by = c(term="word"))
head(textTDM_tidy_sentiments)
## # A tibble: 6 x 4
## term document count sentiment
## <chr> <chr> <dbl> <chr>
## 1 darkness 2 1. negative
## 2 darkness 4 1. negative
## 3 good 4 1. positive
## 4 darkness 5 1. negative
## 5 heaven 8 1. positive
## 6 good 10 1. positive
textTDM_tidy_sentiments %>%
count(sentiment, term, wt=count) %>%
ungroup() %>%
filter(n >= 150) %>%
mutate(n=ifelse(sentiment =="negative", -n, n))%>%
mutate(term = reorder(term, n)) %>%
ggplot(aes(term, n, fill=sentiment))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ylab("Contribution to Sentiment")
As identified by the graph, that like is the most common positive word. However, in nearly all instances in the Bible, like is a neutral word used in the context of similes. So I want to include like as a stop word.
Firstly, I briefly review the list of stop words from three lexicons available in stop_words.
stop_words %>% group_by(lexicon) %>% summarise(noOfWords = n())
## # A tibble: 3 x 2
## lexicon noOfWords
## <chr> <int>
## 1 onix 404
## 2 SMART 571
## 3 snowball 174
get_stopwords(language = "en", source = "snowball")
## # A tibble: 175 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # ... with 165 more rows
get_stopwords(language = "en", source = "smart")
## # A tibble: 571 x 2
## word lexicon
## <chr> <chr>
## 1 a smart
## 2 a's smart
## 3 able smart
## 4 about smart
## 5 above smart
## 6 according smart
## 7 accordingly smart
## 8 across smart
## 9 actually smart
## 10 after smart
## # ... with 561 more rows
stop_words %>% filter(lexicon == "onix")
## # A tibble: 404 x 2
## word lexicon
## <chr> <chr>
## 1 a onix
## 2 about onix
## 3 above onix
## 4 across onix
## 5 after onix
## 6 again onix
## 7 against onix
## 8 all onix
## 9 almost onix
## 10 alone onix
## # ... with 394 more rows
A few things I’ve noted:
tidy_custom_stop_words <- stop_words %>% filter(!lexicon == "onix")
tidy_custom_stop_words %>% group_by(lexicon) %>% summarise(noOfWords = n())
## # A tibble: 2 x 2
## lexicon noOfWords
## <chr> <int>
## 1 SMART 571
## 2 snowball 174
textTDM_tidy_sentiments %>%
anti_join(tidy_custom_stop_words, by=c("term" = "word")) %>%
count(sentiment, term, wt=count) %>%
ungroup() %>%
filter(n >= 150) %>%
mutate(n=ifelse(sentiment =="negative", -n, n))%>%
mutate(term = reorder(term, n)) %>%
ggplot(aes(term, n, fill=sentiment))+
geom_bar(stat = "identity")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
ylab("Contribution to Sentiment")
Here, I see the top positive words are great, good, holy, love, and heaven. And, with no surprise, the top negative words are evil, death, sin, fear, and dead.
So what is the overarching theme and the grand plot of the Bible based on my observations through this analysis? Text analysis tells me that it’s all about God, and His lordship over what comes by the way we choose to live our lives. There is little ambiguity as God frequently makes promises throughout the text. Overall, the sentiment of the Bible has a balance of both positives and negatives. Words frequently used by the authors to characterize God are great, good, holy, and love. And words such as evil, death, and sin are used the most to characterize Satan, the earthly world, and the human condition. The authors have a profound sentiment of trust, undoubtedly towards God.
So now when I read the Bible, I will keep in mind that:
The sentiment scores across the plot trajectory is based on the Bible as it is printed out, however, it would be interesting to score plot sentiment in chronological order of when each book/chapter of the Bible was actually written by the authors.
Another concept not touched on, though not necessary to meet the purpose of my analysis, is stemming. I may try this at some point in the future and create a word cloud from it just for fun.