In my last blog, I told you that for our next adventure I’d be taking what we learnt from our understanding of for-loops and performing a webscrape of one my absolute favourite books and performing some analyses on said book. (Link to my last blog on for-loops.) I think I would get around to another article on web-scraping at some point but for now, all you need to know is, the Internet is probably the hugest repository of data! And I love analyzing data - so let us tap into that repo.

As for that book I told you about, it is none another than the Holy Bible.
“Why the Bible?” you may be asking but to be honest this book has gotten me through some great times, some not so great times and everything in between. So I thought we can take a look at it, in closer detail and without reading it from cover to cover (although I’d highly recommend it.)
Note: This analysis is by no way meant to be controversial, so please do not take it as such.

Again, I will be using R and some of its packages so please feel free to google them as I won’t go into too much detail on them.

So without further ado, let’s get to scraping.

For my analysis I used https://www.bible.com/ (I’m not being endorsed for this but it is my personal go-to app for the Bible - both web & mobile)

Now consider the following, https://www.bible.com/bible/111/GEN.1.NIV
Almost all the books and chapters in this application seem to follow that pattern.
It seems to be "https://www.bible.com/bible/111 <BookName>.<Chapter Number>.<Bible Version>"

<Book Name> is a three letter code. For example GEN represents Genesis.
<Chapter Number> represents the chapter number of the book above.
<Bible Version> is also a three letter code of the version of the Bible. For example NIV represents the New International Version of the Bible.

We can immediately see here we will need the first three letter of each book in caps and the numbers of chapters per book. Can we think about anything we learnt that might help us iterate through books and chapters?
#lightbulbs

You guessed it FOR-LOOPS! We can use for-loops to iterate through books and chapters.

Let’s gather this data on the books and chapters of the Bible. To do this let’s perform our first webscrape. In the interest of time I am only doing the Old Testament but at the end maybe you can give the New Testament a go!
Now things get a little intense here but ultimately we just want to get the data in a usable tabular manner. So stay with me. I found data that we can use from Wikipedia.

#webscape old testament
require(stringi)
require(rvest)
require(stringr)
require(tidyr)
require(dplyr)
oldtestdata1 = read_html("https://simple.wikipedia.org/wiki/Old_Testament")%>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/ul') %>% 
  html_text() #Read in old testament summarized data

#data clean up
oldtestdata1 <- str_replace_all(oldtestdata1, "[\n]" , "") 
oldtestdata1 <- str_replace_all(oldtestdata1, c("Chapters") , "") 
oldtestdata1 <- str_replace_all(oldtestdata1, c("Chapter") , "") 
oldtestdata1 <- data.frame(oldtestdata1)
oldtestdata1 <- separate_rows(oldtestdata1,1,sep = "[)]")
oldtestdata1 <- data.frame(do.call(rbind, str_split(oldtestdata1$oldtestdata1, '[(]')))
oldtestdata1 <- oldtestdata1[!apply(is.na(oldtestdata1) | oldtestdata1 == "", 1, all),]
oldtestdata1 <- oldtestdata1[-c(16),]
oldtestdata1$Code <- toupper(substr(gsub(" ", "", oldtestdata1$X1),1,3))
oldtestdata1 <- oldtestdata1 %>%rename('Book' = X1,'Chapters' = X2)

#replacing codes for Song of Solomon and Ezekiel
oldtestdata1$Code<-gsub("THE", "SNG", oldtestdata1$Code)
oldtestdata1$Code<-gsub("EZE", "EZK", oldtestdata1$Code)

Well that was intense! Just to get 39 rows of data - which are the number of books in the Old Testament.
We could have easily copied and pasted the data into excel, done some formatting and probably saved about fifteen minutes. However, think big picture and long term - what if it was more that 39 rows but thousands of rows.
The skills developed here would be definitely useful! I would also do an article about cleaning data at some point as well.

##             Book Chapters Code
## 1      Genesis        50   GEN
## 3       Exodus        40   EXO
## 4    Leviticus        27   LEV
## 5      Numbers        36   NUM
## 6  Deuteronomy        34   DEU
## 7       Joshua        24   JOS
## 8       Judges        21   JUD
## 9         Ruth         4   RUT
## 10    1 Samuel        31   1SA
## 11    2 Samuel        24   2SA

WARNING
Things are about to take a turn for the best here! All that above was just for us to build on that for loop knowledge we would have gained from the last blog.

require(xml2)
require(rvest)
require(stringr)
require(tidyr)
require(dplyr)
require(tidyverse)
require(tictoc)
chapters = as.numeric(trimws(oldtestdata1$Chapters)) 
code = as.vector(oldtestdata1$Code)
mat = cbind(code, chapters) 
mat = matrix(mat,39,2) #Create matrix to iterate over two variables
#remove(textdata) #Uncomment this if running in your r-console
textdata = c() #Created empty column that we will keep updating with text and saving each iteration
k=1 #Chapters start from 1
tic("total")

for (i in code[1] ){
  for (j in 1:mat[k,2]){  
    webbible = tryCatch(read_html(paste0("https://my.bible.com/bible/111/", i, ".",j,".NIV")), error = function(e){NA}) 
    bibletext = webbible%>%html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "reader", " " ))]')%>% html_text()
    bibletext = str_replace_all(bibletext, "[\n]" , "") 
    textdata = rbind(textdata, bibletext)
  }
  k=k+1
}

toc()
## total: 69.3 sec elapsed
bibletextdata = data.frame(textdata)
print(head(gsub(" ", "", bibletextdata[1,],1))) 
## [1] "1TheBeginning1InthebeginningGodcreatedtheheavensandtheearth.2Nowtheearthwasformlessandempty,darknesswasoverthesurfaceofthedeep,andtheSpiritofGodwashoveringoverthewaters.3AndGodsaid,“Lettherebelight,”andtherewaslight.4Godsawthatthelightwasgood,andheseparatedthelightfromthedarkness.5Godcalledthelight“day,”andthedarknesshecalled“night.”Andtherewasevening,andtherewasmorning—thefirstday.6AndGodsaid,“Lettherebeavaultbetweenthewaterstoseparatewaterfromwater.”7SoGodmadethevaultandseparatedthewaterunderthevaultfromthewateraboveit.Anditwasso.8Godcalledthevault“sky.”Andtherewasevening,andtherewasmorning—thesecondday.9AndGodsaid,“Letthewaterundertheskybegatheredtooneplace,andletdrygroundappear.”Anditwasso.10Godcalledthedryground“land,”andthegatheredwatershecalled“seas.”AndGodsawthatitwasgood.11ThenGodsaid,“Letthelandproducevegetation:seed-bearingplantsandtreesonthelandthatbearfruitwithseedinit,accordingtotheirvariouskinds.”Anditwasso.12Thelandproducedvegetation:plantsbearingseedaccordingtotheirkindsandtreesbearingfruitwithseedinitaccordingtotheirkinds.AndGodsawthatitwasgood.13Andtherewasevening,andtherewasmorning—thethirdday.14AndGodsaid,“Lettherebelightsinthevaultoftheskytoseparatethedayfromthenight,andletthemserveassignstomarksacredtimes,anddaysandyears,15andletthembelightsinthevaultoftheskytogivelightontheearth.”Anditwasso.16Godmadetwogreatlights—thegreaterlighttogovernthedayandthelesserlighttogovernthenight.Healsomadethestars.17Godsettheminthevaultoftheskytogivelightontheearth,18togovernthedayandthenight,andtoseparatelightfromdarkness.AndGodsawthatitwasgood.19Andtherewasevening,andtherewasmorning—thefourthday.20AndGodsaid,“Letthewaterteemwithlivingcreatures,andletbirdsflyabovetheearthacrossthevaultofthesky.”21SoGodcreatedthegreatcreaturesoftheseaandeverylivingthingwithwhichthewaterteemsandthatmovesaboutinit,accordingtotheirkinds,andeverywingedbirdaccordingtoitskind.AndGodsawthatitwasgood.22Godblessedthemandsaid,“Befruitfulandincreaseinnumberandfillthewaterintheseas,andletthebirdsincreaseontheearth.”23Andtherewasevening,andtherewasmorning—thefifthday.24AndGodsaid,“Letthelandproducelivingcreaturesaccordingtotheirkinds:thelivestock,thecreaturesthatmovealongtheground,andthewildanimals,eachaccordingtoitskind.”Anditwasso.25Godmadethewildanimalsaccordingtotheirkinds,thelivestockaccordingtotheirkinds,andallthecreaturesthatmovealongthegroundaccordingtotheirkinds.AndGodsawthatitwasgood.26ThenGodsaid,“Letusmakemankindinourimage,inourlikeness,sothattheymayruleoverthefishintheseaandthebirdsinthesky,overthelivestockandallthewildanimals,#1:26ProbablereadingoftheoriginalHebrewtext(seeSyriac);MasoreticTexttheearthandoverallthecreaturesthatmovealongtheground.”27SoGodcreatedmankindinhisownimage,intheimageofGodhecreatedthem;maleandfemalehecreatedthem.28Godblessedthemandsaidtothem,“Befruitfulandincreaseinnumber;filltheearthandsubdueit.Ruleoverthefishintheseaandthebirdsintheskyandovereverylivingcreaturethatmovesontheground.”29ThenGodsaid,“Igiveyoueveryseed-bearingplantonthefaceofthewholeearthandeverytreethathasfruitwithseedinit.Theywillbeyoursforfood.30Andtoallthebeastsoftheearthandallthebirdsintheskyandallthecreaturesthatmovealongtheground—everythingthathasthebreathoflifeinit—Igiveeverygreenplantforfood.”Anditwasso.31Godsawallthathehadmade,anditwasverygood.Andtherewasevening,andtherewasmorning—thesixthday."
##Sample data from the first entry which comes from Genesis - I removed all the spaces because we are ##tying to be efficient in storage plus algorithms don't need space when we do text analytics

“KEVAN WHAT?!?!”
Well before you get all worked up, I just want to walk you through that bit of code because it may look intimidating but it really does something quite interesting. That for-loop just pulled all the text data from the book of Genesis (look at the beginning of the loop where I used code[1]). We can easily just remove the 1 and let is cycle through all the books.

Well the first bit before the for-loop is just transforming the attributes we want to use like chapters and code (remember like GEN, EXO, etc..)

I also created a matrix there because guess what, we are going to iterate through those two variable (chapters & codes) - think about it, Genesis has 50 chapters, Exodus has 40 chapters and so on.
So I would want to go from GEN, 1 to 50 then EXO, 1 to 40….

See we are looping through codes and chapters.
The next bit of code, does just that! The for loop, iterates through books and chapters. See if you can detect the similarities between this for-loop and the last bit of the for-loop example I posted in the previous blog. Take some time and mull it over and break it down into phases and please feel free to copy and paste the code into your own R-console and test it out.

Also included there is a neat function called tictoc which allows us to measure the time the for-loop took to run. Which means it took about one minute to scrape the book of Genesis! We could never read Genesis in one minute (at least not me). That means it may take about 30 minutes to scrape the Old Testament of the Bible! That’s pretty fast if you ask me and could most likely be faster in other deployments.

Here you have seen the advanced beauty of the for loop. We were able to extract all the data from the Old Testament and placed in a dataframe where each row represents a chapter of the bible and it is ordered in the NIV Version.
As mentioned there are other parts of this I will cover in the future (we have time, since at the time of writing this, we are still quarantined).
However in the next part of this blog, I want to analyze this text we just gathered using for-loops as we enter into a new phase called Natural Language Processing & Text Mining.

What can we detect from the Old Testament text? Most frequently used words? Word associations? A summary of the Old Testament? Other problems we can think about would be - how do different versions of the Bible compare? Are they significantly different? So much types of analyses we can do from this text data.

All this and more coming next time! See you then!

Some other interesting articles on:
Webscraping
R Packages
Natural Language Processing