SYNOPSIS
1.1
The crux of this activity is to do web scraping of spellbook and read the Harry Potter Novels and do a text mining analysis of the spell usage in the books.
1.2
This problem was addressed by web scraping the spells and cleaning the data and reading the Novels in an automated fashion. The reading of webpage is done through rvest & XML package, for data cleaning stringr & tidyr packages are used via regular expressions to clean the data before using it.
1.3
This activity provides an insight into the frequency of spell usage in the Harry Potter books.
2.PACKAGES
The package information is provided as comment
library(dplyr) #dplyr
library(rvest) #Web Scraping
library(XML) #Parsing Online Pages
library(tidytext) #Text Mining
library(compare) #Comparison betwen lists
library(wordcloud) #Word Cloud creation
library(RColorBrewer)#Custom Color Pallete for Wordcloud
library(SnowballC) #Text Stemming in WordCloud
library(ggplot2) #For graphs
library(tm) #Text Mining
library(stringr) #Regular Expression
library(tidyverse) #Tidy Universe functions
library(DT) #Datatable
3.Data Preparation
3.1 SOURCE:
Spell List is obtained from http://www.pojo.com/harrypotter/spelist.shtml
The Harry Potter books are obtained from http://www.readbooksvampire.com/J.K._Rowling/Harry_Potter_Series.html
3.2
The Harry Potter is a 7 books series with each book having chapter ranging from 17 to 38. Each chapter had a separate page so a total of 200 pages are read in order to obtain the relevant text for the books.
SPELL LIST We will parse the page and access the relevant Node which might contain the required information, junk values and additional characters like ,,,@@ etc. had to be removed with the use of regex. The spellbook obtained had three type of NA values " –“,”–“,” “.
Anamolies with the spell data: 1. Some of the spells were two word spell like “Avada Kedavra”,“Expecto Patronum” etc. these were converted into “kedavra”,“patronum” respectively because when we will use unnest tokens later they might be treated as separate spells even though they are one. 2.One of the spells was “Pack” which can be used as an english word also. This was changed to a DummySpell because checking the context of use was not in the scope of this project.
NOVELS There were overall 200 pages which were to be accessed in order to fetch the material of the novels.
3.3 DATA CLEANING
In the spell list the data was to be obtained a one word spell list which can be use as a dictionary to compare the spells from the novels.
In the Novels the junk characters had to be delted so that the text can be broken down into one word per row. Afterwords we will apply an anti_join with “stop_words” which is basically a dataset of commonly used words. Since our focus is on spells we can remove this from them from our dataset.
Below is the web scraping data cleaning part
spellurl<-read_html("http://www.pojo.com/harrypotter/spelist.shtml")
tbls <- html_nodes(spellurl,"table")
spellbook<- tbls[7] %>% html_text() %>% gsub("[\r\n\t]+", "@@", .)
spelltemp<-str_split(spellbook, pattern = "@@")
spelllist<-spelltemp[[1]][seq(2,325,3)]
#The missing values are given by "--" , " " and " --"
spelllist[spelllist==" "]<-NA
spelllist[spelllist=="--"]<-NA
spelllist[spelllist==" --"]<-NA
spelllist<-na.omit(spelllist)
#Adapting two word spell to one word spell for text comparison
#Example : Avada Kedavra as "Kedavra"; Expecto Patronum as "Patronum"
spelllist[92]<-"Wingardium"
spelllist[87]<-"Revelio"
spelllist[32]<-"Finite"
spelllist[9]<-"Inimicum"
spelllist[6]<-"Kedavra"
spelllist[26]<-"Patronum"
spelllist[38]<-"Revelio"
spelllist[63]<-"Peskipiksi"
spelllist[81]<-"Salvio"
spelllist[70]<-"Horribilis"
spelllist[71]<-"Totalum"
spelllist[67]<-"Priori"
spelllist[64]<-"Petrificus"
spelllist[65]<-"Piertotum"
spelllist[78]<-"Repello"
spelllist[48]<-"Mortis"
spelllist[52]<-"Recanto"
spelllist[68]<-"Incantato"
spelllist[62]<-"DummySpell"
fin1<-""
#Defining the book name and the number of chapters in the book
listBooks <- c("Harry_Potter_and_the_Philosophers_Stone",17,
"Harry_Potter_and_the_Chamber_of_Secrets",19,
"Harry_Potter_and_the_Prisoner_of_Azkaban",22,
"Harry_Potter_and_the_Goblet_of_Fire",37,
"Harry_Potter_and_the_Order_of_the_Phoenix",38,
"Harry_Potter_and_the_Half-Blood_Prince",30,
"Harry_Potter_and_the_Deathly_Hallows",37)
#Initializing empty vector to append each chapter to overall data
gkz<-vector()
#Function Name: readChapter
#Arguments: 1. bookName 2.Chapters(Total CHapters in book)
## Use 1. It scrapes the book from web from its online link
#2. Cleans the data
#3. returns the data of the complete book
readChapter <- function(bookName,Chapters){
for (lesson in 01:Chapters){
if(lesson %in% 1:9)
{
bk1<-read_html(paste("http://www.readbooksvampire.com/J.K._Rowling/",bookName,"/0",lesson,".html",collapse = NULL,sep=''))
fl1<-bk1 %>% html_nodes("td") %>% html_text()
lsn<-fl1[5] %>% gsub("[\r\n]+", "", .) %>% gsub('^"',"",.) %>% gsub('"$',"",.)
}
else
{
if(lesson %in% 10:Chapters)
{
bk1<- read_html(paste("http://www.readbooksvampire.com/J.K._Rowling/",bookName,"/",lesson,".html",collapse = NULL,sep=''))
fl1<-bk1 %>% html_nodes("td") %>% html_text()
lsn<-fl1[5] %>% gsub("[\r\n]+", "", .) %>% gsub('^"',"",.) %>% gsub('"$',"",.)
}
}
gkz<-c(gkz,lsn)
}
gkz
}
3.4
The spell list prepared contains the name of all the spells It is one dimensional after slicing and dicing the data. Since it is one dimensional we will just display the head as Datatable need atleast a 2D data.
#glimpse of spell dictionry
head(spelllist)
## [1] "Accio" "Aguamenti" "Alohomora" "Anapneo" "Aparecium" "Kedavra"
str(spelllist)
## atomic [1:92] Accio Aguamenti Alohomora Anapneo ...
## - attr(*, "na.action")=Class 'omit' int [1:16] 9 16 24 39 42 45 52 59 65 67 ...
class(spelllist)
## [1] "character"
3.5
We can see above that spell list has 92 spells which are of character class. Below we will break down the text Novel in individual word and its count, so the structure will have word and n which is the number of times the word appeared in the book.
bkz<-vector()
#Calling the readChapter function for each book
for(valz in 1:14)
{
if(valz %% 2 != 0)
{
print(listBooks[valz])
gkz<-readChapter(listBooks[valz],listBooks[valz+1])
new_bkz<- data_frame(gkz)
tidy_bkz <- new_bkz %>% unnest_tokens(word,gkz) %>% anti_join(stop_words)
checkthis <- tidy_bkz %>% count(word, sort= TRUE)
spellcomm<-intersect(checkthis$word,tolower(spelllist))
spellcomm=str_split(spellcomm,pattern = " ")
finalSpellCnt<- checkthis[checkthis$word %in% spellcomm,]
}
}
## [1] "Harry_Potter_and_the_Philosophers_Stone"
## [1] "Harry_Potter_and_the_Chamber_of_Secrets"
## [1] "Harry_Potter_and_the_Prisoner_of_Azkaban"
## [1] "Harry_Potter_and_the_Goblet_of_Fire"
## [1] "Harry_Potter_and_the_Order_of_the_Phoenix"
## [1] "Harry_Potter_and_the_Half-Blood_Prince"
## [1] "Harry_Potter_and_the_Deathly_Hallows"
finalSpellCnt %>% top_n(20) %>%
ggplot() +
geom_bar(mapping = aes(x = reorder(word, n),
y = n),alpha = 0.8, stat = "identity") +
labs(x = NULL, y = "Spells Count") +
coord_flip() + ggtitle("Harry Potter Spell Count Top 20")
collo = brewer.pal(n=10,name="RdBu")
wordcloud(finalSpellCnt$word,finalSpellCnt$n,max.words = 100,random.order = FALSE,textStemming=FALSE,colors=rev(collo))
datatable(finalSpellCnt)
4.EXPLORATORY DATA ANALYSIS
4.1
a.We can clearly see from the top 20 spell count bar graph and the word cloud that the patronum spell which is “Expecto Patronum” for conjuring patronus has been used the most.
b.Accio the summoning charm is the second most used spell in these books. c.Stupefy (the stunning spell) and expelliarmus the disarming spell are defence spell and they are used equal number of times.
d.The forbiddon curses, which is dark magic as observed in the table
 1. Avada ‘Kedavra’ (Killing Curse) is used 21 times
 2. Crucio (Torture Curse) is used 15 times
 3. Imperio (Impervious Curse) is used 6 times
4.2/4.3/4.4
Please see the output
4.5
Some Further insight into the data
The Killing Curse (Avada ‘Kedavra’) has been used 21 times that implies atleast 21 characters have been killed in the books
MUffliato the ear buzzing spell which is used for eavasdroping is used 11 times so atleast this many times someone in the book has used their spying skills.
Expecto ‘Patronum’ charm which is used to defend against dementors has been used the most which implies that the attack by dementors is rampant in the book. Lumos the torch spell has been used 22 times which tells that characters had to go in pitch dark places atleast 22 times.
5.1
The purpose of this project was to analyse spell usage frequency in Harry Potter series. Which we resolved by using a spell dictionary checking against each word of the novel.
5.2
I used bottom to top approach, I constructed a small portion of code for a single page then built on top of it to cover all scenarios. I have modularized the code by creating readChapter function which takes input BookName and No. Of CHapters. are provides the text of the book as output.
I used text mining and regular expression concepts to handle the textual data.
5.3
The most used spell is to summon a patronus. Frequency of other spells can be seen in the table,graph and word cloud. Expelliarmus and Stupefy are the choice of spell for defence against humans.
Thank you Professor Brad for providing me this opportunity to do this, I enjoyed it so much I coded for 20 hours straight :)
THE END