PACKAGES

library(dplyr)
library(rvest)
library(XML)
library(stringr)

1.Data Description The purpose of this project is to do web scraping on the Harry Potter Online books and do an analysis on the usage of the spells through out the books. The codebook used for this is the online spellbook from which all the spells are scraped. The table has some missing values denoted by – but some other exceptions were also handled which came due to web developers code.

spellurl<-read_html("http://www.pojo.com/harrypotter/spelist.shtml")

  tbls <- html_nodes(spellurl,"table")
  
spellbook<-  tbls[7] %>% html_text() %>% gsub("[\r\n\t]+", "@@", .) 
spelltemp<-str_split(spellbook, pattern = "@@") 
spelllist<-spelltemp[[1]][seq(2,325,3)] 

#The missing values are given by "--" , " " and "  --"
spelllist[spelllist==" "]<-NA
spelllist[spelllist=="--"]<-NA 
spelllist[spelllist=="  --"]<-NA
grymoire<-na.omit(spelllist)
grymoire
##  [1] "Accio"                 "Aguamenti"            
##  [3] "Alohomora"             "Anapneo"              
##  [5] "Aparecium"             "Avada Kedavra"        
##  [7] "Avifors"               "Avis"                 
##  [9] "Cave Inimicum"         "Colloportus"          
## [11] "Confringo"             "Confundus"            
## [13] "Conjunctivitis"        "Crucio"               
## [15] "Defodio"               "Deletrius"            
## [17] "Densaugeo"             "Deprimo"              
## [19] "Diffindo"              "Dissendium"           
## [21] "Duro"                  "Engorgio"             
## [23] "Ennervate"             "Episkey"              
## [25] "Erecto"                "Expecto Patronum"     
## [27] "Expelliarmus"          "Expulso"              
## [29] " "                     "Ferula"               
## [31] "Fidelius"              "Finite Incantatum"    
## [33] "Flagrate"              "Flipendo"             
## [35] "Furnunculus"           "Geminio"              
## [37] "Glisseo"               "Homenum Revelio"      
## [39] "Homorphus"             "Immobulus"            
## [41] "Impedimenta"           "Imperio"              
## [43] "Impervius"             "Incarcerous"          
## [45] "Incendio"              "Langlock"             
## [47] "Legilimens"            "Levicorpus"           
## [49] "Liberacorpus"          "Locomotor Mortis"     
## [51] "Lumos"                 "Meteolojinx Recanto"  
## [53] "Mobiliarbus"           "Mobilicorpus"         
## [55] "Morsmorde"             "Muffliato"            
## [57] "Nox"                   "Obliviate"            
## [59] "Obscuro"               "Oppugno"              
## [61] "Orchideous"            "Pack"                 
## [63] "Peskipiksi Pesternomi" "Petrificus Totalus"   
## [65] "Piertotum Locomotor"   "Point Me"             
## [67] "Priori Incantatum"     "Prior Incantato"      
## [69] "Protego"               "Protego Horribilis "  
## [71] "Protego Totalum "      "Quietus"              
## [73] "Reducio"               "Reducto"              
## [75] "Relashio"              "Rennervate "          
## [77] "Reparo"                "Repello Muggletum "   
## [79] "Rictusempra"           "Riddikulus"           
## [81] "Salvio Hexia"          "Scruge"               
## [83] "Sectumsempra"          "Serpensortia"         
## [85] "Silencio"              "Sonorus"              
## [87] "Specialis Revelio"     "Stupefy"              
## [89] "Tarantallegra"         "Tergeo"               
## [91] "Waddiwasi"             "Wingardium Leviosa"   
## attr(,"na.action")
##  [1]   9  16  24  39  42  45  52  59  65  67  72  83  85  94 103 106
## attr(,"class")
## [1] "omit"

2.Importing Data & 3.Data Cleaning Data from one of the sample book is scraped and shown as demo, in the final project the words will be broken down as tokens to get the frequency of spells. The words will be compared with the spells from the spellbook. Data is cleaned from various unnecessary characters like etc. and other junk characters. The first novel is present at Harry Potter and the Philospher’s Stone, as each novel will be having different portion in different web pages we will be iterating through all the web pages to get the complete text of the novel. Example as the first novel has 17 pages

#Reading the Book1
for (lesson in 01:17){
  
  if(lesson %in% 1:9)
  {
    bk1<-read_html(paste("http://www.readbooksvampire.com/J.K._Rowling/Harry_Potter_and_the_Philosophers_Stone/0",lesson,".html",collapse = NULL,sep=''))  
    fl1<-bk1 %>% html_nodes("td") %>% html_text()
    lsn<-fl1[5] %>% gsub("[\r\n]+", "", .) %>% gsub('^"',"",.) %>% gsub('"$',"",.)
    
  }
  if(lesson %in% 10:17)
  {
    bk1<- read_html(paste("http://www.readbooksvampire.com/J.K._Rowling/Harry_Potter_and_the_Philosophers_Stone/",lesson,".html",collapse = NULL,sep=''))  
    fl1<-bk1 %>% html_nodes("td") %>% html_text()
    lsn<-fl1[5] %>% gsub("[\r\n]+", "", .) %>% gsub('^"',"",.) %>% gsub('"$',"",.)
    
    }
 
}

4.Planed Analysis The analysis will be done by
1. Sorted List of spell usage in all the books
2. Histogram showcasing the usage
3. Term frequency by creating a facet wrap by each book and showcasing spells usage in each book
This will provide us two insights which is the most used spell in all 7 books and bookwise usage of the spells.