We will need to start by loading up a necessary R package called “rvest”

library(rvest)
## Warning: package 'rvest' was built under R version 3.5.2
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.5.2

Now let’s take the top grossing movies in the US box office ranked 1-50 from imdb and specify what needs to be scraped

url <- 'https://www.imdb.com/search/title?title_type=feature&release_date=2018-01-01,2018-12-31&genres=action&sort=boxoffice_gross_us,desc'

Let’s read in the html code from that url

webpage <- read_html(url)

Using the web selector tool from chrome, we selected a portion of CSS code that describes the rank of each movie

rank_data_html <- html_nodes(webpage,'.text-primary')

Now let’s convert it into text

rank_data <- html_text(rank_data_html)
#let's check if that conversion worked 
head(rank_data)
## [1] "1." "2." "3." "4." "5." "6."

Seems to work just fine. However it is currently recognized by R as a text data and not numerical data, meaning it reads the numbers like letters rather than numbers. You can tell by how R spits out the string of number that have quotation marks. I’m sure we’ve all experienced this nuance in excel at one point in our lives before, so let’s perform the conversion.

rank_data<-as.numeric(rank_data)
#now let's check it again 
head(rank_data)
## [1] 1 2 3 4 5 6

We’re going to repeat the same for the titles of each film. This time I’m goign to chunk all of the code rather than break it out like I have beforehand.

#Titles 
title_data_html <- html_nodes(webpage,'.lister-item-header a')
title_data <- html_text(title_data_html)
head(title_data)
## [1] "Black Panther"                  "Avengers: Infinity War"        
## [3] "Incredibles 2"                  "Jurassic World: Fallen Kingdom"
## [5] "Deadpool 2"                     "Aquaman"
#Descriptions 
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')
description_data <- html_text(description_data_html)
head(description_data)
## [1] "\n    T'Challa, heir to the hidden but advanced kingdom of Wakanda, must step forward to lead his people into a new future and must confront a challenger from his country's past."                               
## [2] "\n    The Avengers and their allies must be willing to sacrifice all in an attempt to defeat the powerful Thanos before his blitz of devastation and ruin puts an end to the universe."                           
## [3] "\n    The Incredibles hero family takes on a new mission, which involves a change in family roles: Bob Parr (Mr Incredible) must manage the house while his wife Helen (Elastigirl) goes out to save the world."  
## [4] "\n    When the island's dormant volcano begins roaring to life, Owen and Claire mount a campaign to rescue the remaining dinosaurs from this extinction-level event."                                             
## [5] "\n    Foul-mouthed mutant mercenary Wade Wilson (AKA. Deadpool), brings together a team of fellow mutant rogues to protect a young boy with supernatural abilities from the brutal, time-traveling cyborg, Cable."
## [6] "\n    Arthur Curry learns that he is the heir to the underwater kingdom of Atlantis, and must step forward to lead his people and be a hero to the world."
description_data<-gsub("\n","",description_data)
head(description_data)
## [1] "    T'Challa, heir to the hidden but advanced kingdom of Wakanda, must step forward to lead his people into a new future and must confront a challenger from his country's past."                               
## [2] "    The Avengers and their allies must be willing to sacrifice all in an attempt to defeat the powerful Thanos before his blitz of devastation and ruin puts an end to the universe."                           
## [3] "    The Incredibles hero family takes on a new mission, which involves a change in family roles: Bob Parr (Mr Incredible) must manage the house while his wife Helen (Elastigirl) goes out to save the world."  
## [4] "    When the island's dormant volcano begins roaring to life, Owen and Claire mount a campaign to rescue the remaining dinosaurs from this extinction-level event."                                             
## [5] "    Foul-mouthed mutant mercenary Wade Wilson (AKA. Deadpool), brings together a team of fellow mutant rogues to protect a young boy with supernatural abilities from the brutal, time-traveling cyborg, Cable."
## [6] "    Arthur Curry learns that he is the heir to the underwater kingdom of Atlantis, and must step forward to lead his people and be a hero to the world."
#Runtime
runtime_data_html <- html_nodes(webpage,'.runtime')
runtime_data <- html_text(runtime_data_html)
head(runtime_data)
## [1] "134 min" "149 min" "118 min" "128 min" "119 min" "143 min"
runtime_data<-gsub(" min","",runtime_data)
head(runtime_data)
## [1] "134" "149" "118" "128" "119" "143"
runtime_data<-as.numeric(runtime_data)

#Genre
genre_data_html <- html_nodes(webpage,'.genre')
genre_data <- html_text(genre_data_html)
head(genre_data)
## [1] "\nAction, Adventure, Sci-Fi            "   
## [2] "\nAction, Adventure, Fantasy            "  
## [3] "\nAnimation, Action, Adventure            "
## [4] "\nAction, Adventure, Sci-Fi            "   
## [5] "\nAction, Adventure, Comedy            "   
## [6] "\nAction, Adventure, Fantasy            "
genre_data<-gsub("\n","",genre_data)
genre_data<-gsub(" ","",genre_data)
#taking only the first genre of each movie
genre_data<-gsub(",.*","",genre_data)
#Convering each genre from text to factor
genre_data<-as.factor(genre_data)
head(genre_data)
## [1] Action    Action    Animation Action    Action    Action   
## Levels: Action Animation
#Ratings
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')
rating_data <- html_text(rating_data_html)
head(rating_data)
## [1] "7.4" "8.5" "7.8" "6.2" "7.8" "7.5"
rating_data<-as.numeric(rating_data)

#Votes 
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')
votes_data <- html_text(votes_data_html)
head(votes_data)
## [1] "439,684" "562,145" "161,993" "189,594" "330,780" "116,668"
votes_data<-gsub(",","",votes_data)
head(votes_data)
## [1] "439684" "562145" "161993" "189594" "330780" "116668"
votes_data<-as.numeric(votes_data)

#Metascore
metascore_data_html <- html_nodes(webpage,'.ratings-metascore')
metascore_data <- html_text(metascore_data_html)
metascore_data<-gsub(" ","",metascore_data)
metascore_data<-gsub("\n","",metascore_data)
metascore_data<-gsub("Metascore","",metascore_data)
metascore_data<-as.numeric(metascore_data)
head(metascore_data)
## [1] 88 68 80 51 66 55

Okay, looks like I got all the variables I need, let’s make all of these into one coherent data set.

movies_df<-data.frame(Rank = rank_data, Title = title_data,

Description = description_data, Runtime = runtime_data,

Genre = genre_data, Rating = rating_data, Votes = votes_data)                                                         

Let’s check the strucutre of this wonderful dataset, or as R calls it “dataframe”

str(movies_df)
## 'data.frame':    50 obs. of  7 variables:
##  $ Rank       : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title      : Factor w/ 50 levels "12 Strong","A-X-L",..: 7 6 18 19 11 5 22 4 35 50 ...
##  $ Description: Factor w/ 50 levels "    12 Strong tells the story of the first Special Forces team deployed to Afghanistan after 9/11; under the le"| __truncated__,..: 35 37 39 44 26 17 24 18 23 42 ...
##  $ Runtime    : num  134 149 118 128 119 143 147 118 135 112 ...
##  $ Genre      : Factor w/ 2 levels "Action","Animation": 1 1 2 1 1 1 1 1 1 1 ...
##  $ Rating     : num  7.4 8.5 7.8 6.2 7.8 7.5 7.8 7.2 7 6.8 ...
##  $ Votes      : num  439684 562145 161993 189594 330780 ...

All that’s left is for some analysis. I’ll just keep it really simple and compare rating vs rank.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
ggplot(movies_df,aes(x=Rank,y=Rating))+ geom_point(aes(size=Rating,col=Genre))

My next project will utilize the description text to create a word graph on what words are most used.