In this markdown, we will do a web scraping technique using rvest
package. Web scraping or web harvesting or web data exctraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. We will scrap data from IMDb top 250 movies. IMDb (Internet Movie Database)is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. An additional fan feature, message boards, was abandoned in February 2017. Originally a fan-operated website, the database is owned and operated by IMDb.com, Inc. Web scrapinging is just on of various data gathering technique. Why it’s important to do data gathering? well.. doesn’t matter how ‘expert’ you’re as data scientist, if there’s no data, what you gonna do?
Note: If you plan to using this code to your own cases, please take a closer look to the chunk settings.
The goal(s) from this project is gather information as much as possible from IMDb top 250 movies such as name of the movie, director/writer, casts, budget, storyline, and many more. We also do analysis using gathered data. But the ultimate goals from gathering data is let people worldwide using it to exercise their analyzing skill thus we can get a good new knowledge from their perspective.
Web scraping is easier to do with proper tools. Web scraping is basicly match specific html/css attributes from a website and gather whatever you want from it. the ‘matching’ step is kinda difficult to do if you just saw it from the page source. So i recommend you to use Google Chrome
browser with SelectorGadget
extension (surely you can use another web browser). You can download and install SelectorGadget
extension here.
Open Google Chrome and go to this site https://www.imdb.com/chart/top/ we will start from this web. We will go to every 250 movies on the list and gather the information. To do that, We need to list every 250 movie’s link first. Activate your SelectorGadget
in the top right of your browser and click one of the movie title.
The yellow highlight indicates we’ve been choosen css element for the coresponding object. Problem is some other elements are also choosen. Click to irrelevant element to tell the gadget to deselect it, after that we get clean css selector for the movie title.
Now we got the css element for the movie "#main a"
. Remember, we want the link for every movie. We dont need the title text (yet). Right click to random area and choose inspect
or ctrl+shift+i
. Click ctrl+shift+c
to select an element in the page to inspect. Hover and click to one title and you’ll get the full html elements for the object.
the link is stored in href=
element. Now we got all we need, the css element correspond to movie’s title and the html element where the link is stored. next we’ll jump to rvest
package and gather the information
main_url <- "https://www.imdb.com/chart/top/"
# get html codes. similar to 'view page source' in browser
main_page <- read_html(main_url)
# select the css nodes. in this case #main a, then select href attribute
movies_link <- html_nodes(main_page, "#main a") %>% html_attr("href")
head(movies_link,10)
## [1] NA
## [2] "/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1"
## [3] "/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1"
## [4] "/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2"
## [5] "/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2"
## [6] "/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3"
## [7] "/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3"
## [8] "/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_4"
## [9] "/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_4"
## [10] "/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=X6KZ0Z5VNXVM6NHSJJPM&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_5"
we have 502 row character filled with duplicated movies link. first and last row is NA
for unknown reason, we can simply remove it. The link is also incomplete, it supposed to have “https://www.imdb.com” in the beginning. We can use paste0
function to combine it
movies_link <- data.frame(movies_link) %>% slice(2:501) %>% distinct() %>%
mutate(link = paste0("https://www.imdb.com",movies_link))
head(movies_link,10)
Voila! we’ve finished the first step. now we have 250 link for every top rated movies. Next, we will build data-collector function to gather information we need and loop the function to every listed link.
The good thing is every movie page have same css element. We can select one page, create code to gather the information, and apply the same code for every page. It will work because they shared the same html/css template.
From now, the step is kinda same. you go to the page, open SelectorGadget
and inspect element. Hover to the information you need. save the css or html element and extract the information
Simple css style is actually makes the information gathering step harder. Look, the style for the movie title is just a simple h1
. lucky h1
is just for the title. common element like h4
, and a
is everywhere in blog-stlyed website like this. If we need to gather information from common element like that, surely we need further string-specified code. Anyway let’s just take the movie title
# html_text() function help us to extract 'text' information from the html nodes
title <- html_nodes(movie_1,"h1") %>% html_text()
title
## [1] "The Shawshank Redemption (1994) "
We got the title but with extra whitespace. Since we’re going to gather lots of different kind of text information, let’s build a simple text cleaner function to delete whitespaces and “” that very common in web like this.
lets try the cleaner
## [1] "The Shawshank Redemption (1994)"
For director and writers there’s no single css element to cover the text. Instead, we select the ‘invisible’ table that covers the information. The ‘table’ contain both director and writer, so that’s one code for two information
## [1] "Director: Frank Darabont"
## [2] "Writers: Stephen King (short story \"Rita Hayworth and Shawshank Redemption\"), Frank Darabont (screenplay)"
## [3] "Stars: Tim Robbins, Morgan Freeman, Bob Gunton | See full cast & crew »"
We dont need the stars/crew so we will exclude it.
director_writer <- data.frame(summary) %>%
mutate(Director = summary[1],
Writers = summary[2]) %>%
select(c(Director,Writers)) %>% distinct() %>%
mutate(Director = ifelse(str_detect(Director, "Director:"),
name_cleaner(sub("Director: ","",Director)),
name_cleaner(sub("Directors: ","",Director))),
Writers = ifelse(str_detect(Writers,"Writer:"),
name_cleaner(sub("Writer: ","",Writers)),
name_cleaner(sub("Writers: ","",Writers))))
The problem in web scraping is not only finding a good css/html element, but we also need to deal with unnecessary string that comes with it. After an inspection, i found that some movies are actualy have more than one director or writer, thus the website change the word to plural object (e.g: writer: -> writers:). That’s why i specify the text cleaning with ifelse()
function combined with str_detect()
. It might be not efficient, but it does get the job done.
Sometimes we cant rely to SelectorGadget
only. In this example, the gadget can’t choose specific css style for “9.3/10” rating text we want. But it was there, we can found it on element inspector. so instead of “.rating_wrapper” that contain unnecessary information, we gonna use“.ratingValue” that’s actually on the inside of ‘.rating_wrapper’ element.
rating <- data.frame(rating = html_nodes(movie_1,".ratingValue") %>% html_text(trim = T) %>%
str_replace("/10","")) %>% mutate(rating = as.numeric(rating))
rating
Same like before, the storyline is stored in a very common css style “span”. The good thing is html_nodes()
function can be chained. So first we select “#titleStoryLine” table where lots of informations are store then we chained with specific storyline css element “span”. Then we simply extract the text
## [1] "\n Edit\n "
## [2] " Chronicles the experiences of a formerly successful banker as a prisoner in the gloomy jailhouse of Shawshank after being found guilty of a crime he did not commit. The film portrays the man's unique way of dealing with his new, torturous life; along the way he befriends a number of fellow prisoners, most notably a wise long-term inmate named Red."
## [3] " \n Plot Summary\n |\n Plot Synopsis\n "
## [4] "|"
## [5] "wrongful imprisonment"
## [6] "|"
## [7] "based on the works of stephen king"
## [8] "|"
## [9] "prison"
## [10] "|"
## [11] "escape from prison"
## [12] "|"
## [13] "voice over narration"
## [14] "|"
## [15] "Rated R for language and prison violence"
## [16] "|"
## [17] "\n See all certifications »\n "
## [18] "\n View content advisory »\n "
But it’s not easy as it looks. We get another useless information. we need to remove it. We can’t just select row number two because we can’t take risk maybe the row number is different in another movie page. this is where your string-manipulation skill comes into play. see, the synopsis must be the longest character. we can count every character in every row then simply filter the highest value of it. that will do the trick
storyline <- data.frame(story = storyline) %>% mutate(number = nchar(story)) %>%
filter(number == max(number)) %>% select(story)
storyline
The genre is also bit tricky. Its stored in the same css style like storyline but in different nodes. The problem is theres a possibility of one movie can have more than one genre seperated by strings like “ n” or “|”. We need to code the cleaner but in the same time, it must be not harming a movie with single genre. After several attempts, i propose this code. The trick is we need to manually replace the separator string and replace it with comma (,)
genres <- html_nodes(movie_1,"#titleStoryLine") %>% html_nodes(".inline") %>% html_text(trim = T) %>%
data.frame() %>% filter(grepl("Genres:\n",.)) %>% str_replace("Genres:\n","") %>%
str_replace_all("\n","") %>%
str_replace_all("[|]",",") %>% str_squish() %>%
data.frame() %>% setNames("genre")
genres
Next we want to gather release date, budget, worldwide gross revenue, and languange. All of them is stored in same css style “#titleDetails” and framed with “.tht-block”. It makes us easier to retrieve information but same like ‘storyline’, we need to specify the data by the string not from row number.
details <- html_nodes(movie_1,"#titleDetails") %>% html_nodes(".txt-block") %>% html_text(trim = T) %>%
data.frame() %>% setNames("detail")
details
There’s lot of unnecessary text in release date. we need to remove it while keep the actual date clean. luckly all movie pages have a same text and date format. we can remove ‘Release Date:’ text, take the first 3 words, and remove the rest.
release_date <- details %>% filter(grepl("Release Date:",detail)) %>%
str_replace("Release Date: ","") %>% str_replace_all("\n","") %>%
word(start = 1,end = 3,sep = " ") %>% data.frame() %>% setNames("release_date")
release_date
Retrieving budget is also tricky. there is random text everywhere even the thousand mark is seperated by comma. we need to remove the character and good thing is we can extract numbers only string using str_extract with a bit of regex
budget <- details %>% filter(grepl("Budget:",detail)) %>% str_replace_all(",","") %>%
str_extract("[[:digit:]]+") %>% data.frame() %>% setNames("budget") %>%
mutate(budget = as.numeric(budget))
budget
Same like budget but after several runs, i found out that not every movies have worldwide revenue records. especially old and non-american/european movies.
worldwide_gross <- details %>% filter(grepl("Cumulative Worldwide Gross:",detail)) %>%
str_replace_all(",","") %>% str_extract("[[:digit:]]+") %>%
data.frame() %>% setNames("gross") %>% mutate(gross = as.numeric(gross))
worldwide_gross
Kinda same like genres, some movie can have more than one languanges. so we need to replace all “ n” or “|” separator to comma (,).
languange <- details %>% filter(grepl("Language:\n",detail)) %>%
str_replace("Language:\n","") %>% str_replace_all("\n","") %>%
str_replace_all("[|]",",") %>% str_squish() %>%
data.frame() %>% setNames("languange")
languange
We have 10 main information from one movie. If you think it’s not enough, go ahead find the css style nodes, extract information, and add it on your own. anyway lets combine all of them
film_1 <- cbind(title,director_writer,rating,storyline,
genres,release_date,budget,worldwide_gross,languange)
film_1
Thats it! we have managed to get 10 information from one movie. now let’s make a function from it and apply the function to every 250 movies.
In order to ‘automate’ the process, we need to create a function, apply it to the movie list, and simply sit and wait.
per_movies <- function(url){
read <- read_html(url)
# extract movies name
title <- html_nodes(read,"h1") %>% html_text() %>% name_cleaner()
# summary contain summary data of film
summary <- html_nodes(read,".credit_summary_item") %>% html_text() %>% name_cleaner()
# extract director and writer name
director_writer <- data.frame(summary) %>%
mutate(Director = summary[1],
Writers = summary[2]) %>%
select(c(Director,Writers)) %>% distinct() %>%
mutate(Director = ifelse(str_detect(Director, "Director:"),
name_cleaner(sub("Director: ","",Director)),
name_cleaner(sub("Directors: ","",Director))),
Writers = ifelse(str_detect(Writers,"Writer:"),
name_cleaner(sub("Writer: ","",Writers)),
name_cleaner(sub("Writers: ","",Writers))))
# extract rating
rating <- data.frame(rating = html_nodes(read,".ratingValue") %>% html_text(trim = T) %>%
str_replace("/10","")) %>% mutate(rating = as.numeric(rating))
# extract storyline
storyline <- html_nodes(read,"#titleStoryLine") %>% html_nodes("span") %>% html_text()
storyline <- data.frame(story = storyline) %>% mutate(number = nchar(story)) %>%
filter(number == max(number)) %>% select(story)
# extract genre
genres <- html_nodes(read,"#titleStoryLine") %>% html_nodes(".inline") %>% html_text(trim = T) %>%
data.frame() %>% filter(grepl("Genres:\n",.)) %>% str_replace("Genres:\n","") %>%
str_replace_all("\n","") %>%
str_replace_all("[|]",",") %>% str_squish() %>%
data.frame() %>% setNames("genre")
# details contain budget, revenue, languange, etc
details <- html_nodes(read,"#titleDetails") %>% html_nodes(".txt-block") %>% html_text(trim = T) %>%
data.frame() %>% setNames("detail")
# extract release date
release_date <- details %>% filter(grepl("Release Date:",detail)) %>%
str_replace("Release Date: ","") %>% str_replace_all("\n","") %>%
word(start = 1,end = 3,sep = " ") %>% data.frame() %>% setNames("release_date")
# extract budget
budget <- details %>% filter(grepl("Budget:",detail)) %>% str_replace_all(",","") %>%
str_extract("[[:digit:]]+") %>% data.frame() %>% setNames("budget") %>%
mutate(budget = as.numeric(budget))
# extract worldwide revenue
worldwide_gross <- details %>% filter(grepl("Cumulative Worldwide Gross:",detail)) %>%
str_replace_all(",","") %>% str_extract("[[:digit:]]+") %>%
data.frame() %>% setNames("gross") %>% mutate(gross = as.numeric(gross))
# extract languange
languange <- details %>% filter(grepl("Language:\n",detail)) %>%
str_replace("Language:\n","") %>% str_replace_all("\n","") %>%
str_replace_all("[|]",",") %>% str_squish() %>%
data.frame() %>% setNames("languange")
return(cbind(title,director_writer,rating,storyline,genres,
release_date,budget,worldwide_gross,languange))
}
Now we have the function where url/link as the input. The link will be processed with read_html to extract it page source then every pile of codes will do its job to gather specific information. lets build a loop to apply it for every 250 movie list.
movie_list <- data.frame()
for (i in seq_along(movies_link$link)){
message("Gather Movie Info #",i," /",length(movies_link$link))
page <- per_movies(movies_link$link[i])
movie_list <- rbind(movie_list,page)
}
For the sake of time consumption, i save the output to csv so i don’t need to run the loop everytime i knit the rmd to html
Let’s see what we got
We’ve succeeded in extracting information from IMDb top 250 rated movies! Now its up to you to analyze the data or share it to public. For me, i’ll do both. I’ll post the dataset including the codes.
Below is an additional analysis of IMDb top 250 movies. Let’s see what we can do
movie_list <- movie_list %>% mutate(
release_date = as.Date(release_date,"%d %B %Y"),
revenue = gross - budget,
century = ifelse(year(release_date) <= 2000,
"century_20","century_21")
)
plot1 <- movie_list %>% group_by(century) %>% top_n(10,revenue) %>% na.omit() %>%
ggplot(aes(x = revenue/1000, y = reorder(title,revenue), group = century)) +
geom_col(aes(fill = century)) +
geom_text(aes(label = paste0("$",comma(revenue/1000))),
size = 2.5, hjust = 1,color = "white") +
scale_fill_manual(values = c("#000000","#f5c518")) +
scale_x_continuous(expand = c(0,0),
labels = comma) + theme_minimal() +
labs(title = "Top 20 Movies Revenue by Century",
subtitle = "In thousand US Dollar",
x = "Revenue", y = "",
fill = "") +
theme(legend.position = "bottom")
plot1
Please keep in mind, there’s only 250 movies on our list, not an entire world movie. IMDb rating system is based on user review not including how well it sold on the market. So there’s a chance where high grossing movie is actually low-rated and high-rated movie isn’t sold well. Anyway in this plot i put 10 highest revenue movie for each century. we know that 21th century movie leading the higghest revenue of all time. Maybe because more people is interested in entertainment or the marketing is getting beter over time.
plot_2 <- movie_list %>% na.omit(release_date) %>% select(title, genre, rating, release_date) %>%
tidyr::separate_rows(genre, sep = ",") %>% mutate(genre = str_squish(genre),
genre = as.factor(genre)) %>%
group_by(genre) %>% summarise(n = n()) %>%
ggplot(aes(x = n, y = reorder(genre,n))) +
geom_col(aes(fill = n),show.legend = F) +
geom_text(aes(label = n),size = 3,hjust = 1.25,color = "white") +
scale_x_continuous(expand = c(0,0)) +
scale_fill_continuous(low = "#000000" , high = "#f5c518") +
theme_minimal() +
labs(x = "Frequency", y = "Genre\n",
title = "Top High-Rated Movie Genre*",
subtitle = "*One movie can have more than one genre")
plot_2
From the plot above we know that people apparently loves Drama movies. It have two possibilities actually, first people actually like Drama movie or lots of high-rated movies is tagged as Drama genre since one movie can contain more than one genre. Thriller and Crime movie is on the top 5, this combination indicates people loves movies that rush adrinaline.
plot_3 <- movie_list %>% na.omit(release_date) %>% mutate(
text = paste("Title: ",title,
"<br> Rating: ",rating,
"<br> Genre: ",genre,
"<br> Worldwide Revenue: ",revenue)) %>%
ggplot(aes(x = release_date, y = rating, text = text)) +
geom_point(aes(color = rating),show.legend = F) +
scale_y_continuous(breaks = seq(7.90,10,0.2)) +
scale_color_continuous(low = "#000000" , high = "#f5c518") +
theme_minimal() +
labs(title = "Top Movie Rating of All Time",
subtitle = "Rated by IMDb",
x = "Release Date", y = "Rating\n")
ggplotly(plot_3,tooltip = "text")
The plot above shows rating per movies by the time it released. We start from 1920 when most of the movie is drama-comedy genre. from 1960 - 2000 the trend changed into drama-crime-adventure style and personally i think this is the peak of the best movie humankind have ever produce. The highest rating movie is came from this era. People at that time loves Drama-crime-war that non only brings adrenaline but is relateable to daily live. Apparently it still relevant to these day where people who rated it comes from.
It might be true that art is always relevant. People don’t judge art because its old or new. The technique, artist, messages, or how the story driven makes movie is one piece of art we can be proud of. I hope this article can help you gather data from html website and how to deal with string problems. But one first thing before you gathering data, make sure you read the terms & policy from the web about data scraping or web scraping. Some websites have a strict rule about it. if you encounter some (like IMDb) make sure you use the data for personal use only.
Thanks !