library(rvest)
library(lubridate)
library(tidyverse)IMBd Crawler
Hello everyone, making Quarto or Jupyter projects will help me to show the process of creating R or Python projects with the advantage to make them easy to follow.
The following packages will be used for this project, so make sure to install them.
Crawling through IMDb
The first part of crawling or scraping through IMDb is to know how IMDb gets the information to display it, you can get a general idea from the URL.
After that, we can break the process of gathering all the information of all the seasons into getting the information of one season at the time and then loop through the next season until we reach the last one.
Retrieving the information of a single season will be one function, and loop through all will be another function that uses the first one.
Having loops making everything can be considered as a bad practice.
Functions are generally used to break a big process into small pieces.
The URL
The format to crawl though any TV series in IMDb is:
https://www.imdb.com/title/ + <id of the series> + episodes?season= + <season number>
Scraping Single Season
As described above, the first thing to do is crawl and retrieve the information only of one season.
This is the function of season_data. It gets the html contents for a given series and season.
After that, it parses as text the information about each episode and return a tibble containing that information.
Other way to read the statements above is: Get the information about the episodes and return a table with that information.
# The function season_data retrieves the html for a given series and season.
# After that, it parse as text the information about each episode and return a tibble containing that information.
season_data <- function(season_url, season_number){
current_url <- str_c(season_url, season_number)
season_serie <- read_html(current_url) |>
html_elements(".list_item")
seasons_episodes <- tibble(Season = season_number
, episode_air_date = season_serie |> html_element(".airdate") |> html_text2()
, episode_name = season_serie |> html_element("strong") |> html_text2()
, episode_rate = season_serie |> html_element(".ipl-rating-star__rating") |> html_text2()
, episode_votes = season_serie |> html_element(".ipl-rating-star__total-votes") |> html_text2()
, episode_description = season_serie |> html_element(".item_description") |> html_text2()
)
return(seasons_episodes)
}Scraping All Seasons
Now, we have the general idea of how to get the information about only a single season, but we need to loop through several seasons. That is what all_seasons do, it goes through each season, calls season_data until we reach the season we want.
Other way to read the statements above is: Go season through season and add it to the table containing all the seasons information.
all_seasons <- function(url, num_seasons){
all_seasons <- tibble()
for(season in 1:num_seasons){
all_seasons <- bind_rows(all_seasons, season_data(url, season))
}
return(all_seasons)
}Examples
Now you can go to IMDb and search for any series, I will show two examples of to know series.
The Joy of Painting (Seasons 1-3)
The Joy of Painting, can we say anything more that beautiful oil paintings on canvas by Bob Ross?
Joy_Painting <- all_seasons("https://www.imdb.com/title/tt0383795/episodes?season=", 3)
Joy_Painting# A tibble: 39 × 6
Season episode_air_date episode_name episode_rate episode_vo…¹ episo…²
<int> <chr> <chr> <chr> <chr> <chr>
1 1 11 Jan. 1983 A Walk in the Woods 9.1 (91) "Bob R…
2 1 11 Jan. 1983 Mt. McKinley 9.4 (74) "Bob p…
3 1 18 Jan. 1983 Ebony Sunset 9.2 (62) "Bob u…
4 1 25 Jan. 1983 Winter Mist 9.2 (56) "Bob p…
5 1 1 Feb. 1983 Quiet Stream 9.1 (55) "Bob p…
6 1 8 Feb. 1983 Winter Moon 9.4 (53) "Anoth…
7 1 15 Feb. 1983 Autumn Mountains 9.4 (49) "An al…
8 1 22 Feb. 1983 Peaceful Valley 9.4 (48) "Bob p…
9 1 1 Mar. 1983 Seascape 9.3 (51) "Bob p…
10 1 8 Mar. 1983 Mountain Lake 9.4 (50) "Bob p…
# … with 29 more rows, and abbreviated variable names ¹episode_votes,
# ²episode_description
Formula 1: Drive to Survive(Seasons 1-5)
F1 documentary, amazing work to know more about the drivers, teams, etc. Lots of drama.
F1_drive <- all_seasons("https://www.imdb.com/title/tt8289930/episodes/?season=", 2)
F1_drive# A tibble: 20 × 6
Season episode_air_date episode_name episode_rate episode_v…¹ episo…²
<int> <chr> <chr> <chr> <chr> <chr>
1 1 8 Mar. 2019 All to Play For 7.8 (1,141) Driver…
2 1 8 Mar. 2019 The King of Spain 7.7 (998) Team M…
3 1 8 Mar. 2019 Redemption 8.3 (1,012) At the…
4 1 8 Mar. 2019 The Art of War 8.1 (940) The tr…
5 1 8 Mar. 2019 Trouble at the Top 7.6 (862) A team…
6 1 8 Mar. 2019 All or Nothing 7.8 (874) When F…
7 1 8 Mar. 2019 Keeping Your Head 7.6 (833) Perhap…
8 1 8 Mar. 2019 The Next Generation 8.1 (853) Sauber…
9 1 8 Mar. 2019 Stars and Stripes 7.6 (810) The bi…
10 1 8 Mar. 2019 Crossing the Line 7.7 (814) Driver…
11 2 28 Feb. 2020 Lights Out 7.6 (860) The 20…
12 2 28 Feb. 2020 Boiling Point 7.9 (788) As med…
13 2 28 Feb. 2020 Dogfight 7.7 (794) Carlos…
14 2 28 Feb. 2020 Dark Days 8.2 (855) Lewis …
15 2 28 Feb. 2020 Great Expectations 7.9 (792) Red Bu…
16 2 28 Feb. 2020 Raging Bulls 8.6 (939) Alex A…
17 2 28 Feb. 2020 Seeing Red 7.7 (759) Charle…
18 2 28 Feb. 2020 Musical Chairs 7.7 (742) Niko H…
19 2 28 Feb. 2020 Blood, Sweat & Tears 7.5 (731) Team W…
20 2 28 Feb. 2020 Checkered Flag 8.2 (785) Pierre…
# … with abbreviated variable names ¹episode_votes, ²episode_description
Data Cleansing
If you remember, we scrapped all the data as text and we can not work at all with this format. We need to clean and transform the data into the correct shape and format.
clean_seasons <- function(seasons_table){
seasons_table <- seasons_table |>
mutate(episode_air_date = dmy(episode_air_date),
episode_rate = parse_number(episode_rate),
episode_votes = parse_number(episode_votes)
) |>
group_by(Season) |>
mutate(episode = row_number()) |>
select(Season, episode, everything())
return(seasons_table)
}Cleansing the F1 data
F1_drive <- clean_seasons(F1_drive)
F1_drive# A tibble: 20 × 7
# Groups: Season [2]
Season episode episode_air_date episode_name episod…¹ episo…² episo…³
<int> <int> <date> <chr> <dbl> <dbl> <chr>
1 1 1 2019-03-08 All to Play For 7.8 1141 Driver…
2 1 2 2019-03-08 The King of Spain 7.7 998 Team M…
3 1 3 2019-03-08 Redemption 8.3 1012 At the…
4 1 4 2019-03-08 The Art of War 8.1 940 The tr…
5 1 5 2019-03-08 Trouble at the Top 7.6 862 A team…
6 1 6 2019-03-08 All or Nothing 7.8 874 When F…
7 1 7 2019-03-08 Keeping Your Head 7.6 833 Perhap…
8 1 8 2019-03-08 The Next Generation 8.1 853 Sauber…
9 1 9 2019-03-08 Stars and Stripes 7.6 810 The bi…
10 1 10 2019-03-08 Crossing the Line 7.7 814 Driver…
11 2 1 2020-02-28 Lights Out 7.6 860 The 20…
12 2 2 2020-02-28 Boiling Point 7.9 788 As med…
13 2 3 2020-02-28 Dogfight 7.7 794 Carlos…
14 2 4 2020-02-28 Dark Days 8.2 855 Lewis …
15 2 5 2020-02-28 Great Expectations 7.9 792 Red Bu…
16 2 6 2020-02-28 Raging Bulls 8.6 939 Alex A…
17 2 7 2020-02-28 Seeing Red 7.7 759 Charle…
18 2 8 2020-02-28 Musical Chairs 7.7 742 Niko H…
19 2 9 2020-02-28 Blood, Sweat & Tears 7.5 731 Team W…
20 2 10 2020-02-28 Checkered Flag 8.2 785 Pierre…
# … with abbreviated variable names ¹episode_rate, ²episode_votes,
# ³episode_description
What’s Next?
Now, the information about all our seasons is clean and ready to be upload to a database, csv file, Excel file or any other file extension format.
Other things to improve is to allow the users type the name of the series and return the id of the series, maybe with RSelenium or similar packages.
Have fun!.