IMBd Crawler

Hello everyone, making Quarto or Jupyter projects will help me to show the process of creating R or Python projects with the advantage to make them easy to follow.

The following packages will be used for this project, so make sure to install them.

library(rvest)
library(lubridate)
library(tidyverse)

Crawling through IMDb

The first part of crawling or scraping through IMDb is to know how IMDb gets the information to display it, you can get a general idea from the URL.

After that, we can break the process of gathering all the information of all the seasons into getting the information of one season at the time and then loop through the next season until we reach the last one.

Retrieving the information of a single season will be one function, and loop through all will be another function that uses the first one.

Having loops making everything can be considered as a bad practice.

Functions are generally used to break a big process into small pieces.

The URL

The format to crawl though any TV series in IMDb is:

https://www.imdb.com/title/ + <id of the series> + episodes?season= + <season number>

Scraping Single Season

As described above, the first thing to do is crawl and retrieve the information only of one season.

This is the function of season_data. It gets the html contents for a given series and season.

After that, it parses as text the information about each episode and return a tibble containing that information.

Other way to read the statements above is: Get the information about the episodes and return a table with that information.

# The function season_data retrieves the html for a given series and season.
# After that, it parse as text the information about each episode and return a tibble containing that information.
season_data <- function(season_url, season_number){
  current_url <-  str_c(season_url, season_number)
  
  season_serie <- read_html(current_url) |> 
    html_elements(".list_item")
  
  seasons_episodes <- tibble(Season = season_number
                             , episode_air_date = season_serie |> html_element(".airdate") |> html_text2()
                             , episode_name = season_serie |> html_element("strong") |> html_text2()
                             , episode_rate = season_serie |> html_element(".ipl-rating-star__rating") |> html_text2()
                             , episode_votes = season_serie |> html_element(".ipl-rating-star__total-votes") |> html_text2()
                             , episode_description = season_serie |> html_element(".item_description") |> html_text2()
  )
  
  return(seasons_episodes)
}

Scraping All Seasons

Now, we have the general idea of how to get the information about only a single season, but we need to loop through several seasons. That is what all_seasons do, it goes through each season, calls season_data until we reach the season we want.

Other way to read the statements above is: Go season through season and add it to the table containing all the seasons information.

all_seasons <- function(url, num_seasons){
  all_seasons <- tibble()

  for(season in 1:num_seasons){
    all_seasons <- bind_rows(all_seasons, season_data(url, season))
  }

  return(all_seasons)
}

Examples

Now you can go to IMDb and search for any series, I will show two examples of to know series.

The Joy of Painting (Seasons 1-3)

The Joy of Painting, can we say anything more that beautiful oil paintings on canvas by Bob Ross?

Joy_Painting <- all_seasons("https://www.imdb.com/title/tt0383795/episodes?season=", 3)

Joy_Painting
# A tibble: 39 × 6
   Season episode_air_date episode_name        episode_rate episode_vo…¹ episo…²
    <int> <chr>            <chr>               <chr>        <chr>        <chr>  
 1      1 11 Jan. 1983     A Walk in the Woods 9.1          (91)         "Bob R…
 2      1 11 Jan. 1983     Mt. McKinley        9.4          (74)         "Bob p…
 3      1 18 Jan. 1983     Ebony Sunset        9.2          (62)         "Bob u…
 4      1 25 Jan. 1983     Winter Mist         9.2          (56)         "Bob p…
 5      1 1 Feb. 1983      Quiet Stream        9.1          (55)         "Bob p…
 6      1 8 Feb. 1983      Winter Moon         9.4          (53)         "Anoth…
 7      1 15 Feb. 1983     Autumn Mountains    9.4          (49)         "An al…
 8      1 22 Feb. 1983     Peaceful Valley     9.4          (48)         "Bob p…
 9      1 1 Mar. 1983      Seascape            9.3          (51)         "Bob p…
10      1 8 Mar. 1983      Mountain Lake       9.4          (50)         "Bob p…
# … with 29 more rows, and abbreviated variable names ¹​episode_votes,
#   ²​episode_description

Formula 1: Drive to Survive(Seasons 1-5)

F1 documentary, amazing work to know more about the drivers, teams, etc. Lots of drama.

F1_drive <- all_seasons("https://www.imdb.com/title/tt8289930/episodes/?season=", 2)

F1_drive
# A tibble: 20 × 6
   Season episode_air_date episode_name         episode_rate episode_v…¹ episo…²
    <int> <chr>            <chr>                <chr>        <chr>       <chr>  
 1      1 8 Mar. 2019      All to Play For      7.8          (1,141)     Driver…
 2      1 8 Mar. 2019      The King of Spain    7.7          (998)       Team M…
 3      1 8 Mar. 2019      Redemption           8.3          (1,012)     At the…
 4      1 8 Mar. 2019      The Art of War       8.1          (940)       The tr…
 5      1 8 Mar. 2019      Trouble at the Top   7.6          (862)       A team…
 6      1 8 Mar. 2019      All or Nothing       7.8          (874)       When F…
 7      1 8 Mar. 2019      Keeping Your Head    7.6          (833)       Perhap…
 8      1 8 Mar. 2019      The Next Generation  8.1          (853)       Sauber…
 9      1 8 Mar. 2019      Stars and Stripes    7.6          (810)       The bi…
10      1 8 Mar. 2019      Crossing the Line    7.7          (814)       Driver…
11      2 28 Feb. 2020     Lights Out           7.6          (860)       The 20…
12      2 28 Feb. 2020     Boiling Point        7.9          (788)       As med…
13      2 28 Feb. 2020     Dogfight             7.7          (794)       Carlos…
14      2 28 Feb. 2020     Dark Days            8.2          (855)       Lewis …
15      2 28 Feb. 2020     Great Expectations   7.9          (792)       Red Bu…
16      2 28 Feb. 2020     Raging Bulls         8.6          (939)       Alex A…
17      2 28 Feb. 2020     Seeing Red           7.7          (759)       Charle…
18      2 28 Feb. 2020     Musical Chairs       7.7          (742)       Niko H…
19      2 28 Feb. 2020     Blood, Sweat & Tears 7.5          (731)       Team W…
20      2 28 Feb. 2020     Checkered Flag       8.2          (785)       Pierre…
# … with abbreviated variable names ¹​episode_votes, ²​episode_description

Data Cleansing

If you remember, we scrapped all the data as text and we can not work at all with this format. We need to clean and transform the data into the correct shape and format.

clean_seasons <- function(seasons_table){
  seasons_table <- seasons_table |> 
    mutate(episode_air_date = dmy(episode_air_date),
           episode_rate = parse_number(episode_rate),
           episode_votes = parse_number(episode_votes)
           ) |> 
    group_by(Season) |> 
    mutate(episode = row_number()) |> 
    select(Season, episode, everything())
  
  return(seasons_table)
}

Cleansing the F1 data

F1_drive <- clean_seasons(F1_drive)

F1_drive
# A tibble: 20 × 7
# Groups:   Season [2]
   Season episode episode_air_date episode_name         episod…¹ episo…² episo…³
    <int>   <int> <date>           <chr>                   <dbl>   <dbl> <chr>  
 1      1       1 2019-03-08       All to Play For           7.8    1141 Driver…
 2      1       2 2019-03-08       The King of Spain         7.7     998 Team M…
 3      1       3 2019-03-08       Redemption                8.3    1012 At the…
 4      1       4 2019-03-08       The Art of War            8.1     940 The tr…
 5      1       5 2019-03-08       Trouble at the Top        7.6     862 A team…
 6      1       6 2019-03-08       All or Nothing            7.8     874 When F…
 7      1       7 2019-03-08       Keeping Your Head         7.6     833 Perhap…
 8      1       8 2019-03-08       The Next Generation       8.1     853 Sauber…
 9      1       9 2019-03-08       Stars and Stripes         7.6     810 The bi…
10      1      10 2019-03-08       Crossing the Line         7.7     814 Driver…
11      2       1 2020-02-28       Lights Out                7.6     860 The 20…
12      2       2 2020-02-28       Boiling Point             7.9     788 As med…
13      2       3 2020-02-28       Dogfight                  7.7     794 Carlos…
14      2       4 2020-02-28       Dark Days                 8.2     855 Lewis …
15      2       5 2020-02-28       Great Expectations        7.9     792 Red Bu…
16      2       6 2020-02-28       Raging Bulls              8.6     939 Alex A…
17      2       7 2020-02-28       Seeing Red                7.7     759 Charle…
18      2       8 2020-02-28       Musical Chairs            7.7     742 Niko H…
19      2       9 2020-02-28       Blood, Sweat & Tears      7.5     731 Team W…
20      2      10 2020-02-28       Checkered Flag            8.2     785 Pierre…
# … with abbreviated variable names ¹​episode_rate, ²​episode_votes,
#   ³​episode_description

What’s Next?

Now, the information about all our seasons is clean and ready to be upload to a database, csv file, Excel file or any other file extension format.

Other things to improve is to allow the users type the name of the series and return the id of the series, maybe with RSelenium or similar packages.

Have fun!.