Introduction

IMDB is the world’s most popular and authoritative source for movie, TV, and celebrity content. The data to be extracted is the top 50 movies of the year 2020.

Various details, including a movie’s title, description, actors, director, genre, runtime, and ratings are to be extracted. Then, we check whether ratings correlate to user votes(or popularity). For instance, do the highest-rated movies also have the highest user vote scores?

The webpage to work with is: https://www.imdb.com/search/title/?year=2020&title_type=feature&

Web Scraping

Load libraries needed.

library(rvest)
library(dplyr)
library(ggplot2)
library(stringr)
library(readr)

Define the webpage that we are harvesting data from. Next, identify the selectors used to extract the movie titles and release years.

webpage_content <- read_html("https://www.imdb.com/search/title/?year=2020&title_type=feature&")

movie_titles <- webpage_content %>% html_nodes(".lister-item-header a") %>% html_text()
print(movie_titles)
##  [1] "Worth"                     "News of the World"        
##  [3] "The Night House"           "365 Days"                 
##  [5] "A Quiet Place Part II"     "The Courier"              
##  [7] "Tenet"                     "The Father"               
##  [9] "Promising Young Woman"     "Wonder Woman 1984"        
## [11] "Love and Monsters"         "Another Round"            
## [13] "Boss Level"                "Birds of Prey"            
## [15] "Nomadland"                 "Four Good Days"           
## [17] "The Old Ways"              "After We Collided"        
## [19] "Zola"                      "Let Him Go"               
## [21] "The Hunt"                  "The Devil All the Time"   
## [23] "Bill & Ted Face the Music" "Freaky"                   
## [25] "Riders of Justice"         "Underwater"               
## [27] "The Empty Man"             "Monster Hunter"           
## [29] "Palm Springs"              "Hamilton"                 
## [31] "Greenland"                 "The Dry"                  
## [33] "Mulan"                     "The New Mutants"          
## [35] "Soul"                      "Bad Candy"                
## [37] "I Care a Lot"              "The Witches"              
## [39] "The King of Staten Island" "The Craft: Legacy"        
## [41] "Demon Slayer: Mugen Train" "Inheritance"              
## [43] "Dolittle"                  "Run"                      
## [45] "The Old Guard"             "Minari"                   
## [47] "The Croods: A New Age"     "Extraction"               
## [49] "Enola Holmes"              "The Invisible Man"
release_years <- webpage_content %>% html_nodes(".text-muted.unbold") %>% html_text()
#Remove movie part numbers which were extracted along with release years
release_years <- str_match(release_years, "(\\(\\d+\\))")
#Remove paranthesis
release_years <- parse_number(release_years[,1])
print(release_years)
##  [1] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2021 2020 2020
## [16] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020
## [31] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020
## [46] 2020 2020 2020 2020 2020

Next, extract the movie’s runtime and genre.

run_time <- webpage_content %>% html_nodes(".runtime") %>% html_text()
#Remove the " min" that every run time character string ends with
run_time <- run_time %>% parse_number()
print(run_time)
##  [1] 118 118 107 114  97 112 150  97 113 151 109 117  94 109 107 100  90 105  86
## [20] 113  90 138  91 102 116  95 137 103  90 160 119 117 115  94 100 100 118 106
## [39] 136  97 117 111 101  90 125 115  95 116 123 124
genre <- webpage_content %>% html_nodes(".genre") %>% html_text()
#Trim white spaces
genre <- str_trim(genre)
print(genre)
##  [1] "Biography, Drama, History"    "Action, Adventure, Drama"    
##  [3] "Horror, Mystery, Thriller"    "Drama, Romance"              
##  [5] "Drama, Horror, Sci-Fi"        "Drama, History, Thriller"    
##  [7] "Action, Sci-Fi, Thriller"     "Drama"                       
##  [9] "Crime, Drama, Thriller"       "Action, Adventure, Fantasy"  
## [11] "Action, Adventure, Comedy"    "Comedy, Drama"               
## [13] "Action, Mystery, Sci-Fi"      "Action, Adventure, Comedy"   
## [15] "Drama"                        "Drama"                       
## [17] "Fantasy, Horror"              "Drama, Romance"              
## [19] "Comedy, Crime, Drama"         "Crime, Drama, Thriller"      
## [21] "Action, Horror, Thriller"     "Crime, Drama, Thriller"      
## [23] "Adventure, Comedy, Music"     "Comedy, Horror, Thriller"    
## [25] "Action, Comedy, Drama"        "Action, Horror, Sci-Fi"      
## [27] "Horror, Mystery, Thriller"    "Action, Adventure, Fantasy"  
## [29] "Comedy, Fantasy, Mystery"     "Biography, Drama, History"   
## [31] "Action, Drama, Thriller"      "Crime, Drama, Mystery"       
## [33] "Action, Adventure, Drama"     "Action, Horror, Mystery"     
## [35] "Animation, Adventure, Comedy" "Horror, Thriller"            
## [37] "Comedy, Crime, Thriller"      "Adventure, Comedy, Family"   
## [39] "Comedy, Drama"                "Drama, Fantasy, Horror"      
## [41] "Animation, Action, Adventure" "Drama, Mystery, Thriller"    
## [43] "Adventure, Comedy, Family"    "Mystery, Thriller"           
## [45] "Action, Adventure, Fantasy"   "Drama"                       
## [47] "Animation, Adventure, Comedy" "Action, Thriller"            
## [49] "Action, Adventure, Crime"     "Drama, Horror, Mystery"

Next, IMDB rating and metascores. The metascores are a weighted average of the published critic reviews contained in the chart on that page. It basically indicates the reviews’ quality.

imdb_rating <- webpage_content %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
#Convert from character to numeric data type
imdb_rating <- as.numeric(imdb_rating)
print(imdb_rating)
##  [1] 6.8 6.8 6.9 3.3 7.3 7.1 7.4 8.3 7.5 5.4 7.0 7.8 6.8 6.1 7.4 6.6 5.4 5.2 6.5
## [20] 6.7 6.5 7.1 6.0 6.4 7.6 5.8 6.1 5.3 7.4 8.4 6.4 6.9 5.7 5.3 8.1 6.1 6.3 5.3
## [39] 7.1 4.4 8.3 5.5 5.6 6.7 6.7 7.5 7.0 6.7 6.6 7.1
metascore <- webpage_content %>% html_nodes(".metascore") %>% html_text()
#Remove white spaces and convert to numeric data type
metascore <- str_trim(metascore) %>% as.numeric()
print(metascore)
##  [1] 66 73 68 71 65 69 88 73 60 63 79 56 60 93 52 14 76 63 50 55 65 67 81 48 47
## [26] 83 90 64 69 66 43 83 66 47 67 54 75 31 26 67 70 89 56 56 68 72

Lastly, extract the user votes.

user_votes <- webpage_content %>% html_nodes(".sort-num_votes-visible span:nth-child(2)") %>% html_text()
#Convert from character to numeric data type
user_votes <- parse_number(user_votes)
print(user_votes)
##  [1]   5355  72896   4925  66040 153863  32225 418431  93712 125011 229387
## [11] 106266 107225  47181 207620 125237   2689   3509  24665   5948  20267
## [21]  88475 111914  39924  42661  25464  71558  18000  45928 129059  73104
## [31]  98854  15191 135046  65411 276630     65 113495  32024  52251  10277
## [41]  34922  10977  54458  58406 145233  59219  32664 178330 148655 191426

Combining the Vectors

The titles, years, runtimes, genres, user ratings, metascores, and votes have been obtained from the IMDB web page. These vectors must be combined into one dataframe but this will cause an error because not all vectors are of equal length. The metascores vector has some missing values. This can be handled by adding NA values into the vector where metascroes are missing. By inspecting the webpage it is found that movies 4, 17, 27, 36 do not have a meta score.

As there is no available function in R for inserting NA values into specific indices of a vector, a function has to be written to perform this task as below:

append_vector <- function(vector, inserted_indices, values){

  ## Creating the current indices of the vector
  vector_current_indices <- 1:length(vector)

  ## Adding small amount of values (between 0 and 0.9) to the `inserted_indices`
  new_inserted_indices <- inserted_indices + seq(0, 0.9, length.out = length(inserted_indices))

  ## Appending the `new_inserted_indices` to the current vector indices
  indices <- c(vector_current_indices, new_inserted_indices)

  ## Ordering the indices
  ordered_indices <- order(indices)

  ## Appending the new value to the existing vector
  new_vector <- c(vector, values)

  ## Ordering the new vector wrt the ordered indices
  new_vector[ordered_indices]
}

The above function takes in a vector into which values need to be inserted. It also takes as a second parameter, the indices after which the value needs to be inserted. Third parameter to the function is the value(s) itself.

The indices after which the values need to be inserted are incremented by a small value (less than 1, so as to create a decimal number less than next index but more than previous index). The values are then appended to the end of the original vector after which it is ordered based on newly created index value vector which includes non-whole numbers.

Inserting NA values into meta scores vector

new_metascore <- append_vector(metascore, c(3, 16, 26, 35), NA)
print(new_metascore)
##  [1] 66 73 68 NA 71 65 69 88 73 60 63 79 56 60 93 52 14 NA 76 63 50 55 65 67 81
## [26] 48 47 83 NA 90 64 69 66 43 83 66 47 67 NA 54 75 31 26 67 70 89 56 56 68 72

Now, we can combine the vectors to form a dataframe

imdb_df <- tibble(`Movie Title` = movie_titles, `Release Year` = release_years, `Run Time(mins)` = run_time, Genre = genre, `IMDB Rating` = floor(imdb_rating), `Meta Score` = new_metascore, `User Votes` = user_votes)

Vizualizing Data

The box plot shows us that there is a correlation between user votes and IMDB rating. We see that the movies with lowest user vote count has lowest IMDB rating and the movies with highest user votes has the highest IMDB rating.

Movies that have an IMDB rating of more than 6 have lower user votes.

ggplot(data = imdb_df, aes(x = `IMDB Rating`, y = `User Votes`, group = `IMDB Rating`)) + geom_boxplot() +scale_y_continuous(labels = scales::comma) + ggtitle("Correlation between User Votes and IMDB Rating")

Conclusion

In web scraping data from the IMDB webpage for feature film results of the year 2020, we set out to understand if there is any correlation between user votes(popularity among viewers) and the IMDB rating. The data was cleaned, and ordered to ultimately create a visualization that clearly depicts the user engagement with IMDB for the top 50 movies of 2020.