IMDB is the world’s most popular and authoritative source for movie, TV, and celebrity content. The data to be extracted is the top 50 movies of the year 2020.
Various details, including a movie’s title, description, actors, director, genre, runtime, and ratings are to be extracted. Then, we check whether ratings correlate to user votes(or popularity). For instance, do the highest-rated movies also have the highest user vote scores?
The webpage to work with is: https://www.imdb.com/search/title/?year=2020&title_type=feature&
Load libraries needed.
library(rvest)
library(dplyr)
library(ggplot2)
library(stringr)
library(readr)
Define the webpage that we are harvesting data from. Next, identify the selectors used to extract the movie titles and release years.
webpage_content <- read_html("https://www.imdb.com/search/title/?year=2020&title_type=feature&")
movie_titles <- webpage_content %>% html_nodes(".lister-item-header a") %>% html_text()
print(movie_titles)
## [1] "Worth" "News of the World"
## [3] "The Night House" "365 Days"
## [5] "A Quiet Place Part II" "The Courier"
## [7] "Tenet" "The Father"
## [9] "Promising Young Woman" "Wonder Woman 1984"
## [11] "Love and Monsters" "Another Round"
## [13] "Boss Level" "Birds of Prey"
## [15] "Nomadland" "Four Good Days"
## [17] "The Old Ways" "After We Collided"
## [19] "Zola" "Let Him Go"
## [21] "The Hunt" "The Devil All the Time"
## [23] "Bill & Ted Face the Music" "Freaky"
## [25] "Riders of Justice" "Underwater"
## [27] "The Empty Man" "Monster Hunter"
## [29] "Palm Springs" "Hamilton"
## [31] "Greenland" "The Dry"
## [33] "Mulan" "The New Mutants"
## [35] "Soul" "Bad Candy"
## [37] "I Care a Lot" "The Witches"
## [39] "The King of Staten Island" "The Craft: Legacy"
## [41] "Demon Slayer: Mugen Train" "Inheritance"
## [43] "Dolittle" "Run"
## [45] "The Old Guard" "Minari"
## [47] "The Croods: A New Age" "Extraction"
## [49] "Enola Holmes" "The Invisible Man"
release_years <- webpage_content %>% html_nodes(".text-muted.unbold") %>% html_text()
#Remove movie part numbers which were extracted along with release years
release_years <- str_match(release_years, "(\\(\\d+\\))")
#Remove paranthesis
release_years <- parse_number(release_years[,1])
print(release_years)
## [1] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2021 2020 2020
## [16] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020
## [31] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020
## [46] 2020 2020 2020 2020 2020
Next, extract the movie’s runtime and genre.
run_time <- webpage_content %>% html_nodes(".runtime") %>% html_text()
#Remove the " min" that every run time character string ends with
run_time <- run_time %>% parse_number()
print(run_time)
## [1] 118 118 107 114 97 112 150 97 113 151 109 117 94 109 107 100 90 105 86
## [20] 113 90 138 91 102 116 95 137 103 90 160 119 117 115 94 100 100 118 106
## [39] 136 97 117 111 101 90 125 115 95 116 123 124
genre <- webpage_content %>% html_nodes(".genre") %>% html_text()
#Trim white spaces
genre <- str_trim(genre)
print(genre)
## [1] "Biography, Drama, History" "Action, Adventure, Drama"
## [3] "Horror, Mystery, Thriller" "Drama, Romance"
## [5] "Drama, Horror, Sci-Fi" "Drama, History, Thriller"
## [7] "Action, Sci-Fi, Thriller" "Drama"
## [9] "Crime, Drama, Thriller" "Action, Adventure, Fantasy"
## [11] "Action, Adventure, Comedy" "Comedy, Drama"
## [13] "Action, Mystery, Sci-Fi" "Action, Adventure, Comedy"
## [15] "Drama" "Drama"
## [17] "Fantasy, Horror" "Drama, Romance"
## [19] "Comedy, Crime, Drama" "Crime, Drama, Thriller"
## [21] "Action, Horror, Thriller" "Crime, Drama, Thriller"
## [23] "Adventure, Comedy, Music" "Comedy, Horror, Thriller"
## [25] "Action, Comedy, Drama" "Action, Horror, Sci-Fi"
## [27] "Horror, Mystery, Thriller" "Action, Adventure, Fantasy"
## [29] "Comedy, Fantasy, Mystery" "Biography, Drama, History"
## [31] "Action, Drama, Thriller" "Crime, Drama, Mystery"
## [33] "Action, Adventure, Drama" "Action, Horror, Mystery"
## [35] "Animation, Adventure, Comedy" "Horror, Thriller"
## [37] "Comedy, Crime, Thriller" "Adventure, Comedy, Family"
## [39] "Comedy, Drama" "Drama, Fantasy, Horror"
## [41] "Animation, Action, Adventure" "Drama, Mystery, Thriller"
## [43] "Adventure, Comedy, Family" "Mystery, Thriller"
## [45] "Action, Adventure, Fantasy" "Drama"
## [47] "Animation, Adventure, Comedy" "Action, Thriller"
## [49] "Action, Adventure, Crime" "Drama, Horror, Mystery"
Next, IMDB rating and metascores. The metascores are a weighted average of the published critic reviews contained in the chart on that page. It basically indicates the reviews’ quality.
imdb_rating <- webpage_content %>% html_nodes(".ratings-imdb-rating strong") %>% html_text()
#Convert from character to numeric data type
imdb_rating <- as.numeric(imdb_rating)
print(imdb_rating)
## [1] 6.8 6.8 6.9 3.3 7.3 7.1 7.4 8.3 7.5 5.4 7.0 7.8 6.8 6.1 7.4 6.6 5.4 5.2 6.5
## [20] 6.7 6.5 7.1 6.0 6.4 7.6 5.8 6.1 5.3 7.4 8.4 6.4 6.9 5.7 5.3 8.1 6.1 6.3 5.3
## [39] 7.1 4.4 8.3 5.5 5.6 6.7 6.7 7.5 7.0 6.7 6.6 7.1
metascore <- webpage_content %>% html_nodes(".metascore") %>% html_text()
#Remove white spaces and convert to numeric data type
metascore <- str_trim(metascore) %>% as.numeric()
print(metascore)
## [1] 66 73 68 71 65 69 88 73 60 63 79 56 60 93 52 14 76 63 50 55 65 67 81 48 47
## [26] 83 90 64 69 66 43 83 66 47 67 54 75 31 26 67 70 89 56 56 68 72
Lastly, extract the user votes.
user_votes <- webpage_content %>% html_nodes(".sort-num_votes-visible span:nth-child(2)") %>% html_text()
#Convert from character to numeric data type
user_votes <- parse_number(user_votes)
print(user_votes)
## [1] 5355 72896 4925 66040 153863 32225 418431 93712 125011 229387
## [11] 106266 107225 47181 207620 125237 2689 3509 24665 5948 20267
## [21] 88475 111914 39924 42661 25464 71558 18000 45928 129059 73104
## [31] 98854 15191 135046 65411 276630 65 113495 32024 52251 10277
## [41] 34922 10977 54458 58406 145233 59219 32664 178330 148655 191426
The titles, years, runtimes, genres, user ratings, metascores, and votes have been obtained from the IMDB web page. These vectors must be combined into one dataframe but this will cause an error because not all vectors are of equal length. The metascores vector has some missing values. This can be handled by adding NA values into the vector where metascroes are missing. By inspecting the webpage it is found that movies 4, 17, 27, 36 do not have a meta score.
As there is no available function in R for inserting NA values into specific indices of a vector, a function has to be written to perform this task as below:
append_vector <- function(vector, inserted_indices, values){
## Creating the current indices of the vector
vector_current_indices <- 1:length(vector)
## Adding small amount of values (between 0 and 0.9) to the `inserted_indices`
new_inserted_indices <- inserted_indices + seq(0, 0.9, length.out = length(inserted_indices))
## Appending the `new_inserted_indices` to the current vector indices
indices <- c(vector_current_indices, new_inserted_indices)
## Ordering the indices
ordered_indices <- order(indices)
## Appending the new value to the existing vector
new_vector <- c(vector, values)
## Ordering the new vector wrt the ordered indices
new_vector[ordered_indices]
}
The above function takes in a vector into which values need to be inserted. It also takes as a second parameter, the indices after which the value needs to be inserted. Third parameter to the function is the value(s) itself.
The indices after which the values need to be inserted are incremented by a small value (less than 1, so as to create a decimal number less than next index but more than previous index). The values are then appended to the end of the original vector after which it is ordered based on newly created index value vector which includes non-whole numbers.
Inserting NA values into meta scores vector
new_metascore <- append_vector(metascore, c(3, 16, 26, 35), NA)
print(new_metascore)
## [1] 66 73 68 NA 71 65 69 88 73 60 63 79 56 60 93 52 14 NA 76 63 50 55 65 67 81
## [26] 48 47 83 NA 90 64 69 66 43 83 66 47 67 NA 54 75 31 26 67 70 89 56 56 68 72
Now, we can combine the vectors to form a dataframe
imdb_df <- tibble(`Movie Title` = movie_titles, `Release Year` = release_years, `Run Time(mins)` = run_time, Genre = genre, `IMDB Rating` = floor(imdb_rating), `Meta Score` = new_metascore, `User Votes` = user_votes)
The box plot shows us that there is a correlation between user votes and IMDB rating. We see that the movies with lowest user vote count has lowest IMDB rating and the movies with highest user votes has the highest IMDB rating.
Movies that have an IMDB rating of more than 6 have lower user votes.
ggplot(data = imdb_df, aes(x = `IMDB Rating`, y = `User Votes`, group = `IMDB Rating`)) + geom_boxplot() +scale_y_continuous(labels = scales::comma) + ggtitle("Correlation between User Votes and IMDB Rating")
In web scraping data from the IMDB webpage for feature film results of the year 2020, we set out to understand if there is any correlation between user votes(popularity among viewers) and the IMDB rating. The data was cleaned, and ordered to ultimately create a visualization that clearly depicts the user engagement with IMDB for the top 50 movies of 2020.