IMDB Top 250 Movies Web-Scraping

Author

Aaryan Bhatta

Introduction

Movies are and have been imperative modes of entertainment over many decades. Whether it is fictional or non-fictional, movies have inspired and affected many people’s lives. What makes a movie so great can be simply determined by the plot of the movie or even more complex factors. As decades have gone by in the movies industry, reputable sources have critiqued and ranked movies on various factors. Everyone is bound to have their own opinions when it comes to ranking movies, shows or even more general subjects but, at the end of the day, we would like to know why have these movies ranked higher than some other movies that we prefer.

Hence, this analysis that I will conduct will look into the Internet Movies Databases (IMDb) website and explore what factors contribute towards the rankings of the Top 250 Movies Ranked All Time.

Research Questions

I intend to conduct my analysis by first, developing research questions that will determine factors that may contribute towards the rankings of the movies. Here are the questions as follows:

  • Are there certain genres that are more highly rated than others?

  • Are newer movies more preferred than older movies?

  • Does movies with higher voters mean that they have higher ratings?

  • Does the reputation of directors influence rankings?

  • Is there a relationship between a movies run time and its rating?

Libraries Setup

In order to conduct our analysis, we first need to have certain libraries as follows:

  • tidyverse - The collection of packages that makes data analysis and data cleaning more easier

  • httr - Allows to authenticate with websites and helps us to send HTTP requests to the websites

  • rvest - Gives us the tools to web-scrap HTML and XML

  • magrittr - Makes piping output more easy with loops

Data Collection

In order to answer these research questions we are going to web scrap data from the Top 250 Movies Ranked All Time. The values that we are going to extract are as follows:

  • Title = The name of the movie

  • Year = The year the movie was released

  • Director = The director of the movie

  • Age Rating = The age restriction of the movie

  • Star Rating = The rating of the movie out of 10

  • Duration = The overall run time of the movie

  • Vote Count = The amount of times the movie was rated

We will use a function that will extract the values listed above and will iterate through the individual pages of the movies to grab some of these values

Data Wrangling

After web scraping the website, we need to clean up some of the values so that we are able conduct analyses on the data. Some of the values that we received are not the correct data type and some of them are not in the correct format.

For example the year that movie was produced, the duration of the movie and the age rating of the movie were all extracted into one vector of sorts. We would need to separate them into their own respective values. From there, we would need to change the format of duration so that it represents the total minutes of the movies instead of 2h 30m for example. Another example is how we need to convert the values listed in vote count as its actual number. The vote count value are listed as (3M) or (284K). We would need to change these values to its actual numeric format.

Another part we need to look at is genres. The genres we extracted from the website also contains really specific genres pertaining to that movie and we do not necessarily need these genres. So we will standardize the genres that we want to this list:

“Action”, “Adventure”, “Animation”, “Biography”, “Comedy”, “Crime”, “Drama”, “Epic”, “Family”, “Fantasy”, “History”, “Horror”, “Mystery”, “Romance”, “Sci-Fi”, “Thriller”, “War”, “Western”.

Data Analysis

After cleaning the data, the data is now prepared in a data frame format and it is ready to be used. We will import the data set from a cloud source and begin our analysis through our research questions.

Does Higher Votes Equals Higher Star Ratings?

We plotted box plots that shows the distribution of movies’ vote count by the corresponding star ratings. Based off these findings, we can see that there is not that much of a difference of vote counts across each star ratings. Majority of these box plots vary around the same range of vote counts thus, showcasing that star ratings are not exactly are not really connected to how many people voted. Although there would be less and less movies for each star rating as star ratings increase, I initially expected that there would be some increasing trend that would explain that vote count is a factor that affects the rankings. However, this is not the case hence, vote count may not be a factor that contributes to the rankings of the top 250 IMDb website.

Does Duration influence Star Ratings?

We plotted box plots that shows the distribution of movies’ duration by the corresponding star ratings. Based off our findings, we can see that there is an increasing trend between Duration and Star Ratings. This finding can provide the idea that movies that are at least 2 hours are highly rated by the public. Now not all movies will always be rated high because of its duration because there are other variables to considered to quality plot, pacing, acting and so much more. However, this can suggest that longer films that has deeper narratives and character development, can lead to higher audience satisfaction.

Are New Movies More Preferred than Older Movies

In order to answer this question we plotted two graphs. First we want to see how many movies there are in each decade to get a good idea of from what decades are movies chosen more frequently chosen and we can see that there are more movies from the 2000s and 2010s compared to other decades. From there we wanted to look at how movies were rated in each decade on average. We can see that as the decades increased, so did the star ratings as well however, they begin to fluctuate as we reach past 1950 however the linear regression does tell us that star ratings do increase as the decades increase.ng

Directors Influence on Rankings

In order to answer to see whether directors reputation has an affect in the ranking, we plotted two bar graphs. First we want to see the top ten directors based off how many movies they produced. From there we wanted to look at the weighted average star ratings for these directors to see if there was any influence. Based off our findings, we can see that the bars are in the similar range of weighted average. This means that regardless of who directs these films, the director’s reputation does not guarantee its high ranking.

Certain Genres that are more Preferred?

# Sperate genres onto new line for each movie and coun total movies for each genre
genres_count <- movies_df %>%
      separate_rows(Genres, sep = ",") %>%
      mutate(Genres = str_trim(Genres)) %>% 
      count(Genres) %>%
      arrange(desc(n))
    
# Plot the total count if movies fo each movie
    ggplot(genres_count, aes(x = reorder(Genres, n), y = n)) +
      geom_bar(stat = "identity", fill = "lightblue") +
      labs(title = "Distribution of Genres", 
           x = "Genres", 
           y = "Number of Movies") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

We plotted a bar graph that showcases the total number of movies that belong to each genre. Based off our findings, we can see that there is an overwhelming amount of movies that has the genre drama. While there is an small increase of certain movies from something as western to adventure, there is a overwhelming amount of movies that have the genre, drama. This is probably because of drama’s appeal to the audience and its reach to a wide audience, making it a popular genre in the movie industry.

Conclusion

This analysis has revealed valuable insights about the factors that influenced the movies rankings in the IMDb movie rankings.

Firstly we saw that having higher vote count does not equal higher star ratings, showing the correlation that popularity alone does not determine a movie’s star rating. Therefore,vote count is not a direct factor that influenced the rankings. Similarly, directors’ reputations do not seem to influence the rankings, suggesting that its not the directors reputation that influences but the quality of the movie and its factors that may contribute.

Another observation is that, longer movies tends to have higher star ratings. This probably reflects the idea that having extra time allowed for deeper storytelling, character development and more, made the movies ranked highly amongst the audience.

While analyzing movies over time, we can see that recent movies, were highly preferred to movies in the 1900s. This could reflect that the audience’s preferred taste of more recent movies and how film making has changed over time.

Additionally, it seems that drama is a dominant genre in the IMDb rankings which reflects the idea of how today’s audience is highly regard the genre due its its broad appeal and emotionally driven narratives.

Overall, factors like star ratings, the release year, the duration of the movie and genre all influence the rankings in the IMDb Top 250 Movies website. In contrast, directors’ reputations or having a high amount of people rating the movie, does not contribute towards the rankings. I suggest that going forward, that more research is required to gather more factors of what really influenced the rankings we see on the IMDb website.