Abstract

This project analyzes the IMDb Top 250 Movies dataset to explore trends in ratings, genre distributions, and relationships between runtime, release decade, budget, and IMDb ratings. By answering key business questions, the goal is to identify patterns that contribute to a movie’s success and provide insights for understanding global cinema trends.

In addition to analyzing movie characteristics, this project examines the top directors and actors who frequently appear in the IMDb Top 250 list, highlighting the most impactful contributors to cinema.

Introduction

The IMDb Top 250 Movies dataset offers a fascinating glimpse into highly-rated films worldwide. By analyzing metadata such as ratings, genres, release years, budgets, and runtimes, this project explores key factors contributing to a movie’s success. The study focuses on understanding:

How movie characteristics like runtime, budget, and release decade influence IMDb ratings.
How genres have evolved over time.
Which genres have the highest percentage of “Excellent” movies.

In addition, this study highlights the contributions of directors and actors who frequently appear in the IMDb Top 250 list. By identifying the most frequent directors and actors, we can highlight individuals whose work has consistently shaped the landscape of highly-rated films.

Intro to the Data

library(tidyverse)
imdb_data <- read_csv("IMDB Top 250 Movies.csv")

glimpse(imdb_data)

## Rows: 250
## Columns: 13
## $ rank        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ name        <chr> "The Shawshank Redemption", "The Godfather", "The Dark Kni…
## $ year        <dbl> 1994, 1972, 2008, 1974, 1957, 1993, 2003, 1994, 2001, 1966…
## $ rating      <dbl> 9.3, 9.2, 9.0, 9.0, 9.0, 9.0, 9.0, 8.9, 8.8, 8.8, 8.8, 8.8…
## $ genre       <chr> "Drama", "Crime,Drama", "Action,Crime,Drama", "Crime,Drama…
## $ certificate <chr> "R", "R", "PG-13", "R", "Approved", "R", "PG-13", "R", "PG…
## $ run_time    <chr> "2h 22m", "2h 55m", "2h 32m", "3h 22m", "1h 36m", "3h 15m"…
## $ tagline     <chr> "Fear can hold you prisoner. Hope can set you free.", "An …
## $ budget      <chr> "25000000", "6000000", "185000000", "13000000", "350000", …
## $ box_office  <chr> "28884504", "250341816", "1006234167", "47961919", "955", …
## $ casts       <chr> "Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,Clan…
## $ directors   <chr> "Frank Darabont", "Francis Ford Coppola", "Christopher Nol…
## $ writers     <chr> "Stephen King,Frank Darabont", "Mario Puzo,Francis Ford Co…

summary(imdb_data)

##       rank            name                year          rating     
##  Min.   :  1.00   Length:250         Min.   :1921   Min.   :8.000  
##  1st Qu.: 63.25   Class :character   1st Qu.:1966   1st Qu.:8.100  
##  Median :125.50   Mode  :character   Median :1994   Median :8.200  
##  Mean   :125.50                      Mean   :1986   Mean   :8.307  
##  3rd Qu.:187.75                      3rd Qu.:2006   3rd Qu.:8.400  
##  Max.   :250.00                      Max.   :2022   Max.   :9.300  
##     genre           certificate          run_time           tagline         
##  Length:250         Length:250         Length:250         Length:250        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     budget           box_office           casts            directors        
##  Length:250         Length:250         Length:250         Length:250        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    writers         
##  Length:250        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

This dataset was sourced from Kaggle, uploaded by user rajugc, and last updated in 2023.
It contains 250 rows and several columns, where each row represents a single movie.
The dataset spans a wide time frame, including movies from as early as the 1930s to the 2020s, with a global geographical frame.
Key variables include movie titles, genres, release years, IMDb ratings, budgets, and runtimes.

Data Cleaning / Preparation

# Check for missing values
sum(is.na(imdb_data))

## [1] 0

# Create a new variable: Decade
imdb_data <- imdb_data %>%
  mutate(decade = floor(year / 10) * 10)

# Binning ratings into categories
imdb_data <- imdb_data %>%
  mutate(rating_category = case_when(
    rating >= 9.0 ~ "Excellent",
    rating >= 8.5 ~ "Great",
    TRUE ~ "Good"
  ))

library(stringr)

# Convert `run_time` into `runtime_minutes`
imdb_data <- imdb_data %>%
  mutate(runtime_minutes = case_when(
    grepl("\\d+h \\d+m", run_time) ~ 
      str_extract(run_time, "\\d+h") %>% 
        str_replace("h", "") %>% 
        as.numeric() * 60 +
      str_extract(run_time, "\\d+m") %>% 
        str_replace("m", "") %>% 
        as.numeric(),
    grepl("\\d+m", run_time) ~ 
      str_extract(run_time, "\\d+m") %>% 
        str_replace("m", "") %>% 
        as.numeric(),
    TRUE ~ NA_real_ # Assign NA for invalid or missing formats
  ))

library(stringr)

# Clean and convert `budget` to numeric
imdb_data <- imdb_data %>%
  mutate(budget_numeric = case_when(
    !is.na(budget) & budget != "" & budget != "not available" ~ as.numeric(str_replace_all(budget, "[^0-9]", "")),
    TRUE ~ NA_real_ # Assign NA for missing or invalid values
  ))

library(tidyr)

# Split combined genres into individual rows
imdb_genres <- imdb_data %>%
  separate_rows(genre, sep = ",") %>%
  group_by(decade, genre) %>%
  summarize(count = n(), .groups = "drop")

# Split the 'directors' column into individual rows for analysis
directors_data <- imdb_data %>%
  separate_rows(directors, sep = ",")  # Split multiple directors into separate rows

# Split the 'casts' column into individual rows for analysis
actors_data <- imdb_data %>%
  separate_rows(casts, sep = ",")  # Split multiple actors into separate rows

# Check cleaned dataset
glimpse(imdb_data)

## Rows: 250
## Columns: 17
## $ rank            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ name            <chr> "The Shawshank Redemption", "The Godfather", "The Dark…
## $ year            <dbl> 1994, 1972, 2008, 1974, 1957, 1993, 2003, 1994, 2001, …
## $ rating          <dbl> 9.3, 9.2, 9.0, 9.0, 9.0, 9.0, 9.0, 8.9, 8.8, 8.8, 8.8,…
## $ genre           <chr> "Drama", "Crime,Drama", "Action,Crime,Drama", "Crime,D…
## $ certificate     <chr> "R", "R", "PG-13", "R", "Approved", "R", "PG-13", "R",…
## $ run_time        <chr> "2h 22m", "2h 55m", "2h 32m", "3h 22m", "1h 36m", "3h …
## $ tagline         <chr> "Fear can hold you prisoner. Hope can set you free.", …
## $ budget          <chr> "25000000", "6000000", "185000000", "13000000", "35000…
## $ box_office      <chr> "28884504", "250341816", "1006234167", "47961919", "95…
## $ casts           <chr> "Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,…
## $ directors       <chr> "Frank Darabont", "Francis Ford Coppola", "Christopher…
## $ writers         <chr> "Stephen King,Frank Darabont", "Mario Puzo,Francis For…
## $ decade          <dbl> 1990, 1970, 2000, 1970, 1950, 1990, 2000, 1990, 2000, …
## $ rating_category <chr> "Excellent", "Excellent", "Excellent", "Excellent", "E…
## $ runtime_minutes <dbl> 142, 175, 152, 202, 96, 195, 201, 154, 178, 178, 142, …
## $ budget_numeric  <dbl> 2.50e+07, 6.00e+06, 1.85e+08, 1.30e+07, 3.50e+05, 2.20…

Modifications Made with Data Cleaning:

Created a new variable “decade” to group movies by decade.
Binned IMDb ratings into categories (Excellent, Great, Good) for easier interpretability.
Transformed run_time into a numeric column (runtime_minutes) for precise analysis.
Cleaned and converted budget into numeric (budget_numeric), handling missing values.
Split combined genre into individual rows to analyze per-genre frequencies.
Split the directors column into individual rows for accurate counting of each director.
Split the casts column into individual rows to analyze actors individually.

Business Questions

How does a movie’s release decade impact its IMDb rating?
Is there a significant relationship between movie runtime and IMDb ratings?
Do higher-budget movies generally receive better ratings?
How have movie genres evolved over time?
Which Genres Have the Highest Percentage of Excellent Movies?
Who are the top directors in the IMDb Top 250 list?
Who are the top actors in the IMDb Top 250 list?

Analysis & Results

Question 1: How does a movie’s release decade impact its IMDb rating?

# Create a summary of average IMDb ratings by decade
rating_by_decade <- imdb_data %>%
  group_by(decade) %>%
  summarize(avg_rating = mean(rating, na.rm = TRUE))  

# Bar chart for average IMDb rating by decade
ggplot(rating_by_decade, aes(x = decade, y = avg_rating, fill = as.factor(decade))) +
  geom_bar(stat = "identity", color = "black") +  
  geom_text(aes(label = round(avg_rating, 2)), vjust = -0.5, size = 3) +  
  scale_fill_brewer(palette = "Set3") +  
  labs(
    title = "Average IMDb Rating by Decade",
    subtitle = "IMDb ratings have remained consistent across decades, with slight variations",
    x = "Decade",
    y = "Average Rating",
    fill = "Decade"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",  
    axis.text.x = element_text(angle = 45, hjust = 1)  # Rotate x-axis labels
  )

Question 2: Is there a significant relationship between movie runtime and IMDb ratings?

# Relationship between runtime (in minutes) and IMDb ratings
ggplot(imdb_data, aes(x = runtime_minutes, y = rating)) +
  geom_point(alpha = 0.3) +  
  geom_smooth(method = "lm", col = "red", size = 1.2) +  
  scale_x_continuous(breaks = seq(50, 200, by = 30)) +  
  scale_y_continuous(labels = scales::number_format(accuracy = 0.1)) +  
  labs(
    title = "Relationship Between Runtime and IMDb Rating",
    subtitle = "Longer movies tend to have slightly higher IMDb ratings",
    x = "Runtime (minutes)",
    y = "IMDb Rating",
    caption = "Data Source: IMDb Top 250 Movies"
  ) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question 3: Do higher-budget movies generally receive better ratings?

# Scatter plot for budget vs IMDb rating
ggplot(imdb_data, aes(x = budget_numeric, y = rating)) +
  geom_point(alpha = 0.4, color = "blue") +  
  geom_smooth(method = "lm", col = "darkgreen", size = 1.2) +  
  scale_x_log10(
    breaks = c(1e6, 1e7, 1e8, 1e9),
    labels = c("1M", "10M", "100M", "1B")
  ) +  
  labs(
    title = "Relationship Between Budget and IMDb Rating",
    subtitle = "Higher budgets have a weak positive correlation with IMDb ratings",
    x = "Budget (Log Scale)",
    y = "IMDb Rating",
    caption = "Data Source: IMDb Top 250 Movies"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    plot.title = element_text(size = 14, face = "bold"),  
    plot.subtitle = element_text(size = 12, color = "gray40")  
  )

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 39 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 39 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question 4: How have movie genres evolved over time?

# Filter for top genres by total count
top_genres <- imdb_genres %>%
  group_by(genre) %>%
  summarize(total_count = sum(count)) %>%
  top_n(5, total_count)

imdb_genres_filtered <- imdb_genres %>%
  filter(genre %in% top_genres$genre)

imdb_genres$genre <- factor(imdb_genres$genre, levels = rev(sort(unique(imdb_genres$genre))))

# Stacked bar chart for top genres by decade
ggplot(imdb_genres_filtered, aes(x = decade, y = count, fill = genre)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Top Genres Distribution Over Decades",
    subtitle = "Drama dominates across decades, while Action and Adventure surge post-1970s",
    x = "Decade",
    y = "Number of Movies",
    fill = "Genre"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 8),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Question 5: Which Genres Have the Highest Percentage of Excellent Movies?

# Calculate the percentage of "Excellent" movies for each genre
genre_rating_percentage <- imdb_data %>%
  separate_rows(genre, sep = ",") %>%  
  filter(rating_category == "Excellent") %>%  
  group_by(genre) %>%
  summarize(excellent_count = n(), .groups = "drop") %>%  
  left_join(
    imdb_data %>%
      separate_rows(genre, sep = ",") %>%
      group_by(genre) %>%
      summarize(total_count = n(), .groups = "drop"), 
    by = "genre"
  ) %>%
  mutate(percentage = (excellent_count / total_count) * 100) %>%  
  arrange(desc(percentage))

# Plot the percentage of Excellent movies by genre
ggplot(genre_rating_percentage, aes(x = reorder(genre, percentage), y = percentage, fill = percentage)) +
  geom_bar(stat = "identity", color = "black") +  
  scale_fill_gradient(low = "lightblue", high = "blue") +  
  labs(
    title = "Percentage of Excellent Movies by Genre",
    subtitle = "History leads in the 'Excellent' category, followed by Crime and Action genres",
    x = "Genre",
    y = "Percentage of Movies",
    fill = "Percentage"
  ) +
  coord_flip() +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),  
    plot.subtitle = element_text(size = 12, color = "gray40"),  
    legend.position = "none"  
  )

Question 6: Who are the top directors in the IMDb Top 250 list?

# Analyze the top directors
top_directors <- imdb_data %>%
  separate_rows(directors, sep = ",") %>%  
  group_by(directors) %>%
  summarize(movie_count = n(), .groups = "drop") %>%  
  arrange(desc(movie_count)) %>%  
  slice_head(n = 10)

# Plot the top 10 directors
ggplot(top_directors, aes(x = reorder(directors, movie_count), y = movie_count, fill = movie_count)) +
  geom_bar(stat = "identity", color = "black") +  
  scale_fill_gradient(low = "lightblue", high = "steelblue") +  
  labs(
    title = "Top 10 Directors by Number of Movies in IMDb Top 250",
    subtitle = "Steven Spielberg leads the list, followed by Stanley Kubrick and Martin Scorsese",
    x = "Director",
    y = "Number of Movies",
    fill = "Movies Directed"
  ) +
  coord_flip() +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),  
    plot.subtitle = element_text(size = 12, color = "gray40"),  
    legend.position = "none"  
  )

Question 7: Who are the top actors in the IMDb Top 250 list?

# Analyze the top actors
top_actors <- imdb_data %>%
  separate_rows(casts, sep = ",") %>%  
  group_by(casts) %>%
  summarize(movie_count = n(), .groups = "drop") %>%  
  arrange(desc(movie_count)) %>%  
  slice_head(n = 10)

# Plot the top 10 actors 
ggplot(top_actors, aes(x = reorder(casts, movie_count), y = movie_count, fill = movie_count)) +
  geom_bar(stat = "identity", color = "black") +  
  scale_fill_gradient(low = "pink", high = "red") +  
  labs(
    title = "Top 10 Actors by Number of Movies in IMDb Top 250",
    subtitle = "Robert De Niro leads the list, followed by Morgan Freeman and John Ratzenberger",
    x = "Actor",
    y = "Number of Movies",
    fill = "Movies Acted"
  ) +
  coord_flip() +  
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),  
    plot.subtitle = element_text(size = 10, color = "gray40"),  
    legend.position = "none"  
  )

Conclusion

This analysis reveals:

IMDb ratings have been consistently high across all decades, ranging from 8.2 to 8.4.
Longer runtimes and higher budgets may positively influence ratings, but their impact is limited.
Drama dominates the IMDb Top 250, while Action and Adventure have grown in recent decades.Genres like Western and Film-Noir were more prominent in older decades but have since declined in popularity.
The History, Crime, and Action genres dominate the “Excellent” category, highlighting their strong critical acclaim and consistent impact on audiences.
Directors like Christopher Nolan, Steven Spielberg, and Martin Scorsese dominate the IMDb Top 250 list, consistently creating highly-rated and impactful films that have shaped the world of cinema.
Actors like Morgan Freeman, Leonardo DiCaprio, and Robert De Niro are among the most frequently featured in the IMDb Top 250, reflecting their significant contributions to critically acclaimed movies.

These insights provide valuable guidance for filmmakers and producers by highlighting the evolving dynamics of cinema, and the factors contributing to a movie’s success, as well as who the top directors and actors are.

IMDb Top 250 Movies Analysis

Mathias Schilbred-Eriksen

2024-12-03