IMDB Top 100 Movies

Author

Ronie Ntale

Introduction

This project explores data from the IMDb Top 100 Movies (2025 Edition), originally sourced from the IMDB WEbsite and made into a public dataset by Shayan Zakaria on Kaggle.
The dataset includes information such as: -Movie title
-IMDb rating
-Release year
-Runtime (in minutes)
-Genres

For this analysis, I’ll focus on examining whether there is a relationship between a film’s release year and its IMDB ratings, as well as how this relationship might change based on the genre. The goal is to see whether newer films tend to be rated higher or lower compared to older ones, and whether certain genres always perform better.

Load the Librariy

library(tidyverse)

##Load the dataset

movies <- read_csv("top_100_movies_full_best_effort.csv")

Rows: 100 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Title, Genre(s), Director, Main Actor(s), Country, Language
dbl (8): Rank, Year, IMDb Rating, Rotten Tomatoes %, Runtime (mins), Oscars ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This code cleans and prepares the movie dataset for analysis.

I renamed columns for easier use, removed rows missing key data (which is IMDb rating or release year)

I created a new column called “primary_genre” by extracting the first listed genre for each movie as thats considered the main

movies_cleaned <- movies |>
  rename(
  imdb_rating = `IMDb Rating`,
  run_time_mins = `Runtime (mins)`,
  genres = `Genre(s)`,
  release_year = Year
)|>
  filter(!is.na(imdb_rating) & !is.na(release_year)) |>
  mutate(
## Creating a column for the main genre of each movie
    primary_genre = str_extract(genres, "[A-Za-z]+") |> factor()
  )

head(movies_cleaned)

# A tibble: 6 × 15
   Rank Title   release_year genres Director `Main Actor(s)` Country imdb_rating
  <dbl> <chr>          <dbl> <chr>  <chr>    <chr>           <chr>         <dbl>
1     1 The Sh…         1994 Drama  Frank D… Tim Robbins|Mo… United…         9.3
2     2 The Go…         1972 Crime… Francis… Marlon Brando|… United…         9.2
3     3 The Da…         2008 Actio… Christo… Christian Bale… United…         9  
4     4 The Go…         1974 Crime… Francis… Al Pacino|Robe… United…         9  
5     5 12 Ang…         1957 Crime… Sidney … Henry Fonda|Le… United…         9  
6     6 The Lo…         2003 Adven… Peter J… Elijah Wood|Vi… New Ze…         8.9
# ℹ 7 more variables: `Rotten Tomatoes %` <dbl>, run_time_mins <dbl>,
#   Language <chr>, `Oscars Won` <dbl>, `Box Office ($M)` <dbl>,
#   `Metacritic Score` <dbl>, primary_genre <fct>

##I ran a linear regression to see if there’s a relationship between a movie’s release year and its IMDb rating. ##The regression model helps check whether newer movies generally get higher or lower ratings.

regression_model <- lm(imdb_rating ~ release_year, data = movies_cleaned)
summary(regression_model)


Call:
lm(formula = imdb_rating ~ release_year, data = movies_cleaned)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82403 -0.19999 -0.00422  0.14644  0.84361 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.027676   2.541656   0.011   0.9913   
release_year 0.004227   0.001285   3.289   0.0014 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2908 on 97 degrees of freedom
Multiple R-squared:  0.1003,    Adjusted R-squared:  0.09104 
F-statistic: 10.82 on 1 and 97 DF,  p-value: 0.001403

##This plot shows the relationship between release year and IMDb rating, colored by genre. ##Each point is a movie, and the black line represents the overall trend (linear regression line). ##I customized it with new colors (Set1 palette), a white-background theme (theme_bw()), and labeled everything clearly — title, axes, legend, and caption.

chart1 <- movies_cleaned |>
  ggplot(aes(x = release_year, y = imdb_rating, color = primary_genre)) +
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linewidth = 0.5) +
  scale_color_brewer(palette = "Set1") +
  theme_bw() + 
  labs(
    title = "Top 100 Movie Ratings Over Time, Grouped by Primary Genre",
    x = "Release Year",
    y = "IMDb Rating",
    color = "Primary Genre",
    caption = "Data Source: top_100_movies_full_best_effort.csv (via IMDb)."
  )

chart1

`geom_smooth()` using formula = 'y ~ x'

Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

##Analysis After cleaning the data and running the regression, I found that there’s a slight trend that older movies tend to hold higher IMDb ratings than some newer ones, but this can change based on genre. Genres like Drama and Action tend to be on the top-rated lists.