Applied Statistics to the Best Ranked Movies Until 2019

Introduction

The objective of this project is to analyze movies with a rating above the average from a sample of 831 movies, from 1921 to 2019, rated by users of the Internet Movie Database (IMDB), using various statistical tools through the application of RStudio.

The variables to be analyzed are the average revenue, the number of films per director, the rating given by IMDB users, and the release year of all movies selected from the list of the 831 top-rated IMDB movies.

Data and Data Cleaning

The data were obtained from the following link: data.

The libraries used were as follows.

Generally, tidyverse includes a set of libraries such as ggplot2, dplyr, tidyr, etc.

library(tidyverse)

theme_set(theme_bw())
imdb <- read_csv("imdb_top_1000.csv")
imdb_clean <- imdb %>% 
  # One movie had "PG" in the year column. These types of movies were removed.
  filter(Released_Year != "PG") %>% 
  # The year was then converted to a numeric variable, not a character.
  mutate(Released_Year = as.numeric(Released_Year)) %>% 
  # The year was filtered with the following script.
  filter(Released_Year >= 1921, Released_Year <= 2019) %>% 
  filter(!is.na(Gross), !is.na(IMDB_Rating))

imdb_clean %>% 
  ggplot(aes(x = IMDB_Rating)) +
  geom_histogram(binwidth = 1/5, color="darkgreen", fill="grey") +
  labs(y = "Frequency", x = "Rating")

#Rating Distribution


imdb_clean %>% 
  ggplot(aes(x = IMDB_Rating)) +
  geom_histogram(binwidth = 1/5, color="darkgreen", fill="grey") +
  labs(y = "Frequency", x = "Rating")

To select the list of the 831 top-rated IMDB movies between 1921 and 2019, the following analysis was performed:

imdb_clean <- imdb_clean %>% 
  arrange(desc(IMDB_Rating)) %>% 
  slice(1:831)
print(paste0("Movies to analyze after filters: ", nrow(imdb_clean)))

## [1] "Movies to analyze after filters: 830"

Exploratory Analysis

The selected variables to explore are:

Average revenue
Number of movies per director
Rating given by IMDB users

Average Revenue

Revenue per genre was calculated first. It is important to note that some movies were assigned to more than one genre, so revenue was calculated for movies with multiple genres.

max_genres <- str_count(imdb_clean$Genre, pattern = ", ") %>% 
  max() %>% sum(c(1))

generos <- imdb_clean %>% 
  select(Genre, Gross, Series_Title) %>% 
  separate(Genre, sep=", ", into = paste0("Genre_", 1:max_genres)) %>% 
  pivot_longer(-c(Series_Title, Gross), values_to = "Genre") %>% 
  select(-name) %>% 
  filter(!is.na(Genre))

generos %>% ggplot(aes(x = reorder(Genre, Gross, median), y = Gross, fill = Genre)) +
  geom_boxplot() + scale_y_log10() + coord_flip() +
  labs(y = "Revenue in dollars \n logarithmic scale", x = "Genre") +
  theme(legend.position = "none")

A table was created where each genre is assigned to a movie. For movies with multiple genres, repetitions exist. For example, “The Godfather” appears in both the crime and drama genres.

generos %>%
  # mutate(
  #   Genre = fct_reorder(Genre, Gross, .fun=median, .desc = TRUE)
  # ) %>%
  ggplot(aes(x = reorder(Genre, Gross, median), y = Gross, fill = Genre)) +
  geom_boxplot() +
  scale_y_log10() +
  coord_flip() +
  labs(
    y = "Revenue in dollars \n logarithmic scale",
    x = "Genre"
  ) +
  theme(legend.position = "none")

From this, we can hypothesize:

Adventure films generate more revenue than Mystery films.

To test this hypothesis, we perform a t-test:

hip1 <- generos %>% 
  filter(Genre %in% c("Adventure", "Mystery"))

t.test(Gross ~ Genre, data = hip1)

## 
##  Welch Two Sample t-test
## 
## data:  Gross by Genre
## t = 8.3943, df = 230.36, p-value = 4.782e-15
## alternative hypothesis: true difference in means between group Adventure and group Mystery is not equal to 0
## 95 percent confidence interval:
##   95207479 153610377
## sample estimates:
## mean in group Adventure   mean in group Mystery 
##               165683310                41274382

In this test, we obtain a p-value < 0.05, which allows us to reject the null hypothesis and accept that the Adventure genre generates more revenue than the Mystery genre.

Number of Movies Per Director

Next, the selection of movies by director is shown. Additionally, it was possible to obtain the revenue of each director’s movies. The following table summarizes the top 10 directors who have generated the most revenue and the number of movies produced.

imdb_clean %>% 
  group_by(Director) %>% 
  summarise(Movies = n(), Total_Revenue = sum(Gross)) %>% 
  arrange(-Total_Revenue) %>% 
  slice(1:10) %>% 
  knitr::kable(caption = "Total_Revenue is in US dollars $.")

Total_Revenue is in US dollars $.
Director	Movies	Total_Revenue
Steven Spielberg	13	2478133165
Anthony Russo	4	2205039403
Christopher Nolan	8	1937454106
James Cameron	5	1748236602
Peter Jackson	5	1597312443
J.J. Abrams	3	1423170905
Brad Bird	4	1099627795
Robert Zemeckis	5	1049446456
David Yates	3	978953721
Pete Docter	3	939382131

It’s evident that the number one spot is held by Steven Spielberg with 13 films.

Rating Given by IMDB Users

In this section, we examined if there is a correlation between a movie’s duration in minutes and its rating on IMDB, to check if: Longer movies are more boring.

We tested the hypothesis:

There is a correlation between movie duration and rating.

imdb_clean <- imdb_clean %>% 
  mutate(duration_min = str_replace(Runtime, " min", "") %>% as.numeric())

imdb_clean %>% ggplot(aes(x = duration_min, y = IMDB_Rating, color = log(Gross))) +
  geom_point(position = "jitter") +
  geom_smooth(method = "lm") +
  scale_color_viridis_c() +
  theme(legend.position = "none") +
  labs(x = "Movie duration in minutes", y = "IMDB Rating")

cor.test(x = imdb_clean$duration_min, y = imdb_clean$IMDB_Rating)

## 
##  Pearson's product-moment correlation
## 
## data:  imdb_clean$duration_min and imdb_clean$IMDB_Rating
## t = 7.3627, df = 828, p-value = 4.352e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1829228 0.3106948
## sample estimates:
##       cor 
## 0.2478865

After analyzing the graph, it is observed that there is a positive correlation. In other words, if the movie is long, it will be boring and therefore receive a lower rating.

Following this line of thought, it was studied whether this effect holds with the movie genre variable.

The previous genre table was used.

data2 <- inner_join(imdb_clean, generos, by = "Series_Title")

lm(IMDB_Rating ~ duration_min + Genre.y, data = data2) %>% 
  summary()

## 
## Call:
## lm(formula = IMDB_Rating ~ duration_min + Genre.y, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54826 -0.20952 -0.04453  0.15511  1.31060 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.6200931  0.0358818 212.366   <2e-16 ***
## duration_min      0.0025887  0.0002266  11.423   <2e-16 ***
## Genre.yAdventure  0.0183194  0.0300467   0.610    0.542    
## Genre.yAnimation  0.0457861  0.0400036   1.145    0.253    
## Genre.yBiography -0.0503323  0.0351933  -1.430    0.153    
## Genre.yComedy    -0.0093691  0.0294194  -0.318    0.750    
## Genre.yCrime      0.0183355  0.0299660   0.612    0.541    
## Genre.yDrama      0.0017165  0.0243436   0.071    0.944    
## Genre.yFamily    -0.0266413  0.0447477  -0.595    0.552    
## Genre.yFantasy   -0.0173315  0.0422100  -0.411    0.681    
## Genre.yFilm-Noir  0.1266021  0.0846103   1.496    0.135    
## Genre.yHistory   -0.0618685  0.0472347  -1.310    0.190    
## Genre.yHorror    -0.0207480  0.0617784  -0.336    0.737    
## Genre.yMusic     -0.0253279  0.0518641  -0.488    0.625    
## Genre.yMusical   -0.0826419  0.0756294  -1.093    0.275    
## Genre.yMystery    0.0296868  0.0366296   0.810    0.418    
## Genre.yRomance   -0.0031479  0.0339904  -0.093    0.926    
## Genre.ySci-Fi     0.0539844  0.0408736   1.321    0.187    
## Genre.ySport     -0.0371150  0.0657729  -0.564    0.573    
## Genre.yThriller  -0.0102464  0.0337210  -0.304    0.761    
## Genre.yWar        0.0443897  0.0495448   0.896    0.370    
## Genre.yWestern    0.0690708  0.0710651   0.972    0.331    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2706 on 2084 degrees of freedom
## Multiple R-squared:  0.06883,    Adjusted R-squared:  0.05945 
## F-statistic: 7.335 on 21 and 2084 DF,  p-value: < 2.2e-16

data2 %>% 
  ggplot(aes(x = duration_min, y = IMDB_Rating, color = log(Gross.y))) +
  geom_point(position = "jitter") +
  geom_smooth(method = "lm") +
  scale_color_viridis_c() +
  facet_wrap(~Genre.y, scales = "free_x") +
  theme(legend.position = "none")

The blue line in the graph shows the linear trend. The color of the points represents the Gross variable.

Recent Movies and Revenue

Finally, we investigated whether newer movies generate more revenue with the following parameters:

imdb_clean %>% ggplot(aes(x = Released_Year, y = Gross)) +
  geom_point(position = "jitter") + scale_y_log10() + geom_smooth(method = "lm") +
  labs(x = "Release Year", y = "Revenue in dollars \n logarithmic scale")

cor.test(x = imdb_clean$Released_Year, y = imdb_clean$Gross)

## 
##  Pearson's product-moment correlation
## 
## data:  imdb_clean$Released_Year and imdb_clean$Gross
## t = 6.9021, df = 828, p-value = 1.019e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1678649 0.2965914
## sample estimates:
##       cor 
## 0.2332498

Reproducibility

This analysis was performed under the following R session:

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Mexico_City
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
##  [9] ggplot2_3.5.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    lattice_0.22-6   
##  [5] stringi_1.8.4     hms_1.1.3         digest_0.6.37     magrittr_2.0.3   
##  [9] evaluate_0.24.0   grid_4.4.1        timechange_0.3.0  fastmap_1.2.0    
## [13] Matrix_1.7-0      jsonlite_1.8.8    mgcv_1.9-1        fansi_1.0.6      
## [17] viridisLite_0.4.2 scales_1.3.0      jquerylib_0.1.4   cli_3.6.3        
## [21] rlang_1.1.4       crayon_1.5.3      splines_4.4.1     bit64_4.0.5      
## [25] munsell_0.5.1     withr_3.0.1       cachem_1.1.0      yaml_2.3.10      
## [29] tools_4.4.1       parallel_4.4.1    tzdb_0.4.0        colorspace_2.1-1 
## [33] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   bit_4.0.5        
## [37] vroom_1.6.5       pkgconfig_2.0.3   pillar_1.9.0      bslib_0.8.0      
## [41] gtable_0.3.5      glue_1.7.0        xfun_0.47         tidyselect_1.2.1 
## [45] highr_0.11        rstudioapi_0.16.0 knitr_1.48        farver_2.1.2     
## [49] nlme_3.1-164      htmltools_0.5.8.1 rmarkdown_2.28    labeling_0.4.3   
## [53] compiler_4.4.1

Feel free to use it or modify it as needed!