The objective of this project is to analyze movies with a rating above the average from a sample of 831 movies, from 1921 to 2019, rated by users of the Internet Movie Database (IMDB), using various statistical tools through the application of RStudio.
The variables to be analyzed are the average revenue, the number of films per director, the rating given by IMDB users, and the release year of all movies selected from the list of the 831 top-rated IMDB movies.
The data were obtained from the following link: data.
The libraries used were as follows.
Generally, tidyverse includes a set of libraries such as ggplot2, dplyr, tidyr, etc.
library(tidyverse)
theme_set(theme_bw())
imdb <- read_csv("imdb_top_1000.csv")
imdb_clean <- imdb %>%
# One movie had "PG" in the year column. These types of movies were removed.
filter(Released_Year != "PG") %>%
# The year was then converted to a numeric variable, not a character.
mutate(Released_Year = as.numeric(Released_Year)) %>%
# The year was filtered with the following script.
filter(Released_Year >= 1921, Released_Year <= 2019) %>%
filter(!is.na(Gross), !is.na(IMDB_Rating))
imdb_clean %>%
ggplot(aes(x = IMDB_Rating)) +
geom_histogram(binwidth = 1/5, color="darkgreen", fill="grey") +
labs(y = "Frequency", x = "Rating")
#Rating Distribution
imdb_clean %>%
ggplot(aes(x = IMDB_Rating)) +
geom_histogram(binwidth = 1/5, color="darkgreen", fill="grey") +
labs(y = "Frequency", x = "Rating")
To select the list of the 831 top-rated IMDB movies between 1921 and 2019, the following analysis was performed:
imdb_clean <- imdb_clean %>%
arrange(desc(IMDB_Rating)) %>%
slice(1:831)
print(paste0("Movies to analyze after filters: ", nrow(imdb_clean)))
## [1] "Movies to analyze after filters: 830"
The selected variables to explore are:
Revenue per genre was calculated first. It is important to note that some movies were assigned to more than one genre, so revenue was calculated for movies with multiple genres.
max_genres <- str_count(imdb_clean$Genre, pattern = ", ") %>%
max() %>% sum(c(1))
generos <- imdb_clean %>%
select(Genre, Gross, Series_Title) %>%
separate(Genre, sep=", ", into = paste0("Genre_", 1:max_genres)) %>%
pivot_longer(-c(Series_Title, Gross), values_to = "Genre") %>%
select(-name) %>%
filter(!is.na(Genre))
generos %>% ggplot(aes(x = reorder(Genre, Gross, median), y = Gross, fill = Genre)) +
geom_boxplot() + scale_y_log10() + coord_flip() +
labs(y = "Revenue in dollars \n logarithmic scale", x = "Genre") +
theme(legend.position = "none")
A table was created where each genre is assigned to a movie. For movies with multiple genres, repetitions exist. For example, “The Godfather” appears in both the crime and drama genres.
generos %>%
# mutate(
# Genre = fct_reorder(Genre, Gross, .fun=median, .desc = TRUE)
# ) %>%
ggplot(aes(x = reorder(Genre, Gross, median), y = Gross, fill = Genre)) +
geom_boxplot() +
scale_y_log10() +
coord_flip() +
labs(
y = "Revenue in dollars \n logarithmic scale",
x = "Genre"
) +
theme(legend.position = "none")
From this, we can hypothesize:
Adventure films generate more revenue than Mystery films.
To test this hypothesis, we perform a t-test:
hip1 <- generos %>%
filter(Genre %in% c("Adventure", "Mystery"))
t.test(Gross ~ Genre, data = hip1)
##
## Welch Two Sample t-test
##
## data: Gross by Genre
## t = 8.3943, df = 230.36, p-value = 4.782e-15
## alternative hypothesis: true difference in means between group Adventure and group Mystery is not equal to 0
## 95 percent confidence interval:
## 95207479 153610377
## sample estimates:
## mean in group Adventure mean in group Mystery
## 165683310 41274382
In this test, we obtain a p-value < 0.05, which allows us to reject the null hypothesis and accept that the Adventure genre generates more revenue than the Mystery genre.
Next, the selection of movies by director is shown. Additionally, it was possible to obtain the revenue of each director’s movies. The following table summarizes the top 10 directors who have generated the most revenue and the number of movies produced.
imdb_clean %>%
group_by(Director) %>%
summarise(Movies = n(), Total_Revenue = sum(Gross)) %>%
arrange(-Total_Revenue) %>%
slice(1:10) %>%
knitr::kable(caption = "Total_Revenue is in US dollars $.")
| Director | Movies | Total_Revenue |
|---|---|---|
| Steven Spielberg | 13 | 2478133165 |
| Anthony Russo | 4 | 2205039403 |
| Christopher Nolan | 8 | 1937454106 |
| James Cameron | 5 | 1748236602 |
| Peter Jackson | 5 | 1597312443 |
| J.J. Abrams | 3 | 1423170905 |
| Brad Bird | 4 | 1099627795 |
| Robert Zemeckis | 5 | 1049446456 |
| David Yates | 3 | 978953721 |
| Pete Docter | 3 | 939382131 |
It’s evident that the number one spot is held by Steven Spielberg with 13 films.
In this section, we examined if there is a correlation between a movie’s duration in minutes and its rating on IMDB, to check if: Longer movies are more boring.
We tested the hypothesis:
There is a correlation between movie duration and rating.
imdb_clean <- imdb_clean %>%
mutate(duration_min = str_replace(Runtime, " min", "") %>% as.numeric())
imdb_clean %>% ggplot(aes(x = duration_min, y = IMDB_Rating, color = log(Gross))) +
geom_point(position = "jitter") +
geom_smooth(method = "lm") +
scale_color_viridis_c() +
theme(legend.position = "none") +
labs(x = "Movie duration in minutes", y = "IMDB Rating")
cor.test(x = imdb_clean$duration_min, y = imdb_clean$IMDB_Rating)
##
## Pearson's product-moment correlation
##
## data: imdb_clean$duration_min and imdb_clean$IMDB_Rating
## t = 7.3627, df = 828, p-value = 4.352e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1829228 0.3106948
## sample estimates:
## cor
## 0.2478865
After analyzing the graph, it is observed that there is a positive correlation. In other words, if the movie is long, it will be boring and therefore receive a lower rating.
Following this line of thought, it was studied whether this effect holds with the movie genre variable.
The previous genre table was used.
data2 <- inner_join(imdb_clean, generos, by = "Series_Title")
lm(IMDB_Rating ~ duration_min + Genre.y, data = data2) %>%
summary()
##
## Call:
## lm(formula = IMDB_Rating ~ duration_min + Genre.y, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54826 -0.20952 -0.04453 0.15511 1.31060
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6200931 0.0358818 212.366 <2e-16 ***
## duration_min 0.0025887 0.0002266 11.423 <2e-16 ***
## Genre.yAdventure 0.0183194 0.0300467 0.610 0.542
## Genre.yAnimation 0.0457861 0.0400036 1.145 0.253
## Genre.yBiography -0.0503323 0.0351933 -1.430 0.153
## Genre.yComedy -0.0093691 0.0294194 -0.318 0.750
## Genre.yCrime 0.0183355 0.0299660 0.612 0.541
## Genre.yDrama 0.0017165 0.0243436 0.071 0.944
## Genre.yFamily -0.0266413 0.0447477 -0.595 0.552
## Genre.yFantasy -0.0173315 0.0422100 -0.411 0.681
## Genre.yFilm-Noir 0.1266021 0.0846103 1.496 0.135
## Genre.yHistory -0.0618685 0.0472347 -1.310 0.190
## Genre.yHorror -0.0207480 0.0617784 -0.336 0.737
## Genre.yMusic -0.0253279 0.0518641 -0.488 0.625
## Genre.yMusical -0.0826419 0.0756294 -1.093 0.275
## Genre.yMystery 0.0296868 0.0366296 0.810 0.418
## Genre.yRomance -0.0031479 0.0339904 -0.093 0.926
## Genre.ySci-Fi 0.0539844 0.0408736 1.321 0.187
## Genre.ySport -0.0371150 0.0657729 -0.564 0.573
## Genre.yThriller -0.0102464 0.0337210 -0.304 0.761
## Genre.yWar 0.0443897 0.0495448 0.896 0.370
## Genre.yWestern 0.0690708 0.0710651 0.972 0.331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2706 on 2084 degrees of freedom
## Multiple R-squared: 0.06883, Adjusted R-squared: 0.05945
## F-statistic: 7.335 on 21 and 2084 DF, p-value: < 2.2e-16
data2 %>%
ggplot(aes(x = duration_min, y = IMDB_Rating, color = log(Gross.y))) +
geom_point(position = "jitter") +
geom_smooth(method = "lm") +
scale_color_viridis_c() +
facet_wrap(~Genre.y, scales = "free_x") +
theme(legend.position = "none")
The blue line in the graph shows the linear trend. The color of the points represents the Gross variable.
Finally, we investigated whether newer movies generate more revenue with the following parameters:
imdb_clean %>% ggplot(aes(x = Released_Year, y = Gross)) +
geom_point(position = "jitter") + scale_y_log10() + geom_smooth(method = "lm") +
labs(x = "Release Year", y = "Revenue in dollars \n logarithmic scale")
cor.test(x = imdb_clean$Released_Year, y = imdb_clean$Gross)
##
## Pearson's product-moment correlation
##
## data: imdb_clean$Released_Year and imdb_clean$Gross
## t = 6.9021, df = 828, p-value = 1.019e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1678649 0.2965914
## sample estimates:
## cor
## 0.2332498
This analysis was performed under the following R session:
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Mexico_City
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [9] ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 lattice_0.22-6
## [5] stringi_1.8.4 hms_1.1.3 digest_0.6.37 magrittr_2.0.3
## [9] evaluate_0.24.0 grid_4.4.1 timechange_0.3.0 fastmap_1.2.0
## [13] Matrix_1.7-0 jsonlite_1.8.8 mgcv_1.9-1 fansi_1.0.6
## [17] viridisLite_0.4.2 scales_1.3.0 jquerylib_0.1.4 cli_3.6.3
## [21] rlang_1.1.4 crayon_1.5.3 splines_4.4.1 bit64_4.0.5
## [25] munsell_0.5.1 withr_3.0.1 cachem_1.1.0 yaml_2.3.10
## [29] tools_4.4.1 parallel_4.4.1 tzdb_0.4.0 colorspace_2.1-1
## [33] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 bit_4.0.5
## [37] vroom_1.6.5 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.8.0
## [41] gtable_0.3.5 glue_1.7.0 xfun_0.47 tidyselect_1.2.1
## [45] highr_0.11 rstudioapi_0.16.0 knitr_1.48 farver_2.1.2
## [49] nlme_3.1-164 htmltools_0.5.8.1 rmarkdown_2.28 labeling_0.4.3
## [53] compiler_4.4.1
Feel free to use it or modify it as needed!