pacotes <- c("readr", "dplyr", "ggplot2", "PerformanceAnalytics", "tidyr" )
lapply(pacotes, library, character.only = TRUE)
filmes <- read_csv("imdb_movies.csv")
filmes$Runtime <- gsub(" min", "", filmes$Runtime)
filmes$Runtime <- as.numeric(filmes$Runtime)
filmes$Released_Year <- as.numeric(as.character(filmes$Released_Year))
summary(filmes)
## ...1 Series_Title Released_Year Certificate
## Min. : 1.0 Length:999 Min. :1920 Length:999
## 1st Qu.:250.5 Class :character 1st Qu.:1976 Class :character
## Median :500.0 Mode :character Median :1999 Mode :character
## Mean :500.0 Mean :1991
## 3rd Qu.:749.5 3rd Qu.:2009
## Max. :999.0 Max. :2020
## NA's :1
## Runtime Genre IMDB_Rating Overview
## Min. : 45.0 Length:999 Min. :7.600 Length:999
## 1st Qu.:103.0 Class :character 1st Qu.:7.700 Class :character
## Median :119.0 Mode :character Median :7.900 Mode :character
## Mean :122.9 Mean :7.948
## 3rd Qu.:137.0 3rd Qu.:8.100
## Max. :321.0 Max. :9.200
##
## Meta_score Director Star1 Star2
## Min. : 28.00 Length:999 Length:999 Length:999
## 1st Qu.: 70.00 Class :character Class :character Class :character
## Median : 79.00 Mode :character Mode :character Mode :character
## Mean : 77.97
## 3rd Qu.: 87.00
## Max. :100.00
## NA's :157
## Star3 Star4 No_of_Votes Gross
## Length:999 Length:999 Min. : 25088 Min. : 1305
## Class :character Class :character 1st Qu.: 55472 1st Qu.: 3245338
## Mode :character Mode :character Median : 138356 Median : 23457440
## Mean : 271621 Mean : 68082574
## 3rd Qu.: 373168 3rd Qu.: 80876340
## Max. :2303232 Max. :936662225
## NA's :169
Apart from some NAs that need to be removed during the construction of the predictive model, by analyzing some of the statistics of the variables present in the dataset, we can notice some interesting facts such as the oldest movie in the dataset being from the year 1920 and a movie in the dataset having a duration of 321 minutes!
Additionally, the variables do not show very large variations or extreme outliers that could distort the model or the analyses.
Analysis of Variance (ANOVA) is a fundamental statistical technique used to compare the means of three or more independent groups to determine if there is a statistically significant difference between them.
If ANOVA finds significant differences between the means of the groups, it indicates that the categorical variable has a significant influence on the target variable.
For example, if the IMDB_Rating in relation to genre results in a p-value < 0.01, this indicates that there is a significant difference between the means of IMDB ratings for different genres, suggesting that, considering a 99% confidence level, the genre of the film influences the rating.
anova1 <- aov(IMDB_Rating ~ Genre, data = filmes)
anova2 <- aov(IMDB_Rating ~ Director, data = filmes)
anova3 <- aov(IMDB_Rating ~ Series_Title, data = filmes)
anova4 <- aov(IMDB_Rating ~ Overview, data = filmes)
anova5 <- aov(IMDB_Rating ~ Star1, data = filmes)
summary(anova1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Genre 201 14.61 0.07266 0.975 0.58
## Residuals 797 59.39 0.07451
summary(anova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Director 547 42.83 0.07829 1.133 0.0836 .
## Residuals 451 31.17 0.06911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(anova3)
## Df Sum Sq Mean Sq F value Pr(>F)
## Series_Title 997 73.99 0.07421 14.84 0.205
## Residuals 1 0.00 0.00500
summary(anova4)
## Df Sum Sq Mean Sq
## Overview 998 73.99 0.07414
summary(anova5)
## Df Sum Sq Mean Sq F value Pr(>F)
## Star1 658 48.79 0.07415 1 0.503
## Residuals 340 25.20 0.07413
We can see that all the p-values obtained were greater than 0.01, indicating that, with a 99% confidence level, none of the tested variables have a statistically significant effect on the IMDB_Rating variable.
In summary, the results of the analysis of variance (ANOVA) suggest that there is no significant influence of the variables movie title, overview, genre, directors, and actors on the IMDB_Rating variable.
chart.Correlation((filmes[, c(3, 5, 9, 15, 16, 7)]), histogram = TRUE)
The correlation table allows us to visualize how numerical variables are related, facilitating our understanding of the data in question.
It is noticeable that the variable with the greatest influence on IMDb rating is the number of votes. This suggests that people are more likely to take the effort to go to the IMDb website and vote when they watch a movie they enjoy.
The number of votes also has a positive relationship with revenue, indicating that movies with more votes tend to be those that earn more. Interestingly, revenue has a low correlation with IMDb ratings. This suggests that the highest-grossing movies are generally those with a larger number of votes, i.e., the most popular ones, but not necessarily the ones that please the audience the most, since revenue has a low correlation with ratings.
Another interesting relationship is the release year in relation to IMDb ratings and critics’ scores (Meta Score). We can observe that older movies were slightly better rated by both the audience and critics. This relationship indicates that there may have been a decline in the quality of films over the years, or at least a decline in the perception of film quality by the public and critics.
We can also see that audience ratings and critics’ ratings tend to be only slightly similar. It is interesting to note this relationship, as one might initially think they would be more strongly related, given that both represent evaluations made by people who watched the movies. Thus, the not very strong relationship between audience and critics’ ratings suggests that these two groups likely have evaluation criteria that differ significantly from each other.
Overall, despite the interesting insights that can be drawn from the correlation table of numerical variables, it is notable that there are no very high correlations between any of the variables with IMDb_Rating or among themselves.
# Analyzing directors in relation to revenue
diretor_faturamento <- filmes %>%
group_by(Director) %>%
summarise(media_faturamento = mean(Gross, na.rm = TRUE)) %>%
arrange(desc(media_faturamento))
# Displaying the top 20 directors with the highest average revenue
head(diretor_faturamento, 20)
## # A tibble: 20 × 2
## Director media_faturamento
## <chr> <dbl>
## 1 Anthony Russo 551259851.
## 2 Gareth Edwards 532177324
## 3 J.J. Abrams 474390302.
## 4 Josh Cooley 434038008
## 5 Roger Allers 422783777
## 6 Tim Miller 363070709
## 7 James Gunn 361494850.
## 8 James Cameron 349647320.
## 9 Byron Howard 341268248
## 10 David Yates 326317907
## 11 David Leitch 324591735
## 12 Joss Whedon 324397032
## 13 George Lucas 322740140
## 14 Peter Jackson 319462489.
## 15 Jon Favreau 318412101
## 16 Pete Docter 313127377
## 17 Lee Unkrich 312365448.
## 18 Richard Marquand 309125409
## 19 Todd Phillips 306386907
## 20 Gore Verbinski 305413918
In addition to the insights derived from numerical variables, valuable insights can also be gained from the categorical variables in the dataset. Since the goal of studios is to achieve profit from their productions and considering that ratings have little influence on revenue, it makes more sense to analyze which directors worked on the highest-grossing films rather than focusing on ratings. This way, the studio can choose directors with a history of contributing to high financial returns when planning new projects.
To further enhance this analysis, I will incorporate considerations based on the Value Above Replacement (VAR) statistic, as discussed in the work of data scientist Jeremy Lee. VAR, in summary, quantifies how often an actor/actress or director has been involved in films with above-average profitability.
Link to the VAR of directors Please right-click and select “Open link in new tab”
It is observed that the top five in the VAR are all included in the list of the top 20 directors provided earlier, which reinforces the reliability and relevance of this list.
# Analyzing actors in relation to revenue
atores <- filmes %>%
select(Gross, Star1, Star2, Star3, Star4) %>%
gather(key = "posicao", value = "ator", Star1:Star4) %>%
group_by(ator) %>%
summarise(media_faturamento = mean(Gross, na.rm = TRUE)) %>%
arrange(desc(media_faturamento))
# Displaying the top 50 actors with the highest average revenue
top_60_atores <- atores[1:60, ]
num_rows <- ceiling(nrow(top_60_atores) / 3) # número de linhas necessárias
atores_matrix <- matrix(paste0(seq_along(top_60_atores$ator), ". ", top_60_atores$ator), nrow = num_rows, byrow = TRUE)
print(atores_matrix)
## [,1] [,2] [,3]
## [1,] "1. Daisy Ridley" "2. John Boyega" "3. Michelle Rodriguez"
## [2,] "4. Billy Zane" "5. Huck Milner" "6. Sarah Vowell"
## [3,] "7. Joe Russo" "8. Aaron Eckhart" "9. Diego Luna"
## [4,] "10. Donnie Yen" "11. Oscar Isaac" "12. Robert Downey Jr."
## [5,] "13. Dee Wallace" "14. Drew Barrymore" "15. Henry Thomas"
## [6,] "16. Peter Coyote" "17. Craig T. Nelson" "18. Holly Hunter"
## [7,] "19. Annie Potts" "20. Tony Hale" "21. Zoe Saldana"
## [8,] "22. James Earl Jones" "23. Rob Minkoff" "24. Joan Cusack"
## [9,] "25. Sam Worthington" "26. Jeff Goldblum" "27. Chris Evans"
## [10,] "28. Alexander Gould" "29. Ellen DeGeneres" "30. Ed Skrein"
## [11,] "31. T.J. Miller" "32. Bill Hader" "33. Lewis Black"
## [12,] "34. Ronnie Del Carmen" "35. Elijah Wood" "36. Morena Baccarin"
## [13,] "37. Ryan Reynolds" "38. Mark Ruffalo" "39. Michael Gambon"
## [14,] "40. Jared Bush" "41. Jason Bateman" "42. Rich Moore"
## [15,] "43. Chris Hemsworth" "44. Frances Conroy" "45. Zazie Beetz"
## [16,] "46. Orlando Bloom" "47. Sally Field" "48. Chris Pratt"
## [17,] "49. Tim Allen" "50. Terrence Howard" "51. Anne Hathaway"
## [18,] "52. Maggie Smith" "53. Sean Bean" "54. Tom Hiddleston"
## [19,] "55. Heath Ledger" "56. Mark Hamill" "57. Daniel Radcliffe"
## [20,] "58. Rupert Grint" "59. Lee Unkrich" "60. Billy Dee Williams"
Just as was done with directors, it is also necessary to analyze actors in relation to revenue. This way, the studio can have a list of actors who are most associated with high-grossing films, making it easier to select talent for future productions.
Link to the VAR of actors Please right-click and select “Open link in new tab”
Chris Hemsworth, Robert Downey Jr., Elijah Wood, and Zoe Saldana are some of the actors well-positioned on the VAR list and also appear in the above list. Just like with directors, this fact adds reliability to the exploratory data analysis (EDA) done here, as it shows that similar results were found by other researchers using different datasets.
# Group by genre and calculate the average revenue
genero_faturamento <- filmes %>%
group_by(Genre) %>%
summarise(media_faturamento = mean(Gross, na.rm = TRUE)) %>%
arrange(desc(media_faturamento))
# Displaying the top 20 genres with the highest average revenue
head(genero_faturamento, 20)
## # A tibble: 20 × 2
## Genre media_faturamento
## <chr> <dbl>
## 1 Family, Sci-Fi 435110554
## 2 Action, Adventure, Fantasy 352723505.
## 3 Action, Adventure, Family 301959197
## 4 Action, Adventure, Sci-Fi 280888546.
## 5 Adventure, Fantasy 280685212.
## 6 Adventure, Thriller 260000000
## 7 Animation, Comedy, Crime 251513985
## 8 Action, Adventure 229507242.
## 9 Animation, Adventure, Comedy 225166880.
## 10 Action, Adventure, Drama 222402969.
## 11 Action, Adventure, Comedy 213379327.
## 12 Action, Adventure, Mystery 209028679
## 13 Animation, Action, Adventure 200869452.
## 14 Action, Mystery, Thriller 175124898
## 15 Adventure, Family, Fantasy 174928484.
## 16 Action, Adventure, Thriller 168611780.
## 17 Adventure, Comedy, Sci-Fi 164554881
## 18 Comedy, Family 153183226
## 19 Adventure, Drama, Sci-Fi 150668688.
## 20 Action, Drama, Sci-Fi 148394052.
Finally, it is important to analyze which genres tend to be more profitable. It is noted that action and adventure films are among the most lucrative. Therefore, it is advisable for the studio, if prioritizing revenue, to consider these genres when creating a new film.
Through this brief exploratory data analysis (EDA), valuable guidance can be provided to the studio for planning its next films.
We can even propose an “action plan” for the studio’s next project! Considering a new film with a primary focus on revenue, a good scope for the studio’s next project, based on the data explored here and on the VAR statistic, would be:
Genre: A family-friendly Action and Adventure film with a Sci-Fi context
Director: Anthony Russo
Lead Actors: Chris Hemsworth, Robert Downey Jr., Elijah Wood, Zoe Saldana, and Anne Hathaway.
Overview: In a distant future, a group of unlikely adventurers sets off on a thrilling quest of action and adventure to rescue their home planet.
An example that proves the power of data analysis for business decisions is that the 4th highest-grossing film of 2023 is very similar in scope and context to the film suggested here.
Guardians of the Galaxy Vol. 3 was the 4th highest-grossing film of 2023, and in addition to being similar in scope and context to the film suggested here, it also features a director and actors with styles similar to those proposed for this film. In fact, one of the stars of Guardians of the Galaxy Vol. 3 is Zoe Saldana herself!
Highest-Grossing Films of 2023 Please right-click and select “Open link in new tab”