Loading packages and dataset

pacotes <- c("readr", "dplyr", "ggplot2", "PerformanceAnalytics", "tidyr" )
lapply(pacotes, library, character.only = TRUE)
filmes <- read_csv("imdb_movies.csv")
filmes$Runtime <- gsub(" min", "", filmes$Runtime)
filmes$Runtime <- as.numeric(filmes$Runtime)
filmes$Released_Year <- as.numeric(as.character(filmes$Released_Year))

Looking at some statistics of the variables

summary(filmes)
##       ...1       Series_Title       Released_Year  Certificate       
##  Min.   :  1.0   Length:999         Min.   :1920   Length:999        
##  1st Qu.:250.5   Class :character   1st Qu.:1976   Class :character  
##  Median :500.0   Mode  :character   Median :1999   Mode  :character  
##  Mean   :500.0                      Mean   :1991                     
##  3rd Qu.:749.5                      3rd Qu.:2009                     
##  Max.   :999.0                      Max.   :2020                     
##                                     NA's   :1                        
##     Runtime         Genre            IMDB_Rating      Overview        
##  Min.   : 45.0   Length:999         Min.   :7.600   Length:999        
##  1st Qu.:103.0   Class :character   1st Qu.:7.700   Class :character  
##  Median :119.0   Mode  :character   Median :7.900   Mode  :character  
##  Mean   :122.9                      Mean   :7.948                     
##  3rd Qu.:137.0                      3rd Qu.:8.100                     
##  Max.   :321.0                      Max.   :9.200                     
##                                                                       
##    Meta_score       Director            Star1              Star2          
##  Min.   : 28.00   Length:999         Length:999         Length:999        
##  1st Qu.: 70.00   Class :character   Class :character   Class :character  
##  Median : 79.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 77.97                                                           
##  3rd Qu.: 87.00                                                           
##  Max.   :100.00                                                           
##  NA's   :157                                                              
##     Star3              Star4            No_of_Votes          Gross          
##  Length:999         Length:999         Min.   :  25088   Min.   :     1305  
##  Class :character   Class :character   1st Qu.:  55472   1st Qu.:  3245338  
##  Mode  :character   Mode  :character   Median : 138356   Median : 23457440  
##                                        Mean   : 271621   Mean   : 68082574  
##                                        3rd Qu.: 373168   3rd Qu.: 80876340  
##                                        Max.   :2303232   Max.   :936662225  
##                                                          NA's   :169

Apart from some NAs that need to be removed during the construction of the predictive model, by analyzing some of the statistics of the variables present in the dataset, we can notice some interesting facts such as the oldest movie in the dataset being from the year 1920 and a movie in the dataset having a duration of 321 minutes!

Additionally, the variables do not show very large variations or extreme outliers that could distort the model or the analyses.

Anova

Analysis of Variance (ANOVA) is a fundamental statistical technique used to compare the means of three or more independent groups to determine if there is a statistically significant difference between them.

If ANOVA finds significant differences between the means of the groups, it indicates that the categorical variable has a significant influence on the target variable.

For example, if the IMDB_Rating in relation to genre results in a p-value < 0.01, this indicates that there is a significant difference between the means of IMDB ratings for different genres, suggesting that, considering a 99% confidence level, the genre of the film influences the rating.

anova1 <- aov(IMDB_Rating ~ Genre, data = filmes) 
anova2 <- aov(IMDB_Rating ~ Director, data = filmes) 
anova3 <- aov(IMDB_Rating ~ Series_Title, data = filmes) 
anova4 <- aov(IMDB_Rating ~ Overview, data = filmes) 
anova5 <- aov(IMDB_Rating ~ Star1, data = filmes) 

summary(anova1) 
##              Df Sum Sq Mean Sq F value Pr(>F)
## Genre       201  14.61 0.07266   0.975   0.58
## Residuals   797  59.39 0.07451
summary(anova2)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## Director    547  42.83 0.07829   1.133 0.0836 .
## Residuals   451  31.17 0.06911                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(anova3)
##               Df Sum Sq Mean Sq F value Pr(>F)
## Series_Title 997  73.99 0.07421   14.84  0.205
## Residuals      1   0.00 0.00500
summary(anova4)
##              Df Sum Sq Mean Sq
## Overview    998  73.99 0.07414
summary(anova5)
##              Df Sum Sq Mean Sq F value Pr(>F)
## Star1       658  48.79 0.07415       1  0.503
## Residuals   340  25.20 0.07413

We can see that all the p-values obtained were greater than 0.01, indicating that, with a 99% confidence level, none of the tested variables have a statistically significant effect on the IMDB_Rating variable.

In summary, the results of the analysis of variance (ANOVA) suggest that there is no significant influence of the variables movie title, overview, genre, directors, and actors on the IMDB_Rating variable.

Analyzing the correlation between numerical variables

chart.Correlation((filmes[, c(3, 5, 9, 15, 16, 7)]), histogram = TRUE)

The correlation table allows us to visualize how numerical variables are related, facilitating our understanding of the data in question.

It is noticeable that the variable with the greatest influence on IMDb rating is the number of votes. This suggests that people are more likely to take the effort to go to the IMDb website and vote when they watch a movie they enjoy.

The number of votes also has a positive relationship with revenue, indicating that movies with more votes tend to be those that earn more. Interestingly, revenue has a low correlation with IMDb ratings. This suggests that the highest-grossing movies are generally those with a larger number of votes, i.e., the most popular ones, but not necessarily the ones that please the audience the most, since revenue has a low correlation with ratings.

Another interesting relationship is the release year in relation to IMDb ratings and critics’ scores (Meta Score). We can observe that older movies were slightly better rated by both the audience and critics. This relationship indicates that there may have been a decline in the quality of films over the years, or at least a decline in the perception of film quality by the public and critics.

We can also see that audience ratings and critics’ ratings tend to be only slightly similar. It is interesting to note this relationship, as one might initially think they would be more strongly related, given that both represent evaluations made by people who watched the movies. Thus, the not very strong relationship between audience and critics’ ratings suggests that these two groups likely have evaluation criteria that differ significantly from each other.

Overall, despite the interesting insights that can be drawn from the correlation table of numerical variables, it is notable that there are no very high correlations between any of the variables with IMDb_Rating or among themselves.

Which directors worked on the highest-grossing films?

# Analyzing directors in relation to revenue
diretor_faturamento <- filmes %>%
  group_by(Director) %>%
  summarise(media_faturamento = mean(Gross, na.rm = TRUE)) %>%
  arrange(desc(media_faturamento))

# Displaying the top 20 directors with the highest average revenue
head(diretor_faturamento, 20)
## # A tibble: 20 × 2
##    Director         media_faturamento
##    <chr>                        <dbl>
##  1 Anthony Russo           551259851.
##  2 Gareth Edwards          532177324 
##  3 J.J. Abrams             474390302.
##  4 Josh Cooley             434038008 
##  5 Roger Allers            422783777 
##  6 Tim Miller              363070709 
##  7 James Gunn              361494850.
##  8 James Cameron           349647320.
##  9 Byron Howard            341268248 
## 10 David Yates             326317907 
## 11 David Leitch            324591735 
## 12 Joss Whedon             324397032 
## 13 George Lucas            322740140 
## 14 Peter Jackson           319462489.
## 15 Jon Favreau             318412101 
## 16 Pete Docter             313127377 
## 17 Lee Unkrich             312365448.
## 18 Richard Marquand        309125409 
## 19 Todd Phillips           306386907 
## 20 Gore Verbinski          305413918

In addition to the insights derived from numerical variables, valuable insights can also be gained from the categorical variables in the dataset. Since the goal of studios is to achieve profit from their productions and considering that ratings have little influence on revenue, it makes more sense to analyze which directors worked on the highest-grossing films rather than focusing on ratings. This way, the studio can choose directors with a history of contributing to high financial returns when planning new projects.

To further enhance this analysis, I will incorporate considerations based on the Value Above Replacement (VAR) statistic, as discussed in the work of data scientist Jeremy Lee. VAR, in summary, quantifies how often an actor/actress or director has been involved in films with above-average profitability.

Link to the VAR of directors Please right-click and select “Open link in new tab”

It is observed that the top five in the VAR are all included in the list of the top 20 directors provided earlier, which reinforces the reliability and relevance of this list.

Which actors worked in the highest-grossing films?

# Analyzing actors in relation to revenue
atores <- filmes %>%
  select(Gross, Star1, Star2, Star3, Star4) %>%
  gather(key = "posicao", value = "ator", Star1:Star4) %>%
  group_by(ator) %>%
  summarise(media_faturamento = mean(Gross, na.rm = TRUE)) %>%
  arrange(desc(media_faturamento))

# Displaying the top 50 actors with the highest average revenue
top_60_atores <- atores[1:60, ]

num_rows <- ceiling(nrow(top_60_atores) / 3)  # número de linhas necessárias
atores_matrix <- matrix(paste0(seq_along(top_60_atores$ator), ". ", top_60_atores$ator), nrow = num_rows, byrow = TRUE)

print(atores_matrix)
##       [,1]                    [,2]                  [,3]                    
##  [1,] "1. Daisy Ridley"       "2. John Boyega"      "3. Michelle Rodriguez" 
##  [2,] "4. Billy Zane"         "5. Huck Milner"      "6. Sarah Vowell"       
##  [3,] "7. Joe Russo"          "8. Aaron Eckhart"    "9. Diego Luna"         
##  [4,] "10. Donnie Yen"        "11. Oscar Isaac"     "12. Robert Downey Jr." 
##  [5,] "13. Dee Wallace"       "14. Drew Barrymore"  "15. Henry Thomas"      
##  [6,] "16. Peter Coyote"      "17. Craig T. Nelson" "18. Holly Hunter"      
##  [7,] "19. Annie Potts"       "20. Tony Hale"       "21. Zoe Saldana"       
##  [8,] "22. James Earl Jones"  "23. Rob Minkoff"     "24. Joan Cusack"       
##  [9,] "25. Sam Worthington"   "26. Jeff Goldblum"   "27. Chris Evans"       
## [10,] "28. Alexander Gould"   "29. Ellen DeGeneres" "30. Ed Skrein"         
## [11,] "31. T.J. Miller"       "32. Bill Hader"      "33. Lewis Black"       
## [12,] "34. Ronnie Del Carmen" "35. Elijah Wood"     "36. Morena Baccarin"   
## [13,] "37. Ryan Reynolds"     "38. Mark Ruffalo"    "39. Michael Gambon"    
## [14,] "40. Jared Bush"        "41. Jason Bateman"   "42. Rich Moore"        
## [15,] "43. Chris Hemsworth"   "44. Frances Conroy"  "45. Zazie Beetz"       
## [16,] "46. Orlando Bloom"     "47. Sally Field"     "48. Chris Pratt"       
## [17,] "49. Tim Allen"         "50. Terrence Howard" "51. Anne Hathaway"     
## [18,] "52. Maggie Smith"      "53. Sean Bean"       "54. Tom Hiddleston"    
## [19,] "55. Heath Ledger"      "56. Mark Hamill"     "57. Daniel Radcliffe"  
## [20,] "58. Rupert Grint"      "59. Lee Unkrich"     "60. Billy Dee Williams"

Just as was done with directors, it is also necessary to analyze actors in relation to revenue. This way, the studio can have a list of actors who are most associated with high-grossing films, making it easier to select talent for future productions.

Link to the VAR of actors Please right-click and select “Open link in new tab”

Chris Hemsworth, Robert Downey Jr., Elijah Wood, and Zoe Saldana are some of the actors well-positioned on the VAR list and also appear in the above list. Just like with directors, this fact adds reliability to the exploratory data analysis (EDA) done here, as it shows that similar results were found by other researchers using different datasets.

Which film genres achieved the highest revenue?

# Group by genre and calculate the average revenue
genero_faturamento <- filmes %>%
  group_by(Genre) %>%
  summarise(media_faturamento = mean(Gross, na.rm = TRUE)) %>%
  arrange(desc(media_faturamento))

# Displaying the top 20 genres with the highest average revenue
head(genero_faturamento, 20)
## # A tibble: 20 × 2
##    Genre                        media_faturamento
##    <chr>                                    <dbl>
##  1 Family, Sci-Fi                      435110554 
##  2 Action, Adventure, Fantasy          352723505.
##  3 Action, Adventure, Family           301959197 
##  4 Action, Adventure, Sci-Fi           280888546.
##  5 Adventure, Fantasy                  280685212.
##  6 Adventure, Thriller                 260000000 
##  7 Animation, Comedy, Crime            251513985 
##  8 Action, Adventure                   229507242.
##  9 Animation, Adventure, Comedy        225166880.
## 10 Action, Adventure, Drama            222402969.
## 11 Action, Adventure, Comedy           213379327.
## 12 Action, Adventure, Mystery          209028679 
## 13 Animation, Action, Adventure        200869452.
## 14 Action, Mystery, Thriller           175124898 
## 15 Adventure, Family, Fantasy          174928484.
## 16 Action, Adventure, Thriller         168611780.
## 17 Adventure, Comedy, Sci-Fi           164554881 
## 18 Comedy, Family                      153183226 
## 19 Adventure, Drama, Sci-Fi            150668688.
## 20 Action, Drama, Sci-Fi               148394052.

Finally, it is important to analyze which genres tend to be more profitable. It is noted that action and adventure films are among the most lucrative. Therefore, it is advisable for the studio, if prioritizing revenue, to consider these genres when creating a new film.

What Should the Studio’s Next Film Look Like if Focused Mainly on Revenue?

Through this brief exploratory data analysis (EDA), valuable guidance can be provided to the studio for planning its next films.

We can even propose an “action plan” for the studio’s next project! Considering a new film with a primary focus on revenue, a good scope for the studio’s next project, based on the data explored here and on the VAR statistic, would be:

Genre: A family-friendly Action and Adventure film with a Sci-Fi context

Director: Anthony Russo

Lead Actors: Chris Hemsworth, Robert Downey Jr., Elijah Wood, Zoe Saldana, and Anne Hathaway.

Overview: In a distant future, a group of unlikely adventurers sets off on a thrilling quest of action and adventure to rescue their home planet.

An example that proves the power of data analysis for business decisions is that the 4th highest-grossing film of 2023 is very similar in scope and context to the film suggested here.

Guardians of the Galaxy Vol. 3 was the 4th highest-grossing film of 2023, and in addition to being similar in scope and context to the film suggested here, it also features a director and actors with styles similar to those proposed for this film. In fact, one of the stars of Guardians of the Galaxy Vol. 3 is Zoe Saldana herself!

Highest-Grossing Films of 2023 Please right-click and select “Open link in new tab”