November 02, 2025

Problem Statement

Core Question: Given historical movie data, can we accurately predict success metrics and identify what factors contribute to higher ratings in the film industry?


Industry Context: The movie industry represents a multi-billion dollar global market where predicting success is crucial for studios, investors, and creators. Understanding what drives audience reception can inform production decisions and investment strategies.


Key Research Questions

  • Genre Performance: What genres consistently receive higher ratings?
  • Temporal Trends: How has movie quality evolved over time?
  • Budget Impact: Does production budget correlate with better ratings?
  • Predictive Modeling: Can we build an accurate rating prediction model?
  • Genre Combinations: Do specific genre combinations perform better?

Dataset Overview

Overview

  • 58,788 movies with 24 variables including:
    • Title, year, budget, length
    • Rating, votes, genre classifications
  • Time Span: Early cinema (1890s) to modern films (2000s)
  • Data Types: Mixed numerical (budget, rating) and categorical (genres)


Sample Data Preview

# List the columns used in the project
data(movies)
key <- c("title", "year", "rating", "budget")
key2 <- c("Action", "Animation", "Comedy", "Drama", "Documentary", "Romance", "Short")

# Fix titles
movies$title <- ifelse(
  grepl(", The$", movies$title),
  paste("The", gsub(", The$", "", movies$title)),
  paste(movies$title)
)

# Data used for equations and graphs
result <- as.data.frame(head(movies[key], 5))
result$title <- as.character(result$title) # Prevent right alignment
print(result, row.names = FALSE, right = FALSE)
 title                   year rating budget
 $                       1971 6.4    NA    
 $1000 a Touchdown       1939 6.0    NA    
 $21 a Day Once a Month  1941 8.2    NA    
 $40,000                 1996 8.2    NA    
 The $50,000 Climax Show 1975 3.4    NA    
# Types of genres in the dataset
result2 <- as.data.frame(head(movies[key2], 5))
print(result2, row.names = FALSE, right = FALSE)
 Action Animation Comedy Drama Documentary Romance Short
 0      0         1      1     0           0       0    
 0      0         1      0     0           0       0    
 0      1         0      0     0           0       1    
 0      0         1      0     0           0       0    
 0      0         0      0     0           0       0    

Genre Performance

Objective: Analyze which movie genres achieve higher audience ratings to identify patterns of success.


Analysis Approach

  • Calculate average ratings for the genres
  • Compare genre performance using bar plots
  • Identify top-performing genres based on historical data


Bar Plot

# Count each genre
average <- c(
  action = mean(movies$rating[movies$Action == 1], na.rm = TRUE),
  animation = mean(movies$rating[movies$Animation == 1], na.rm = TRUE),
  comedy = mean(movies$rating[movies$Comedy == 1], na.rm = TRUE),
  drama = mean(movies$rating[movies$Drama == 1], na.rm = TRUE),
  documentary = mean(movies$rating[movies$Documentary == 1], na.rm = TRUE),
  romance = mean(movies$rating[movies$Romance == 1], na.rm = TRUE),
  short = mean(movies$rating[movies$Short == 1], na.rm = TRUE)
)

# Create colors
pastels <- c("#FFB3BA", # Action 
             "#FFDFBA", # Animation
             "#FFFFBA", # Comedy
             "#BAFFC9", # Drama
             "#BAE1FF", # Documentary
             "#E2BAFF", # Romance
             "#D3D3D3") # Short

# Bar plot
bp <- barplot(average,
              main = "Average Rating by Genre", 
              xlab = "Genre", 
              ylab = "Average Rating (1-10)",
              ylim = c(0, max(average) * 1.2),
              names.arg = names(average), 
              cex.names = 0.7,
              col = pastels)

# Add numbers on top of bars
text(x = bp, y = average, label = round(average, 2), pos = 3)

Graph Analysis: Documentary films achieved the highest average rating at 6.65/10, demonstrating that informative content resonates strongly with audiences. On the opposite end, action films received the lowest average rating at 5.29/10, suggesting this genre may prioritize entertainment value over critical acclaim. This contrast highlights how different genre expectations influence audience ratings, with educational content outperforming pure entertainment in measured critical reception.

Temporal Trends

Budget Impact

Objective: Investigate whether higher production budgets lead to improved audience ratings.


Analysis Approach

  • Examine budget and rating data for films
  • Use logarithmic scale for budget to account for spending differences
  • Calculate correlation coefficient to measure relationship strength
  • Visualize with scatter plot and trend line


\[\Large Rating = \beta_0 + \beta_1 \cdot \log(Budget) + \varepsilon\]

  • Rating: Dependent variable - This is what we’re trying to predict.
  • \(\beta_0\): Intercept - Predicted rating when log(budget) = 0.
  • \(\beta_1\): Slope coefficient - How much rating changes for each 1-unit increase.
  • log(Budget): Independent variable - Natural log of the movie’s production budget.
  • \(\varepsilon\): Error term - Difference between predicted and actual rating.


Why log(Budget)?

Using log(Budget) instead of Budget is necessary because movie budgets are highly skewed. Most films have low-to-medium budgets, while a few blockbusters have extremely high ones.

A linear model using raw Budget would be distorted by these extreme values. The relationship isn’t linear, as going from $1 million to $2 million is significant, while going from $200 million to $201 million is meaningless.

The log transformation fixes this by measuring proportional changes rather than dollar changes. This means our model interprets a budget doubling (e.g., $10M to $20M) as the same increase, regardless of the starting point, which matches how we intuitively understand budget impact.


Example

# Remove rows with N/A values
budget_data <- movies %>% filter(!is.na(budget) & budget > 0)

# Linear regression
rating_model <- lm(rating ~ log(budget), data = budget_data)
b0 <- round(coef(rating_model)[1], 4)
b1 <- round(coef(rating_model)[2], 4)

cat("NOTE: For prediction, we typically assume epsilon = 0 (the average case)\n
GENERAL EQUATION
Rating = ", b0, " + ", b1, " * log(Budget)\n\n", sep = "")
NOTE: For prediction, we typically assume epsilon = 0 (the average case)

GENERAL EQUATION
Rating = 6.7658 + -0.0438 * log(Budget)
# Solve for Rating (example)
example_budget <- 50000000
prediction = b0 + b1 * log(example_budget)

cat("EXAMPLE CALCULATION
If Budget: $", format(example_budget, scientific = FALSE), "
Rating = ", b0, " + ", b1, " * log(", format(example_budget, scientific = FALSE), ")
Rating = ", round(prediction, 2), "/10\n\n", sep = "")
EXAMPLE CALCULATION
If Budget: $50000000
Rating = 6.7658 + -0.0438 * log(50000000)
Rating = 5.99/10


Linear Regression (ggplot)

# Remove rows with N/A values
budget_data <- movies %>% filter(!is.na(budget) & budget > 0 & !is.na(rating))

# Linear model
rating_model <- lm(rating ~ log(budget), data = budget_data)

# Linear regression plot (using ggplot)
ggplot(budget_data, aes(x = budget, y = rating)) +
  geom_point(color = "steelblue", alpha = 0.6, size = 2.2) +
  geom_smooth(method = "lm", color = "red", se = TRUE, linetype = "dashed") +
  scale_x_continuous(trans = "log10", labels = scales::dollar_format(), breaks = c(1e5, 1e6, 1e7, 1e8)) +
  labs(title = "Movie Rating vs. Budget",
       subtitle = "Little to no relationship between movie budget and average ratings",
       x = "Budget (Log Scale)",
       y = "Average Rating (1–10)") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, size = 12),
    axis.title.x = element_text(size = 11, vjust = -1),
    axis.title.y = element_text(size = 11, vjust = 2)
    )

Graph Analysis: The fitted regression line suggests a very weak, slightly negative relationship between budget size and ratings. This suggests that higher production budgets do not necessarily translate to higher audience approval. In fact, movies with larger budgets may receive slightly lower ratings on average, implying that financial investment alone is not a reliable indicator of a film’s perceived quality. Other factors such as storytelling, originality, and genre appeal likely have a stronger influence on audience reception.

Predictive Modeling

Objective: Develop and evaluate a predictive model to estimate movie ratings.


Analysis Approach

  • Use multiple regression with key predictors (budget, year)
  • Calculate correlation coefficient to measure relationship strength
  • Compare predicted vs. actual ratings


\[\Large Rating = \beta_0 + \beta_1 \cdot \log(Budget) + \beta_2 \cdot Year + \varepsilon\]

  • Rating: Dependent variable - This is what we’re trying to predict.
  • \(\beta_0\): Intercept - Predicted rating when all variables are 0.
  • \(\beta_1\): Budget coefficient - How much rating changes for each 1-unit increase in log(budget).
  • log(Budget): Independent variable - Natural log of the movie’s production budget.
  • \(\beta_2\): Year coefficient - How much rating changes for each additional year.
  • Year: Temporal predictor - Release year of the movie.
  • \(\varepsilon\): Error term - Difference between predicted and actual rating.


Example

# Remove rows with N/A values
plot_data <- movies %>%
  filter(!is.na(rating) & !is.na(budget) & budget > 0)

# Multiple regression
rating_model <- lm(rating ~ log(budget) + year, data = plot_data)
b0 <- round(coef(rating_model)[1], 4)
b1 <- round(coef(rating_model)[2], 4)
b2 <- round(coef(rating_model)[3], 4)

cat("NOTE: For prediction, we typically assume epsilon = 0 (the average case)\n
GENERAL EQUATION
Rating = ", b0, " + ", b1, " * log(Budget) + ", b2, " * Year\n\n", sep = "")
NOTE: For prediction, we typically assume epsilon = 0 (the average case)

GENERAL EQUATION
Rating = 13.488 + -0.0387 * log(Budget) + -0.0034 * Year
# Solve for Rating (example)
example_budget <- 50000000
example_year <- 2020
prediction = b0 + b1 * log(example_budget) + b2 * example_year

cat("EXAMPLE CALCULATION
If Budget: $", format(example_budget, scientific = FALSE), "
If Year: ", example_year, "
Rating = ", b0, " + ", 
          b1, " * log(", format(example_budget, scientific = FALSE), ") + ",
          b2, " * ", example_year, "
Rating = ", round(prediction, 2), "/10\n\n", sep = "")
EXAMPLE CALCULATION
If Budget: $50000000
If Year: 2020
Rating = 13.488 + -0.0387 * log(50000000) + -0.0034 * 2020
Rating = 5.93/10

Multiple Linear Regression (plotly)

# Remove rows with N/A values
plot_data <- movies %>%
  filter(!is.na(rating) & !is.na(budget) & budget > 0)

# Multiple regression model
plot_model <- lm(rating ~ log(budget) + year, data = plot_data)

# Create ranges for log(budget) instead of raw budget
log_budget_range <- seq(min(log(plot_data$budget)), max(log(plot_data$budget)), length.out = 20)
year_range <- seq(min(plot_data$year), max(plot_data$year), length.out = 20)
grid <- expand.grid(log_budget = log_budget_range, year = year_range)
grid$prediction <- predict(plot_model, newdata = data.frame(
  budget = exp(grid$log_budget),
  year = grid$year
))

# Convert log values back to dollars
budget_breaks <- c(1e5, 1e6, 1e7, 1e8, 1e9)  # $100K, $1M, $10M, $100M, $1B
log_budget_breaks <- log(budget_breaks)

# 3D scatter plot
plot_ly() %>%
  add_trace(x = log(plot_data$budget), y = plot_data$year, z = plot_data$rating,
            type = "scatter3d", mode = "markers",
            marker = list(color = "steelblue", size = 3, opacity = 0.5),
            showlegend = FALSE) %>%
  add_surface(x = log_budget_range, y = year_range, 
              z = matrix(grid$prediction, nrow = length(log_budget_range), ncol = length(year_range)),
              colorscale = list(list(0, "red"), list(1, "red")),
              opacity = 0.4,
              showscale = FALSE,
              name = "Prediction Plane") %>%
  layout(title = list(text = "Movie Ratings vs. Budget and Release Year",
                      x = 0.5, y = 0.95,
                      font = list(size = 20)),
         scene = list(xaxis = list(title = "Budget (Log Scale)", 
                                   tickvals = log_budget_breaks,
                                   ticktext = c("$100K", "$1M", "$10M", "$100M", "$1B")),
                      yaxis = list(title = "Release Year"), 
                      zaxis = list(title = "Rating (0-10)"),
                      aspectmode = "cube",
                      camera = list(eye = list(x = 1.5, y = -1.5, z = 1.5))
                      ),
         annotations = list(
           text = "Neither budget increases nor temporal trends significantly affect audience ratings",
           x = 0.5, 
           y = 0.93,
           xref = "paper",
           yref = "paper",
           showarrow = FALSE,
           font = list(size = 15, color = "gray")
         ))

Graph Analysis: The visualization demonstrates a weak predictive relationship between budget, release year, and movie ratings. The flat red prediction plane shows that neither increasing budgets nor temporal trends significantly elevate audience ratings. The scattered blue data points form a wide cloud around the regression plane, indicating substantial variability in ratings that cannot be explained by budget or year alone. Despite budgets ranging from $100k to over $100 million, the ratings remain consistently distributed between 4 to 8 out of 10.

Genre Combinations

Objective: Analyze whether certain genre pairings achieve higher ratings than individual genres alone.


Analysis Approach

  • Identify best and worst genre combinations in the dataset
  • Calculate average ratings for genre pairs
  • Compare combination ratings against single-genre baselines


Top 10 Genre Combinations

# Remove rows with N/A values
genre_combos <- movies %>%
  filter(!is.na(rating)) %>%
  mutate(
    genre_count = Action + Animation + Comedy + Drama + Documentary + Romance + Short,
    
    # All possible 2-genre combinations
    action_animation = ifelse(Action == 1 & Animation == 1, 1, 0),
    action_comedy = ifelse(Action == 1 & Comedy == 1, 1, 0),
    action_drama = ifelse(Action == 1 & Drama == 1, 1, 0),
    action_documentary = ifelse(Action == 1 & Documentary == 1, 1, 0),
    action_romance = ifelse(Action == 1 & Romance == 1, 1, 0),
    action_short = ifelse(Action == 1 & Short == 1, 1, 0),
    animation_comedy = ifelse(Animation == 1 & Comedy == 1, 1, 0),
    animation_drama = ifelse(Animation == 1 & Drama == 1, 1, 0),
    animation_documentary = ifelse(Animation == 1 & Documentary == 1, 1, 0),
    animation_romance = ifelse(Animation == 1 & Romance == 1, 1, 0),
    animation_short = ifelse(Animation == 1 & Short == 1, 1, 0),
    comedy_drama = ifelse(Comedy == 1 & Drama == 1, 1, 0),
    comedy_documentary = ifelse(Comedy == 1 & Documentary == 1, 1, 0),
    comedy_romance = ifelse(Comedy == 1 & Romance == 1, 1, 0),
    comedy_short = ifelse(Comedy == 1 & Short == 1, 1, 0),
    drama_documentary = ifelse(Drama == 1 & Documentary == 1, 1, 0),
    drama_romance = ifelse(Drama == 1 & Romance == 1, 1, 0),
    drama_short = ifelse(Drama == 1 & Short == 1, 1, 0),
    documentary_romance = ifelse(Documentary == 1 & Romance == 1, 1, 0),
    documentary_short = ifelse(Documentary == 1 & Short == 1, 1, 0),
    romance_short = ifelse(Romance == 1 & Short == 1, 1, 0)
  )

# Average ratings for all combinations
combo_ratings <- data.frame(
  Combination = c(
    "Action_Animation", "Action_Comedy", "Action_Drama", "Action_Documentary", "Action_Romance", 
    "Action_Short","Animation_Comedy", "Animation_Drama", "Animation_Documentary", "Animation_Romance", 
    "Animation_Short","Comedy_Drama", "Comedy_Documentary", "Comedy_Romance", "Comedy_Short",
    "Drama_Documentary", "Drama_Romance", "Drama_Short", "Documentary_Romance", "Documentary_Short",
    "Romance_Short", "Single_Genre", "Multiple_Genres"
  ),
  
  Average_Rating = c(
    mean(genre_combos$rating[genre_combos$action_animation == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$action_comedy == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$action_drama == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$action_documentary == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$action_romance == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$action_short == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$animation_comedy == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$animation_drama == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$animation_documentary == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$animation_romance == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$animation_short == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$comedy_drama == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$comedy_documentary == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$comedy_romance == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$comedy_short == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$drama_documentary == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$drama_romance == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$drama_short == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$documentary_romance == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$documentary_short == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$romance_short == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$genre_count == 1], na.rm = TRUE),
    mean(genre_combos$rating[genre_combos$genre_count > 1], na.rm = TRUE)
  )
)

# Top 10 combinations
combo_ratings_sorted <- combo_ratings[order(-combo_ratings$Average_Rating), ]
top_10_combinations <- head(combo_ratings_sorted, 10)
bottom_10_combinations <- tail(combo_ratings_sorted, 10)

cat("TOP 10 GENRE COMBINATIONS BY AVERAGE RATING:\n", 
    capture.output(print(top_10_combinations, row.names = FALSE)), 
    "\nBOTTOM 10 GENRE COMBINATIONS BY AVERAGE RATING:\n",
    capture.output(print(bottom_10_combinations, row.names = FALSE)),
    sep = "\n")
TOP 10 GENRE COMBINATIONS BY AVERAGE RATING:

           Combination Average_Rating
     Animation_Romance       7.092500
       Animation_Drama       6.928148
 Animation_Documentary       6.897674
    Action_Documentary       6.800000
   Documentary_Romance       6.770000
       Animation_Short       6.637548
      Animation_Comedy       6.618081
          Comedy_Short       6.612835
           Drama_Short       6.569791
     Drama_Documentary       6.559055

BOTTOM 10 GENRE COMBINATIONS BY AVERAGE RATING:

       Combination Average_Rating
 Documentary_Short       6.345444
     Drama_Romance       6.292737
  Action_Animation       6.285714
   Multiple_Genres       6.255300
      Comedy_Drama       6.222814
    Comedy_Romance       6.096310
    Action_Romance       6.005776
      Single_Genre       5.971873
      Action_Drama       5.640523
     Action_Comedy       5.567526
# Count movies in top combinations
output_lines <- c("NUMBER OF MOVIES IN TOP COMBINATIONS:\n")

for(i in 1:nrow(top_10_combinations)) {
  combo <- top_10_combinations$Combination[i]
  if(combo == "Single_Genre") {
    count <- sum(genre_combos$genre_count == 1)
  } else if(combo == "Multiple_Genres") {
    count <- sum(genre_combos$genre_count > 1)
  } else {
    col_name <- tolower(combo)
    count <- sum(genre_combos[[col_name]] == 1, na.rm = TRUE)
  }
  output_lines <- c(output_lines, paste0(combo, ": ", count, " movies"))
}

# Count movies in bottom combinations
output_lines <- c(output_lines, "\nNUMBER OF MOVIES IN BOTTOM COMBINATIONS:\n")

for(i in 1:nrow(bottom_10_combinations)) {
  combo <- bottom_10_combinations$Combination[i]
  if(combo == "Single_Genre") {
    count <- sum(genre_combos$genre_count == 1)
  } else if(combo == "Multiple_Genres") {
    count <- sum(genre_combos$genre_count > 1)
  } else {
    col_name <- tolower(combo)
    count <- sum(genre_combos[[col_name]] == 1, na.rm = TRUE)
  }
  output_lines <- c(output_lines, paste0(combo, ": ", count, " movies"))
}

# Print everything at once
cat(output_lines, sep = "\n")
NUMBER OF MOVIES IN TOP COMBINATIONS:

Animation_Romance: 40 movies
Animation_Drama: 135 movies
Animation_Documentary: 43 movies
Action_Documentary: 16 movies
Documentary_Romance: 10 movies
Animation_Short: 3116 movies
Animation_Comedy: 2251 movies
Comedy_Short: 3880 movies
Drama_Short: 1099 movies
Drama_Documentary: 127 movies

NUMBER OF MOVIES IN BOTTOM COMBINATIONS:

Documentary_Short: 867 movies
Drama_Romance: 2561 movies
Action_Animation: 84 movies
Multiple_Genres: 15537 movies
Comedy_Drama: 3099 movies
Comedy_Romance: 2195 movies
Action_Romance: 277 movies
Single_Genre: 30465 movies
Action_Drama: 1799 movies
Action_Comedy: 776 movies


Bar Plot (Top Combinations)

# Create colors
pastels <- c("#FFB3BA", # Animation_Romance
             "#FFDFBA", # Animation_Drama
             "#FFFFBA", # Animation_Documentary
             "#E6FFCC", # Action_Documentary
             "#BAFFC9", # Documentary_Romance
             "#C9E7FF", # Animation_Short
             "#BAE1FF", # Animation_Comedy
             "#E2BAFF", # Comedy_Short
             "#EEE5FF", # Drama_Short
             "#D3D3D3") # Drama_Documentary

# Set larger margins
par(mar = c(6, 4, 4, 2))

# Bar plot
bp <- barplot(top_10_combinations$Average_Rating,
        col = pastels,
        main = "Top 10 Genre Combinations by Average Rating",
        ylab = "Average Rating (1-10)",
        ylim = c(0, max(top_10_combinations$Average_Rating) * 1.15),
        xaxt = "n")

# X-axis labels rotated 45-degrees
text(x = bp, 
     y = par("usr")[3] - (max(top_10_combinations$Average_Rating) * 0.05),
     labels = top_10_combinations$Combination,
     srt = 45,
     adj = 1,
     xpd = TRUE,
     cex = 0.7)

# Add x-axis title
mtext("Genre Combination", side = 1, line = 5, cex = 1) 

# Add numbers on top of bars
text(x = bp, 
     y = top_10_combinations$Average_Rating + (max(top_10_combinations$Average_Rating) * 0.02),
     labels = round(top_10_combinations$Average_Rating, 2),
     pos = 3, 
     cex = 0.8, 
     xpd = TRUE)

Graph Analysis: Animation-based genre combinations dominate the highest-rated parings, with Animation-Romance leading at 7.09/10 average rating. Documentary and Drama combinations also performed well, particularly when paired with Animation or Romance. This suggests that blending artistic/visual genres (Animation) with emotionally-driven genres (Romance, Drama) creates a powerful combination that resonates strongly with audiences.


Bar Plot (Bottom Combinations)

# Create colors
pastels <- c("#FFB3BA", # Documentary_Short
             "#FFDFBA", # Drama_Romance
             "#FFFFBA", # Action_Animation
             "#E6FFCC", # Multiple_Genres
             "#BAFFC9", # Comedy_Drama
             "#C9E7FF", # Comedy_Romance
             "#BAE1FF", # Action_Romance
             "#E2BAFF", # Single_Genre
             "#EEE5FF", # Action_Drama
             "#D3D3D3") # Action_Comedy

# Set larger margins
par(mar = c(6.5, 4, 4, 2))

# Bar plot
bp <- barplot(bottom_10_combinations$Average_Rating,
        col = pastels,
        main = "Bottom 10 Genre Combinations by Average Rating",
        ylab = "Average Rating (1-10)",
        ylim = c(0, max(bottom_10_combinations$Average_Rating) * 1.15),
        xaxt = "n")

# X-axis labels rotated 45-degrees
text(x = bp, 
     y = par("usr")[3] - (max(bottom_10_combinations$Average_Rating) * 0.05),
     labels = bottom_10_combinations$Combination,
     srt = 45,
     adj = 1,
     xpd = TRUE,
     cex = 0.7)

# Add x-axis title
mtext("Genre Combination", side = 1, line = 5, cex = 1) 

# Add numbers on top of bars
text(x = bp, 
     y = bottom_10_combinations$Average_Rating + (max(bottom_10_combinations$Average_Rating) * 0.02),
     labels = round(bottom_10_combinations$Average_Rating, 2),
     pos = 3, 
     cex = 0.8, 
     xpd = TRUE)

Graph Analysis: Action-based combinations consistently under-perform, with Action-Drama and Action-Comedy among the lowest-rated pairings. Surprisingly, Single-Genre films and Multiple-Genres overall rank in the bottom tier, indicating that pure genre films struggle to achieve high ratings compared to specific and well-matched genre pairs.

Conclusion

Core Question: Given historical movie data, can we accurately predict success metrics and identify what factors contribute to higher ratings in the film industry?


Key Findings

  • (Slide 4) Genre Performance Matters Most: Documentary and Animation films consistently achieve the highest ratings, while Action films receive the lowest.
  • (Slide 5) Temporal Trends Are Stable: Movie ratings have remained consistent since the 1950s, with a horizontal trend line indicating stable audience standards despite technological advancements.
  • (Slide 6) Budget Has Minimal Impact: Production budget shows weak correlation with ratings. Higher spending does not guarantee better audience reception, challenging conventional industry knowledge.
  • (Slide 7) Limited Predictive Accuracy: Our multiple regression model achieve modest predictive power, indicating that while we can identify influential factors, precise rating prediction remains challenging with basic features.
  • (Slide 8) Strategic Genre Combinations Outperform: Animation-Romance and Animation-Drama combinations achieve the highest ratings, demonstrating that thoughtful genre paring can significantly boost critical success compared to single-genre films.


Industry Implications

  • For Studios and Investors:
    • Prioritize genre strategy over budget increases
    • Consider Animation and Documentary projects for critical acclaim
    • Avoid over-reliance on Action genre for quality expectations
    • Explore genre hybrids (Animation + Emotional genres) for maximum impact
  • For Predictive Modeling:
    • Basic features (budget, year, genre) provide limited predictive power
    • Future models should incorporate additional factors:
      • Director track record
      • Screenplay quality
      • Marketing spend
      • Social media buzz
    • Ensemble methods and machine learning may improve accuracy beyond linear regression


Final Assessment

We can successfully identify what factors contribute to higher ratings, with genre emerging as the dominant influence. However, accurate prediction of exact ratings remains limited with the available features. The film industry’s complexity requires more sophisticated modeling approaches, but our analysis provides valuable strategic insights for content development and investment decisions.

The most reliable success formula: Combine artistic genres (Animation/Documentary) with emotional storytelling (Romance/Drama) while maintaining reasonable budgets, as financial scale alone does not drive audience satisfaction.


Explore the Data Yourself!

https://vivixntrxn.shinyapps.io/vtran32_project1/

Select any genre to see how it performs alone and in combination with other genres.